Abstract
In numerous real-world applications, data tends to be ordered and partially labelled, predominantly due to the constraints of labeling costs. The current methodologies for managing such data are inadequate, especially when confronted with the challenge of high-dimensional datasets, which often require reprocessing from the start, resulting in significant inefficiencies. To tackle this, we introduce an incremental semi-supervised feature selection algorithm that is grounded in neighborhood discernibility, and incorporates pseudolabel granular balls and matrix updating techniques. This novel approach evaluates the significance of features for both labelled and unlabelled data independently, using the power of neighborhood distinguishability to identify the most optimal subset of features. In a bid to enhance computational efficiency, especially with large datasets, we adopt a pseudolabel granular balls technique, which effectively segments the dataset into more manageable samples prior to feature selection. For high-dimensional data, we employ matrices to store neighborhood information, with distance functions and matrix structures that are tailored for both low and high-dimensional contexts. Furthermore, we present an innovative matrix updating method designed to accommodate fluctuations in the number of features. Our experimental results conducted across 12 datasets-including 4 with over 2000 features-demonstrate that our algorithm not only outperforms existing methods in handling large samples and high-dimensional datasets but also achieves an average time reduction of over six fold compared to similar semi-supervised algorithms. Moreover, we observe an average improvement in accuracy of 1.4%, 0.6%, and 0.2% per dataset for SVM, KNN, and Random Forest classifiers, respectively, when compared to the best-performing algorithm among the compared algorithms.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Data sources: https://jundongl.github.io/scikit-feature/
References
Hancer E, Xue B, Zhang M (2022) Fuzzy filter cost-sensitive feature selection with differential evolution. Knowl-Based Syst 241:108259
Huang P, Yang X (2022) Unsupervised feature selection via adaptive graph and dependency score. Patt Recognit 127:108622
Sang B, Chen H, Yang L, Li T, Xu W (2021) Incremental feature selection using a conditional entropy based on fuzzy dominance neighborhood rough sets. IEEE Trans Fuzzy Syst 30:1683–1697
Yang L, Qin K, Sang B, Fu C (2022) A novel incremental attribute reduction by using quantitative dominance-based neighborhood self-information. Knowl-Based Syst 261:110200
Bai H, Li D, Ge Y, Wang J, Cao F (2022) Spatial rough set-based geographical detectors for nominal target variables. Inf Sci 586:525–539
Zhang X, Hou J (2023) A novel rough set method based on adjustable-perspective dominance relations in intuitionistic fuzzy ordered decision tables. Int J Approx Reason 154:218–241
Jiang H, Hu BQ (2022) On (o, g)-fuzzy rough sets based on overlap and grouping functions over complete lattices. Int J Approx Reason 144:18–50
Xie J, Hu BQ, Jiang H (2022) A novel method to attribute reduction based on weighted neighborhood probabilistic rough sets. Int J Approx Reason 144:1–17
Shu W, Yan Z, Chen T, Yu J, Qian W (2022) Information granularity-based incremental feature selection for partially labeled hybrid data. Intell Data Anal 26:33–56
Yang X, Chen H, Li T, Wan J, Sang B (2021) Neighborhood rough sets with distance metric learning for feature selection. Knowl-Based Syst 224:107076
Wu S, Wang L, Ge S, Hao Z, Liu Y (2023) Neighborhood rough set with neighborhood equivalence relation for feature selection. Knowl Inf Syst, pp 1–27
Liu K, Li T, Yang X, Yang X, Liu D, Zhang P, Wang J (2021) Granular cabin: An efficient solution to neighborhood learning in big data. Inf Sci 583:189–201
Wan J, Chen H, Yuan Z, Li T, Yang X, Sang B (2021) A novel hybrid feature selection method considering feature interaction in neighborhood rough set. Knowl-Based Syst 227:107167
Liu K, Tsang EC, Song J, Yu H, Chen X, Yang X (2020) Neighborhood attribute reduction approach to partially labeled data. Granul Comput 5:239–250
Shu W, Yu J, Chen T, Qian W (2023) Neighbourhood discernibility degree-based semisupervised feature selection for partially labelled mixed-type data with granular ball. Appl Intell 53:22467–22487
Huang D, Zhang Q, Li Z (2023) Semi-supervised attribute reduction for partially labeled categorical data based on predicted label. Int J Approx Reason 154:242–261
Liu K, Yang X, Yu H, Fujita H, Chen X, Liu D (2020) Supervised information granulation strategy for attribute reduction. Int J Mach Learn Cybern, pp 1–15
Gao C, Zhou J, Miao D, Yue X, Wan J (2021) Granular-conditional-entropy-based attribute reduction for partially labeled data with proxy labels. Inf Sci 580:111–128
Pan Y, Xu W, Ran Q (2022) An incremental approach to feature selection using the weighted dominance-based neighborhood rough sets. Int J Mach Learn Cybern 14:1217–1233
Xu W, Yang Y (2023) Matrix-based feature selection approach using conditional entropy for ordered data set with time-evolving features. Knowl-Based Syst 279:110947
Yang Y, Chen D, Zhang X, Ji Z, Zhang Y (2022) Incremental feature selection by sample selection and feature-based accelerator. Appl Soft Comput 121:108800
Cai M, Lang G, Fujita H, Li Z, Yang T (2019) Incremental approaches to updating reducts under dynamic covering granularity. Knowl-Based Syst 172:130–140
Jiang Z, Liu K, Song J, Yang X, Li J, Qian Y (2021) Accelerator for crosswise computing reduct. Appl Soft Comput 98:106740
Liu K, Li T, Yang X, Chen H, Wang J, Deng Z (2023) Semifree: Semisupervised feature selection with fuzzy relevance and redundancy. IEEE Trans Fuzzy Syst 31:3384–3396
Zhang P, Li T, Yuan Z, Luo C, Liu K, Yang X (2022) Heterogeneous feature selection based on neighborhood combination entropy. IEEE Trans Neural Netw Learn Syst, pp 1–14
Xu W, Yuan K, Li W, Ding W (2022) An emerging fuzzy feature selection method using composite entropy-based uncertainty measure and data distribution. IEEE Trans Emerg Top Comput Intell 7:76–88
Liu Y, Zheng L, Xiu Y, Yin H, Zhao S, Wang X, Chen H, Li C (2020) Discernibility matrix based incremental feature selection on fused decision tables. Int J Approx Reason 118:1–26
Sheng K, Wang W, Xf Bian, Dong H, MA J (2020) Neighborhood discernibility degree incremental attribute reduction algorithm for mixed data. Acta Electonica Sin 48:682
Lin R, Li J, Chen D, Huang J, Chen Y (2021) Attribute reduction in fuzzy multi-covering decision systems via observational-consistency and fuzzy discernibility. J Intell Fuzzy Syst 40:5239–5253
Li X, Tang J, Hu B, Li Y (2022) Indiscernibility and discernibility relations attribute reduction with variable precision. Sci Prog 2022:1–11
Xia S, Liu Y, Ding X, Wang G, Yu H, Luo Y (2019) Granular ball computing classifiers for efficient, scalable and robust learning. Inf Sci 483:136–152
Xia S, Peng D, Meng D, Zhang C, Wang G, Giem E, Wei W, Chen Z (2022) Ball \(k\)-means: Fast adaptive clustering with no bounds. IEEE Trans Patt Anal Mach Intell 44:87–99
Xia S, Zhang H, Li W, Wang G, Giem E, Chen Z (2020) Gbnrs: A novel rough set algorithm for fast adaptive attribute reduction in classification. IEEE Trans Knowl Data Eng 34:1231–1242
Chen Y, Wang P, Yang X, Mi J, Liu D (2021) Granular ball guided selector for attribute reduction. Knowl-Based Syst 229:107326
Zhang P, Li T, Yuan Z, Luo C, Wang G, Liu J, Du S (2022) A data-level fusion model for unsupervised attribute selection in multi-source homogeneous data. Inf Fusion 80:87–103
Hu Q, Yu D, Liu J, Wu C (2008) Neighborhood rough set based heterogeneous feature subset selection. Inf sci 178:3577–3594
Yuan Z, Zhang X, Feng S (2018) Hybrid data-driven outlier detection based on neighborhood information entropy and its developmental measures. Expert Syst Appl 112:243–257
Dheeru D, Casey G (2019) Uci machine learning repository http://archive.ics.uci.edu/ml. irvine, ca: University of california. School Inf Comput Sci 25:27
Hu M, Tsang EC, Guo Y, Xu W (2021) Fast and robust attribute reduction based on the separability in fuzzy decision systems. IEEE Trans Cybern 52:5559–5572
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Acknowledgements
This work is supported by the National Natural Science Foundation of China (NO. 62376229) and Natural Science Foundation of Chongqing (NO. CSTB2023NSCQ-LZX0027).
Author information
Authors and Affiliations
Contributions
Weihua Xu: Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Validation. Jinlong Li: Data curation, Methodology, Formal analysis, Software, Visualization, Writing - original draft, Writing - review & editing.
Corresponding author
Ethics declarations
Competing interests
We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.
Publication ethics
We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us.
Intellectual property
We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xu, W., Li, J. Granular-ball-matrix-based incremental semi-supervised feature selection approach to high-dimensional variation using neighbourhood discernibility degree for ordered partially labelled dataset. Appl Intell 55, 268 (2025). https://doi.org/10.1007/s10489-024-06134-1
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-06134-1