Abstract
To deal with semi-supervised feature selection tasks, this paper presents a recursive feature retention (RFR) method based on a neighborhood discriminant index (NDI) method (a supervised feature selection method) and a forward iterative Laplacian score (FILS) method (an unsupervised method), where FILS is designed specially for RFR. The goal of RFR is to determine an optimal feature subset that has not only a high discriminant ability but also a strong ability to maintain the local structure of data. The discriminant ability of a feature is measured by NDI, and the ability of a feature to maintain the local structure of data is described by FILS. RFR compromises these two scores to give a balanced score for a feature. RFR iteratively selects a feature with the smallest balanced score and moves it into the current optimal feature subset. This paper also shows theoretical analysis to speed up iterations. Extensive experiments are conducted on toy and real-world data sets. Experimental results confirm that RFR can achieve a better performance compared with the state-of-the-art semi-supervised methods.
Similar content being viewed by others
References
Shang R, Chang J, Jiao L et al (2019) Unsupervised feature selection based on self-representation sparse regression and local similarity preserving. Int J Mach Learn Cybern 10(4):757–770
Karagoz GN, Yazici A, Dökeroglu T et al (2021) A new framework of multi-objective evolutionary algorithms for feature selection and multi-label classification of video data. Int J Mach Learn Cybern 12(1):53–71
Zhang W, Kang P, Fang X et al (2019) Joint sparse representation and locality preserving projection for feature extraction. Int J Mach Learn Cybern 10(7):1731–1745
Valiant LG (1984) A Theory of the learnable. Commun ACM 27(11):1134–1142
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7:179–188
Smallman L, Artemiou A, Morgan J (2018) Sparse generalised principal component analysis. Pattern Recogn 83:443–455
Lai Z, Xu Y, Chen Q et al (2014) Multilinear sparse principal component analysis. IEEE Trans Neural Netw Learn Syst 25(10):1942–1950
Wang S, Lu J, Gu X et al (2016) Semi-supervised linear discriminant analysis for dimension reduction and classification. Pattern Recogn 57:179–189
Sheikhpour R, Sarram MA, Gharaghani S et al (2017) A Survey on semi-supervised feature selection methods. Pattern Recogn 64:141–158
Wang X, Chen RC, Hong C et al (2018) Unsupervised feature analysis with sparse adaptive learning. Pattern Recogn Lett 102:89–94
Benabdeslem K, Hindawi M (2011) Constrained Laplacian score for semi-supervised feature selection, in Machine Learning and Knowledge Discovery in Databases, pp 204-218
Xu J, Tang B, He H et al (2017) Semisupervised feature selection based on relevance and redundancy criteria. IEEE Trans Neural Netw Learn Syst 28(9):1974–1984
Zhao J, Lu K, He X (2008) Locality sensitive semi-supervised feature selection. Neurocomputing 71(10):1842–1848
Yang M, Chen Y, Ji G (2010) Semi-Fisher score: a semisupervised method for feature selection. In: International conference on machine learning and cybernetics. IEEE, pp 527–532
Gu Q, Li Z, Han J (2011) Generalized Fisher score for feature selection. In: Twenty-seventh conference on uncertainty in arti cial intelligence. AUAI Press, pp 266-273
Bishop CM (1996) Neural networks for pattern recognition. Oxford University Press, USA
Lv S, Jiang H, Zhao L et al. (2013) Manifold based Fisher method for semi-supervised feature selection. In: International conference on fuzzy systems and knowledge discovery. IEEE, pp 664-668
Li Z, Liao B, Cai L et al (2018) Semi-supervised maximum discriminative local margin for gene selection. Sci Rep 8:8619
He X, Cai D, Han J (2008) Learning a maximum margin subspace for image retrieval. IEEE Trans Knowl Data Eng 20(2):189–201
He X, Cai D, Niyog P (2005) Laplacian score for feature selection. Neural Inf Process Syst 18:507–514
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Sedgwick P (2012) Pearson’s correlation coefficient, BMJ (online), 345(jul04 1): e4483-e4483
Tang B, Zhang L (2019) Multi-class semi-supervised Logistic I-RELIEF feature selection based on nearest neighbor, knowledge discovery and data mining, lecture notes in computer science. Lect Notes Comput Sci 11440:281–292
Sun Y, Todorovic S, Goodison S (2010) Local learning-based feature selection for high-dimensional data analysis. IEEE Trans Pattern Anal Mach Intell 32(9):1610–1626
Tang B, Zhang L (2020) Local preserving logistic I-relief for semi-supervised feature selection. Neurocomputing 399:48–64
Zhu L, Miao L, Zhang D (2012) Iterative Laplacian score for feature felection, Chinese conference on pattern recognition, 507–541
Wang C, Hu Q, Wang X et al (2018) Feature selection based on neighborhood discrimination index. IEEE Trans Neural Netw Learn Syst 29(7):2986–2999
Zelnik-manor L, Perona P (2004) Self-tuning spectral clustering. In: Advances in neural information processing systems. Vol 17, MIT Press, Cambridge pp 1601–1608
Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository, UCI machine learning repository. URL http://archive.ics.uci.edu/ml
Monti S, Tamayo P, Mesirov J et al (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52(1–2):91–118
Golub TR, Slonim DK, Tamayo P et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression. Science 286(5439):531–537
Cooke MP, Ching KA, Hakak Y et al (2002) Large-scale analysis of the human and house transcriptomes. Proc Natl Acad Sci 99(7):4465–4470
Yeoh EJ, Ross ME, Shurtle SA et al (2002) Classification, subtype discovery, and prediction of outcome in pediatricacute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1(2):133–143
Bhattacharjee A, Richards WG, Staunton J et al (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinomas sub-classes. Proc Natl Acad Sci 98(24):13790–13795
Pomeroy S, Tamayo P, Gaasenbeek M et al (2001) Gene expression-based classification and outcome prediction of central nervous system embryonal tumors. Nature 415(6870):436–442
Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, New York
Shieh MD, Yang CC (2008) Multiclass SVM-REF for product from feature selection. Expert Syst Appl 35(1–2):531–541
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701
Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56(293):52–64
Chen H, Tiňo P, Yao X (2009) Predictive ensemble pruning by expectation propagation. IEEE Trans Knowl Data Eng 21(7):999–1013
Zhang L, Huang X, Zhou W (2019) Logistic local hyperplane-Relief: A feature weighting method for classification. Knowl Based Syst 181:1
Huang X, Zhang L, Li F et al (2018) Feature weight estimation based on dynamic representation and neighbor sparse reconstruction. Pattern Recogn 81(9):338–403
Acknowledgements
This work was supported in part by the Natural Science Foundation of the Jiangsu Higher Education Institutions of China under Grant No. 19KJA550002, by the Six Talent Peak Project of Jiangsu Province of China under Grant No. XYDXX-054, by the Priority Academic Program Development of Jiangsu Higher Education Institutions, and by the Collaborative Innovation Center of Novel Software Technology and Industrialization.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Proof of Theorem 1
Proof
For a given non-empty set A, We need to prove that \(J(A)\ge 0\) holds true for \(1\le t\le n\). When \(A \ne \emptyset\), its Laplacian score is described as
The numerator of J(A) can be rewritten as
where \(\widetilde{{\mathbf {z}}}_{f_m}\) is a column of \(\widetilde{{\mathbf {Z}}}_{A}\). Because the Laplacian matrix \({\mathbf {L}}\) is symmetric and positive semi-definite, we have
which indicates that the numerator (20) of \(J(A^*(t))\) is nonnegative, or
Similarly, the denominator of J(A) can be rewritten as
Moreover, the matrix \({\mathbf {D}}\) is a diagonal matrix that is positive definite. Thus we have
By (22) and (24), when \(A\ne \emptyset\) we can have the conclusion:
which completes the proof of Theorem 1. \(\square\)
Proof of Theorem 2
Proof
To prove the inequalities (13) in Theorem 2, we use the mathematical induction method.
When \(t=1\), \(A^*(0)=\emptyset\) and \(A^*(1)\ne \emptyset\). Since \(J(A^*(0))=-\infty\) and \(J(A^*(1))\ge 0\) according to (11), \(J(A^*(0))\le J(A^*(1))\) is true. For simplification, let
When \(t=2\), the Laplacian score of a feature subset A(1) with \(|A(1)|=1\) can be reduced as:
where \(a_1^*\) and \(b_1^*\) are the corresponding \(a_{f_m}\) and \(b_{f_m}\) under the optimal solution, respectively.
where \(a_2^*=a_1^*+a_{p_1}\) and \(b_2^*=b_1^*+b_{p_1}\), \({p_1}\) corresponds the optimal feature index in the second iteration. Combining (26) and (27), we have
According to (26), we know
Thus, we have
Considering \(b_1^*(b_1^*+b_{p_1})>0\) and (30), we lead a conclusion that \(J(A^*(1))\le J(A^*(2))\) is true.
Assume that \(J(A^*(N-1))\le J(A^*(N))\) is true for \(t=N<n\). Without loss of generality, let
and
where \({p_{N-1}}\) is the optimal feature index in the N-th iteration. According to \(J(A^*(N-1))\le J(A^*(N))\), (31) and (32), for \(\forall f_k \in \overline{A^*(N-1)}\) we have
Further, we have
where \(f_k \in \overline{A^*(N-1)}\).
When \(t=N+1 \le n\), we want to prove that \(J(A^*(N))\le J(A^*(N+1))\) is true. Let
where \({p_{N}}\) is the optimal feature index in the \((N+1)\)-th iteration. We compute \(J(A^*(N))-J(A^*(N+1))\), and have
Substituting (36) into the last equation in (38) and replacing \(f_k\) with \(p_{N}\) owing to the arbitrariness of \(f_k\), we have
which shows that \(J(A^*(N))\le J(A^*(N+1))\) is true when \(t=N+1\).
Consequently, by the Principle of Induction, \(J(A^*(t-1))\le J(A^*(t))\) for \(1 \le t\le n\). \(\square\)
Proof of Theorem 3
Proof
Note that the set \(|B|=1\). We follow the notations in the proof procedure of Theorem 2. In the t-th iteration, assume that
For the \((t+1)\)-th iteration, we have
where \({p_{t}}\) is the optimal feature index in the \((t+1)\)-th iteration, or \(A^*(t+1)-A^*(t)=B={f_{p_t}}\).
According to Theorem 2, we know that \(J(A^*(t))-J(A^*(t+1))\le 0\), and have
Thus, according to \(a_{t}^*b_{p_{t}}-b_{t}^*a_{p_{t}}\le 0\), we have
Since \(a_{p_{t}}=\widetilde{{\mathbf {z}}}_{f_{p_{t}}}^T{\mathbf {L}} \widetilde{{\mathbf {z}}}_{f_{p_{t}}}=\widetilde{{\mathbf {Z}}}^T_B{\mathbf {L}} \widetilde{{\mathbf {Z}}}^T_B\), and \(b_{p_{t}}=\widetilde{{\mathbf {z}}}_{f_{p_{t}}}^T{\mathbf {D}} \widetilde{{\mathbf {z}}}_{f_{p_{t}}}=\widetilde{{\mathbf {Z}}}^T_B{\mathbf {D}} \widetilde{{\mathbf {Z}}}^T_B\), (43) can be rewritten as
which completes the proof of Theorem 3. \(\square\)
Proof of Theorem 5
Proof
According to the definition of neighborhood relation, we have
and
On the basis of \(\Delta ^{A}\left( \mathbf{x }_i,\mathbf{x }_j\right)\), the distance function \(\Delta ^{A^k}\left( \mathbf{x }_i,\mathbf{x }_j\right)\) can be rewritten as
\(\forall \left( {\mathbf {x}}_i,{\mathbf {x}}_j\right) \in R_{A^k}^{\varepsilon }\), we have \(\Delta ^{A^k}\left( \mathbf{x }_i,\mathbf{x }_j\right) \le \varepsilon\). According to (47), we know that \(\Delta ^{A}\left( \mathbf{x }_i,\mathbf{x }_j\right) \le \varepsilon\) holds true. Thus, we have \(\left( {\mathbf {x}}_i,{\mathbf {x}}_j\right) \in R_{A}^{\varepsilon }\) in terms of (45). In other words, \(R_{A^k}^{\varepsilon } \subseteq R_{A}^{\varepsilon }\), which completes the proof of Theorem 5. \(\square\)
Proof of Theorem 6
Proof
To prove the rules in Theorem 6, we can use the relationship between \(\Delta ^{A^k}\left( \mathbf{x }_i,\mathbf{x }_j\right)\) and \(\Delta ^{A}\left( \mathbf{x }_i,\mathbf{x }_j\right)\):
For the rule (1), we have
For the rule (2), we have
For the rule (3), we have
In summary, the three rules hold true. This completes the proof of Theorem 6. \(\square\)
Rights and permissions
About this article
Cite this article
Pang, Q., Zhang, L. A recursive feature retention method for semi-supervised feature selection. Int. J. Mach. Learn. & Cyber. 12, 2639–2657 (2021). https://doi.org/10.1007/s13042-021-01346-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-021-01346-0