Skip to main content
Log in

A recursive feature retention method for semi-supervised feature selection

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

To deal with semi-supervised feature selection tasks, this paper presents a recursive feature retention (RFR) method based on a neighborhood discriminant index (NDI) method (a supervised feature selection method) and a forward iterative Laplacian score (FILS) method (an unsupervised method), where FILS is designed specially for RFR. The goal of RFR is to determine an optimal feature subset that has not only a high discriminant ability but also a strong ability to maintain the local structure of data. The discriminant ability of a feature is measured by NDI, and the ability of a feature to maintain the local structure of data is described by FILS. RFR compromises these two scores to give a balanced score for a feature. RFR iteratively selects a feature with the smallest balanced score and moves it into the current optimal feature subset. This paper also shows theoretical analysis to speed up iterations. Extensive experiments are conducted on toy and real-world data sets. Experimental results confirm that RFR can achieve a better performance compared with the state-of-the-art semi-supervised methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Shang R, Chang J, Jiao L et al (2019) Unsupervised feature selection based on self-representation sparse regression and local similarity preserving. Int J Mach Learn Cybern 10(4):757–770

    Article  Google Scholar 

  2. Karagoz GN, Yazici A, Dökeroglu T et al (2021) A new framework of multi-objective evolutionary algorithms for feature selection and multi-label classification of video data. Int J Mach Learn Cybern 12(1):53–71

    Article  Google Scholar 

  3. Zhang W, Kang P, Fang X et al (2019) Joint sparse representation and locality preserving projection for feature extraction. Int J Mach Learn Cybern 10(7):1731–1745

    Article  Google Scholar 

  4. Valiant LG (1984) A Theory of the learnable. Commun ACM 27(11):1134–1142

    Article  Google Scholar 

  5. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7:179–188

    Article  Google Scholar 

  6. Smallman L, Artemiou A, Morgan J (2018) Sparse generalised principal component analysis. Pattern Recogn 83:443–455

    Article  Google Scholar 

  7. Lai Z, Xu Y, Chen Q et al (2014) Multilinear sparse principal component analysis. IEEE Trans Neural Netw Learn Syst 25(10):1942–1950

    Article  Google Scholar 

  8. Wang S, Lu J, Gu X et al (2016) Semi-supervised linear discriminant analysis for dimension reduction and classification. Pattern Recogn 57:179–189

    Article  Google Scholar 

  9. Sheikhpour R, Sarram MA, Gharaghani S et al (2017) A Survey on semi-supervised feature selection methods. Pattern Recogn 64:141–158

    Article  Google Scholar 

  10. Wang X, Chen RC, Hong C et al (2018) Unsupervised feature analysis with sparse adaptive learning. Pattern Recogn Lett 102:89–94

    Article  Google Scholar 

  11. Benabdeslem K, Hindawi M (2011) Constrained Laplacian score for semi-supervised feature selection, in Machine Learning and Knowledge Discovery in Databases, pp 204-218

  12. Xu J, Tang B, He H et al (2017) Semisupervised feature selection based on relevance and redundancy criteria. IEEE Trans Neural Netw Learn Syst 28(9):1974–1984

    Article  MathSciNet  Google Scholar 

  13. Zhao J, Lu K, He X (2008) Locality sensitive semi-supervised feature selection. Neurocomputing 71(10):1842–1848

    Article  Google Scholar 

  14. Yang M, Chen Y, Ji G (2010) Semi-Fisher score: a semisupervised method for feature selection. In: International conference on machine learning and cybernetics. IEEE, pp 527–532

  15. Gu Q, Li Z, Han J (2011) Generalized Fisher score for feature selection. In: Twenty-seventh conference on uncertainty in arti cial intelligence. AUAI Press, pp 266-273

  16. Bishop CM (1996) Neural networks for pattern recognition. Oxford University Press, USA

    MATH  Google Scholar 

  17. Lv S, Jiang H, Zhao L et al. (2013) Manifold based Fisher method for semi-supervised feature selection. In: International conference on fuzzy systems and knowledge discovery. IEEE, pp 664-668

  18. Li Z, Liao B, Cai L et al (2018) Semi-supervised maximum discriminative local margin for gene selection. Sci Rep 8:8619

    Article  Google Scholar 

  19. He X, Cai D, Han J (2008) Learning a maximum margin subspace for image retrieval. IEEE Trans Knowl Data Eng 20(2):189–201

    Article  Google Scholar 

  20. He X, Cai D, Niyog P (2005) Laplacian score for feature selection. Neural Inf Process Syst 18:507–514

    Google Scholar 

  21. Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  22. Sedgwick P (2012) Pearson’s correlation coefficient, BMJ (online), 345(jul04 1): e4483-e4483

  23. Tang B, Zhang L (2019) Multi-class semi-supervised Logistic I-RELIEF feature selection based on nearest neighbor, knowledge discovery and data mining, lecture notes in computer science. Lect Notes Comput Sci 11440:281–292

    Article  Google Scholar 

  24. Sun Y, Todorovic S, Goodison S (2010) Local learning-based feature selection for high-dimensional data analysis. IEEE Trans Pattern Anal Mach Intell 32(9):1610–1626

    Article  Google Scholar 

  25. Tang B, Zhang L (2020) Local preserving logistic I-relief for semi-supervised feature selection. Neurocomputing 399:48–64

    Article  Google Scholar 

  26. Zhu L, Miao L, Zhang D (2012) Iterative Laplacian score for feature felection, Chinese conference on pattern recognition, 507–541

  27. Wang C, Hu Q, Wang X et al (2018) Feature selection based on neighborhood discrimination index. IEEE Trans Neural Netw Learn Syst 29(7):2986–2999

    MathSciNet  Google Scholar 

  28. Zelnik-manor L, Perona P (2004) Self-tuning spectral clustering. In: Advances in neural information processing systems. Vol 17, MIT Press, Cambridge pp 1601–1608

  29. Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository, UCI machine learning repository. URL http://archive.ics.uci.edu/ml

  30. Monti S, Tamayo P, Mesirov J et al (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52(1–2):91–118

    Article  Google Scholar 

  31. Golub TR, Slonim DK, Tamayo P et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression. Science 286(5439):531–537

    Article  Google Scholar 

  32. Cooke MP, Ching KA, Hakak Y et al (2002) Large-scale analysis of the human and house transcriptomes. Proc Natl Acad Sci 99(7):4465–4470

    Article  Google Scholar 

  33. Yeoh EJ, Ross ME, Shurtle SA et al (2002) Classification, subtype discovery, and prediction of outcome in pediatricacute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1(2):133–143

    Article  Google Scholar 

  34. Bhattacharjee A, Richards WG, Staunton J et al (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinomas sub-classes. Proc Natl Acad Sci 98(24):13790–13795

    Article  Google Scholar 

  35. Pomeroy S, Tamayo P, Gaasenbeek M et al (2001) Gene expression-based classification and outcome prediction of central nervous system embryonal tumors. Nature 415(6870):436–442

    Article  Google Scholar 

  36. Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  37. Shieh MD, Yang CC (2008) Multiclass SVM-REF for product from feature selection. Expert Syst Appl 35(1–2):531–541

    Article  Google Scholar 

  38. Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701

    Article  Google Scholar 

  39. Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56(293):52–64

    Article  MathSciNet  Google Scholar 

  40. Chen H, Tiňo P, Yao X (2009) Predictive ensemble pruning by expectation propagation. IEEE Trans Knowl Data Eng 21(7):999–1013

    Article  Google Scholar 

  41. Zhang L, Huang X, Zhou W (2019) Logistic local hyperplane-Relief: A feature weighting method for classification. Knowl Based Syst 181:1

    Google Scholar 

  42. Huang X, Zhang L, Li F et al (2018) Feature weight estimation based on dynamic representation and neighbor sparse reconstruction. Pattern Recogn 81(9):338–403

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by the Natural Science Foundation of the Jiangsu Higher Education Institutions of China under Grant No. 19KJA550002, by the Six Talent Peak Project of Jiangsu Province of China under Grant No. XYDXX-054, by the Priority Academic Program Development of Jiangsu Higher Education Institutions, and by the Collaborative Innovation Center of Novel Software Technology and Industrialization.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Proof of Theorem 1

Proof

For a given non-empty set A, We need to prove that \(J(A)\ge 0\) holds true for \(1\le t\le n\). When \(A \ne \emptyset\), its Laplacian score is described as

$$\begin{aligned} J(A)=\frac{trace\left( \widetilde{{\mathbf {Z}}}_{A}^T{\mathbf {L}} \widetilde{{\mathbf {Z}}}_{A}\right) }{trace\left( \widetilde{{\mathbf {Z}}}_{A}^T{\mathbf {D}}\widetilde{{\mathbf {Z}}}_{A}\right) } \end{aligned}$$
(19)

The numerator of J(A) can be rewritten as

$$\begin{aligned} trace\left( \widetilde{{\mathbf {Z}}}_{A}^T{\mathbf {L}}\widetilde{{\mathbf {Z}}}_{A}\right) = \sum _{f_m\in A}\widetilde{{\mathbf {z}}}_{f_m}^T{\mathbf {L}}\widetilde{{\mathbf {z}}}_{f_m} \end{aligned}$$
(20)

where \(\widetilde{{\mathbf {z}}}_{f_m}\) is a column of \(\widetilde{{\mathbf {Z}}}_{A}\). Because the Laplacian matrix \({\mathbf {L}}\) is symmetric and positive semi-definite, we have

$$\begin{aligned} \widetilde{{\mathbf {z}}}_{f_m}^T{\mathbf {L}}\widetilde{{\mathbf {z}}}_{f_m} \ge 0 \end{aligned}$$
(21)

which indicates that the numerator (20) of \(J(A^*(t))\) is nonnegative, or

$$\begin{aligned} trace\left( \widetilde{{\mathbf {Z}}}_{A}^T{\mathbf {L}}\widetilde{{\mathbf {Z}}}_{A}\right) \ge 0 \end{aligned}$$
(22)

Similarly, the denominator of J(A) can be rewritten as

$$\begin{aligned} trace\left( \widetilde{{\mathbf {Z}}}_{A}^T{\mathbf {D}}\widetilde{{\mathbf {Z}}}_{A}\right) = \sum _{f_m\in A}\widetilde{{\mathbf {z}}}_{f_m}^T{\mathbf {D}}\widetilde{{\mathbf {z}}}_{f_m} \end{aligned}$$
(23)

Moreover, the matrix \({\mathbf {D}}\) is a diagonal matrix that is positive definite. Thus we have

$$\begin{aligned} trace\left( \widetilde{{\mathbf {Z}}}_{A}^T{\mathbf {D}}\widetilde{{\mathbf {Z}}}_{A}\right) > 0 \end{aligned}$$
(24)

By (22) and (24), when \(A\ne \emptyset\) we can have the conclusion:

$$\begin{aligned} J(A)\ge 0 \end{aligned}$$
(25)

which completes the proof of Theorem 1. \(\square\)

Proof of Theorem 2

Proof

To prove the inequalities (13) in Theorem 2, we use the mathematical induction method.

When \(t=1\), \(A^*(0)=\emptyset\) and \(A^*(1)\ne \emptyset\). Since \(J(A^*(0))=-\infty\) and \(J(A^*(1))\ge 0\) according to (11), \(J(A^*(0))\le J(A^*(1))\) is true. For simplification, let

$$\begin{aligned} trace\left( \widetilde{{\mathbf {Z}}}_{A}^T{\mathbf {L}}\widetilde{{\mathbf {Z}}}_{A}\right) =\sum _{f_m\in A} a_{f_m},~~trace\left( \widetilde{{\mathbf {Z}}}_{A}^T{\mathbf {D}}\widetilde{{\mathbf {Z}}}_{A} \right) =\sum _{f_m\in A} b_{f_m} \end{aligned}$$

When \(t=2\), the Laplacian score of a feature subset A(1) with \(|A(1)|=1\) can be reduced as:

$$\begin{aligned} J(A^*(1))=\frac{a_1^*}{b_1^*}=\min _{f_k\in F}\frac{a_{f_k}}{b_{f_k}} \end{aligned}$$
(26)

where \(a_1^*\) and \(b_1^*\) are the corresponding \(a_{f_m}\) and \(b_{f_m}\) under the optimal solution, respectively.

$$\begin{aligned} J(A^*(2))=\frac{a_2^*}{b_2^*}=\frac{a_1^*+a_{p_1}}{b_1^*+b_{p_1}}=\min _{f_k\in \overline{A^*(1)}}\frac{a_1^*+a_{f_k}}{b_1^*+b_{f_k}} \end{aligned}$$
(27)

where \(a_2^*=a_1^*+a_{p_1}\) and \(b_2^*=b_1^*+b_{p_1}\), \({p_1}\) corresponds the optimal feature index in the second iteration. Combining (26) and (27), we have

$$\begin{aligned} J(A^*(1))-J(A^*(2))= & {} \frac{a_1^*}{b_1^*}-\frac{a_2^*}{b_2^*} \nonumber \\= & {} \frac{a_1^*}{b_1^*}-\frac{a_1^*+a_{p_1}}{b_1^*+b_{p_1}}\nonumber \\= & {} \frac{a_1^*(a_1^*+a_{p_1})-b_1^*(a_1^*+a_{p_1})}{b_1^*(b_1^*+b_{p_1})}\nonumber \\= & {} \frac{a_1^*b_{p_1}-b_1^*a_{p_1}}{b_1^*(b_1^*+b_{p_1})} \end{aligned}$$
(28)

According to (26), we know

$$\begin{aligned} \frac{a_1^*}{b_1^*}\le \frac{a_{p_1}}{b_{p_1}} \end{aligned}$$
(29)

Thus, we have

$$\begin{aligned} a_1^*b_{p_1}-b_1^*a_{p_1}\le 0 \end{aligned}$$
(30)

Considering \(b_1^*(b_1^*+b_{p_1})>0\) and (30), we lead a conclusion that \(J(A^*(1))\le J(A^*(2))\) is true.

Assume that \(J(A^*(N-1))\le J(A^*(N))\) is true for \(t=N<n\). Without loss of generality, let

$$\begin{aligned} J(A^*(N-1))=\frac{a_{N-1}^*}{b_{N-1}^*} \end{aligned}$$
(31)

and

$$\begin{aligned} J(A^*(N))=\frac{a_{N}^*}{b_{N}^*}=\frac{a_{N-1}^*+a_{p_{N-1}}}{b_{N-1}^*+b_{p_{N-1}}}= \min _{f_k\in \overline{A^*(N-1)}} \frac{a_{N-1}^*+a_{f_k}}{b_{N-1}^*+b_{f_k}} \end{aligned}$$
(32)

where \({p_{N-1}}\) is the optimal feature index in the N-th iteration. According to \(J(A^*(N-1))\le J(A^*(N))\), (31) and (32), for \(\forall f_k \in \overline{A^*(N-1)}\) we have

$$\begin{aligned} \frac{a_{N-1}^*}{b_{N-1}^*}\le \frac{a_{N-1}^*+a_{p_{N-1}}}{b_{N-1}^*+b_{p_{N-1}}}\le \frac{a_{N-1}^*+a_{f_k}}{b_{N-1}^*+b_{f_k}} \end{aligned}$$
(33)

Further, we have

$$\begin{aligned} \frac{a_{N-1}^*}{b_{N-1}^*} \le \frac{a_{N-1}^*+a_{p_{N-1}}}{b_{N-1}^*+b_{p_{N-1}}} \Rightarrow \frac{a_{N-1}^*}{b_{N-1}^*} \le \frac{a_{p_{N-1}}}{b_{p_{N-1}}} \Rightarrow \nonumber \\ a_{N-1}^*b_{p_{N-1}}-b_{N-1}^*a_{p_{N-1}} \le 0 \end{aligned}$$
(34)
$$\begin{aligned} \frac{a_{N-1}^*}{b_{N-1}^*}\le \frac{a_{N-1}^*+a_{f_k}}{b_{N-1}^*+b_{f_k}} \Rightarrow \frac{a_{N-1}^*}{b_{N-1}^*}\le \frac{a_{f_{k}}}{b_{f_{k}}} \Rightarrow \nonumber \\ a_{N-1}^*b_{f_{k}}-b_{N-1}^*a_{f_{k}}\le 0 \end{aligned}$$
(35)
$$\begin{aligned} \frac{a_{N-1}^*+a_{p_{N-1}}}{b_{N-1}^*+b_{p_{N-1}}} \le \frac{a_{N-1}^*+a_{f_k}}{b_{N-1}^* +b_{f_k}} \Rightarrow \nonumber \\ (a^*_{N-1}b_{f_k}-b^*_{N-1}a_{f_k})+ \left( a_{p_{N-1}}b_{f_k}-b_{p_{N-1}}a_{f_k}\right) \le \nonumber \\ \left( a^*_{N-1}b_{{p_{N-1}}}-b^*_{N-1}a_{{p_{N-1}}} \right) \end{aligned}$$
(36)

where \(f_k \in \overline{A^*(N-1)}\).

When \(t=N+1 \le n\), we want to prove that \(J(A^*(N))\le J(A^*(N+1))\) is true. Let

$$\begin{aligned} J(A^*(N+1))=\frac{a_{N+1}^*}{b_{N+1}^*}=\frac{a_{N}^*+a_{p_{N}}}{b_{N}^*+b_{p_{N}}} \end{aligned}$$
(37)

where \({p_{N}}\) is the optimal feature index in the \((N+1)\)-th iteration. We compute \(J(A^*(N))-J(A^*(N+1))\), and have

$$\begin{aligned}&J(A^*(N))-J(A^*(N+1)) \nonumber \\= & {} \frac{a_{N}^*}{b_{N}^*}-\frac{a_{N+1}^*}{b_{N+1}^*} \nonumber \\= & {} \frac{a_{N}^*}{b_{N}^*}-\frac{a_{N}^*+a_{p_{N}}}{b_{N}^*+b_{p_{N}}}\nonumber \\= & {} \frac{a_{N}^*(a_{N}^*+a_{p_{N}})-b_{N}^*(a_{N}^*+a_{p_{N}})}{b_{N}^*(b_{N}^*+b_{p_{N}})}\nonumber \\= & {} \frac{a_{N}^*b_{p_{N}}-b_{N}^*a_{p_{N}}}{b_{N}^*(b_{N}^*+b_{p_{N}})}\nonumber \\= & {} \frac{\left( a_{N-1}^*b_{p_{N}}-b_{N-1}^*a_{p_{N}}\right) +\left( a_{p_{N-1}}b_{p_{N}} -b_{p_{N-1}}a_{p_{N}}\right) }{b_{N}^*(b_{N}^*+b_{p_{N}})} \end{aligned}$$
(38)

Substituting (36) into the last equation in (38) and replacing \(f_k\) with \(p_{N}\) owing to the arbitrariness of \(f_k\), we have

$$\begin{aligned} J(A^*(N))-J(A^*(N+1)) \le \frac{\left( a^*_{N-1}b_{{p_{N-1}}}-b^*_{N-1}a_{{p_{N-1}}}\right) }{b_{N}^*(b_{N}^* +b_{p_{N}})}\le 0 \end{aligned}$$
(39)

which shows that \(J(A^*(N))\le J(A^*(N+1))\) is true when \(t=N+1\).

Consequently, by the Principle of Induction, \(J(A^*(t-1))\le J(A^*(t))\) for \(1 \le t\le n\). \(\square\)

Proof of Theorem 3

Proof

Note that the set \(|B|=1\). We follow the notations in the proof procedure of Theorem 2. In the t-th iteration, assume that

$$\begin{aligned} J(A^*(t))=\frac{a_{t}^*}{b_{t}^*} \end{aligned}$$
(40)

For the \((t+1)\)-th iteration, we have

$$\begin{aligned} J(A^*(t+1))=\frac{a_{t+1}^*}{b_{t+1}^*}=\frac{a_{t}^*+a_{p_{t}}}{b_{t}^*+b_{p_{t}}} \end{aligned}$$
(41)

where \({p_{t}}\) is the optimal feature index in the \((t+1)\)-th iteration, or \(A^*(t+1)-A^*(t)=B={f_{p_t}}\).

According to Theorem 2, we know that \(J(A^*(t))-J(A^*(t+1))\le 0\), and have

$$\begin{aligned} J(A^*(t))-J(A^*(t+1))= & {} \frac{a_{t}^*b_{p_{t}}-b_{t}^*a_{p_{t}}}{b_{t}^*(b_{t}^*+b_{p_{t}})} \end{aligned}$$
(42)

Thus, according to \(a_{t}^*b_{p_{t}}-b_{t}^*a_{p_{t}}\le 0\), we have

$$\begin{aligned} \frac{a_{t}^*}{b_{t}^*}\le \frac{a_{p_{t}}}{b_{p_{t}}} \end{aligned}$$
(43)

Since \(a_{p_{t}}=\widetilde{{\mathbf {z}}}_{f_{p_{t}}}^T{\mathbf {L}} \widetilde{{\mathbf {z}}}_{f_{p_{t}}}=\widetilde{{\mathbf {Z}}}^T_B{\mathbf {L}} \widetilde{{\mathbf {Z}}}^T_B\), and \(b_{p_{t}}=\widetilde{{\mathbf {z}}}_{f_{p_{t}}}^T{\mathbf {D}} \widetilde{{\mathbf {z}}}_{f_{p_{t}}}=\widetilde{{\mathbf {Z}}}^T_B{\mathbf {D}} \widetilde{{\mathbf {Z}}}^T_B\), (43) can be rewritten as

$$\begin{aligned} J(A^*(t))\le \frac{\widetilde{{\mathbf {Z}}}^T_B{\mathbf {L}}\widetilde{{\mathbf {Z}}}^T_B}{\widetilde{{\mathbf {Z}}}^T_B{\mathbf {D}}\widetilde{{\mathbf {Z}}}^T_B} \end{aligned}$$
(44)

which completes the proof of Theorem 3. \(\square\)

Proof of Theorem 5

Proof

According to the definition of neighborhood relation, we have

$$\begin{aligned} R_A^{\varepsilon }=\left\{ \left( \mathbf{x }_i,\mathbf{x }_j\right) |\Delta ^A \left( \mathbf{x }_i,\mathbf{x }_j\right) \le \varepsilon ,\left( \mathbf{x }_i,y_i\right) ,\left( \mathbf{x }_j,y_j\right) \in X_L\right\} \end{aligned}$$
(45)

and

$$\begin{aligned} R_{A^k}^{\varepsilon }=\left\{ \left( {\mathbf {x}}_i,{\mathbf {x}}_j\right) |\Delta ^{A^k} \left( \mathbf{x }_i,\mathbf{x }_j\right) \le \varepsilon ,\left( \mathbf{x }_i,y_i\right) ,\left( \mathbf{x }_j,y_j\right) \in X_L\right\} \end{aligned}$$
(46)

On the basis of \(\Delta ^{A}\left( \mathbf{x }_i,\mathbf{x }_j\right)\), the distance function \(\Delta ^{A^k}\left( \mathbf{x }_i,\mathbf{x }_j\right)\) can be rewritten as

$$\begin{aligned} \Delta ^{A^k}\left( \mathbf{x }_i,\mathbf{x }_j\right) =&\max _{f_q\in A^k} {\left| x_{iq}-x_{jq}\right| } \nonumber \\ =&\max \left( \Delta ^{A}\left( \mathbf{x }_i,\mathbf{x }_j\right) ,\left| x_{ik}-x_{jk} \right| \right) \end{aligned}$$
(47)

\(\forall \left( {\mathbf {x}}_i,{\mathbf {x}}_j\right) \in R_{A^k}^{\varepsilon }\), we have \(\Delta ^{A^k}\left( \mathbf{x }_i,\mathbf{x }_j\right) \le \varepsilon\). According to (47), we know that \(\Delta ^{A}\left( \mathbf{x }_i,\mathbf{x }_j\right) \le \varepsilon\) holds true. Thus, we have \(\left( {\mathbf {x}}_i,{\mathbf {x}}_j\right) \in R_{A}^{\varepsilon }\) in terms of (45). In other words, \(R_{A^k}^{\varepsilon } \subseteq R_{A}^{\varepsilon }\), which completes the proof of Theorem 5. \(\square\)

Proof of Theorem 6

Proof

To prove the rules in Theorem 6, we can use the relationship between \(\Delta ^{A^k}\left( \mathbf{x }_i,\mathbf{x }_j\right)\) and \(\Delta ^{A}\left( \mathbf{x }_i,\mathbf{x }_j\right)\):

$$\begin{aligned} \Delta ^{A^k}\left( \mathbf{x }_i,\mathbf{x }_j\right) = \max \left( \Delta ^{A}\left( \mathbf{x }_i,\mathbf{x }_j\right) ,\left| x_{ik} -x_{jk}\right| \right) \end{aligned}$$
(48)

For the rule (1), we have

$$\begin{aligned}&\forall \left( {\mathbf {x}}_i,{\mathbf {x}}_j\right) \notin R_{A}^{\varepsilon } \Leftrightarrow \Delta ^{A}\left( \mathbf{x }_i,\mathbf{x }_j\right)>\varepsilon \\&\Rightarrow \Delta ^{A^k}\left( \mathbf{x }_i,\mathbf{x }_j\right) >\varepsilon \Leftrightarrow \left( {\mathbf {x}}_i,{\mathbf {x}}_j\right) \notin R_{A^k}^{\varepsilon } \end{aligned}$$

For the rule (2), we have

$$\begin{aligned}&\forall \left( {\mathbf {x}}_i,{\mathbf {x}}_j\right) \in R_{A}^{\varepsilon } \wedge \left| {x}_{ik}-x_{jk}\right|>\varepsilon \Leftrightarrow \Delta ^{A}\left( \mathbf{x }_i,\mathbf{x }_j\right) \le \varepsilon \wedge \left| {x}_{ik}-x_{jk}\right|>\varepsilon \\&\Rightarrow \Delta ^{A^k}\left( \mathbf{x }_i,\mathbf{x }_j\right) >\varepsilon \Leftrightarrow \left( {\mathbf {x}}_i,{\mathbf {x}}_j\right) \notin R_{A^k}^{\varepsilon } \end{aligned}$$

For the rule (3), we have

$$\begin{aligned}&\forall \left( {\mathbf {x}}_i,{\mathbf {x}}_j\right) \in R_{A}^{\varepsilon } \wedge \left| {x}_{ik}-x_{jk}\right| \le \varepsilon \Leftrightarrow \Delta ^{A}\left( \mathbf{x }_i,\mathbf{x }_j\right) \le \varepsilon \wedge \left| {x}_{ik}-x_{jk}\right| \le \varepsilon \\&\Rightarrow \Delta ^{A^k}\left( \mathbf{x }_i,\mathbf{x }_j\right) \le \varepsilon \Leftrightarrow \left( {\mathbf {x}}_i,{\mathbf {x}}_j\right) \in R_{A^k}^{\varepsilon } \end{aligned}$$

In summary, the three rules hold true. This completes the proof of Theorem 6. \(\square\)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pang, Q., Zhang, L. A recursive feature retention method for semi-supervised feature selection. Int. J. Mach. Learn. & Cyber. 12, 2639–2657 (2021). https://doi.org/10.1007/s13042-021-01346-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-021-01346-0

Keywords

Navigation