Skip to main content
Log in

Classifier selection using geometry preserving feature

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

The selection of proper classifiers for a given data set is full of challenges. The critical problem of classifier selection is how to extract feature from data sets. This paper proposes a new method for feature extraction of a data set. Our method not only preserves the geometrical structure of a data set, but also characterizes the decision boundary of classification problems. Specifically speaking, the extracted feature can recover a data set that has the same Euclidean geometrical structure as the original data set. We present an efficient algorithm to compute the similarity between data set features. We theoretically analyze how the similarity between our features affects the performance of the support vector machine, a well-known classifier. The empirical results show that our method is effective in finding suitable classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

The datasets analysed during the current study are available in the UCI machine learning repository, https://archive.ics.uci.edu/ml/index.php.

Notes

  1. For simplicity, we omit the class labels.

References

  1. Aha DW (1992) Generalizing from case studies: a case study. In: Proceedings of the ninth international conference on machine learning, pp 1–10

  2. Bahri M, Salutari F, Putina A et al (2022) AutoML: state of the art with a focus on anomaly detection, challenges, and research directions. Int J Data Sci Anal 14(2):113–126

    Article  Google Scholar 

  3. Bensusan H (1998) God doesn’t always shave with Occam’s razor - learning when and how to prune. In: Proceedings of the tenth European conference on machine learning, pp 119–124

  4. Bensusan H, Giraud-Carrier C (2000) Discovering task neighbourhoods through landmark learning performances. In: Proceedings of the fourth European conference on principles and practice of knowledge discovery in databases, pp 325–330

  5. Bensusan H, Giraud-Carrier C, Kennedy C (2000) A higher-order approach to meta-learning. In: Proceedings of the ECML workshop on meta-learning: building automatic advice strategies for model selection and method combination, pp 109–118

  6. Broomhead DS, Lowe D (1988) Multivariable functional interpolation and adaptive networks. Complex Syst 2(3):321–355

    MathSciNet  MATH  Google Scholar 

  7. Cano JR (2013) Analysis of data complexity measures for classification. Expert Syst Appl 40(12):4820–4831

    Article  Google Scholar 

  8. Cartinhour J (1992) A Bayes classifier when the class distributions come from a common multivariate normal distribution. IEEE Trans Reliab 41(1):124–126

    Article  MATH  Google Scholar 

  9. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  10. Deng L, Xiao M (2023) Latent feature learning via autoencoder training for automatic classification configuration recommendation. Knowl-Based Syst 261(110):218

    Google Scholar 

  11. Deng L, Xiao M (2023) A new automatic hyperparameter recommendation approach under low-rank tensor completion e framework. IEEE Trans Pattern Anal Mach Intell 45(4):4038–4050

    Google Scholar 

  12. Duda RO, Hart PE, Stork DG (2001) Pattern classification. Springer, Berlin

    MATH  Google Scholar 

  13. Duin RPW, Pekalska E, Tax DMJ (2004) The characterization of classification problems by classifier disagreements. In: Proceedings of the seventeenth international conference on pattern recognition, pp 140–143

  14. Fernández-Delgado M, Cernadas E, Barro S et al (2014) Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res 15:3133–3181

    MathSciNet  MATH  Google Scholar 

  15. Fukunaga K (1990) Introduction to statistical pattern recognition. Academic Press, Cambridge

    MATH  Google Scholar 

  16. Golub GH, Van Loan CF (1996) Matrix computations. Johns Hopkins University Press, Baltimore

    MATH  Google Scholar 

  17. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Springer, New York

    Book  MATH  Google Scholar 

  18. Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300

    Article  Google Scholar 

  19. Jain AK, Ramaswami M (1988) Classifier design with Parzen windows. Mach Intell Pattern Recogn 7:211–228

    Google Scholar 

  20. Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449

    Article  MATH  Google Scholar 

  21. Kalousis A, Theoharis T (1999) NOEMON: design, implementation and performance results of an intelligent assistant for classifier selection. Intell Data Anal 3(5):319–337

    MATH  Google Scholar 

  22. Koren O, Hallin CA, Koren M et al (2022) AutoML classifier clustering procedure. Int J Intell Syst 37(7):4214–4232

    Article  Google Scholar 

  23. Macià N, Bernadó-Mansilla E, Orriols-Puig A et al (2013) Learner excellence biased by data set selection: a case for data characterisation and artificial data sets. Pattern Recogn 46(3):1054–1066

    Article  Google Scholar 

  24. Pan B, Chen WS, Chen B et al (2016) Efficient learning of supervised kernels with a graph-based loss function. Inf Sci 370(371):50–62

    Article  MathSciNet  MATH  Google Scholar 

  25. Pan B, Chen WS, Xu C et al (2016) A novel framework for learning geometry-aware kernels. IEEE Trans Neural Netw Learn Syst 27:939–951

    Article  MathSciNet  Google Scholar 

  26. Peng Y, Flach PA, Brazdil P, et al (2002) Improved dataset characterisation for meta-learning. In: Proceedings of the Fifth international conference on discovery science, pp 141–152

  27. Pfahringer B, Bensusan H, Giraud-Carrier C (2000) Meta-learning by landmarking various learning algorithms. In: Proceedings of the seventeenth international conference on machine learning, pp 743–750

  28. Raudys S, Duin RPW (1998) Expected classification error of the fisher linear classifier with pseudo-inverse covariance matrix. Pattern Recogn Lett 19(5–6):385–392

    Article  MATH  Google Scholar 

  29. Rice JR (1976) The algorithm selection problem. Adv Comput 15:65–118

    Article  Google Scholar 

  30. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536

    Article  MATH  Google Scholar 

  31. Singh S (2003) Multiresolution estimates of classification complexity. IEEE Trans Pattern Anal Mach Intell 25(12):1534–1539

    Article  Google Scholar 

  32. Song Q, Wang G, Wang C (2012) Automatic recommendation of classification algorithms based on data set characteristics. Pattern Recogn 45(7):2672–2689

    Article  Google Scholar 

  33. Umeyama S (1988) An eigendecomposition approach to weighted graph matching problems. IEEE Trans Pattern Anal Mach Intell 10(5):695–703

    Article  MATH  Google Scholar 

  34. Vong CM, Du J (2020) Accurate and efficient sequential ensemble learning for highly imbalanced multi-class data. Neural Netw 128:268–278

    Article  Google Scholar 

  35. Wang G, Song Q, Zhu X (2015) An improved data characterization method and its application in classification algorithm recommendation. Appl Intell 43(4):892–912

    Article  Google Scholar 

  36. Williams CKI, Seeger M (2000) The effect of the input density distribution on kernel-based classifiers. In: Proceedings of the seventeenth international conference on machine learning, pp 1159–1166

  37. Williams CKI, Seeger M (2001) Using the Nyström method to speed up kernel machines. In: Leen T, Dietterich T, Tresp V (eds) Advances in neural information processing systems 13. MIT Press, Cambridge, pp 682–688

    Google Scholar 

  38. Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8(7):1341–1390

    Article  Google Scholar 

  39. Yokota T, Yamashita Y (2013) A quadratically constrained MAP classifier using the mixture of Gaussians models as a weight function. IEEE Trans Neural Netw Learn Syst 24(7):1127–1140

    Article  Google Scholar 

  40. Yousef WA (2021) Estimating the standard error of cross-validation-based estimators of classifier performance. Pattern Recogn Lett 146:115–125

    Article  Google Scholar 

  41. Zhu X, Wu X (2004) Class noise versus attribute noise: a quantitative study. Artif Intell Rev 22(3):177–210

    Article  MATH  Google Scholar 

  42. Zhu X, Wu X, Yang Y (2004) Error detection and impact-sensitive instance ranking in noisy datasets. In: McGuinness DL, Ferguson G (eds) Proceedings of the nineteenth national conference on artificial intelligence, July 25-29, 2004, San Jose, California, USA, pp 378–384

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant 61602308 and the Interdisciplinary Innovation Team of Shenzhen University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wen-Sheng Chen.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of Theorem 2

Proof

Let \(h(\textbf{x}) = \textbf{w}^\top \textbf{x}\), \(h^\prime (\textbf{x}^\prime ) = {\textbf{w}^\prime }^\top \textbf{x}^\prime \) and \(\Delta \textbf{w} = \textbf{w} - \textbf{w}^\prime \). Then we have

$$\begin{aligned} \begin{aligned}&\ |h(\textbf{x})-h^\prime (\textbf{x}^\prime ) |= |\textbf{w}^\top \textbf{x} - {\textbf{w}^\prime }^\top \textbf{x}^\prime |\\ &=\ |(\textbf{w}^\top \textbf{x} - {\textbf{w}^\prime }^\top \textbf{x}) + ({\textbf{w}^\prime }^\top \textbf{x} - {\textbf{w}^\prime }^\top \textbf{x}^\prime )|\\ \leqslant&\ |\textbf{w}^\top \textbf{x} - {\textbf{w}^\prime }^\top \textbf{x} |_2 + |{\textbf{w}^\prime }^\top \textbf{x} - {\textbf{w}^\prime }^\top \textbf{x}^\prime |_2\\ \leqslant&\ \Vert \textbf{w} - {\textbf{w}^\prime }\Vert _2\Vert \textbf{x}\Vert _2 + \Vert \textbf{w}^\prime \Vert _2\Vert \textbf{x} - \textbf{x}^\prime \Vert _2. \end{aligned} \end{aligned}$$
(A1)

Since \(\textbf{w}\) and \(\textbf{w}^\prime \) are the minimizers of the associated SVM problems (12), for all \(t \in [0,1]\), we have

$$\begin{aligned} \begin{aligned}&\frac{1}{2}\Vert \textbf{w}\Vert ^2_2 + C \sum _{i=1}^n L(y_i\textbf{w}^\top \textbf{x}_i) \\ \leqslant&\ \frac{1}{2}\Vert \textbf{w} + t \Delta \textbf{w} \Vert ^2_2 + C \sum _{i=1}^n L(y_i(\textbf{w} + t \Delta \textbf{w})^\top \textbf{x}_i), \end{aligned} \end{aligned}$$
(A2)

and

$$\begin{aligned} \begin{aligned}&\frac{1}{2}\Vert \textbf{w}^\prime \Vert ^2_2 + C \sum _{i=1}^n L(y_i^\prime {\textbf{w}^\prime }^\top \textbf{x}_i^\prime ) \\ \leqslant&\ \frac{1}{2}\Vert \textbf{w}^\prime - t \Delta \textbf{w} \Vert ^2_2 + C \sum _{i=1}^n L(y_i^\prime (\textbf{w}^\prime - t \Delta \textbf{w})^\top \textbf{x}_i^\prime ). \end{aligned} \end{aligned}$$
(A3)

Summing (A2) and (A3), we obtain

$$\begin{aligned} \begin{aligned} t(1-t) \Vert \Delta \textbf{w} \Vert ^2_2&\leqslant C \sum _{i=1}^n \left[ \left( L(y_i(\textbf{w} + t \Delta \textbf{w})^\top \textbf{x}_i) - L(y_i\textbf{w}^\top \textbf{x}_i) \right) \right. \\&\ + \left. \left( L(y_i^\prime (\textbf{w}^\prime - t \Delta \textbf{w})^\top \textbf{x}_i^\prime - L(y_i^\prime {\textbf{w}^\prime }^\top \textbf{x}_i^\prime ) \right) \right] . \end{aligned} \end{aligned}$$
(A4)

Using the convexity of hinge loss, we have

$$\begin{aligned} \begin{aligned}&L(y_i(\textbf{w} + t \Delta \textbf{w})^\top \textbf{x}_i) - L(y_i\textbf{w}^\top \textbf{x}_i) \\ \leqslant&\ t \left( L(y_i {\textbf{w}^\prime }^\top \textbf{x}_i) - L(y_i\textbf{w}^\top \textbf{x}_i) \right) \end{aligned} \end{aligned}$$
(A5)

and

$$\begin{aligned} \begin{aligned}&L(y_i^\prime (\textbf{w}^\prime - t \Delta \textbf{w})^\top \textbf{x}_i^\prime ) - L(y_i^\prime {\textbf{w}^\prime }^\top \textbf{x}_i^\prime ) \\ \leqslant&\ -t \left( L(y_i^\prime {\textbf{w}^\prime }^\top \textbf{x}_i^\prime ) - L(y_i^\prime \textbf{w}^\top \textbf{x}_i^\prime ) \right) . \end{aligned} \end{aligned}$$
(A6)

Combing (A4), (A5) and (A6) and taking the limit \(t \rightarrow 0\) leads to

$$\begin{aligned} \begin{aligned}&\Vert \textbf{w}^\prime -\textbf{w}\Vert ^2_2 \\ \leqslant&\ C\sum _{i=1}^n \left[ \left( L(y_i{\textbf{w}^\prime }^\top \textbf{x}_i)- L(y^\prime _i{\textbf{w}^\prime }^\top \textbf{x}^\prime _i) \right) \right. \\&\ + \left. \left( L(y^\prime _i \textbf{w}^\top \textbf{x}^\prime _i)- L(y_i \textbf{w}^\top \textbf{x}_i) \right) \right] \\ \leqslant&\ C\sum _{i=1}^n \Bigl [ \Vert \textbf{w}^\prime \Vert _2 \cdot \Vert y_i\phi (\textbf{x}_i) - y^\prime _i\phi (\textbf{x}^\prime _i)\Vert _2 \Bigr . \\&\ + \Bigl . \Vert \textbf{w}\Vert _2 \cdot \Vert y_i\phi (\textbf{x}_i) - y^\prime _i\phi (\textbf{x}^\prime _i)\Vert _2 \Bigr ]. \end{aligned} \end{aligned}$$
(A7)

The second inequality follows from the Lipschitz continuity of hinge loss. We write \(\textbf{w}\) in terms of the dual variables \(\alpha _i\), namely \(\textbf{w} = \sum _{i=1}^n \alpha _i \textbf{x}_i\). Note that \(0 \leqslant \alpha _i \leqslant C\) and \(\textbf{x}_i = \textbf{G}^{1/2} \textbf{e}_i\). Then we obtain

$$\begin{aligned} \begin{aligned} \Vert \textbf{w} \Vert _2&= \Vert \sum _{i=1}^n \alpha _i \textbf{x}_i\Vert _2 \leqslant \sum _{i=1}^n |\alpha _i |\cdot \Vert \textbf{x}_i \Vert _2 \\&\leqslant C \sum _{i=1}^n \Vert \textbf{G}^{1/2} \textbf{e}_i \Vert _2 \leqslant n C \Vert \textbf{G}^{1/2} \Vert _2. \end{aligned} \end{aligned}$$
(A8)

Analogously, we have

$$\begin{aligned} \Vert \textbf{w}^\prime \Vert _2 \leqslant n C \Vert {\textbf{G}^\prime }^{1/2} \Vert _2. \end{aligned}$$
(A9)

Thus, (A7) can be rewritten as

$$\begin{aligned} \begin{aligned}&\Vert \textbf{w}^\prime -\textbf{w}\Vert ^2_2 \\ \leqslant&\ C_0^2 \sum _{i=1}^n\Vert y_i \textbf{x}_i -y^\prime _i \textbf{x}^\prime _i\Vert _2\\ &=\ C_0^2 \sum _{i=1}^n\Vert y_i\textbf{G}^{1/2}\textbf{e}_i - y^\prime _i{\textbf{G}^\prime }^{1/2}\textbf{e}_i\Vert _2\\ &=\ C_0^2 \sum _{i=1}^n\Vert (y_i\textbf{G}^{1/2}\textbf{e}_i - y^\prime _i\textbf{G}^{1/2}\textbf{e}_i) + (y^\prime _i\textbf{G}^{1/2}\textbf{e}_i - y^\prime _i{\textbf{G}^\prime }^{1/2}\textbf{e}_i)\Vert _2\\ \leqslant&\ C_0^2 \Bigl ( \Vert \textbf{G}^{1/2}\Vert _2 \sum _{i=1}^n |y_i - y^\prime _i|+ \sum _{i=1}^n \Vert \textbf{G}^{1/2}-{\textbf{G}^\prime }^{1/2}\Vert _2 \Bigr ) \\ \leqslant&\ C_0^2 \Vert \textbf{G}^{1/2}\Vert _2 \Vert \textbf{y} - \textbf{y}^\prime \Vert _1 + nC_0^2 \Vert \textbf{G} - \textbf{G}^\prime \Vert ^{1/2}_2 \end{aligned} \end{aligned}$$
(A10)

Here we use the inequality \(\Vert \textbf{G}^{1/2}-{\textbf{G}^\prime }^{1/2}\Vert _2 \leqslant \Vert \textbf{G}-\textbf{G}^\prime \Vert ^{1/2}_2\).

We also have

$$\begin{aligned} \begin{aligned} \Vert \textbf{x} - \textbf{x}^\prime \Vert _2 &=\ \Vert \sum _{i=1}^n \beta _i \textbf{x}_i - \sum _{i=1}^n \beta _i \textbf{x}_i^\prime \Vert _2\\ &=\ \Vert \sum _{i=1}^n \beta _i (\textbf{G}^{1/2} - {\textbf{G}^\prime }^{1/2}) \textbf{e}_i\Vert _2 \\ \leqslant&\ \Vert \textbf{G}^{1/2} - {\textbf{G}^\prime }^{1/2}\Vert _2\Vert \varvec{\beta }\Vert _2 \\ \leqslant&\ \Vert \textbf{G} - {\textbf{G}^\prime }\Vert ^{1/2}_2\Vert \varvec{\beta }\Vert _2 \end{aligned} \end{aligned}$$
(A11)

and

$$\begin{aligned} \Vert \textbf{x} \Vert _2 = \Vert \sum _{i=1}^n \beta _i \textbf{x}_i \Vert _2 \leqslant \Vert \textbf{G}^{1/2}\Vert _2\Vert \varvec{\beta }\Vert _2. \end{aligned}$$
(A12)

Substituting (A9) to (A12) into (A1) yields

$$\begin{aligned} \begin{aligned}&|h(\textbf{x}) - h^\prime (\textbf{x}^\prime )|\\ \leqslant&\ C_0 (\Vert \textbf{G}^{1/2}\Vert _2\Vert \textbf{y} - \textbf{y}^\prime \Vert _1 + n \Vert \textbf{G} - \textbf{G}^\prime \Vert ^{1/2}_2)^{1/2} \Vert \textbf{G}^{1/2} \Vert _2 \Vert \varvec{\beta } \Vert _2 \\&+ nC \Vert {\textbf{G}^\prime }^{1/2} \Vert _2 \Vert \varvec{\beta }\Vert _2\Vert \textbf{G} - \textbf{G}^\prime \Vert ^{1/2}_2\\ \leqslant&\ C_0 (\Vert \textbf{G}^{1/2}\Vert _2^{1/2} \Vert \textbf{y} - \textbf{y}^\prime \Vert _1^{1/2} + \sqrt{n} \Vert \textbf{G} - \textbf{G}^\prime \Vert ^{1/4}_2) \Vert \textbf{G}^{1/2} \Vert _2 \Vert \varvec{\beta } \Vert _2 \\&+ nC \Vert {\textbf{G}^\prime }^{1/2} \Vert _2 \Vert \varvec{\beta }\Vert _2\Vert \textbf{G} - \textbf{G}^\prime \Vert ^{1/2}_2. \end{aligned} \end{aligned}$$
(A13)

The last inequality results from \((a+b)^{1/2}\leqslant \sqrt{a}+\sqrt{b}\). \(\square \)

Appendix B: Incomplete Cholesky decomposition (ICD) algorithm

The algorithm of ICD is shown in Algorithm 2.

figure b

Appendix C: Proof of Theorem 3

Proof

The second and the last statements can be inferred from the first and the third ones. Thus, we only need to proof (i) and (iii). Since \(\textbf{G}\) and \(\textbf{G}^\prime \) are isomorphic, there exists a bijection \(\pi _0: \{1,2,\cdots ,n\}\rightarrow \{1,2,\cdots ,n\}\) such that

$$\begin{aligned} \textbf{G}(i,j) = \textbf{G}^\prime (\pi _0(i),\pi _0(j)),\ 1\leqslant i,j\leqslant n. \end{aligned}$$

We will argue by mathematical induction.

1) For \(s = 1\), let \(p_1 = \arg \max _{1 \leqslant i \leqslant n} \textbf{G}(i,i)\) and \(q_1 = \arg \max _{1 \leqslant i \leqslant n} \textbf{G}^\prime (i,i)\). Since the largest diagonal elements of \(\textbf{G}\) and \(\textbf{G}^\prime \) are unique, we have \(q_1 = \pi _0(p_1)\). We construct bijections

$$\begin{aligned} \omega _1(i) = \left\{ \begin{array}{ll} i, &{} i\ne 1, p_1;\\ p_1, &{} i = 1;\\ 1, &{} i = p_1. \end{array}\right. \end{aligned}$$

and

$$\begin{aligned} \omega ^\prime _1(i) = \left\{ \begin{array}{ll} i, &{} i\ne 1, q_1;\\ q_1, &{} i = 1;\\ 1, &{} i = q_1. \end{array}\right. \end{aligned}$$

If \(p_1 = 1\) (\(q_1 = 1\)), then \(\omega _1\) (\(\omega ^\prime _1\)) is an identity mapping. Define a composite function \(\pi _1: \{1,2,\cdots ,n\}\rightarrow \{1,2,\cdots ,n\}\):

$$\begin{aligned} \pi _1(i) = \omega ^\prime _1\circ \pi _0\circ \omega _1(i). \end{aligned}$$
(C14)

Then

$$\begin{aligned} \pi _1(1) = \omega ^\prime _1\circ \pi _0\circ \omega _1(1) = \omega ^\prime _1\circ \pi _0(p_1) = \omega ^\prime _1(q_1) = 1. \end{aligned}$$

The first diagonal element of \(\textbf{A}^{(1)}\) is

$$\begin{aligned} \begin{aligned} \textbf{A}^{(1)}(1,1)&= \sqrt{\textbf{G}(p_1, p_1)} = \sqrt{\textbf{G}^\prime (\pi _0(p_1),\pi _0( p_1))} \\&= \sqrt{\textbf{G}^\prime (q_1, q_1)} = \textbf{B}^{(1)}(1,1) = \textbf{B}^{(1)}(\pi _1(1),1). \end{aligned} \end{aligned}$$

The rest elements of the first column of \(\textbf{A}^{(1)}\) are

$$\begin{aligned} \begin{aligned} \textbf{A}^{(1)}(i,1)&= \left\{ \begin{array}{ll} \frac{\textbf{G}(i, p_1)}{\textbf{A}^{(1)}(1,1)}, &{} i\ne p_1;\\ \frac{\textbf{G}(1, p_1)}{\textbf{A}^{(1)}(1,1)}, &{} i = p_1.\\ \end{array}\right. \\&= \frac{\textbf{G}(\omega _1(i), p_1)}{\textbf{A}^{(1)}(1,1)}, \ 2\leqslant i \leqslant n. \end{aligned} \end{aligned}$$

In a similar way,

$$\begin{aligned} \textbf{B}^{(1)}(i,1) = \frac{\textbf{G}^\prime (\omega _1^\prime (i), q_1)}{\textbf{B}^{(1)}(1,1)}, \ 2\leqslant i \leqslant n. \end{aligned}$$

Note that \(\textbf{A}^{(1)}(1,1) = \textbf{B}^{(1)}(1,1)\), \(\omega ^\prime _1\circ \omega ^\prime _1(i) = i\) and \(\textbf{G}(i,p_1) = \textbf{G}^\prime (\pi _0(i),\pi _0(p_1))\). Then we obtain

$$\begin{aligned} \begin{aligned} \textbf{A}^{(1)}(i,1) = \ {}&\frac{\textbf{G}(\omega _1(i), p_1)}{\textbf{A}^{(1)}(1,1)} = \frac{\textbf{G}^\prime (\pi _0\circ \omega _1(i), \pi _0(p_1))}{\textbf{B}^{(1)}(1,1)}\\ = \ {}&\frac{\textbf{G}^\prime (\omega ^\prime _1\circ \omega ^\prime _1\circ \pi _0\circ \omega _1(i), q_1)}{\textbf{B}^{(1)}(1,1)} \\ = \ {}&\frac{\textbf{G}^\prime (\omega ^\prime _1\circ \pi _1(i), q_1)}{\textbf{B}^{(1)}(1,1)}\\ = \ {}&\textbf{B}^{(1)}(\pi _1(i),1), \ 2\leqslant i \leqslant n. \end{aligned} \end{aligned}$$

Let \(\textbf{P}^{(1)} = \textbf{P}[1,p_1]\) and \(\textbf{Q}^{(1)} = \textbf{P}[1,q_1]\). Denote \(\textbf{G}^{(1)} = \textbf{P}^{(1)}\textbf{G}\textbf{P}^{(1)}\) and \({\textbf{G}^\prime }^{(1)} = \textbf{Q}^{(1)}\textbf{G}^\prime \textbf{Q}^{(1)}\), then

$$\begin{aligned} \textbf{G}^{(1)}(i,j) = (\textbf{P}^{(1)}\textbf{G}\textbf{P}^{(1)})(i,j) = \textbf{G}(\omega _1(i), \omega _1(j)) \end{aligned}$$
(C15)

and

$$\begin{aligned} \begin{aligned} {\textbf{G}^\prime }^{(1)}(\pi _1(i),\pi _1(j)) = \ {}&(\textbf{Q}^{(1)} \textbf{G}^\prime \textbf{Q}^{(1)})(\pi _1(i),\pi _1(j))\\ = \ {}&\textbf{G}^\prime (\omega ^\prime _1\circ \pi _1(i),\omega ^\prime _1\circ \pi _1(j))\\ = \ {}&\textbf{G}^\prime (\pi _0\circ \omega _1(i),\pi _0\circ \omega _1(j))\\ = \ {}&\textbf{G}(\omega _1(i), \omega _1(j)), \end{aligned} \end{aligned}$$
(C16)

where the third equality results from (C14). Combining (C15) and (C16) yields \(\textbf{G}^{(1)}(i,j) = {\textbf{G}^\prime }^{(1)}(\pi _1(i),\pi _1(j))\). Therefore, the theorem holds for \(s=1\).

2) Assume that the theorem holds for \(s = m\). There exists a bijection \(\pi _m:\{1,2,\cdots ,n\}\rightarrow \{1,2,\cdots ,n\}\) such that \(\pi _m(i) = i\) (\(1\leqslant i\leqslant m\)) and for every i,

$$\begin{aligned} \textbf{A}^{(m)}(i,j) = \textbf{B}^{(m)}(\pi _m(i),j), \ 1\leqslant i\leqslant m. \end{aligned}$$
(C17)

There also exist permutation matrices \(\textbf{P}^{(m)}\) and \(\textbf{Q}^{(m)}\) such that

$$\begin{aligned} \textbf{G}^{(m)}(i,j) = {\textbf{G}^\prime }^{(m)}(\pi _m(i),\pi _m(j)), \end{aligned}$$
(C18)

where \(\textbf{G}^{(m)} = \textbf{P}^{(m)} \textbf{G} \textbf{P}^{(m)}\) and \({\textbf{G}^\prime }^{(m)} = \textbf{Q}^{(m)} \textbf{G}^\prime \textbf{Q}^{(m)}\).

3) We will show that the theorem holds for \(s = m+1\). The \((m+1)\)th to the nth diagonal elements of \(\textbf{A}^{(m)}\) are

$$\begin{aligned} \begin{aligned} \textbf{A}^{(m)}&(i,i) = \ \textbf{G}^{(m)}(i,i) - \sum _{j=1}^m\textbf{A}^{(m)}(i,j)^2\\ &=\ {\textbf{G}^\prime }^{(m)}(\pi _m(i),\pi _m(i)) - \sum _{j=1}^m\textbf{B}^{(m)}(\pi _m(i),j)^2\\ &=\ \textbf{B}^{(m)}(\pi _m(i),\pi _m(i)), \ m+1\leqslant i \leqslant n. \end{aligned} \end{aligned}$$

Note that \(\pi _m(i) = i\) (\(1\leqslant i\leqslant m\)) and \(\pi _m\) is a bijection. Let \(p_{m+1} = \arg \max _{m+1\leqslant i\leqslant n}\textbf{A}^{(m)}(i,i)\) and \(q_{m+1} = \arg \max _{m+1\leqslant i\leqslant n} \textbf{B}^{(m)}(i,i)\). We have \(q_{m+1} = \pi _m(p_{m+1})\) due to the uniqueness of the largest diagonal elements of symmetric pivoting. We construct bijections

$$\begin{aligned} \omega _{m+1}(i) = \left\{ \begin{array}{ll} i, &{} i\ne m+1, p_{m+1};\\ p_{m+1}, &{} i = m+1;\\ m+1, &{} i = p_{m+1}. \end{array}\right. \end{aligned}$$

and

$$\begin{aligned} \omega ^\prime _{m+1}(i) = \left\{ \begin{array}{ll} i, &{} i\ne m+1, q_{m+1};\\ q_{m+1}, &{} i = m+1;\\ m+1, &{} i = q_{m+1}. \end{array}\right. \end{aligned}$$

Define a composite function \(\pi _{m+1}:\{1,2,\cdots ,n\}\rightarrow \{1,2,\cdots ,n\}\):

$$\begin{aligned} \pi _{m+1}(i) = \omega ^\prime _{m+1}\circ \pi _m\circ \omega _{m+1}(i). \end{aligned}$$
(C19)

For \(1 \leqslant i \leqslant m\),

$$\begin{aligned} \begin{aligned}&\pi _{m+1}(i) = \omega ^\prime _{m+1}\circ \pi _m\circ \omega _{m+1}(i)\\ &=\omega ^\prime _{m+1}\circ \pi _m(i) = \omega ^\prime _{m+1}(i) = i. \end{aligned} \end{aligned}$$

We also have

$$\begin{aligned} \begin{aligned}&\pi _{m+1}(m+1) = \omega ^\prime _{m+1}\circ \pi _m\circ \omega _{m+1}(m+1) \\ &=\omega ^\prime _{m+1}\circ \pi _m(p_{m+1}) = \omega ^\prime _{m+1}(q_{m+1}) = m+1. \end{aligned} \end{aligned}$$

Thus, \(\pi _{m+1}(i) = i \ (1\leqslant i\leqslant m+1)\).

Let \(\textbf{P}^{(m+1)} = \textbf{P}[m+1,p_{m+1}] \textbf{P}^{(m)}\) and \(\textbf{Q}^{(m+1)} = \textbf{P}[m+1,q_{m+1}] \textbf{Q}^{(m)}\). Denote \(\textbf{G}^{(m+1)} = \textbf{P}^{(m+1)} \textbf{G} \textbf{P}^{(m+1)} \) and \({\textbf{G}^\prime }^{(m+1)} = \textbf{Q}^{(m+1)} {\textbf{G}^\prime } \textbf{Q}^{(m+1)}\), then

$$\begin{aligned} \begin{aligned} \textbf{G}^{(m+1)}(i,j) &=\ (\textbf{P}^{(m+1)} \textbf{G} \textbf{P}^{(m+1)})(i,j) \\ &=\ \textbf{G}^{(m)}(\omega _{m+1}(i), \omega _{m+1}(j)), \end{aligned} \end{aligned}$$
(C20)

and

$$\begin{aligned} \begin{aligned}&\ {\textbf{G}^\prime }^{(m+1)}(\pi _{m+1}(i),\pi _{m+1}(j)) \\ &=\ (\textbf{Q}^{(m+1)} {\textbf{G}^\prime } \textbf{Q}^{(m+1)})(\pi _{m+1}(i),\pi _{m+1}(j))\\ &=\ {\textbf{G}^\prime }^{(m)}(\omega ^\prime _{m+1}\circ \pi _{m+1}(i),\omega ^\prime _{m+1}\circ \pi _{m+1}(j))\\ &=\ {\textbf{G}^\prime }^{(m)}(\pi _m\circ \omega _{m+1}(i),\pi _m\circ \omega _{m+1}(j))\\ &=\ \textbf{G}^{(m)}(\omega _{m+1}(i), \omega _{m+1}(j)), \end{aligned} \end{aligned}$$
(C21)

where the last two equations follows from (C19) and (C18), respectively. Combining (C20) and (C21) yields

$$\begin{aligned} \textbf{G}^{(m+1)}(i,j) = {\textbf{G}^\prime }^{(m+1)}(\pi _{m+1}(i), \pi _{m+1}(j)). \end{aligned}$$

For \(1 \leqslant i \leqslant n\) and \(1 \leqslant j \leqslant m\),

$$\begin{aligned} \begin{aligned} \textbf{A}^{(m+1)}(i,j) &=\left\{ \begin{array}{ll} \textbf{A}^{(m)}(i,j), &{} i\ne m+1, p_{m+1};\\ \textbf{A}^{(m)}(p_{m+1},j), &{} i = m+1;\\ \textbf{A}^{(m)}(m+1,j), &{} i = p_{m+1}.\\ \end{array}\right. \\ = \ {}&\textbf{A}^{(m)}(\omega _{m+1}(i),j) \\ = \ {}&\textbf{B}^{(m)}(\pi _m\circ \omega _{m+1}(i),j)\\ = \ {}&\textbf{B}^{(m)}(\omega ^\prime _{m+1}\circ \pi _{m+1}(i),j)\\ = \ {}&\textbf{B}^{(m+1)}(\pi _{m+1}(i),j), \end{aligned} \end{aligned}$$

where the third and the fourth equations results from (C17) and (C19), respectively. For the \((m+1)\)th diagonal element of \(\textbf{A}^{(m+1)}\),

$$\begin{aligned} \begin{aligned}&\textbf{A}^{(m+1)}(m+1,m+1) = \sqrt{\textbf{G}^{(m)}(p_{m+1}, p_{m+1})} \\ = \ {}&\sqrt{{\textbf{G}^\prime }^{(m)}(\pi _m(p_{m+1}),\pi _m( p_{m+1}))} \\ = \ {}&\sqrt{{\textbf{G}^\prime }^{(m)}(q_{m+1}, q_{m+1})} \\ = \ {}&\textbf{B}^{(m+1)}(m+1,m+1) \end{aligned} \end{aligned}$$

and for \(m + 2 \leqslant i \leqslant n\),

$$\begin{aligned} \begin{aligned}&\ \textbf{A}^{(m + 1)}(i,m + 1) \\ &=\ \frac{{\textbf{G}^{(m + 1)}(i,m + 1)}}{{\textbf{A}^{(m + 1)}(m + 1,m + 1)}} -\\&\ \frac{\sum \limits _{l = 1}^m {\textbf{A}^{(m + 1)}(i,l)} \textbf{A}^{(m + 1)}(m + 1,l)}{\textbf{A}^{(m + 1)}(m + 1,m + 1)}\\ &=\ \frac{{\textbf{G}^\prime }^{(m + 1)}({\pi _{m + 1}}(i),{\pi _{m + 1}}(m + 1))}{\textbf{B}^{(m + 1)}(m + 1,m + 1)} -\\&\ \frac{\sum \limits _{l = 1}^m {\textbf{B}^{(m + 1)}({\pi _{m + 1}}(i),l)} \textbf{B}^{(m + 1)}({\pi _{m + 1}}(m + 1),l)}{\textbf{B}^{(m + 1)}(m + 1,m + 1)}\\ &=\ \frac{{\textbf{G}^\prime }^{(m+1)}({\pi _{m + 1}}(i),m + 1)}{\textbf{B}^{(m + 1)}(m + 1,m + 1)} -\\&\ \frac{\sum \limits _{l = 1}^m {\textbf{B}^{(m + 1)}({\pi _{m + 1}}(i),l)} \textbf{B}^{(m + 1)}(m + 1,l)}{\textbf{B}^{(m + 1)}(m + 1,m + 1)}\\ &=\ \textbf{B}^{(m + 1)}({\pi _{m + 1}}(i),m + 1). \end{aligned} \end{aligned}$$

Thus, for every i we obtain

$$\begin{aligned} \textbf{A}^{(m+1)}(i,j) = \textbf{B}^{(m+1)}(\pi _{m+1}(i),j),\ 1\leqslant j\leqslant m+1. \end{aligned}$$

The proof is complete by mathematical induction.

\(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pan, B., Chen, WS., Deng, L. et al. Classifier selection using geometry preserving feature. Neural Comput & Applic 35, 20955–20976 (2023). https://doi.org/10.1007/s00521-023-08828-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-08828-y

Keywords

Navigation