Skip to main content
Log in

Discriminative least squares regression for multiclass classification based on within-class scatter minimization

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Least square regression has been widely used in pattern classification, due to the compact form and efficient solution. However, two main issues limit its performance for solving the multiclass classification problems. The first one is that employing the hard discrete labels as the regression targets is inappropriate for multiclass classification. The second one is that it focus only on exactly fitting the instances to the target matrix while ignoring the within-class similarity of the instances, resulting in overfitting. To address this issues, we propose a discriminative least squares regression for multiclass classification based on within-class scatter minimization (WCSDLSR). Specifically, a ε-dragging technique is first introduced to relax the hard discrete labels into the slack soft labels, which enlarges the between-class margin for the soft labels as much as possible. The within-class scatter for the soft labels is then constructed as a regularization term to make the transformed instances of the same class closer to each other. These factors ensure WCSDLSR can learn a more compact and discriminative transformation for classification, thus avoiding the overfitting problems. Furthermore, the proposed WCSDLSR can obtain a closed-form solution in each iteration with the lower computational costs. Experimental results on the benchmark datasets demonstrate that the proposed WCSDLSR achieves the better classification performance with the lower computational costs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Suykens Johan AK et al (2002) Least Squares Support Vector Machines. Int J Circ Theory Appl 27(6):605–615

  2. Li C, Li S, Liu Y (2016) A least squares support vector machine model optimized by moth-flame optimization algorithm for annual power load forecasting. Appl Intell 45(4):1166–1178

    Article  Google Scholar 

  3. Shalev-Shwartz S, Ben-David S (2014) Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press

  4. Aydav PSS, Minz S (2020) Granulation-based self-training for the semi-supervised classification of remote-sensing images. Granular Comput 5(3):309–327

  5. Amezcua J, Melin P (2019) A new fuzzy learning vector quantization method for classification problems based on a granular approach. Granular Comput 4(2):197–209

  6. Bas E, Egrioglu E, Yolcu U, Grosan C (2019) Type 1 fuzzy function approach based on ridge regression for forecasting. Granular Comput 4(4):629–637

  7. Lopez J, Maldonado S, Carrasco M (2016) A novel multi-class SVM model using second-order cone constraints. Appl Intell 44(2):457–469

    Article  Google Scholar 

  8. Doǧan U, Glasmachers T, Igel C (2016) A Unified View on Multi-class Support Vector Classification. Journal of Machine Learning Research, 17(1):1550–1831

  9. Ma J, Zhou S, Chen L, Wang W, Zhang Z (2019) A sparse robust model for Large scale Multi-Class Classification based on K-SVCR. Pattern Recogn Lett 117(1):16–23

  10. Cherkassky V (1997) The nature of statistical learning theory. IEEE Trans Neural Netw Learn Syst 8(6):1564–1564

    Article  Google Scholar 

  11. Allwein EL, Schapire RE, Singer Y (2001) Reducing multiclass to Binary: a unifying approach for margin classifiers. J Mach Learn Res 1(2):113–141

    MathSciNet  MATH  Google Scholar 

  12. Crammer K, Singer Y (2002) On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines. J Mach Learn Res 2(2):265–292

  13. Tsochantaridis I, Joachims T, Hofmann T, Altun Y, Singer Y (2006) Large Margin Methods for Structured and Interdependent Output Variables. J Mach Learn Res 6(2):1453–1484

  14. Robles Guerrero A, Saucedo Anaya B, González Ramírez A, Rosa Vargas A (2019) Analysis of a multiclass classification problem by Lasso Logistic Regression and Singular Value Decomposition to identify sound patterns in queenless bee colonies. Comput Electron Agricul 159:69–74

  15. Xiang S, Nie F, Meng G, Pan C, Zhang C (2012) Discriminative Least Squares Regression for Multiclass Classification and Feature Selection. IEEE Trans Neural Netw Learn Syst 23(11):1738–1754

    Article  Google Scholar 

  16. Zhang X, Wang L, Xiang S, Liu C (2015) Retargeted least squares regression algorithm. IEEE Trans Neural Netw Learn Syst 26(9):2206–2213

    Article  MathSciNet  Google Scholar 

  17. Wang L., Zhang X., Pan C (2016) MSDLSR: Margin Scalable Discriminative Least Squares Regression for Multicategory Classification. IEEE Trans Neural Netw Learn Syst 27(12):2711–2717

    Article  Google Scholar 

  18. Wang L, Pan C (2018) Groupwise Retargeted Least-Squares Regression. IEEE Trans Neural Netw Learn Syst 29(4):1352–1358

    Article  MathSciNet  Google Scholar 

  19. Fang X, Xu Y, Li X, Lai Z, Wong WK, Fang B (2018) Regularized Label Relaxation Linear Regression. IEEE Trans Neural Netw Learn Syst 29(4):1006–1018

    Article  Google Scholar 

  20. Wen J, Xu Y, Zuoyong Li Y, Ma Z, Xu Y (2018) Inter-class sparsity based discriminative least square regression. Neural Networks, 102:36–47

  21. He K, Peng Y, Liu S, Li J (2020) Regularized negative label relaxation least squares regression for face recognition. Neural Process Lett 51(3):2629–2647

  22. Zhang Y, Li W, Li HC, Tao R, Du Q (2020) Discriminative marginalized least-squares regression for hyperspectral image classification. IEEE Trans Geoence Remote Sens 58(5):3148–3161

  23. Chen Z, Wu X-J, Kittler J (2020) Low-Rank Discriminative least squares regression for image classification. Signal Process 173:107485

    Article  Google Scholar 

  24. Xue H, Chen S, Yang Q (2009) Discriminatively regularized least-squares classification. Pattern Recogn 42(1):93–104

    Article  Google Scholar 

  25. Chang K, Hsieh C, Lin C (2008) Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines. J Mach Learn Res 9:1369–1398

    MathSciNet  MATH  Google Scholar 

  26. Nie F, Wang X, Huang H (2017) Multiclass Capped p-norm Svm for Robust Classifications. In: The 31th AAAI Conference on Artificial Intelligence

  27. Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning, pp 1–122

  28. Gene Golub H., Van Charles Loan F. (1996) Matrix Computations, 3rd ed. Matrix computations

  29. Hong M, Luo X (2017) On the Linear Convergence of the Alternating Direction Method of Multipliers. Mathematical Programming: Series A and B, pp 165–199

  30. Fang X, Teng S, Lai Z, He Z, Xie S, Wong W K (2018) Robust Latent Subspace Learning for Image Classification. IEEE Trans Neural Netw 29(6):2502–2515

    Article  MathSciNet  Google Scholar 

  31. Georghiades A. S., Belhumeur P. N., Kriegman D. J. (2001) From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose. IEEE Trans Pattern Anal Mach Intell 23(6):643–660

    Article  Google Scholar 

  32. Martínez A, Benavente R (1998) The AR Face Database. Cvc Technical Report, pp 24

  33. Sim T, Baker S, Bsat M (2003) The CMU pose, Illumination, and Expression Database. IEEE Trans Pattern Anal Mach Intell 25(12):1615–1618

    Article  Google Scholar 

  34. Huang GB, Mattar M, Berg T, Learned-Miller E (2008) Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Workshop on Faces in ‘Real-Life’ Images, Detection, Alignment, and Recognition

  35. Lazebnik S, Schmid C, Ponce J (2006) Beyond Bags of Features: Spatial pyramid matching for recognizing natural scene categories. In: Computer Vision and Pattern Recognition, vol 2, pp 2169–2178

  36. Van Der Maaten L, Hinton GE (2008) Visualizing Data using t-sne. J Mach Learn Res 9:2579–2605

    MATH  Google Scholar 

  37. Jiang Z, Lin Z, Davis L. S. (2013) Label Consistent k-SVD: Learning a Discriminative Dictionary for Recognition. IEEE Trans Pattern Anal Mach Intell 35(11):2651–2664

    Article  Google Scholar 

  38. Demiar J, Schuurmans D (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(1):1–30

    MathSciNet  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (No. 61772020).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuisheng Zhou.

Ethics declarations

conflict of interest

The authors declared that they have no conflicts of interest to this work.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A:: Proof of Theorem 1

Appendix A:: Proof of Theorem 1

Proof

For simplicity, Let L denotes the optimization problem (8). The KKT conditions for (8) are derived as follows (because the process of solving W and E does not involve the Lagrange multipliers, the KKT conditions for them are omitted):

$$ \begin{array}{@{}rcl@{}} \textbf{U}=\textbf{T}. \end{array} $$
(20)
$$ \begin{array}{@{}rcl@{}} \frac{\partial{L}}{\partial{\textbf{T}}}&=&(1+\alpha+\mu)\textbf{T}\\&&-\left( \textbf{X}\textbf{W}+\alpha(\textbf{Y}+\textbf{Y}\odot\textbf{E}) +\mu\textbf{U}-\textbf{H}\right)=0. \end{array} $$
(21)
$$ \begin{array}{@{}rcl@{}} \frac{\partial{L}}{\partial{\textbf{U}_{j}}}&=&\left( (\beta+\mu)\textbf{I}_{n_{j}}-\frac{\beta}{n_{j}}\textbf{1}_{n_{j}}\right)\textbf{U}_{j} \\&&-(\mu\textbf{T}_{j}+\textbf{H}_{j})=0. \end{array} $$
(22)

First, the Lagrangian multiplier H can be obtained from Algorithm 1, as follows

$$ \begin{array}{@{}rcl@{}} \textbf{H}^{k}=\textbf{H}+\mu(\textbf{T}-\textbf{U}). \end{array} $$
(23)

If sequence \(\{\textbf {H}^{k}\}^{\infty }_{k=1}\) converges to a stationary point, i.e., \((\textbf {H}^k-\textbf {H})\rightarrow 0\), then \((\textbf {T}-\textbf {U})\rightarrow 0\). Thus, the first KKT condition (20) is proved.

For the second KKT condition, the following equation can be obtained from (10)

$$ \begin{array}{@{}rcl@{}} \textbf{T}^{k}-\textbf{T} &=&(1+\alpha+\mu)^{-1}\left( \textbf{X}\textbf{W}+\alpha(\textbf{Y}+\textbf{Y}\odot\textbf{E})\right.\\&&\left.+\mu\textbf{U}-\textbf{H}\right)-\textbf{T} \end{array} $$
(24)

From the (24), the following equation can be obtained

$$ \begin{array}{@{}rcl@{}} &&(1+\alpha+\mu)(\textbf{T}^{k}-\textbf{T})\\ &=&\left( \textbf{X}\textbf{W}+\alpha(\textbf{Y}+\textbf{Y}\odot\textbf{E})+\mu\textbf{U}-\textbf{H}\right)-(1+\alpha+\mu)\textbf{T} \end{array} $$
(25)

(25) indicates that when the sequence \(\{\textbf {T}^{k}\}^{\infty }_{k=1}\) converges to a stationary point, i.e., \((\textbf {T}^k-\textbf {T})\rightarrow 0\), the second KKT condition (21) holds.

For the third KKT condition (22), the following equation is obtained by using the results of (12)

$$ \begin{array}{@{}rcl@{}} \textbf{U}_{j}^{k} - \textbf{U}_{j} = \left( (\beta+\mu)\textbf{I}_{n_{j}} - \frac{\beta}{n_{j}}\textbf{1}_{n_{j}}\right)^{-1} (\mu\textbf{T}_{j} + \textbf{H}_{j}) - \textbf{U}_{j} \end{array} $$
(26)

which is equivalent to

$$ \begin{array}{@{}rcl@{}} &\left( (\beta+\mu)\textbf{I}_{n_{j}}-\frac{\beta}{n_{j}}\textbf{1}_{n_{j}}\right)(\textbf{U}_{j}^{k}-\textbf{U}_{j})\\ &=(\mu\textbf{T}_{j}+\textbf{H}_{j})-\left( (\beta+\mu)\textbf{I}_{n_{j}}-\frac{\beta}{n_{j}}\textbf{1}_{n_{j}}\right)\textbf{U}_{j} \end{array} $$
(27)

(27) indicates that when the sequence \(\{\textbf {U}^{k}\}^{\infty }_{k=1}\) converges to a stationary point, i.e., \((\textbf {U}^k-\textbf {U})\rightarrow 0\), the third KKT condition (22) holds.

In summary, the sequence solution \(\{{{\varTheta }}^k\}^{\infty }_{k=1}\) is bounded and \(\lim _{k\rightarrow \infty }({{\varTheta }}^{k+1}-{{\varTheta }}^{k})=0\) can deduce that the limit points of \(\{{{\varTheta }}^k\}^{\infty }_{k=1}\) is the Karush-Kuhn-Tucker (KKT) point of problem (8). Thus, the proof is complete. □

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, J., Zhou, S. Discriminative least squares regression for multiclass classification based on within-class scatter minimization. Appl Intell 52, 622–635 (2022). https://doi.org/10.1007/s10489-021-02258-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02258-w

Keywords

Navigation