Skip to main content

Advertisement

Log in

An explainable multi-sparsity multi-kernel nonconvex optimization least-squares classifier method via ADMM

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Convex optimization techniques are extensively applied to various models, algorithms, and applications of machine learning and data mining. For optimization-based classification methods, the sparsity principle can greatly help to select simple classifier models, while the single- and multi-kernel methods can effectively address nonlinearly separable problems. However, the limited sparsity and kernel methods hinder the improvement of the predictive accuracy, efficiency, iterative update, and interpretable classification model. In this paper, we propose a new Explainable Multi-sparsity Multi-kernel Nonconvex Optimization Least-squares Classifier (EM2NOLC) model, which is an optimization problem with a least-squares objective function and multi-sparsity multi-kernel nonconvex constraints, aiming to address the aforementioned issues. Based on reconstructed multiple kernel learning (MKL), the proposed model can extract important instances and features by finding the sparse coefficient and kernel weight vectors, which are used to compute importance or contribution to classification and obtain the explainable prediction. The corresponding EM2NOLC algorithm is implemented with the Alternating Direction Method of Multipliers (ADMM) method. On the real classification datasets, compared with the three ADMM classifiers of Linear Support Vector Machine Classifier, SVMC, Least Absolute Shrinkage and Selection Operator Classifier, the two MKL classifiers of SimpleMKL and EasyMKL, and the gradient descent classifier of Feature Selection for SVMC, our proposed EM2NOLC generally obtains the best predictive performance and explainable results with the least number of important instances and features having different contribution percentages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. https://github.com/sagedavid/EM2NOLC.

References

  1. Sra S, Nowozin S, Wright SJ (eds) (2012) Optimization for machine learning. Mit Press, Cambridge

    MATH  Google Scholar 

  2. Yang X (2019) Introduction to algorithms for data mining and machine learning. Academic Press, Cambridge

    MATH  Google Scholar 

  3. Kantardzic M (2020) Data mining concepts, models, methods, and algorithms, 3rd edn. Wiley-IEEE Press, Hoboken

    MATH  Google Scholar 

  4. Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  5. Shigeo A (2010) Support vector machines for pattern classification, 2nd edn. Springer, Berlin

    MATH  Google Scholar 

  6. Deng N, Tian Y, Zhang C (2012) Support vector machines: optimization-based theory, algorithms, and extensions. CRC Press, Boca Raton

    Book  MATH  Google Scholar 

  7. Simeone O (2018) A brief introduction to machine learning for engineers. Found Trends Signal Process 12(3–4):200–431

    Article  MathSciNet  MATH  Google Scholar 

  8. Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  9. Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) SimpleMKL. J Mach Learn Res 9:2491–2521

    MathSciNet  MATH  Google Scholar 

  10. Gönen M, Alpaydin E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268

    MathSciNet  MATH  Google Scholar 

  11. Gu Y, Liu T, Jia X, Benediktsson JA, Chanussot J (2016) Nonlinear multiple Kernel learning with multiple-structure-element extended morphological Profiles for hyperspectral image classification. IEEE Trans Geosci Remote Sens 54(6):3235–3247

    Article  Google Scholar 

  12. Zien, A., & Ong, C. S. (2007). Multiclass multiple kernel learning. In Proceedings of the 24th international conference on Machine learning, pages 1191–1198, ACM.

  13. Wang T, Zhao D, Feng Y (2013) Two-stage multiple kernel learning with multiclass kernel polarization. Knowl-Based Syst 48:10–16

    Article  Google Scholar 

  14. Nazarpour A, Adibi P (2015) Two-stage multiple kernel learning for supervised dimensionality reduction. Pattern Recogn 48(5):1854–1862

    Article  Google Scholar 

  15. Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B (2006) Large scale multiple kernel learning. J Mach Learn Res 7:1531–1565

    MathSciNet  MATH  Google Scholar 

  16. Aiolli F, Donini M (2015) EasyMKL: a scalable multiple kernel learning algorithm. Neurocomputing 169:215–224

    Article  Google Scholar 

  17. Lauriola I, Gallicchio C, Aiolli F (2020) Enhancing deep neural networks via multiple kernel learning. Pattern Recogn 101:107194

    Article  Google Scholar 

  18. Zhang Z, Gao G, Yao T, He J, Tian Y (2020) An interpretable regression approach based on bi-sparse optimization. Appl Intell 50(11):4117–4142

    Article  Google Scholar 

  19. Bach F, Jenatton R, Mairal J, Obozinski G (2011) Optimization with sparsity-inducing penalties. Found Trends Mach Learn 4(1):1–106

    Article  MATH  Google Scholar 

  20. Rish I, Grabarnik GY (2014) Sparse modeling: theory, algorithms, and applications. Chapman & Hall/CRC Press, Boca Raton

    Book  MATH  Google Scholar 

  21. Gregorova M (2019) Sparse learning for variable selection with structures and nonlinearities. Doctoral dissertation, Geneve

  22. Jain P, Kar P (2017) Non-convex optimization for machine learning. Found Trends Mach Learn 10(3–4):142–336

    Article  MATH  Google Scholar 

  23. Weston J, Elisseeff A, Schölkopf B, Tipping M (2003) Use of the zero-norm with linear models and kernel methods. J Mach Learn Res 3:1439–1461

    MathSciNet  MATH  Google Scholar 

  24. Huang K, Zheng D, Sun J, Hotta Y, Fujimoto K, Naoi S (2010) Sparse learning for support vector classification. Pattern Recogn Lett 31(13):1944–1951

    Article  Google Scholar 

  25. Zhu J, Rosset S, Tibshirani R, Hastie TJ (2004) 1-norm support vector machines. In Advances in neural information processing systems, pages 49–56

  26. Wang L, Shen X (2007) On L1-Norm Multiclass Support Vector Machines. J Am Stat Assoc 102(478):583–594

    Article  MATH  Google Scholar 

  27. Chapelle O, Keerthi SS (2008) Multi-class feature selection with support vector machines. In Proceedings of the American statistical association

  28. Mairal J, Bach F, Ponce J (2012) Sparse modeling for image and vision processing. Found Trends Comput Graph Vis 8(2–3):85–283

    MATH  Google Scholar 

  29. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodological) 58:267–288

    MathSciNet  MATH  Google Scholar 

  30. Yamada M, Jitkrittum W, Sigal L, Xing EP, Sugiyama M (2014) High-dimensional feature selection by feature-wise kernelized lasso. Neural Comput 26(1):185–207

    Article  MathSciNet  MATH  Google Scholar 

  31. Sjöstrand K, Clemmensen LH, Larsen R, Einarsson G, Ersbøll BK (2018) Spasm: a matlab toolbox for sparse statistical modeling. J Stat Softw 84(10):1–37

    Article  Google Scholar 

  32. Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, Vapnik V (2000) Feature selection for SVMs

  33. Parikh N, Boyd S (2013) Proximal algorithms. Found Trends Optim 1(3):123–231

    Google Scholar 

  34. Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2010) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3(1):1–122

    Article  MATH  Google Scholar 

  35. Beck A (2017) First-order methods in optimization. Mathematical optimization society and the society for industrial and applied mathematics, Philadelphia, PA 19104–2688 USA

  36. Gallier J, Quaintance J (2019) Fundamentals of optimization theory with applications to machine learning. University of Pennsylvania, Philadelphia

    MATH  Google Scholar 

  37. Theodoridis S (2020) Machine learning a Bayesian and optimization perspective, 2nd edn. Academic Press, Elsevier

    Google Scholar 

  38. Shalev-Shwartz S, Ben-David S (2014) Understanding machine learning: from theory to algorithms. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  39. Bottou L, Curtis EF, Nocedal J (2018) Optimization methods for large-scale machine learning. SIAM Rev 60(2):223–311

    Article  MathSciNet  MATH  Google Scholar 

  40. Shalev-Shwartz S (2011) Online learning and online convex optimization. Found Trends Mach Learn 4(2):107–194

    Article  MATH  Google Scholar 

  41. Hazan E (2015) Introduction to online convex optimization. Found Trends Optim 2(3–4):157–325

    Google Scholar 

  42. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. The MIT Press, Cambridge

    MATH  Google Scholar 

  43. Charniak E (2019) Introduction to deep learning. The MIT Press, Cambridge

    Google Scholar 

  44. Cao J, Wang Y, He J, Liang W, Tao H, Zhu G (2021) Predicting grain losses and waste rate along the entire chain: a multitask multigated recurrent unit autoencoder based method. IEEE Trans Industr Inform 17(6):4390–4400

    Article  Google Scholar 

  45. Hall, P. & Gill, N. (2019). An Introduction to Machine Learning Interpretability, An Applied Perspective on Fairness, Accountability, Transparency, and Explainable AI, 2nd Edition. O'Reilly Media, Inc.

  46. Murdoch WJ, Singh C, Kumbier K, Abbasi-Asl R, Yu B (2019) Interpretable machine learning: definitions, methods, and applications. PNAS 116(44):22071–22080

    Article  MATH  Google Scholar 

  47. Molnar C (2021). Interpretable machine learning, a guide for making black box models explainable. Leanpub.com

  48. Hastie T, Tibshirani R, Wainwright M (2015) Statistical learning with sparsity: the lasso and generalizations. CRC Press, Boca Raton

    Book  MATH  Google Scholar 

  49. Suykens JA, Vandewalle J, Moor BD (2001) Optimal control by least squares support vector machines. Neural Netw 14(1):23–35

    Article  Google Scholar 

  50. Xanthopoulos P, Pardalos PM, Trafalis TB (2012) Robust data mining. Springer Science & Business Media, Berlin

    MATH  Google Scholar 

  51. Boyd S, Vandenberghe L (2018) Introduction to applied linear algebra vectors, matrices, and least squares. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  52. Dua D, Graff C (2019) UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science

  53. Hand DJ (2009) Measuring classifier performance: a coherent alternative to the area under the ROC curve. Mach Learn 77:103–123

    Article  MATH  Google Scholar 

  54. Matlab, http://www.mathworks.com

  55. https://web.stanford.edu/~boyd/index.html

Download references

Acknowledgements

The authors would like to thank anonymous reviewers for their valuable comments and suggestions. This research has been partially supported by the Key Program of National Natural Science Foundation of China under grant 92046026, in part by the National Natural Science Foundation of China (No. 61877061, 71271191, 71871109, 91646204, 71701089), in part by the Jiangsu Provincial Key Research and Development Program under Grant BE2020001-3, in part by the Jiangsu Provincial Policy Guidance Program under grant BZ2020008, and in part by the High-End Foreign Experts Projects under grant G2021194011L.

Author information

Authors and Affiliations

Authors

Contributions

ZZ contributed to conceptualization, methodology, validation, formal analysis, writing—original draft, funding acquisition. JH contributed to methodology, investigation, resources, validation. JC contributed to data curation, supervision, funding acquisition. Shuqing Li contributed to validation, software, visualization. XL contributed to formal analysis, visualization, funding acquisition. KZ contributed to data curation, visualization, validation. PW contributed to formal analysis, writing—checking, resources. SY contributed to writing—review and editing, supervision, project administration.

Corresponding author

Correspondence to Zhiwang Zhang.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interests in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

1.1 Proof of equality (6)

For any two input points \({\varvec{x}}_{i}\) and \({\varvec{x}}_{j}\) (\(i,j = 1, \cdots ,n\)) from the training set \({\varvec{T}}\), suppose that a basis function \(\phi ( \cdot )\) mapping any feature value \({\varvec{x}}_{jm}\)(\(m = 1, \cdots ,d\)) from the input space to a new feature space is given. If their product \(\phi ({\varvec{x}}_{jm} ) \times \phi ({\varvec{x}}_{im} )\) in the feature space can be replaced with the kernel function \(\kappa (x_{jm} , x_{im} )\) regarding the \(m{\text{th}}\) feature. The kernel vector \({\varvec{u}}_{ji}\)(\({\varvec{u}}_{ji} \in {\mathbb{R}}^{d}\)) of \(d\) features is denoted as

$$\begin{gathered} {\varvec{u}}_{ji} = \left[ {\phi (x_{j1} ), \cdots ,\phi (x_{jd} )} \right] \odot \left[ {\phi (x_{i1} ), \cdots ,\phi (x_{id} )} \right] \hfill \\ \, \quad \quad = \left[ {\phi (x_{j1} )\phi (x_{i1} ), \cdots ,\phi (x_{jd} )\phi (x_{id} )} \right] \hfill \\ \, \quad \quad= \left[ {\kappa (x_{j1} ,x_{i1} ), \cdots ,\kappa (x_{jd} ,x_{id} )} \right]. \hfill \\ \end{gathered}$$

Given the kernel weight vector \({\varvec{\mu}}^{t}\) at iteration \(t\), the row-wise multi-kernel vector \({\varvec{A}}_{j}\)(\({\varvec{A}}_{j} \in {\mathbb{R}}^{n}\)) is defined as a weighted similarity between the input points \({\varvec{x}}_{j}\) and other input points in the training set, that is we have the equality

$${\varvec{A}}_{j} = \left( {{\varvec{u}}_{j1} {\varvec{\mu}}^{t} , \cdots ,{\varvec{u}}_{jn} {\varvec{\mu}}^{t} } \right)^{T} ,\;\;j = 1, \cdots ,n$$

The row-wise multi-kernel matrix \({\varvec{A}}\)(\({\varvec{A}} \in {\mathbb{R}}^{n \times n}\)) has the below form

$${\varvec{A}} = \left( {{\varvec{A}}_{1} , \cdots ,{\varvec{A}}_{n} } \right)$$

So, for any two input points \({\varvec{x}}_{i}\) and \({\varvec{x}}_{j}\), the element of the matrix \({\varvec{A}}\) is \({\varvec{A}}_{ji} = \sum\nolimits_{m = 1}^{d} {\mu_{m}^{t} \kappa ({\varvec{x}}_{jm} , \, {\varvec{x}}_{im} )}\) for all \(i,j = 1, \cdots ,n\).

1.2 Proof of equality (8)

For any feature \({\varvec{f}}_{m}\)(\({\varvec{f}}_{m} \in {\mathbb{R}}^{n}\), \(m = 1, \cdots ,d\)) from the training set \({\varvec{T}}\), assume that the mapping function \(\phi ( \cdot )\) transforms the feature value \({\varvec{x}}_{jm}\)(\(j = 1, \cdots ,n\)) in the input space into a new feature space, the Kronecker product \({\varvec{P}}\) (\({\varvec{P}} \in {\mathbb{R}}^{n \times n}\)) regarding feature \({\varvec{f}}_{m}\) is computed by

$$\begin{gathered} {\varvec{P}} = {\varvec{f}}_{m} {\varvec{f}}_{m}^{T} \hfill \\ = \left[ {\phi (x_{1m} ), \cdots ,\phi (x_{nm} )} \right]^{T} \left[ {\phi (x_{1m} ), \cdots ,\phi (x_{nm} )} \right] \hfill \\ \, = \left( {\begin{array}{*{20}c} {\phi (x_{1m} )\phi (x_{1m} )} & \ldots & {\phi (x_{1m} )\phi (x_{nm} )} \\ \vdots & \ddots & \vdots \\ {\phi (x_{nm} )\phi (x_{1m} )} & \cdots & {\phi (x_{nm} )\phi (x_{nm} )} \\ \end{array} } \right) \hfill \\ \end{gathered}$$
$$= \left( {\begin{array}{*{20}c} {\kappa (x_{1m} ,x_{1m} )} & \ldots & {\kappa (x_{1m} ,x_{nm} )} \\ \vdots & \ddots & \vdots \\ {\kappa (x_{nm} ,x_{1m} )} & \cdots & {\kappa (x_{nm} ,x_{nm} )} \\ \end{array} } \right).$$

If the coefficient vector \({\varvec{\lambda}}^{t}\) at iteration \(t\) is given, then the column-wise multi-kernel vector \({\varvec{B}}_{m}\)(\({\varvec{B}}_{m} \in {\mathbb{R}}^{n}\)) with respect to the \(m{\text{th}}\) feature can be obtained by the multiplication of the transposed matrix \({\varvec{P}}^{T}\) and the vector \({\varvec{\lambda}}^{t}\) with the form

$$\begin{gathered} {\varvec{B}}_{m} = {\varvec{P}}^{T} {\varvec{\lambda}}^{t} \hfill \\ { = }\left( {\begin{array}{*{20}c} {\kappa (x_{1m} ,x_{1m} )} & \ldots & {\kappa (x_{1m} ,x_{nm} )} \\ \vdots & \ddots & \vdots \\ {\kappa (x_{nm} ,x_{1m} )} & \cdots & {\kappa (x_{nm} ,x_{nm} )} \\ \end{array} } \right)^{T} \left[ \begin{gathered} \lambda_{1}^{t} \hfill \\ \vdots \hfill \\ \lambda_{n}^{t} \hfill \\ \end{gathered} \right] \hfill \\ \, = \left[ {\sum\limits_{j = 1}^{n} {\lambda_{j}^{t} \kappa (x_{jm} ,x_{1m} )} , \cdots ,\sum\limits_{j = 1}^{n} {\lambda_{j}^{t} \kappa (x_{jm} ,x_{nm} )} } \right]^{T} . \hfill \\ \end{gathered}$$

The column-wise multi-kernel matrix \({\varvec{B}}\)(\({\varvec{B}} \in {\mathbb{R}}^{n \times d}\)) is denoted as.

\(\user2{B = }\left( {{\varvec{B}}_{1} , \cdots ,{\varvec{B}}_{d} } \right)\).

Thus, for any input points \({\varvec{x}}_{i}\) with respect to the feature \({\varvec{f}}_{m}\), the element of the matrix \({\varvec{B}}\) is \({\varvec{B}}_{im} = \sum\nolimits_{j = 1}^{n} {\lambda_{j}^{t} \kappa ({\varvec{x}}_{jm} , \, {\varvec{x}}_{im} )}\) for all \(i = 1, \cdots ,n\) and \(m = 1, \cdots ,d\).

1.3 Proof of the \({\mathbf{\lambda - step}}\) EM2NOLC algorithm via ADMM

Corresponding with the \(\lambda {\text{ - step}}\) EM2NOLC model (7) and its ADMM optimization problem (13) with the separable objective and equality constraint functions, the augmented Lagrangian function regarding the scaled dual variables \({\varvec{u}}\)(\({\varvec{u}} \in {\mathbb{R}}^{n}\)) and the penalty parameter \(\rho\)(\(\rho > 0\)) is defined as.

\({\varvec{L}}_{\rho } ({\varvec{\lambda}}, \, {\varvec{q}}, \, {\varvec{u}}) = f({\varvec{\lambda}}) + g({\varvec{q}}) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\lambda}} - \, {\varvec{q}} + {\varvec{u}}} \right\|_{2}^{2} - \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {\varvec{u}} \right\|_{2}^{2}\)

The ADMM updates (17), (18), and (16) of the \(\lambda {\text{ - step}}\) EM2NOLC algorithm are obtained from the partial derivative of \({\varvec{L}}_{\rho } (\lambda , \, {\varvec{q}}, \, {\varvec{u}})\) with respect to its three parameters \({\varvec{\lambda}}\), \({\varvec{q}}\), and \({\varvec{u}}\), respectively. We can express the \(\lambda {\text{ - minimization}}\) as the proximal operator:

$$\begin{gathered} {\varvec{\lambda}}^{k + 1} = \mathop {\arg \min }\limits_{{\lambda \in {\mathbb{R}}^{n} }} {\varvec{L}}_{\rho } ({\varvec{\lambda}}, \, {\varvec{q}}^{k} , \, {\varvec{u}}^{k} ) \hfill \\ { } = \mathop {\arg \min }\limits_{{\lambda \in {\mathbb{R}}^{n} }} \left\{ {f({\varvec{\lambda}}) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\lambda}} - {\varvec{q}}^{k} + {\varvec{u}}^{k} } \right\|_{2}^{2} } \right\}. \hfill \\ \end{gathered}$$

Then the gradient of \({\varvec{L}}_{\rho } ({\varvec{\lambda}}, \, {\varvec{q}}^{k} , \, {\varvec{u}}^{k} )\) regarding \({\varvec{\lambda}}\) is set to zero, we analytically obtained the resulting \(\lambda {\text{ - update}}\):

$$\begin{gathered} \, \nabla_{\lambda } {\varvec{L}}_{\rho } ({\varvec{\lambda}}, \, {\varvec{q}}^{k} , \, {\varvec{u}}^{k} ) = 0 \hfill \\ \Rightarrow \nabla_{\lambda } \left\{ {f({\varvec{\lambda}}) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\lambda}} - {\varvec{q}}^{k} + {\varvec{u}}^{k} } \right\|_{2}^{2} } \right\} = 0 \hfill \\ \Rightarrow \nabla_{\lambda } \left\{ {({1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-\nulldelimiterspace} 2})\left\| {{\varvec{y}} \odot \left( {\user2{A\lambda } - b_{1} {\mathbf{1}}_{n} } \right) - 1_{n} } \right\|_{2}^{2} + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\lambda}} - {\varvec{q}}^{k} + {\varvec{u}}^{k} } \right\|_{2}^{2} } \right\} = 0 \hfill \\ \Rightarrow {\varvec{A}}_{y}^{T} {\varvec{A}}_{y} {\varvec{\lambda}} - {\varvec{A}}_{y}^{T} \left( {b_{1} {\varvec{y}} + {\mathbf{1}}_{n} } \right) + \rho \lambda - \rho \left( {{\varvec{q}}^{k} - {\varvec{u}}^{k} } \right) = 0 \hfill \\ \Rightarrow \lambda^{k + 1} = \left( {{\varvec{A}}_{y}^{T} {\varvec{A}}_{y} + \rho I_{n} } \right)^{ - 1} \left\{ {{\varvec{A}}_{y}^{T} \left( {b_{1} y + {\mathbf{1}}_{n} } \right) + \rho \left( {{\varvec{q}}^{k} - {\varvec{u}}^{k} } \right)} \right\}.{ } \hfill \\ \end{gathered}$$

The \(q{\text{ - minimization}}\) of the \(\lambda {\text{ - step}}\) EM2NOLC algorithm has the form

$$\begin{gathered} {\varvec{q}}^{k + 1} = \mathop {\arg \min }\limits_{{{\varvec{q}} \in {\mathbb{R}}^{n} }} {\varvec{L}}_{\rho } (\lambda^{k + 1} , \, {\varvec{q}}, \, {\varvec{u}}^{k} ) \hfill \\ { } = \mathop {\arg \min }\limits_{{{\varvec{q}} \in {\mathbb{R}}^{n} }} \left\{ {g({\varvec{q}}) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {\lambda^{k + 1} - {\varvec{q}} + {\varvec{u}}^{k} } \right\|_{2}^{2} } \right\}. \hfill \\ \end{gathered}$$

Dual to \(S_{0} (C_{\lambda } ) = \{ {\varvec{q}} \in {\mathbb{R}}^{n} |\left\| {\varvec{q}} \right\|_{0} \le C_{\lambda } \}\) is a nonconvex set, the \(q{\text{ - minimization}}\) may not converge to an optimal point. We can apply the projected gradient method to approximate the \(q{\text{ - update}}\) procedure. So, we have the \(q{\text{ - update}}\):

$$\begin{gathered} {\varvec{q}}^{k + 1} = \mathop {\arg \min }\limits_{{{\varvec{q}} \in {\mathbb{R}}^{n} }} \left\{ {I_{{S_{0} (C_{\lambda } )}} ({\varvec{q}}) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\lambda}}^{k + 1} - {\varvec{q}} + {\varvec{u}}^{k} } \right\|_{2}^{2} } \right\} \hfill \\ \, \approx \mathop {\arg \min }\limits_{{{\varvec{q}} \in S_{0} (C_{\lambda } )}} \left\| {{\varvec{\lambda}}^{k + 1} + {\varvec{u}}^{k} - {\varvec{q}}} \right\|_{2}^{2} \hfill \\ \, = \Pi_{{S_{0} (C_{\lambda } )}} ({\varvec{\lambda}}^{k + 1} + {\varvec{u}}^{k} ), \hfill \\ \end{gathered}$$

where the indicator function has \(I_{{S_{0} (C_{\lambda } )}} ({\varvec{q}}) = 1\) if \({\varvec{q}} \in S_{0} (C_{\lambda } )\), otherwise \(I_{{S_{0} (C_{\lambda } )}} ({\varvec{q}}) = 0\). For any vector \({\varvec{q}}\)(\({\varvec{q}} \in {\mathbb{R}}^{n}\)), the projection operator \(\Pi_{{S_{0} (C_{\lambda } )}} ({\varvec{q}})\) can be actually implemented by sorting the elements of the vector \({\varvec{q}}\) in descending order of their absolute values and setting all elements to zeros except top \(C_{\lambda }\).

The \(u{\text{ - update}}\) step can be considered as the change of constraint residuals in the optimization problem (13), which ensures the convergence of the ADMM iterations (17), (18), and (16).

1.4 Proof of the \(\mu - step\) EM2 NOLC algorithm via ADMM

Similar to “Proof of the λ-step EM2NOLC algorithm via ADMM” in Appendix, for the \(\mu {\text{ - step}}\) EM2NOLC model (9) and its ADMM optimization problem (20) with the separable objective functions and equality constraints, the augmented Lagrangian function is denoted as.

\({\varvec{L}}_{\rho } ({\varvec{\mu}}, \, {\varvec{z}}, \, {\varvec{v}}) = h({\varvec{\mu}}) + l(z) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\mu}} - \, {\varvec{z}} + {\varvec{v}}} \right\|_{2}^{2} - \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {\varvec{v}} \right\|_{2}^{2},\)

with the scaled dual variables \({\varvec{v}}\)(\({\varvec{v}} \in {\mathbb{R}}^{d}\)).

The ADMM updates (24), (25), and (23) of the \(\mu {\text{ - step}}\) EM2NOLC algorithm are, respectively, generated from the KKT optimality conditions of \({\varvec{L}}_{\rho } ({\varvec{\mu}}, \, {\varvec{z}}, \, {\varvec{v}})\) regarding its three parameters \({\varvec{\mu}}\), \({\mathbf{z}}\), and \({\varvec{v}}\). The \(\mu {\text{ - minimization}}\) is defined as the proximal operator:

$$\begin{gathered} {\varvec{\mu}}^{k + 1} = \mathop {\arg \min }\limits_{{\mu \in {\mathbb{R}}^{d} }} {\varvec{L}}_{\rho } ({\varvec{\mu}}, \, {\varvec{z}}^{k} , \, {\varvec{v}}^{k} ) \hfill \\ { } = \mathop {\arg \min }\limits_{{\mu \in {\mathbb{R}}^{d} }} \left\{ {h(\mu ) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\mu}} - \, {\varvec{z}}^{k} + {\varvec{v}}^{k} } \right\|_{2}^{2} } \right\}. \hfill \\ \end{gathered}$$

Setting the gradient of \({\varvec{L}}_{\rho } ({\varvec{\mu}}, \, {\varvec{z}}^{k} , \, {\varvec{v}}^{k} )\) with respect to \({\varvec{\mu}}\) to zero, we can get the analytical solution, that is the \(\mu {\text{ - update}}\):

$$\begin{gathered} \, \nabla_{{\varvec{\mu}}} {\varvec{L}}_{\rho } {(}{\varvec{\mu}}, \, {\mathbf{z}}^{k} , \, {\varvec{v}}^{k} {)} = 0 \hfill \\ \Rightarrow \nabla_{{\varvec{\mu}}} \left\{ {h({\varvec{\mu}}) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\mu}} - \, {\mathbf{z}}^{k} + {\varvec{v}}^{k} } \right\|_{2}^{2} } \right\} = 0 \hfill \\ \Rightarrow \nabla_{{\varvec{\mu}}} \left\{ {({1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-\nulldelimiterspace} 2})\left\| {{\varvec{y}} \odot \left( {\user2{B\mu } - b_{2} {\mathbf{1}}_{n} } \right) - {\mathbf{1}}_{n} } \right\|_{2}^{2} + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\mu}} - \, {\mathbf{z}}^{k} + {\varvec{v}}^{k} } \right\|_{2}^{2} } \right\}{ = 0} \hfill \\ \Rightarrow {\varvec{B}}_{y}^{T} {\varvec{B}}_{y} {\varvec{\mu}} - {\varvec{B}}_{y}^{T} \left( {b_{2} {\varvec{y}} + {\mathbf{1}}_{n} } \right) + \rho {\varvec{\mu}} - \rho \left( {{\mathbf{z}}^{k} - {\varvec{v}}^{k} } \right) = 0 \hfill \\ \Rightarrow {\varvec{\mu}}^{k + 1} = \left( {{\varvec{B}}_{y}^{T} {\varvec{B}}_{y} + \rho {\varvec{I}}_{d} } \right)^{ - 1} \left\{ {{\varvec{B}}_{y}^{T} \left( {b_{2} {\varvec{y}} + {\mathbf{1}}_{n} } \right) + \rho \left( {{\mathbf{z}}^{k} - {\varvec{v}}^{k} } \right)} \right\}{. } \hfill \\ \end{gathered}$$

The \(z{\text{ - minimization}}\) of the \(\mu {\text{ - step}}\) EM2NOLC algorithm has the minimization problem with the form

$$\begin{gathered} {\mathbf{z}}^{k + 1} = \mathop {\arg \min }\limits_{{{\mathbf{z}} \in {\mathbb{R}}^{d} }} {\varvec{L}}_{\rho } {(}{\varvec{\mu}}^{k + 1} , \, {\mathbf{z}}, \, {\varvec{v}}^{k} {)} \hfill \\ { } = \mathop {\arg \min }\limits_{{{\mathbf{z}} \in {\mathbb{R}}^{d} }} \left\{ {l({\mathbf{z}}) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\mu}}^{k + 1} - {\mathbf{z}} + {\varvec{v}}^{k} } \right\|_{2}^{2} } \right\}. \hfill \\ \end{gathered}$$

Similarly, owing to \(S_{0} (C_{\mu } ) = \{ {\mathbf{z}} \in {\mathbb{R}}^{d} |\left\| {\mathbf{z}} \right\|_{0} \le C_{\mu } \}\) is a nonconvex set, we can apply the projected gradient method to the \(z{\text{ - minimization}}\) to obtain the \(z{\text{ - update}}\):

$$\begin{gathered} {\mathbf{z}}^{k + 1} = \mathop {\arg \min }\limits_{{{\mathbf{z}} \in {\mathbb{R}}^{d} }} \left\{ {I_{{S_{0} (C_{\mu } )}} ({\mathbf{z}}) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\mu}}^{k + 1} - {\mathbf{z}} + {\varvec{v}}^{k} } \right\|_{2}^{2} } \right\} \hfill \\ \, \approx \mathop {\arg \min }\limits_{{{\mathbf{z}} \in S_{0} (C_{\mu } )}} \left\| {{\varvec{\mu}}^{k + 1} + {\varvec{v}}^{k} - {\mathbf{z}}} \right\|_{2}^{2} \hfill \\ \, = \Pi_{{S_{0} (C_{\mu } )}} ({\varvec{\mu}}^{k + 1} + {\varvec{v}}^{k} ), \hfill \\ \end{gathered}$$

with the indicator function \(I_{{S_{0} (C_{\mu } )}} ({\mathbf{z}}) = 1\) for \({\mathbf{z}} \in S_{0} (C_{\mu } )\) and \(I_{{S_{0} (C_{\mu } )}} ({\mathbf{z}}) = 0\) for \({\mathbf{z}} \notin S_{0} (C_{\mu } )\). For any vector \({\mathbf{z}}\)(\({\mathbf{z}} \in {\mathbb{R}}^{d}\)), the projection operator \(\Pi_{{S_{0} (C_{\mu } )}} ({\mathbf{z}})\) can be carried out by sorting the elements of the vector \({\mathbf{z}}\) in descending order of their absolute values and setting all elements to zeros except the largest \(C_{\mu }\) elements.

Finally, the \(v{\text{ - update}}\) step can be regarded as the change of constraint residuals in the ADMM problem (20), which guarantees the convergence of the ADMM updates (24), (25), and (23).

Appendix B

2.1 KS curves in Fig. 2

See Fig. 2.

Fig. 2
figure 2

KS curves of four ADMM classifiers on four medical test sets (the upper sold lines for AP and the lower dot lines for NP)

2.2 ROC curves in Fig. 3

See Fig. 3.

Fig. 3
figure 3

ROC curves of four ADMM classifiers on four medical test sets

2.3 IFs for EM2NOLC in Fig. 4

See Fig. 4.

Fig. 4
figure 4

Important instances (IIs) with their percentage values of II (%) given by EM2NOLC on four training sets

2.4 IFs for EM2NOLC in Fig. 5

See Fig. 5.

Fig. 5
figure 5

Important features (IFs) with their percentage values of FI (%) given by EM2NOLC on four training sets

2.5 Correlation analysis of selected features in Fig. 6

See Fig. 6.

Fig. 6
figure 6

Correlation analysis of selected features across folds for the GC dataset

2.6 Correlation analysis of selected features in Fig. 7

See Fig. 7.

Fig. 7
figure 7

Correlation analysis of selected features across folds for the MSJ dataset

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Z., He, J., Cao, J. et al. An explainable multi-sparsity multi-kernel nonconvex optimization least-squares classifier method via ADMM. Neural Comput & Applic 34, 16103–16128 (2022). https://doi.org/10.1007/s00521-022-07282-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-07282-6

Keywords

Navigation