An explainable multi-sparsity multi-kernel nonconvex optimization least-squares classifier method via ADMM

Zhang, Zhiwang; He, Jing; Cao, Jie; Li, Shuqing; Li, Xingsen; Zhang, Kai; Wang, Pingjiang; Shi, Yong

doi:10.1007/s00521-022-07282-6

An explainable multi-sparsity multi-kernel nonconvex optimization least-squares classifier method via ADMM

Original Article
Published: 03 May 2022

Volume 34, pages 16103–16128, (2022)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Zhiwang Zhang ORCID: orcid.org/0000-0002-1060-5797¹,
Jing He²,
Jie Cao¹,
Shuqing Li¹,
Xingsen Li³,
Kai Zhang⁴,
Pingjiang Wang^4,5 &
…
Yong Shi⁶

304 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

Convex optimization techniques are extensively applied to various models, algorithms, and applications of machine learning and data mining. For optimization-based classification methods, the sparsity principle can greatly help to select simple classifier models, while the single- and multi-kernel methods can effectively address nonlinearly separable problems. However, the limited sparsity and kernel methods hinder the improvement of the predictive accuracy, efficiency, iterative update, and interpretable classification model. In this paper, we propose a new Explainable Multi-sparsity Multi-kernel Nonconvex Optimization Least-squares Classifier (EM²NOLC) model, which is an optimization problem with a least-squares objective function and multi-sparsity multi-kernel nonconvex constraints, aiming to address the aforementioned issues. Based on reconstructed multiple kernel learning (MKL), the proposed model can extract important instances and features by finding the sparse coefficient and kernel weight vectors, which are used to compute importance or contribution to classification and obtain the explainable prediction. The corresponding EM²NOLC algorithm is implemented with the Alternating Direction Method of Multipliers (ADMM) method. On the real classification datasets, compared with the three ADMM classifiers of Linear Support Vector Machine Classifier, SVMC, Least Absolute Shrinkage and Selection Operator Classifier, the two MKL classifiers of SimpleMKL and EasyMKL, and the gradient descent classifier of Feature Selection for SVMC, our proposed EM²NOLC generally obtains the best predictive performance and explainable results with the least number of important instances and features having different contribution percentages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

A review of unsupervised feature selection methods

Article 29 January 2019

Notes

https://github.com/sagedavid/EM2NOLC.

References

Sra S, Nowozin S, Wright SJ (eds) (2012) Optimization for machine learning. Mit Press, Cambridge
MATH Google Scholar
Yang X (2019) Introduction to algorithms for data mining and machine learning. Academic Press, Cambridge
MATH Google Scholar
Kantardzic M (2020) Data mining concepts, models, methods, and algorithms, 3rd edn. Wiley-IEEE Press, Hoboken
MATH Google Scholar
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge
Book MATH Google Scholar
Shigeo A (2010) Support vector machines for pattern classification, 2nd edn. Springer, Berlin
MATH Google Scholar
Deng N, Tian Y, Zhang C (2012) Support vector machines: optimization-based theory, algorithms, and extensions. CRC Press, Boca Raton
Book MATH Google Scholar
Simeone O (2018) A brief introduction to machine learning for engineers. Found Trends Signal Process 12(3–4):200–431
Article MathSciNet MATH Google Scholar
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge
Book MATH Google Scholar
Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) SimpleMKL. J Mach Learn Res 9:2491–2521
MathSciNet MATH Google Scholar
Gönen M, Alpaydin E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268
MathSciNet MATH Google Scholar
Gu Y, Liu T, Jia X, Benediktsson JA, Chanussot J (2016) Nonlinear multiple Kernel learning with multiple-structure-element extended morphological Profiles for hyperspectral image classification. IEEE Trans Geosci Remote Sens 54(6):3235–3247
Article Google Scholar
Zien, A., & Ong, C. S. (2007). Multiclass multiple kernel learning. In Proceedings of the 24th international conference on Machine learning, pages 1191–1198, ACM.
Wang T, Zhao D, Feng Y (2013) Two-stage multiple kernel learning with multiclass kernel polarization. Knowl-Based Syst 48:10–16
Article Google Scholar
Nazarpour A, Adibi P (2015) Two-stage multiple kernel learning for supervised dimensionality reduction. Pattern Recogn 48(5):1854–1862
Article Google Scholar
Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B (2006) Large scale multiple kernel learning. J Mach Learn Res 7:1531–1565
MathSciNet MATH Google Scholar
Aiolli F, Donini M (2015) EasyMKL: a scalable multiple kernel learning algorithm. Neurocomputing 169:215–224
Article Google Scholar
Lauriola I, Gallicchio C, Aiolli F (2020) Enhancing deep neural networks via multiple kernel learning. Pattern Recogn 101:107194
Article Google Scholar
Zhang Z, Gao G, Yao T, He J, Tian Y (2020) An interpretable regression approach based on bi-sparse optimization. Appl Intell 50(11):4117–4142
Article Google Scholar
Bach F, Jenatton R, Mairal J, Obozinski G (2011) Optimization with sparsity-inducing penalties. Found Trends Mach Learn 4(1):1–106
Article MATH Google Scholar
Rish I, Grabarnik GY (2014) Sparse modeling: theory, algorithms, and applications. Chapman & Hall/CRC Press, Boca Raton
Book MATH Google Scholar
Gregorova M (2019) Sparse learning for variable selection with structures and nonlinearities. Doctoral dissertation, Geneve
Jain P, Kar P (2017) Non-convex optimization for machine learning. Found Trends Mach Learn 10(3–4):142–336
Article MATH Google Scholar
Weston J, Elisseeff A, Schölkopf B, Tipping M (2003) Use of the zero-norm with linear models and kernel methods. J Mach Learn Res 3:1439–1461
MathSciNet MATH Google Scholar
Huang K, Zheng D, Sun J, Hotta Y, Fujimoto K, Naoi S (2010) Sparse learning for support vector classification. Pattern Recogn Lett 31(13):1944–1951
Article Google Scholar
Zhu J, Rosset S, Tibshirani R, Hastie TJ (2004) 1-norm support vector machines. In Advances in neural information processing systems, pages 49–56
Wang L, Shen X (2007) On L1-Norm Multiclass Support Vector Machines. J Am Stat Assoc 102(478):583–594
Article MATH Google Scholar
Chapelle O, Keerthi SS (2008) Multi-class feature selection with support vector machines. In Proceedings of the American statistical association
Mairal J, Bach F, Ponce J (2012) Sparse modeling for image and vision processing. Found Trends Comput Graph Vis 8(2–3):85–283
MATH Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodological) 58:267–288
MathSciNet MATH Google Scholar
Yamada M, Jitkrittum W, Sigal L, Xing EP, Sugiyama M (2014) High-dimensional feature selection by feature-wise kernelized lasso. Neural Comput 26(1):185–207
Article MathSciNet MATH Google Scholar
Sjöstrand K, Clemmensen LH, Larsen R, Einarsson G, Ersbøll BK (2018) Spasm: a matlab toolbox for sparse statistical modeling. J Stat Softw 84(10):1–37
Article Google Scholar
Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, Vapnik V (2000) Feature selection for SVMs
Parikh N, Boyd S (2013) Proximal algorithms. Found Trends Optim 1(3):123–231
Google Scholar
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2010) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3(1):1–122
Article MATH Google Scholar
Beck A (2017) First-order methods in optimization. Mathematical optimization society and the society for industrial and applied mathematics, Philadelphia, PA 19104–2688 USA
Gallier J, Quaintance J (2019) Fundamentals of optimization theory with applications to machine learning. University of Pennsylvania, Philadelphia
MATH Google Scholar
Theodoridis S (2020) Machine learning a Bayesian and optimization perspective, 2nd edn. Academic Press, Elsevier
Google Scholar
Shalev-Shwartz S, Ben-David S (2014) Understanding machine learning: from theory to algorithms. Cambridge University Press, Cambridge
Book MATH Google Scholar
Bottou L, Curtis EF, Nocedal J (2018) Optimization methods for large-scale machine learning. SIAM Rev 60(2):223–311
Article MathSciNet MATH Google Scholar
Shalev-Shwartz S (2011) Online learning and online convex optimization. Found Trends Mach Learn 4(2):107–194
Article MATH Google Scholar
Hazan E (2015) Introduction to online convex optimization. Found Trends Optim 2(3–4):157–325
Google Scholar
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. The MIT Press, Cambridge
MATH Google Scholar
Charniak E (2019) Introduction to deep learning. The MIT Press, Cambridge
Google Scholar
Cao J, Wang Y, He J, Liang W, Tao H, Zhu G (2021) Predicting grain losses and waste rate along the entire chain: a multitask multigated recurrent unit autoencoder based method. IEEE Trans Industr Inform 17(6):4390–4400
Article Google Scholar
Hall, P. & Gill, N. (2019). An Introduction to Machine Learning Interpretability, An Applied Perspective on Fairness, Accountability, Transparency, and Explainable AI, 2nd Edition. O'Reilly Media, Inc.
Murdoch WJ, Singh C, Kumbier K, Abbasi-Asl R, Yu B (2019) Interpretable machine learning: definitions, methods, and applications. PNAS 116(44):22071–22080
Article MATH Google Scholar
Molnar C (2021). Interpretable machine learning, a guide for making black box models explainable. Leanpub.com
Hastie T, Tibshirani R, Wainwright M (2015) Statistical learning with sparsity: the lasso and generalizations. CRC Press, Boca Raton
Book MATH Google Scholar
Suykens JA, Vandewalle J, Moor BD (2001) Optimal control by least squares support vector machines. Neural Netw 14(1):23–35
Article Google Scholar
Xanthopoulos P, Pardalos PM, Trafalis TB (2012) Robust data mining. Springer Science & Business Media, Berlin
MATH Google Scholar
Boyd S, Vandenberghe L (2018) Introduction to applied linear algebra vectors, matrices, and least squares. Cambridge University Press, Cambridge
Book MATH Google Scholar
Dua D, Graff C (2019) UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science
Hand DJ (2009) Measuring classifier performance: a coherent alternative to the area under the ROC curve. Mach Learn 77:103–123
Article MATH Google Scholar
Matlab, http://www.mathworks.com
https://web.stanford.edu/~boyd/index.html

Download references

Acknowledgements

The authors would like to thank anonymous reviewers for their valuable comments and suggestions. This research has been partially supported by the Key Program of National Natural Science Foundation of China under grant 92046026, in part by the National Natural Science Foundation of China (No. 61877061, 71271191, 71871109, 91646204, 71701089), in part by the Jiangsu Provincial Key Research and Development Program under Grant BE2020001-3, in part by the Jiangsu Provincial Policy Guidance Program under grant BZ2020008, and in part by the High-End Foreign Experts Projects under grant G2021194011L.

Author information

Authors and Affiliations

College of Information Engineering, Nanjing University of Finance and Economics, Nanjing, 210023, China
Zhiwang Zhang, Jie Cao & Shuqing Li
Department of Neuroscience, University of Oxford, Oxford, UK
Jing He
Research Institute of Extenics and Innovation Methods, Guangdong University of Technology, Guangzhou, 510006, China
Xingsen Li
Quanzhou HUST Research Institute of Intelligent Manufacturing, Quanzhou, 362000, China
Kai Zhang & Pingjiang Wang
School of Mechanical Science and Engineering, Huazhong University of Science and Technology, Wuhan, 430074, China
Pingjiang Wang
Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing, 100190, China
Yong Shi

Authors

Zhiwang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jing He
View author publications
You can also search for this author in PubMed Google Scholar
Jie Cao
View author publications
You can also search for this author in PubMed Google Scholar
Shuqing Li
View author publications
You can also search for this author in PubMed Google Scholar
Xingsen Li
View author publications
You can also search for this author in PubMed Google Scholar
Kai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Pingjiang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yong Shi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

ZZ contributed to conceptualization, methodology, validation, formal analysis, writing—original draft, funding acquisition. JH contributed to methodology, investigation, resources, validation. JC contributed to data curation, supervision, funding acquisition. Shuqing Li contributed to validation, software, visualization. XL contributed to formal analysis, visualization, funding acquisition. KZ contributed to data curation, visualization, validation. PW contributed to formal analysis, writing—checking, resources. SY contributed to writing—review and editing, supervision, project administration.

Corresponding author

Correspondence to Zhiwang Zhang.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interests in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

1.1 Proof of equality (6)

For any two input points ${\varvec{x}}_{i}$ and ${\varvec{x}}_{j}$ ($i,j = 1, \cdots ,n$) from the training set ${\varvec{T}}$, suppose that a basis function $\phi ( \cdot )$ mapping any feature value ${\varvec{x}}_{jm}$($m = 1, \cdots ,d$) from the input space to a new feature space is given. If their product $\phi ({\varvec{x}}_{jm} ) \times \phi ({\varvec{x}}_{im} )$ in the feature space can be replaced with the kernel function $\kappa (x_{jm} , x_{im} )$ regarding the $m{\text{th}}$ feature. The kernel vector ${\varvec{u}}_{ji}$(${\varvec{u}}_{ji} \in {\mathbb{R}}^{d}$) of $d$ features is denoted as

$$\begin{gathered} {\varvec{u}}_{ji} = \left[ {\phi (x_{j1} ), \cdots ,\phi (x_{jd} )} \right] \odot \left[ {\phi (x_{i1} ), \cdots ,\phi (x_{id} )} \right] \hfill \\ \, \quad \quad = \left[ {\phi (x_{j1} )\phi (x_{i1} ), \cdots ,\phi (x_{jd} )\phi (x_{id} )} \right] \hfill \\ \, \quad \quad= \left[ {\kappa (x_{j1} ,x_{i1} ), \cdots ,\kappa (x_{jd} ,x_{id} )} \right]. \hfill \\ \end{gathered}$$

Given the kernel weight vector ${\varvec{\mu}}^{t}$ at iteration $t$, the row-wise multi-kernel vector ${\varvec{A}}_{j}$(${\varvec{A}}_{j} \in {\mathbb{R}}^{n}$) is defined as a weighted similarity between the input points ${\varvec{x}}_{j}$ and other input points in the training set, that is we have the equality

$${\varvec{A}}_{j} = \left( {{\varvec{u}}_{j1} {\varvec{\mu}}^{t} , \cdots ,{\varvec{u}}_{jn} {\varvec{\mu}}^{t} } \right)^{T} ,\;\;j = 1, \cdots ,n$$

The row-wise multi-kernel matrix ${\varvec{A}}$(${\varvec{A}} \in {\mathbb{R}}^{n \times n}$) has the below form

$${\varvec{A}} = \left( {{\varvec{A}}_{1} , \cdots ,{\varvec{A}}_{n} } \right)$$

So, for any two input points ${\varvec{x}}_{i}$ and ${\varvec{x}}_{j}$, the element of the matrix ${\varvec{A}}$ is ${\varvec{A}}_{ji} = \sum\nolimits_{m = 1}^{d} {\mu_{m}^{t} \kappa ({\varvec{x}}_{jm} , \, {\varvec{x}}_{im} )}$ for all $i,j = 1, \cdots ,n$.

1.2 Proof of equality (8)

For any feature ${\varvec{f}}_{m}$(${\varvec{f}}_{m} \in {\mathbb{R}}^{n}$, $m = 1, \cdots ,d$) from the training set ${\varvec{T}}$, assume that the mapping function $\phi ( \cdot )$ transforms the feature value ${\varvec{x}}_{jm}$($j = 1, \cdots ,n$) in the input space into a new feature space, the Kronecker product ${\varvec{P}}$ (${\varvec{P}} \in {\mathbb{R}}^{n \times n}$) regarding feature ${\varvec{f}}_{m}$ is computed by

$$\begin{gathered} {\varvec{P}} = {\varvec{f}}_{m} {\varvec{f}}_{m}^{T} \hfill \\ = \left[ {\phi (x_{1m} ), \cdots ,\phi (x_{nm} )} \right]^{T} \left[ {\phi (x_{1m} ), \cdots ,\phi (x_{nm} )} \right] \hfill \\ \, = \left( {\begin{array}{*{20}c} {\phi (x_{1m} )\phi (x_{1m} )} & \ldots & {\phi (x_{1m} )\phi (x_{nm} )} \\ \vdots & \ddots & \vdots \\ {\phi (x_{nm} )\phi (x_{1m} )} & \cdots & {\phi (x_{nm} )\phi (x_{nm} )} \\ \end{array} } \right) \hfill \\ \end{gathered}$$

$$= \left( {\begin{array}{*{20}c} {\kappa (x_{1m} ,x_{1m} )} & \ldots & {\kappa (x_{1m} ,x_{nm} )} \\ \vdots & \ddots & \vdots \\ {\kappa (x_{nm} ,x_{1m} )} & \cdots & {\kappa (x_{nm} ,x_{nm} )} \\ \end{array} } \right).$$

If the coefficient vector ${\varvec{\lambda}}^{t}$ at iteration $t$ is given, then the column-wise multi-kernel vector ${\varvec{B}}_{m}$(${\varvec{B}}_{m} \in {\mathbb{R}}^{n}$) with respect to the $m{\text{th}}$ feature can be obtained by the multiplication of the transposed matrix ${\varvec{P}}^{T}$ and the vector ${\varvec{\lambda}}^{t}$ with the form

$$\begin{gathered} {\varvec{B}}_{m} = {\varvec{P}}^{T} {\varvec{\lambda}}^{t} \hfill \\ { = }\left( {\begin{array}{*{20}c} {\kappa (x_{1m} ,x_{1m} )} & \ldots & {\kappa (x_{1m} ,x_{nm} )} \\ \vdots & \ddots & \vdots \\ {\kappa (x_{nm} ,x_{1m} )} & \cdots & {\kappa (x_{nm} ,x_{nm} )} \\ \end{array} } \right)^{T} \left[ \begin{gathered} \lambda_{1}^{t} \hfill \\ \vdots \hfill \\ \lambda_{n}^{t} \hfill \\ \end{gathered} \right] \hfill \\ \, = \left[ {\sum\limits_{j = 1}^{n} {\lambda_{j}^{t} \kappa (x_{jm} ,x_{1m} )} , \cdots ,\sum\limits_{j = 1}^{n} {\lambda_{j}^{t} \kappa (x_{jm} ,x_{nm} )} } \right]^{T} . \hfill \\ \end{gathered}$$

The column-wise multi-kernel matrix ${\varvec{B}}$(${\varvec{B}} \in {\mathbb{R}}^{n \times d}$) is denoted as.

$\user2{B = }\left( {{\varvec{B}}_{1} , \cdots ,{\varvec{B}}_{d} } \right)$.

Thus, for any input points ${\varvec{x}}_{i}$ with respect to the feature ${\varvec{f}}_{m}$, the element of the matrix ${\varvec{B}}$ is ${\varvec{B}}_{im} = \sum\nolimits_{j = 1}^{n} {\lambda_{j}^{t} \kappa ({\varvec{x}}_{jm} , \, {\varvec{x}}_{im} )}$ for all $i = 1, \cdots ,n$ and $m = 1, \cdots ,d$.

1.3 Proof of the ${\mathbf{\lambda - step}}$ EM²NOLC algorithm via ADMM

Corresponding with the $\lambda {\text{ - step}}$ EM²NOLC model (7) and its ADMM optimization problem (13) with the separable objective and equality constraint functions, the augmented Lagrangian function regarding the scaled dual variables ${\varvec{u}}$(${\varvec{u}} \in {\mathbb{R}}^{n}$) and the penalty parameter $\rho$($\rho > 0$) is defined as.

${\varvec{L}}_{\rho } ({\varvec{\lambda}}, \, {\varvec{q}}, \, {\varvec{u}}) = f({\varvec{\lambda}}) + g({\varvec{q}}) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\lambda}} - \, {\varvec{q}} + {\varvec{u}}} \right\|_{2}^{2} - \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {\varvec{u}} \right\|_{2}^{2}$

The ADMM updates (17), (18), and (16) of the $\lambda {\text{ - step}}$ EM²NOLC algorithm are obtained from the partial derivative of ${\varvec{L}}_{\rho } (\lambda , \, {\varvec{q}}, \, {\varvec{u}})$ with respect to its three parameters ${\varvec{\lambda}}$, ${\varvec{q}}$, and ${\varvec{u}}$, respectively. We can express the $\lambda {\text{ - minimization}}$ as the proximal operator:

$$\begin{gathered} {\varvec{\lambda}}^{k + 1} = \mathop {\arg \min }\limits_{{\lambda \in {\mathbb{R}}^{n} }} {\varvec{L}}_{\rho } ({\varvec{\lambda}}, \, {\varvec{q}}^{k} , \, {\varvec{u}}^{k} ) \hfill \\ { } = \mathop {\arg \min }\limits_{{\lambda \in {\mathbb{R}}^{n} }} \left\{ {f({\varvec{\lambda}}) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\lambda}} - {\varvec{q}}^{k} + {\varvec{u}}^{k} } \right\|_{2}^{2} } \right\}. \hfill \\ \end{gathered}$$

Then the gradient of ${\varvec{L}}_{\rho } ({\varvec{\lambda}}, \, {\varvec{q}}^{k} , \, {\varvec{u}}^{k} )$ regarding ${\varvec{\lambda}}$ is set to zero, we analytically obtained the resulting $\lambda {\text{ - update}}$:

$$\begin{gathered} \, \nabla_{\lambda } {\varvec{L}}_{\rho } ({\varvec{\lambda}}, \, {\varvec{q}}^{k} , \, {\varvec{u}}^{k} ) = 0 \hfill \\ \Rightarrow \nabla_{\lambda } \left\{ {f({\varvec{\lambda}}) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\lambda}} - {\varvec{q}}^{k} + {\varvec{u}}^{k} } \right\|_{2}^{2} } \right\} = 0 \hfill \\ \Rightarrow \nabla_{\lambda } \left\{ {({1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-\nulldelimiterspace} 2})\left\| {{\varvec{y}} \odot \left( {\user2{A\lambda } - b_{1} {\mathbf{1}}_{n} } \right) - 1_{n} } \right\|_{2}^{2} + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\lambda}} - {\varvec{q}}^{k} + {\varvec{u}}^{k} } \right\|_{2}^{2} } \right\} = 0 \hfill \\ \Rightarrow {\varvec{A}}_{y}^{T} {\varvec{A}}_{y} {\varvec{\lambda}} - {\varvec{A}}_{y}^{T} \left( {b_{1} {\varvec{y}} + {\mathbf{1}}_{n} } \right) + \rho \lambda - \rho \left( {{\varvec{q}}^{k} - {\varvec{u}}^{k} } \right) = 0 \hfill \\ \Rightarrow \lambda^{k + 1} = \left( {{\varvec{A}}_{y}^{T} {\varvec{A}}_{y} + \rho I_{n} } \right)^{ - 1} \left\{ {{\varvec{A}}_{y}^{T} \left( {b_{1} y + {\mathbf{1}}_{n} } \right) + \rho \left( {{\varvec{q}}^{k} - {\varvec{u}}^{k} } \right)} \right\}.{ } \hfill \\ \end{gathered}$$

The $q{\text{ - minimization}}$ of the $\lambda {\text{ - step}}$ EM²NOLC algorithm has the form

$$\begin{gathered} {\varvec{q}}^{k + 1} = \mathop {\arg \min }\limits_{{{\varvec{q}} \in {\mathbb{R}}^{n} }} {\varvec{L}}_{\rho } (\lambda^{k + 1} , \, {\varvec{q}}, \, {\varvec{u}}^{k} ) \hfill \\ { } = \mathop {\arg \min }\limits_{{{\varvec{q}} \in {\mathbb{R}}^{n} }} \left\{ {g({\varvec{q}}) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {\lambda^{k + 1} - {\varvec{q}} + {\varvec{u}}^{k} } \right\|_{2}^{2} } \right\}. \hfill \\ \end{gathered}$$

Dual to $S_{0} (C_{\lambda } ) = \{ {\varvec{q}} \in {\mathbb{R}}^{n} |\left\| {\varvec{q}} \right\|_{0} \le C_{\lambda } \}$ is a nonconvex set, the $q{\text{ - minimization}}$ may not converge to an optimal point. We can apply the projected gradient method to approximate the $q{\text{ - update}}$ procedure. So, we have the $q{\text{ - update}}$:

$$\begin{gathered} {\varvec{q}}^{k + 1} = \mathop {\arg \min }\limits_{{{\varvec{q}} \in {\mathbb{R}}^{n} }} \left\{ {I_{{S_{0} (C_{\lambda } )}} ({\varvec{q}}) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\lambda}}^{k + 1} - {\varvec{q}} + {\varvec{u}}^{k} } \right\|_{2}^{2} } \right\} \hfill \\ \, \approx \mathop {\arg \min }\limits_{{{\varvec{q}} \in S_{0} (C_{\lambda } )}} \left\| {{\varvec{\lambda}}^{k + 1} + {\varvec{u}}^{k} - {\varvec{q}}} \right\|_{2}^{2} \hfill \\ \, = \Pi_{{S_{0} (C_{\lambda } )}} ({\varvec{\lambda}}^{k + 1} + {\varvec{u}}^{k} ), \hfill \\ \end{gathered}$$

where the indicator function has $I_{{S_{0} (C_{\lambda } )}} ({\varvec{q}}) = 1$ if ${\varvec{q}} \in S_{0} (C_{\lambda } )$, otherwise $I_{{S_{0} (C_{\lambda } )}} ({\varvec{q}}) = 0$. For any vector ${\varvec{q}}$(${\varvec{q}} \in {\mathbb{R}}^{n}$), the projection operator $\Pi_{{S_{0} (C_{\lambda } )}} ({\varvec{q}})$ can be actually implemented by sorting the elements of the vector ${\varvec{q}}$ in descending order of their absolute values and setting all elements to zeros except top $C_{\lambda }$.

The $u{\text{ - update}}$ step can be considered as the change of constraint residuals in the optimization problem (13), which ensures the convergence of the ADMM iterations (17), (18), and (16).

1.4 Proof of the $\mu - step$ EM² NOLC algorithm via ADMM

Similar to “Proof of the λ-step EM²NOLC algorithm via ADMM” in Appendix, for the $\mu {\text{ - step}}$ EM²NOLC model (9) and its ADMM optimization problem (20) with the separable objective functions and equality constraints, the augmented Lagrangian function is denoted as.

${\varvec{L}}_{\rho } ({\varvec{\mu}}, \, {\varvec{z}}, \, {\varvec{v}}) = h({\varvec{\mu}}) + l(z) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\mu}} - \, {\varvec{z}} + {\varvec{v}}} \right\|_{2}^{2} - \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {\varvec{v}} \right\|_{2}^{2},$

with the scaled dual variables ${\varvec{v}}$(${\varvec{v}} \in {\mathbb{R}}^{d}$).

The ADMM updates (24), (25), and (23) of the $\mu {\text{ - step}}$ EM²NOLC algorithm are, respectively, generated from the KKT optimality conditions of ${\varvec{L}}_{\rho } ({\varvec{\mu}}, \, {\varvec{z}}, \, {\varvec{v}})$ regarding its three parameters ${\varvec{\mu}}$, ${\mathbf{z}}$, and ${\varvec{v}}$. The $\mu {\text{ - minimization}}$ is defined as the proximal operator:

$$\begin{gathered} {\varvec{\mu}}^{k + 1} = \mathop {\arg \min }\limits_{{\mu \in {\mathbb{R}}^{d} }} {\varvec{L}}_{\rho } ({\varvec{\mu}}, \, {\varvec{z}}^{k} , \, {\varvec{v}}^{k} ) \hfill \\ { } = \mathop {\arg \min }\limits_{{\mu \in {\mathbb{R}}^{d} }} \left\{ {h(\mu ) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\mu}} - \, {\varvec{z}}^{k} + {\varvec{v}}^{k} } \right\|_{2}^{2} } \right\}. \hfill \\ \end{gathered}$$

Setting the gradient of ${\varvec{L}}_{\rho } ({\varvec{\mu}}, \, {\varvec{z}}^{k} , \, {\varvec{v}}^{k} )$ with respect to ${\varvec{\mu}}$ to zero, we can get the analytical solution, that is the $\mu {\text{ - update}}$:

$$\begin{gathered} \, \nabla_{{\varvec{\mu}}} {\varvec{L}}_{\rho } {(}{\varvec{\mu}}, \, {\mathbf{z}}^{k} , \, {\varvec{v}}^{k} {)} = 0 \hfill \\ \Rightarrow \nabla_{{\varvec{\mu}}} \left\{ {h({\varvec{\mu}}) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\mu}} - \, {\mathbf{z}}^{k} + {\varvec{v}}^{k} } \right\|_{2}^{2} } \right\} = 0 \hfill \\ \Rightarrow \nabla_{{\varvec{\mu}}} \left\{ {({1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-\nulldelimiterspace} 2})\left\| {{\varvec{y}} \odot \left( {\user2{B\mu } - b_{2} {\mathbf{1}}_{n} } \right) - {\mathbf{1}}_{n} } \right\|_{2}^{2} + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\mu}} - \, {\mathbf{z}}^{k} + {\varvec{v}}^{k} } \right\|_{2}^{2} } \right\}{ = 0} \hfill \\ \Rightarrow {\varvec{B}}_{y}^{T} {\varvec{B}}_{y} {\varvec{\mu}} - {\varvec{B}}_{y}^{T} \left( {b_{2} {\varvec{y}} + {\mathbf{1}}_{n} } \right) + \rho {\varvec{\mu}} - \rho \left( {{\mathbf{z}}^{k} - {\varvec{v}}^{k} } \right) = 0 \hfill \\ \Rightarrow {\varvec{\mu}}^{k + 1} = \left( {{\varvec{B}}_{y}^{T} {\varvec{B}}_{y} + \rho {\varvec{I}}_{d} } \right)^{ - 1} \left\{ {{\varvec{B}}_{y}^{T} \left( {b_{2} {\varvec{y}} + {\mathbf{1}}_{n} } \right) + \rho \left( {{\mathbf{z}}^{k} - {\varvec{v}}^{k} } \right)} \right\}{. } \hfill \\ \end{gathered}$$

The $z{\text{ - minimization}}$ of the $\mu {\text{ - step}}$ EM²NOLC algorithm has the minimization problem with the form

$$\begin{gathered} {\mathbf{z}}^{k + 1} = \mathop {\arg \min }\limits_{{{\mathbf{z}} \in {\mathbb{R}}^{d} }} {\varvec{L}}_{\rho } {(}{\varvec{\mu}}^{k + 1} , \, {\mathbf{z}}, \, {\varvec{v}}^{k} {)} \hfill \\ { } = \mathop {\arg \min }\limits_{{{\mathbf{z}} \in {\mathbb{R}}^{d} }} \left\{ {l({\mathbf{z}}) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\mu}}^{k + 1} - {\mathbf{z}} + {\varvec{v}}^{k} } \right\|_{2}^{2} } \right\}. \hfill \\ \end{gathered}$$

Similarly, owing to $S_{0} (C_{\mu } ) = \{ {\mathbf{z}} \in {\mathbb{R}}^{d} |\left\| {\mathbf{z}} \right\|_{0} \le C_{\mu } \}$ is a nonconvex set, we can apply the projected gradient method to the $z{\text{ - minimization}}$ to obtain the $z{\text{ - update}}$:

$$\begin{gathered} {\mathbf{z}}^{k + 1} = \mathop {\arg \min }\limits_{{{\mathbf{z}} \in {\mathbb{R}}^{d} }} \left\{ {I_{{S_{0} (C_{\mu } )}} ({\mathbf{z}}) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\mu}}^{k + 1} - {\mathbf{z}} + {\varvec{v}}^{k} } \right\|_{2}^{2} } \right\} \hfill \\ \, \approx \mathop {\arg \min }\limits_{{{\mathbf{z}} \in S_{0} (C_{\mu } )}} \left\| {{\varvec{\mu}}^{k + 1} + {\varvec{v}}^{k} - {\mathbf{z}}} \right\|_{2}^{2} \hfill \\ \, = \Pi_{{S_{0} (C_{\mu } )}} ({\varvec{\mu}}^{k + 1} + {\varvec{v}}^{k} ), \hfill \\ \end{gathered}$$

with the indicator function $I_{{S_{0} (C_{\mu } )}} ({\mathbf{z}}) = 1$ for ${\mathbf{z}} \in S_{0} (C_{\mu } )$ and $I_{{S_{0} (C_{\mu } )}} ({\mathbf{z}}) = 0$ for ${\mathbf{z}} \notin S_{0} (C_{\mu } )$. For any vector ${\mathbf{z}}$(${\mathbf{z}} \in {\mathbb{R}}^{d}$), the projection operator $\Pi_{{S_{0} (C_{\mu } )}} ({\mathbf{z}})$ can be carried out by sorting the elements of the vector ${\mathbf{z}}$ in descending order of their absolute values and setting all elements to zeros except the largest $C_{\mu }$ elements.

Finally, the $v{\text{ - update}}$ step can be regarded as the change of constraint residuals in the ADMM problem (20), which guarantees the convergence of the ADMM updates (24), (25), and (23).

Appendix B

2.1 KS curves in Fig. 2

See Fig. 2.

2.2 ROC curves in Fig. 3

See Fig. 3.

2.3 IFs for EM2NOLC in Fig. 4

See Fig. 4.

2.4 IFs for EM2NOLC in Fig. 5

See Fig. 5.

2.5 Correlation analysis of selected features in Fig. 6

See Fig. 6.

2.6 Correlation analysis of selected features in Fig. 7

See Fig. 7.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Z., He, J., Cao, J. et al. An explainable multi-sparsity multi-kernel nonconvex optimization least-squares classifier method via ADMM. Neural Comput & Applic 34, 16103–16128 (2022). https://doi.org/10.1007/s00521-022-07282-6

Download citation

Received: 19 August 2021
Accepted: 07 April 2022
Published: 03 May 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s00521-022-07282-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An explainable multi-sparsity multi-kernel nonconvex optimization least-squares classifier method via ADMM

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Feature selection techniques for machine learning: a survey of more than two decades of research

A review of unsupervised feature selection methods

Notes

References

Acknowledgements