Robust variable structure discovery based on tilted empirical risk minimization

Zhang, Xuelin; Wang, Yingjie; Zhu, Liangxuan; Chen, Hong; Li, Han; Wu, Lingjuan

doi:10.1007/s10489-022-04409-z

Robust variable structure discovery based on tilted empirical risk minimization

Published: 14 January 2023

Volume 53, pages 17865–17886, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Xuelin Zhang¹,
Yingjie Wang²,
Liangxuan Zhu¹,
Hong Chen³,
Han Li¹ &
…
Lingjuan Wu¹

295 Accesses
Explore all metrics

Abstract

Robust group lasso regression plays an important role in high-dimensional regression modeling such as biological data analysis for disease diagnosis and gene expression. However, most existing methods are optimized with prior variable structure knowledge under the traditional empirical risk minimization (ERM) framework, in which the estimators are excessively dependent on prior structure information and sensitive to outliers. To address this issue, we propose a new robust variable structure discovery method for group lasso based on a convergent bilevel optimization framework. In this paper, we adopt tilted empirical risk minimization (TERM) as the target function to improve the robustness of the estimator by assigning less weight to the outliers. Moreover, we modify the TERM objective function to calculate its Fenchel conjugate while maintaining its robust property, which is proven theoretically and empirically. Experimental results on both synthetic and real-world datasets show that the proposed method can improve the robust performance on prediction and variable structure discovery compared to the existing techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization

Article 15 April 2024

Data mining in clinical big data: the frequently used databases, steps, and methodological models

Article Open access 11 August 2021

The benefits and pitfalls of machine learning for biomarker discovery

Article Open access 27 July 2023

Notes

The code can be downloaded from https://github.com/zhangxuelincode/demoTERMGL https://github.com/zhangxuelincode/demoTERMGL
The data can be downloaded from https://adni.loni.usc.edu/

References

Yu X, Sun Y, Zhou H-J (2021) An adaptive shortest-solution guided decimation approach to sparse high-dimensional linear regression. Sci Rep 11(1):1–13
Article Google Scholar
Wilms I, Croux C (2016) Robust sparse canonical correlation analysis. BMC Syst Biol 10 (1):1–13
Article Google Scholar
Wang Y, Li X, Ruiz R (2018) Weighted general group lasso for gene selection in cancer classification. IEEE Trans Cybern 49(8):2860–2873
Article Google Scholar
He H, Guo X, Yu J, Ai C, Shi S (2022) Overcoming the inadaptability of sparse group lasso for data with various group structures by stacking. Bioinformatics 38(6):1542–1549
Article Google Scholar
Liu X, Goncalves AR, Cao P, Zhao D, Banerjee A (2018) Modeling alzheimer’s disease cognitive scores using multi-task sparse group lasso. Comput Med Imaging Graph 66:100–114
Article Google Scholar
Cao P, Shan X, Zhao D, Huang M, Zaiane O (2017) Sparse shared structure based multi-task learning for mri based cognitive performance prediction of alzheimer’s disease. Pattern Recogn 72:219–235
Article Google Scholar
Liu X, Cao P, Wang J, Kong J, Zhao D (2019) Fused group lasso regularized multi-task feature learning and its application to the cognitive performance prediction of alzheimer’s disease. Neuroinformatics 17 (2):271–294
Article Google Scholar
Oliveira SHG, Gonçalves AR, Von Zuben FJ (2019) Group lasso with asymmetric structure estimation for multi-task learning. In: Proceedings of the 28th international joint conference on artificial intelligence, pp 3202–3208
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B (Stat Methodol) 68(1):49–67
Article MathSciNet MATH Google Scholar
Ma S, Song X, Huang J (2007) Supervised group lasso with applications to microarray data analysis. BMC Bioinforma 8(1):1–17
Article Google Scholar
Zhang HH, Cheng G, Liu Y (2011) Linear or nonlinear? automatic structure discovery for partially linear models. J Am Stat Assoc 106(495):1099–1112
Article MathSciNet MATH Google Scholar
Yang G, Sun Y, Cui X (2017) Automatic structure discovery for varying-coefficient partially linear models. Commun Stat-Theory Methods 46(15):7703–7716
Article MathSciNet MATH Google Scholar
Hernández-Lobato D, Hernández-Lobato JM (2013) Learning feature selection dependencies in multi-task learning. Adv Neural Inf Process Syst 26
Pan C, Zhu M (2017) Group additive structure identification for kernel nonparametric regression. Adv Neural Inf Process Syst 30
Frecon J, Salzo S, Pontil M (2018) Bilevel learning of the group lasso structure. Adv Neural Inf Process Syst 31:8301–8311
Google Scholar
Franceschi L, Frasconi P, Salzo S, Grazzi R, Pontil M (2018) Bilevel programming for hyperparameter optimization and meta-learning. In: International conference on machine learning, pp 1568–1577
Shaban A, Cheng C-A, Hatch N, Boots B (2019) Truncated back-propagation for bilevel optimization. In: The 22nd international conference on artificial intelligence and statistics, pp 1723–1732
Feurer M, Hutter F (2019) Hyperparameter optimization. Springer, Cham, pp 3–33
Google Scholar
Ji K, Yang J, Liang Y (2021) Bilevel optimization: convergence analysis and enhanced design. In: International conference on machine learning, pp 4882–4892
Sun H, Pu W, Fu X, Chang T-H, Hong M (2022) Learning to continuously optimize wireless resource in a dynamic environment: a bilevel optimization perspective. IEEE Trans Signal Process 70:1900–1917
Article MathSciNet Google Scholar
Liu H, Simonyan K, Yang Y (2019) Darts: differentiable architecture search. In: International conference on learning representations, pp 1–13
Bertinetto L, Henriques JF, Torr P, Vedaldi A (2018) Meta-learning with differentiable closed-form solvers. In: International conference on learning representations, pp 1–15
Rajeswaran A, Finn C, Kakade SM, Levine S (2019) Meta-learning with implicit gradients. Adv Neural Inf Process Syst 32
Ji K, Lee JD, Liang Y, Poor HV (2020) Convergence of meta-learning with task-specific adaptation over partial parameters. Adv Neural Inf Process Syst 33:11490–11500
Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodological) 58(1):267–288
MathSciNet MATH Google Scholar
Feng Y, Huang X, Shi L, Yang Y, Suykens JA (2015) Learning with the maximum correntropy criterion induced losses for regression. J Mach Learn Res 16(1):993–1034
MathSciNet MATH Google Scholar
Li Y, Liang M, Mao L, Wang S (2021) Robust estimation and variable selection for the accelerated failure time model. Stat Med 40(20):4473–4491
Article MathSciNet Google Scholar
Li T, Beirami A, Sanjabi M, Smith V (2020) Tilted empirical risk minimization. In: International conference on learning representations, pp 1–44
Huber PJ (1973) Robust regression: asymptotics, conjectures and monte carlo. Ann Stat 799–821
Sun Q, Zhou W-X, Fan J (2020) Adaptive huber regression. J Am Stat Assoc 115 (529):254–265
Article MathSciNet MATH Google Scholar
Feng Y, Wu Q (2022) A statistical learning assessment of huber regression. J Approx Theory 273:105660
Article MathSciNet MATH Google Scholar
Chen S, Gong C, Yang J, Li X, Wei Y, Li J (2018) Adversarial metric learning. In: Proceedings of the 27th international joint conference on artificial intelligence, pp 2021–2027
Sinha A, Malo P, Deb K (2017) A review on bilevel optimization: from classical to evolutionary approaches and applications. IEEE Trans Evol Comput 22(2):276–295
Article Google Scholar
Sharma A (2020) Optimistic variants of single-objective bilevel optimization for evolutionary algorithms. Int J Comput Intell Appl 19(03):2050020
Article Google Scholar
Hong T, Zhao D, Zhang Y, Wang Z (2021) A bilevel voltage regulation operation for distribution systems with self-operated microgrids. IEEE Trans Smart Grid 13(2):1238–1248
Article Google Scholar
Condat L (2016) Fast projection onto the simplex and the ℓ₁-ball. Math Prog 158(1):575–585
Article MathSciNet MATH Google Scholar
Reddi SJ, Sra S, Poczos B, Smola AJ (2016) Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. Adv Neural Inf Process Syst 29:1145–1153
Google Scholar
Van Nguyen Q (2017) Forward-backward splitting with bregman distances. Vietnam J Math 45 (3):519–539
Article MathSciNet MATH Google Scholar
Nikolova M, Chan RH (2007) The equivalence of half-quadratic minimization and the gradient linearization iteration. IEEE Trans Image Process 16(6):1623–1627
Article MathSciNet Google Scholar
He R, Zheng W-S, Tan T, Sun Z (2013) Half-quadratic-based iterative minimization for robust sparse representation. IEEE Trans Pattern Anal Mach Intell 36(2):261–275
Google Scholar
Wang X, Chen H, Cai W, Shen D, Huang H (2017) Regularized modal regression with applications in cognitive impairment prediction. Adv Neural Inf Process Syst 30:1448–1458
Google Scholar
Huber PJ (1981) Robust statistics Wiley series in probability and mathematical statistics
Yu Y-l, Aslan Ö, Schuurmans D (2012) A polynomial-time form of robust regression. Adv Neural Inf Process Syst 25
Yao W, Li L (2014) A new regression model: modal linear regression. Scand J Stat 41 (3):656–671
Article MathSciNet MATH Google Scholar
Weiner MW, Aisen PS, Jack Jr CR, Jagust WJ, Trojanowski JQ, Shaw L, Saykin AJ, Morris JC, Cairns N, Beckett LA et al (2010) The alzheimer’s disease neuroimaging initiative: progress report and future plans. Alzheim Dement 6(3):202–211
Article Google Scholar
Yang J, Wang S, Wu T (2022) Maximum mutual information for feature extraction from graph-structured data: application to alzheimer’s disease classification. Appl Intell 1–17

Download references

Acknowledgements

This paper is supported by the Fundamental Research Funds for the Central Universities of China (No.2662022XXYJ005) and National Natural Science Foundation of China (No.12071166).

Author information

Authors and Affiliations

College of Informatics, Huazhong Agricultural University, Shizishan Street, Wuhan, 430070, Hubei, China
Xuelin Zhang, Liangxuan Zhu, Han Li & Lingjuan Wu
College of control science and engineering, China University of Petroleum (East China), Qingdao, 266580, Shandong, China
Yingjie Wang
College of Science, Huazhong Agricultural University, Shizishan Street, Wuhan, 430070, Hubei, China
Hong Chen

Authors

Xuelin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yingjie Wang
View author publications
You can also search for this author in PubMed Google Scholar
Liangxuan Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Hong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Han Li
View author publications
You can also search for this author in PubMed Google Scholar
Lingjuan Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lingjuan Wu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Robustness analysis

In this section, we analyze the robustness property of the modified TERM objective function with the related sensitivity curve with respect to prediction error and the finite sample breakdown point theory.

1.1 A.1: Sensitivity curve and gradient analysis

The modified TERM objective function is defined as $\phi _{t}(e)=\frac {1}{t} \exp \{t e^{2}\}$ with negative t, where e represents the prediction error $y_{i} - {x_{i}^{T}} \beta $. Inspired by [26], we draw the sensitivity curve by analyzing its derivative $2e \cdot \exp \{t e^{2}\}$, as shown in Fig. 12.

Compared with the square loss function, where the samples with larger prediction errors have a greater impact on the weight updates, small weights are assigned to atypical samples in the modified TERM formulation, making this method more robust to outliers. Moreover, the behavior of the modified TERM is affected by t, e.g., the function converges slowly with a sufficiently large negative t value (e.g., $t \rightarrow -\infty $), and the outliers will affect the model when t is a small negative value (e.g., $t \rightarrow 0^{-}$). The optimal value of t should be set in a specific task, which is also discussed in the simulation experiments in Section 4.1.3. The performance of the modified TERM can be tuned by t to magnify (t > 0) or suppress (t < 0) the influence of outliers. The theoretical analysis of this is shown in Lemma 3 by exploring the gradient.

The gradient of the modified objective function is a weighted average of the original individual loss gradients $\nabla _{\theta } f\left (x_{i} ; \theta \right )$. In addition, we find that the weight w_i(t; 𝜃) is closely related to the value of the hyperparameter t and the loss $f\left (x_{i} ; \theta \right )$ of each sample. Specifically, when t < 0, samples with larger loss values are suppressed by assigning less weight to them.

Lemma 3

For a smooth loss function $f\left (x_{i} ; \beta \right )$, we have

$$ \tilde{R}(t ; \beta)= \frac{1}{n} \sum\limits_{i=1}^{n} \frac{1}{t} e^{t f\left( x_{i} ; \beta\right)}. $$

$$ \begin{array}{@{}rcl@{}} \nabla_{\beta} \tilde{R}(t ; \beta)&=& \frac{1}{n} \sum\limits_{i=1}^{n} w_{i}(t ; \beta) \nabla_{\beta} f\left( x_{i} ; \beta\right), \text { where } w_{i}(t ; \beta)\\ &=& e^{t f\left( x_{i} ; \beta\right)}. \end{array} $$

Proof

Here, we provide the proof:

$$ \begin{array}{@{}rcl@{}} \nabla_{\beta} \tilde{R}(t ; \beta) &=\nabla_{\beta}\left\{\frac{1}{n} \sum\limits_{i=1}^{n} \frac{1}{t} e^{t f\left( x_{i} ; \beta\right)}\right\} \\ &=\frac{1}{n} \sum\limits_{i=1}^{n} e^{t f\left( x_{i} ; \beta\right)} \nabla_{\beta} f\left( x_{i} ; \beta\right). \end{array} $$

□

When t is positive, TERM focuses on the samples with relatively large losses (rare instances). Thus, the positive parameter t can be adopted when rare instances are important for estimation, which aims to minimize variance as proven in [28]. In summary, the new modified function of TERM without the $\log $ operator can be considered as an alternate loss that magnifies samples with large loss when t > 0 and suppresses the outliers when t < 0. In this paper, we focus on the robust performance in nonoverlapping group lasso and variable structure discovery applications, where t < 0.

1.2 A.2: Finite sample breakdown point analysis

We are interested in studying the behavior of the modified TERM estimator when the dataset is contaminated with outliers. To further discuss the robustness of our estimator, we analyze its finite sample breakdown point.

Various definitions of the sample breakdown point have been proposed. In this work, we analyze the finite sample contamination breakdown point [42], which is used to quantify the proportion of bad data that an estimator can tolerate before returning arbitrary values.

Given original observations $\boldsymbol {S}=\left (s_{1}, \ldots , s_{n}\right )$, where $s_{i}=\left (\boldsymbol {x}_{i}, y_{i}\right )$, denote T(S) as the minimizer of the following objective function:

$$ \min_{\beta} \frac{1}{n} \sum\limits_{i=1}^{n} \phi_{t}\left( y_{i}-x_{i} \beta \right) $$

(13)

and

$$ \phi_{t}\left( y_{i}-x_{i} \beta \right) = \frac{1}{t} e^{t\left( y_{i}-x_{i} \beta \right)^{2}}. $$

(14)

The original samples S are then corrupted by adding extra m arbitrary points $\boldsymbol {S}^{\prime }=\left (s_{n+1}, \ldots , s_{n+m}\right )$. Thus, the new dataset $\boldsymbol {S} \cup \boldsymbol {S}^{\prime }$ contains ε = m/(m + n) bad samples. The finite sample contamination breakdown point ε^∗ is defined as

$$ \varepsilon^{*}=\min_{1 \leq m \leq n}\left\{\frac{m}{n+m}: \sup_{\boldsymbol{S}^{\prime}}\left\|T\left( \boldsymbol{S} \cup \boldsymbol{S}^{\prime}\right)\right\|=\infty\right\}, $$

where ∥⋅∥ is the Euclidean norm.

Theorem 2

Given original observations $\boldsymbol {S}=\left (s_{1}, \ldots , s_{n}\right )$, suppose $T(S)=\hat {\beta }$, which is the minimizer of (13). Let $\phi _{t}(v)=\frac {1}{t} e^{t v^{2}}$, and t < 0. Denote $M= t {\sum }_{i=1}^{n} \phi _{t}$ $\left (y_{i}-\boldsymbol {x}_{i}^{T} \hat {\beta }\right )$. Then, the finite sample contamination breakdown point of the modified TERM framework is

$$ \varepsilon^{*}(\boldsymbol{S}, T)=\frac{m^{*}}{n+m^{*}}, $$

(15)

where m^∗ represents the ideal maximum number of abnormal samples and satisfies

$$ \lceil M\rceil \leq m^{*} \leq\lfloor M\rfloor+1, $$

where ⌈M⌉ represents the largest integer that is not greater than M, and ⌊M⌋ is the smallest integer that is not less than M.

Proof

Given the modified TERM objective ϕ_t(⋅), original n samples $\boldsymbol {S}=\left (s_{1}, \ldots , s_{n}\right )$ and extra m arbitrary samples $\boldsymbol {S}^{\prime }=\left (s_{n+1}, \ldots , s_{n+m}\right )$, denote T as the estimator. Let ϕ^∗(⋅) = tϕ_t(⋅), $M={\sum }_{i=1}^{n} \phi ^{*}\left (y_{i}-\boldsymbol {x}_{i}^{T} \hat {{\beta }}\right )$, where $\hat {{\beta }}= T(\boldsymbol {S})$. In addition, ϕ^∗(⋅) is symmetric about 0 and satisfies that $\max \limits _{v} \phi ^{*}(v)=\phi ^{*}(0)=1$, ϕ^∗(⋅) decreases monotonically toward both sides and $\lim _{\vert v\vert \rightarrow \infty } \phi ^{*}(v)=0$.

First, we prove that $T\left (\boldsymbol {S} \cup \boldsymbol {S}^{\prime }\right )$ remains bounded if m < M. Let ξ > 0 be such that m + nξ < M. Let C be such that ϕ^∗(v) ≤ ξ for |v|≥ C. Let β be any real vector such that |y − x^Tβ|≥ C for all samples in S. Then, we have

$$ \sum\limits_{i=1}^{m+n} \phi^{*}\left( y_{i}-\boldsymbol{x}_{i}^{T} \hat{{\beta}} \right) \geq M $$

(16)

and

$$ \sum\limits_{i=1}^{m+n} \phi^{*}\left( y_{i}-\boldsymbol{x}_{i}^{T} {\beta}\right) \leq m + n \xi. $$

(17)

From (16) and (17), we find that $T\left (\boldsymbol {S} \cup \boldsymbol {S}^{\prime }\right )$ must satisfy $ \vert y-\boldsymbol {x}^{T} T\left (\boldsymbol {S} \cup \boldsymbol {S}^{\prime }\right ) \vert <C$ for a point in S because ${\sum }_{i=1}^{m+n} \phi ^{*}\left (y_{i}-\boldsymbol {x}_{i}^{T} T\left (\boldsymbol {S} \cup \boldsymbol {S}^{\prime }\right ) \right ) \geq M$. Thus, $T\left (\boldsymbol {S} \cup \boldsymbol {S}^{\prime }\right )$ is bounded in the case of m < M.

Second, we prove that the estimator may breakdown if m > M. Similarly, let ξ > 0 such that m > M + mξ, and let C be such that ϕ^∗(v) ≤ ξ for |v|≥ C. Assume that all arbitrary points $\left \{\left (\boldsymbol {x}_{n+1}, y_{n+1}\right ), \ldots ,\left (\boldsymbol {x}_{n+m}, y_{n+m}\right )\right \}$ in $\boldsymbol {S}^{\prime }$ are the same and satisfy a linear relationship y = x^Tβ^∗. Let β be any vector such that $\vert y_{n+1}-\boldsymbol {x}_{n+1}^{T} \beta \vert >C$. Then, we have

$$ \sum\limits_{i=1}^{m+n} \phi^{*}\left( y_{i}-\boldsymbol{x}_{i}^{T} {\beta}^{*}\right) \geq m $$

(18)

and

$$ \sum\limits_{i=1}^{m+n} \phi^{*}\left( y_{i}-\boldsymbol{x}_{i}^{T} \beta\right) \leq M+m \xi. $$

(19)

From (18) and (19), we find that $T\left (\boldsymbol {S} \cup \boldsymbol {S}^{\prime }\right )$ must satisfy $ \vert y_{n+1}-\boldsymbol {x}_{n+1}^{T} T\left (S \cup S^{\prime }\right ) \vert \leq C$ because ${\sum }_{i=1}^{m+n} \phi ^{*}$ $\left (y_{i}-\boldsymbol {x}_{i}^{T} T\left (\boldsymbol {S} \cup \boldsymbol {S}^{\prime }\right ) \right ) \geq m$. $\vert T\left (\boldsymbol {S} \cup \boldsymbol {S}^{\prime }\right )\vert $ will go to infinity when we let $y_{n+1} \rightarrow \infty $ and fix x_n+ 1. Thus, we obtain the breakdown point. □

From the above theorem, we can see that the influencing factors of the breakdown point include not only the objective function ϕ_t(⋅) with the tuning parameter t, but also the sample configuration. However, as pointed out in [42, 44], the breakdown point is empirically quite high if the scale (contained in the hyperparameter t of TERMGL) is determined from the dataset itself.

Appendix B: Optimization details

1.1 B.1: Details for updating w

Denote the dual object ${w}=\left ({w}_{1}, \ldots , {w}_{{L}}\right ) \in \mathbb {R}^{{P} \times {L}}$, where ${w}_{l}=\left ({w}_{1 l}, {w}_{2 l}, \ldots , {w}_{P l}\right )^{{T}} \in \mathbb {R}^{{P}}$. According to the Fenchel-Rockafellar duality theorem, we can formulate the dual problem of (9) as

$$ \hat{w}=\underset{w \in \mathbb{R}^{P \times L}}{\text{argmin}}\left\{\mathcal{L}^{*}\left( -\mathcal{T}_{\theta}^{*} w\right) + {\Omega}^{*}(w)\right\}. $$

(20)

Moreover, this is a smooth constrained convex optimization problem. Here ${\mathscr{L}}^{*}\left (-\mathcal {T}_{\theta }^{*} w\right )$ is the Fenchel conjugate of ${\mathscr{L}}(\beta )$, and $\mathcal {T}_{\theta }^{*} w={\sum }_{l=1}^{L} \mathcal {T}_{\theta _{l}} w_{l} \in \mathbb {R}^{P}$. ${\Omega }^{*}(w)={\sum }_{l=1}^{L} \delta \left (w_{l}\right )$ is the Fenchel conjugate of Ω(β), where the indicator function $\delta \left (w_{l}\right )$ satisfies $\delta \left (w_{l}\right )=0$ if $\left \|w_{l}\right \|_{2} \leq \eta $, and $+\infty $ otherwise.

According to the properties of the strongly-convex function, ${\mathscr{L}}^{*}$ can be differentiable everywhere with ε^− 1-Lipschitz continuous gradient, and we also have that $\nabla _{{w}}\left [{\mathscr{L}}^{*}\left (-\mathcal {T}_{\theta }^{*} {~w}\right )\right ]=-\left (\mathcal {T}_{\theta _{l}} \nabla {\mathscr{L}}^{*}\left (-\mathcal {T}_{\theta }^{*} {~w}\right )\right )_{1 \leq l \leq {L}} \in \mathbb {R}^{{P} \times {L}}$ is $\left \|\left (\mathcal {T}_{\theta _{1}} \cdot \right )_{1 \leq l \leq {L}}\right \|_{2}^{2} \varepsilon ^{-1}$-Lipschitz continuous. We can easily obtain the following primal-dual link with the properties of conjugate functions after Q iterations:

$$ \begin{array}{@{}rcl@{}} \beta&=&\nabla \mathcal{L}^{*}\left( -\mathcal{T}_{\theta}^{*} w^{(Q)}\right)\\ &=&\left( \nabla \mathcal{L}\left( -\mathcal{T}_{\theta}^{*} w^{(Q)}\right)\right)^{-1}\\ &=&\left( \widetilde{X}^{T} \widetilde{X}+\right.\varepsilon \mathbb{I})^{-1}\left( \widetilde{X}^{T} \widetilde{y}-\mathcal{T}_{\theta}^{*} w^{(Q)}\right). \end{array} $$

(21)

We next apply the forward-backward splitting method with Bregman proximity operator $\text {prox}_{{\Omega }^{*}({w})}^{\varphi }$ to the dual problem in (20), where φ is set to be the separable Helinger-like function as follows:

$$ \begin{array}{@{}rcl@{}} \varphi(w)&=&\sum\limits_{l=1}^{L} \varphi_{l}\left( w_{l}\right)\\ &=&-\sum\limits_{l=1}^{L} \sqrt{\eta^{2} -\left\|w_{l}\right\|_{2}^{2}}, \text { s.t. }\left\|w_{l}\right\|_{2} \leq \eta. \end{array} $$

(22)

Its Fenchel conjugate function is as follows:

$$ \varphi^{*}(w)=\sum\limits_{l=1}^{L} \varphi_{l}^{*}\left( w_{l}\right)={\sum}_{l=1}^{L} \eta \sqrt{1+\left\|w_{l}\right\|_{2}^{2}}. $$

(23)

By computing directly, we obtain

$$ \nabla \varphi({w})=\left( \frac{{w}_{l}}{\sqrt{\eta^{2}-\left\|w_{l}\right\|_{2}^{2}}}\right)_{1 \leq l \leq {L}}, $$

(24)

and

$$ \nabla \varphi^{*}({w})=\left( \frac{\eta {w}_{l}}{\sqrt{1+\left\|{w}_{l}\right\|_{2}^{2}}}\right)_{1 \leq l \leq {L}}. $$

(25)

Denote q as the q-th iteration and γ as the step-size. To address the dual problem in (20), we update w, which is the dual variable of coefficient β, by the following iterative steps:

$$ \begin{array}{@{}rcl@{}} w^{(q+1)} &=&\operatorname{prox}_{{\Omega}^{*}(w)}^{\varphi}\left( \nabla \varphi\left( \!w^{(q)}\!\right)-\gamma \nabla_{w}\left[\mathcal{L}^{*}\left( \!-\mathcal{T}_{\theta}^{*} w^{(q)}\!\right)\!\right]\right) \\ &=&\underset{w \in \mathbb{R}^{P \times L}}{\operatorname{argmin}} \varphi(w)+{\Omega}^{*}(w)-\left\langle\nabla \varphi\left( w^{(q)}\right)\right.\\ &&-\left.\gamma \nabla_{w}\left[\mathcal{L}^{*}\left( -\mathcal{T}_{\theta}^{*} w^{(q)}\right)\right], w\right\rangle \\ &=&\underset{\left\|w_{l}\right\|_{2} \leq \eta , l=1, \ldots, L}{\operatorname{argmin}} \varphi(w)-\left\langle\nabla \varphi\left( w^{(q)}\right)\right.\\ &&-\left.\gamma \nabla_{w}\left[\mathcal{L}^{*}\left( -\mathcal{T}_{\theta}^{*} w^{(q)}\right)\right], w\right\rangle \\ &=&\underset{\left\|w_{l}\right\|_{2} \leq \eta , l=1, \ldots, L}{\operatorname{argmin}} {\sum}_{l=1}^{L}\left\{\varphi_{l}\left( w_{l}\right)-\left\langle\nabla \varphi_{l}\left( w_{l}^{(q)}\right)\right.\right.\\ &&-\left.\left.\gamma \nabla_{w_{l}}\left[\mathcal{L}^{*}\left( -\mathcal{T}_{\theta}^{*} w^{(q)}\right)\right], w_{l} \right>\right\} \\ &=&\left( \nabla \varphi_{l }^{* } \left( \nabla \varphi_{l}\left( w_{l}^{(q)}\right)\right.\right.\\ &&-\left.\left.\gamma \nabla_{w_{l}}\left[\mathcal{L}^{*}\left( -\mathcal{T}_{\theta}^{*} w^{(q)}\right)\right]\right)\right)_{1 \leq l \leq L}\quad \in \quad \!\!\mathbb{R}^{P \times L} \\ &=&\left( \nabla \varphi_{l }^{* } \left( \nabla \varphi_{l}\left( w_{l}^{(q)}\right)\right.\right.\\ &&+\!\left.\left.\gamma \mathcal{T}_{\theta}\!\left( \widetilde{{X}}^{T} \widetilde{{X}}\!+\varepsilon \mathbb{I}\right)^{-1}\left( \widetilde{{X}}^{T} \tilde{y}-\mathcal{T}_{\theta}^{*} w^{(q)}\!\right)\!\right)\!\right)_{\!1 \leq l \leq L} \end{array} $$

The last step above for updating the dual variable w is also employed in HQ-DFBB, as summarized in Algorithm 1. In the following, we state the details of updating 𝜃 to solve the upper level problem.

1.2 B.2: Details for updating 𝜃

Recalling the definitions given in Sections 3 and 4, we summarize the simplified bilevel problems.Upper level problem:

$$ \begin{array}{@{}rcl@{}} &&\hat{\theta} \in \underset{\theta \in {\Theta}}{\arg \min } \frac{1}{K}{\sum}_{k=1}^{K} U\left( \hat{\beta}^{(k)}(\theta)\right) \text { with } \\ &&U\left( \hat{\beta}^{(k)}(\theta)\right) = \frac{1}{t} \log \left( \frac{1}{n} {\sum}_{i=1}^{n} e^{t f\left( x_{i}^{(k)}, y_{i}^{(k)} ; \widehat{\beta}^{(k)}(\theta)\right)}\right). \end{array} $$

(26)

Lower level problem:

$$ \min_{\beta \in \mathbb{R}^{P}}\underbrace{\frac{1}{2}\|\tilde{y}-\tilde{X} \beta\|_{2}^{2}+\frac{\varepsilon}{2}\|\beta\|_{2}^{2}}_{\mathcal{L}(\beta)}+\underbrace{\eta \sum\limits_{l=1}^{L}\left\|\left( \mathcal{T}_{\theta_{l}} \beta\right)\right\|_{2}}_{\Omega\left( \mathcal{T}_{\theta} \beta\right)}. $$

To update 𝜃, we then obtain the partial derivative calculator defined in Section 3.1 with the following backward gradient descent method,

$$ \frac{\partial U(\hat{\beta}(\theta))}{\partial \theta}=\left( \frac{d \hat{\beta}(\theta)}{d \theta}\right)^{T} \frac{\partial U(\hat{\beta}(\theta))}{\partial \beta}, $$

where

$$ \begin{array}{@{}rcl@{}} \frac{\partial U(\hat{\beta}(\theta) )}{\partial \beta} &=&\frac{\partial \frac{1}{t} \log \left( \frac{1}{n} {\sum}_{i=1}^{n} e^{t f\left( x_{i}^{(k)}, y_{i}^{(k)} ; \widehat{\beta}^{(k)}(\theta)\right)}\right)}{\partial \beta}\\ &=&\frac{{\sum}_{i=1}^{n} e^{t f\left( x_{i}^{(k)}, y_{i}^{(k)} ; \hat{\beta}^{(k)}(\theta)\right)} \nabla_{\theta} f\left( x_{i}^{(k)}, y_{i}^{(k)} ; \hat{\beta}^{(k)}(\theta)\right)}{{\sum}_{i=1}^{n} e^{t f\left( x_{i}^{(k)}, y_{i}^{(k)} ; \hat{\beta}^{(k)}(\theta)\right)}}. \end{array} $$

We next specify the implementation details of the partial derivative computation. For m = 0,…,M − 1 and q = 0,…,Q − 1, we denote the equations in Algorithm 2 as

$$ \begin{array}{@{}rcl@{}} \mathcal{A}\left( \beta^{(m)}\right)&:=&\left( -f^{\prime}\left( \left( {y_{i}-{{{X_{i}^{T}}}} \beta^{(m)}}\right)^{2}\right)\right)_{1 \leq i \leq n}^{T}\\ &:=&\left( e^{t \left( {y_{i}-{{X_{i}^{T}}} \beta^{(m)}}\right)^{2}}\right)_{1 \leq i \leq n}^{T}, \end{array} $$

$$ \mathcal{B}\left( b^{(m)}, w^{(m, Q)}, \theta\right):= \left( \tilde{X}^{T} \widetilde{X}+\varepsilon \mathbb{I}\right)^{-1}\left( \widetilde{X}^{T} \widetilde{y}-\mathcal{T}_{\theta}^{*} w^{(m, Q)}\right), $$

$$ \begin{array}{@{}rcl@{}} \mathcal{C}\left( b^{(m)}, w^{(m, q)}, \theta\right)\!&:=&\!\left( \nabla \varphi_{l}^{*}\left( \nabla \varphi_{l}\left( w_{l}^{(m, q)}\!\right)\right.\right.\\ &&+\left.\left.\gamma\! \mathcal{T}_{\theta_{l}}\!\left( \!\tilde{X}^{T} \widetilde{X}+\varepsilon \mathbb{I}\right)^{-1}\left( \!\tilde{X}^{T} \widetilde{y}-\mathcal{T}_{\theta}^{*} w^{(m+1, q)}\!\right)\!\right)\!\right)_{\!\!1 \leq l \leq L}. \end{array} $$

Thus, $\mathcal {A}, {\mathscr{B}}, \mathcal {C}$ represent the updating function for the HQ parameter b, dual object w and coefficient β, respectively. Different from the process in BiGL [15], the key step for updating the HQ parameter b is added for calculation. Then, we summarize Algorithm 1 of HQ-DFBB as follows:

$$ \left\{ \begin{array}{l} \text { Set } \beta^{(0)}(\theta) \equiv 0 \in R^{P} \\ \text { for } m=0,1, \cdots, M-1 \\ \left\{\begin{array}{l} b^{(m+1)}=A\left( \beta^{(m)}\right) \\ \text { Set } w^{(m+1,0)} \equiv 0 \in R^{P \times L} \\ \text { for } q=0,1, \cdots, Q-1 \\ \left\{ w^{(m+1, q+1)}=B\left( b^{(m)}, w^{(m, q)}, \theta\right) \right. \\ \beta^{(m+1)}=C\left( b^{(m+1)}, w^{(m+1, Q)}, \theta\right) \end{array}\right.\\ \beta^{(M)}=C\left( b^{(M)}, w^{(M, Q)}, \theta\right) \end{array} \right. $$

(27)

We focus on the computation of $\left (\frac {d \hat {\beta }(\theta )}{d \theta }\right )^{T}$ to illustrate the relation between 𝜃^m+ 1 and 𝜃^m. Then we analyze the backward gradient details.

For simplicity, similar to [15], we define the partial derivatives by using the following abbreviations with an additional computation for b:

$$ \frac{d \mathcal{A}^{(m)}}{d\beta}=\partial_{1} \mathcal{A}^{(m)}, $$

$$ \begin{array}{@{}rcl@{}} \frac{\partial \mathcal{B}^{(m, q)}}{\partial b}&:=&\partial_{1} \mathcal{B}^{(m, q)}, \quad\!\!\! \frac{\partial \mathcal{B}^{(m, q)}}{\partial w}:=\partial_{2} \mathcal{B}^{(m, q)}, \quad\!\!\! \frac{\partial \mathcal{B}^{(m, q)}}{\partial \theta}:=\partial_{3} \mathcal{B}^{(m, q)}, \\ \frac{\partial \mathcal{C}^{(m, Q)}}{\partial b}&:=&\partial_{1} \mathcal{C}^{(m, Q)},\!\!\! \quad \frac{\partial \mathcal{C}^{(m, Q)}}{\partial w}:=\partial_{2} \mathcal{C}^{(m, Q)},\!\!\! \quad \frac{\partial \mathcal{C}^{(m, Q)}}{\partial \theta}:=\partial_{3} \mathcal{C}^{(m, Q)}. \end{array} $$

From the simplified algorithms in (27), we can easily find that

$$ \frac{d \beta^{(m+1)}}{d \theta}=D_{m+1}+E_{m+1} \frac{d \beta^{(m)}}{d \theta}. $$

(28)

D contains the partial derivative with respect to 𝜃, and E is related to b.

We can further obtain the following equations:

$$ \begin{array}{@{}rcl@{}} D_{m+1}&=&\partial_{3} \mathcal{C}^{(m+1,Q)}\\ &&+\partial_{2} \mathcal{C}^{(m+1,Q)} \partial_{3} \mathcal{C}^{(m+1,Q-1)}\\ &&+\partial_{2} \mathcal{C}^{(m+1,Q)} \partial_{2} \mathcal{C}^{(m+1,Q-1)} \partial_{3} \mathcal{C}^{(m+1,Q-2)}\\ &&+\partial_{2} \mathcal{C}^{(m+1,Q)} \partial_{2} \mathcal{C}^{(m+1,Q-1)} \partial_{2} \mathcal{C}^{(m+1,Q-2)} \partial_{3} \mathcal{C}^{(m+1,Q-3)}\\ &&+\cdots\\ &&+ \partial_{2} \mathcal{C}^{(m+1,Q)} \partial_{2} \mathcal{C}^{(m+1,Q-1)} {\cdots} \partial_{2} \mathcal{C}^{(m+1,1)} \partial_{3} \mathcal{C}^{(m+1,0)}\\ \end{array} $$

and

$$ \begin{array}{@{}rcl@{}} E_{m+1} &=& \partial_{1} \mathcal{C}^{(m+1,Q)} \partial_{1} \mathcal{A}^{(m)}\\ &&+\partial_{2} \mathcal{C}^{(m+1,Q)} \partial_{1} \mathcal{B}^{(m+1,Q-1)} \partial_{1} \mathcal{A}^{(m)}\\ &&+\partial_{2} \mathcal{C}^{(m+1,Q)} \partial_{2} \mathcal{B}^{(m+1,Q-1)} \partial_{1} \mathcal{B}^{(m+1,Q-2)} \partial_{1} \mathcal{A}^{(m)}\\ &&+\partial_{2} \mathcal{C}^{(m+1,Q)} \partial_{2} \mathcal{B}^{(m+1,Q-1)} \partial_{2} \mathcal{B}^{(m+1,Q-2)} \partial_{1} \mathcal{B}^{(m+1,Q-3)} \partial_{1} \mathcal{A}^{(m)}\\ &&+\cdots\\ &&+\partial_{2} \mathcal{C}^{(m+1,Q)} \partial_{2} \mathcal{B}^{(m+1,Q-1)} {\cdots} \partial_{1} \mathcal{B}^{(m+1,0)} \partial_{1} \mathcal{A}^{(m)}.\\ \end{array} $$

According to the above results, we expand the analysis with (28).

$$ \begin{array}{@{}rcl@{}} \left( \frac{d \hat{\beta}^{(M)}(\theta)}{d \theta}\right)^{T} &=&\underbrace{{{D_{M}^{T}}}}_{F_{M}}+ \left( \frac{d \hat{\beta}^{(M-1)}(\theta)}{d \theta}\right)^{T} \underbrace{{{E_{M}^{T}}}}_{G_{M}} \\ &=&\underbrace{{F_{M}+{D_{M-1}^{T}}G_{M}}}_{F_{M-1}}+ \left( \frac{d \hat{\beta}^{(M-2)}(\theta)}{d \theta}\right)^{T} \underbrace{{E_{M-1}^{T}}G_{M}}_{G_{M-1}}\\ &=&\underbrace{{F_{M-1}+{D_{M-2}^{T}}G_{M-1}}}_{F_{M-2}} + \left( \frac{d \hat{\beta}^{(M-3)}(\theta)}{d \theta}\right)^{T} \underbrace{{E_{M-2}^{T}}G_{M-1}}_{G_{M-2}}\\ &=&{\cdots} \\ &=&\underbrace{{D_{M}^{T}}+D_{M-1}^{T}G_{M}+D_{M-2}^{T}G_{M-1}+\cdots+{D_{1}^{T}}G_{2}}_{F_{1}}. \end{array} $$

Therefore, the hypergradient in (28) can be computed by the above steps.

Appendix C: Convergence analysis

For completeness, we succinctly provide the proof for Theorem 1.

Proof

(a) Since the function J(β,b) is strongly convex w.r.t. β, we have

$$ J\left( \bar{\beta}^{(m+1)}, \bar{b}^{(m+1)}\right)<J\left( \bar{\beta}^{(m)}, \bar{b}^{(m+1)}\right) \leq J\left( \bar{\beta}^{(m)}, \bar{b}^{(m)}\right). $$

According to the continuous properties of J(β,b) and b with respect to β, we have that $\left \{J\left (\beta ^{(m)}, b^{(m)}\right ), m=1,2, \ldots \right \}$ is a decreasing sequence by analyzing the case of $Q\rightarrow +\infty $. Since J(β,b) is bounded, $\left \{J\left (\beta ^{(m)}, b^{(m)}\right ), m=1,2, {\ldots } M\right \}$ converges. Due to $J(\beta )=\min \limits _{b} J(\beta , b)$, we have $J\left (\beta ^{(m)}\right )=n^{-1} J\left (\beta ^{(m)}, b^{(m+1)}\right )$ from HQ optimization. Finally, the sequence $\left \{J\left (\beta ^{(m)}\right ), m=1,2, \ldots \right \}$ also converges.

(b) Due to the ε-strongly convex w.r.t. β of J(β,b), we have

$$ \begin{array}{@{}rcl@{}} &&{}J\left( \bar{\beta}^{(m+1)}, \bar{b}^{(m)}\right)-J\left( \bar{\beta}^{(m)}, \bar{b}^{(m)}\right) \geq G^{T}\left( \bar{\beta}^{(m+1)}-\bar{\beta}^{(m)}\right)\\ &&\qquad\qquad\qquad\quad+\frac{\varepsilon}{2}\left\|\bar{\beta}^{(m+1)}-\bar{\beta}^{(m)}\right\|^{2}, \end{array} $$

where G is the gradient of $J\left (\bar {\beta }^{(m)}, \bar {b}^{(m)}\right )$ and could be zero as $\bar {\beta }^{(m)}=\arg \min \limits _{\beta } J\left (\beta , \bar {b}^{(m)}\right )$. Considering the case of $m \rightarrow 0$, we have $J\left (\bar {\beta }^{(m+1)}, \bar {b}^{(m)}\right )-J\left (\bar {\beta }^{(m)}, \bar {b}^{(m)}\right ) \rightarrow 0$ and $\left \|\bar {\beta }^{(m+1)}-\bar {\beta }^{(m)}\right \|^{2} \rightarrow 0$. From Section 3, we conclude that

$$ \begin{array}{@{}rcl@{}} &&\left\|\beta^{(Q, m+1)}-\bar{\beta}^{(m+1)}\right\|^{2}+ \left\|\bar{\beta}^{(m+1)}-\bar{\beta}^{(m)}\right\|^{2}\\ &&\quad+\left\|\bar{\beta}^{(m)}-\beta^{(Q, m)}\right\|^{2} \geq\left\|\beta^{(Q, m+1)}-\beta^{(Q, m)}\right\|^{2} \end{array} $$

and

$$ \frac{2 C}{Q}+\left\|\bar{\beta}^{(m+1)}-\bar{\beta}^{(m)}\right\|^{2}\geq \left\|\beta^{(Q, m+1)}-\beta^{(Q, m)}\right\|^{2}. $$

Finally, we have $\left \|\beta ^{(M)}-\beta ^{(M-1)}\right \|_{2}^{2} \rightarrow 0$ as $m, Q \rightarrow +\infty $.

(c) Based on Theorem 3.1 in [38], we know that the result $\left \{w^{(m, Q)}\right \}_{Q \in \mathbb {N}}$ generated in the HQ-DFBB algorithm is convergent to $\hat {w}^{(m)}$ as $Q \rightarrow +\infty $.

We denote b^∗, $\hat {w}^{*}$, β^∗, $\widetilde {X}^{*}=\sqrt {b^{*}} \odot X$ and $\widetilde {y}^{*}=\sqrt {b^{*}} \odot y$ as the sequence generated by Algorithm 1 after $m, Q \rightarrow +\infty $ iterations. When $Q, m \rightarrow +\infty $, we obtain that b^(m) is convergent as the weight b is continuous differential with respect to β. Then we analyze the primal-dual link function in the case of $Q, m \rightarrow +\infty $,

$$ \beta^{*}=\left( \widetilde{X}^{* T} \widetilde{X}^{*}+\varepsilon \mathbb{I}\right)^{-1}\left( \widetilde{X}^{* T} \widetilde{y}^{*}-\mathcal{T}_{\theta}^{*} \hat{w}^{*}\right). $$

We can rewrite the above formula as

$$ \left( \widetilde{X}^{* T} \widetilde{X}^{*}+\varepsilon \mathbb{I}\right) \beta^{*}-\widetilde{X}^{* T} \widetilde{y}^{*}+\mathcal{T}_{\theta}^{*} \hat{w}^{*}=0. $$

(29)

From the definitions of Ω(β) and its dual function Ω^∗(w), we have $\left \|w_{l}^{(m, q)}\right \| \leq \eta $ for m ∈{1,…,M},l ∈ {1,…,L},q ∈{1,…,Q}. Therefore, under the case of $Q, m \rightarrow +\infty $, we have $\left \|w_{l}^{*}\right \| \leq \eta $. Let $w_{l}^{*}=\frac {1}{\eta } \hat {w}_{l}^{*}$, and we can obtain $\left \|w_{l}^{*}\right \| \leq 1$. Equation (29) can be rewritten as

$$ \left( \widetilde{X}^{* T} \widetilde{X}^{*}+\varepsilon \mathbb{I}\right) \beta^{*}-\widetilde{X}^{* T} \widetilde{y}^{*}+\sum\limits_{l=1}^{L} \eta \mathcal{T}_{\theta_{l}} w_{l}^{*}=0. $$

By deriving Ω(β), we have

$$ \partial {\Omega}(\beta)=\eta \sum\limits_{l=1}^{L} \mathcal{T}_{\theta_{l}} \delta_{l}, $$

where δ_l satisfies $\left \|\delta _{l}\right \|=1$ if $\mathcal {T}_{\theta _{l}} \beta \neq 0$, and $\left \|\delta _{l}\right \|<1$ otherwise (1 ≤ l ≤ L). Hence,

$$ \begin{array}{@{}rcl@{}} 0&=&\left( \widetilde{X}^{* T} \widetilde{X}^{*}+\varepsilon \mathbb{I}\right) \beta^{*}-\widetilde{X}^{* T} \widetilde{y}^{*}+\sum\limits_{l=1}^{L} \eta \mathcal{T}_{\theta_{l}} w_{l}^{*}\\ &\in& \nabla \mathcal{L}\left( \beta^{*}\right)+\partial {\Omega}\left( \beta^{*}\right). \end{array} $$

In summary, we obtain $0 \in \partial _{\beta } J\left (\beta ^{*}, b^{*}\right )$. Finally, we know that β^∗ is the unique result of $J\left (\beta , b^{*}\right )$ since J(β,b) is strongly convex with respect to β. □

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, X., Wang, Y., Zhu, L. et al. Robust variable structure discovery based on tilted empirical risk minimization. Appl Intell 53, 17865–17886 (2023). https://doi.org/10.1007/s10489-022-04409-z

Download citation

Accepted: 13 December 2022
Published: 14 January 2023
Issue Date: July 2023
DOI: https://doi.org/10.1007/s10489-022-04409-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust variable structure discovery based on tilted empirical risk minimization

Abstract

Access this article

Similar content being viewed by others

PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization

Data mining in clinical big data: the frequently used databases, steps, and methodological models

The benefits and pitfalls of machine learning for biomarker discovery

Notes

References

Acknowledgements