Skip to main content
Log in

Robust variable structure discovery based on tilted empirical risk minimization

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Robust group lasso regression plays an important role in high-dimensional regression modeling such as biological data analysis for disease diagnosis and gene expression. However, most existing methods are optimized with prior variable structure knowledge under the traditional empirical risk minimization (ERM) framework, in which the estimators are excessively dependent on prior structure information and sensitive to outliers. To address this issue, we propose a new robust variable structure discovery method for group lasso based on a convergent bilevel optimization framework. In this paper, we adopt tilted empirical risk minimization (TERM) as the target function to improve the robustness of the estimator by assigning less weight to the outliers. Moreover, we modify the TERM objective function to calculate its Fenchel conjugate while maintaining its robust property, which is proven theoretically and empirically. Experimental results on both synthetic and real-world datasets show that the proposed method can improve the robust performance on prediction and variable structure discovery compared to the existing techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. The code can be downloaded from https://github.com/zhangxuelincode/demoTERMGLhttps://github.com/zhangxuelincode/demoTERMGL

  2. The data can be downloaded from https://adni.loni.usc.edu/

References

  1. Yu X, Sun Y, Zhou H-J (2021) An adaptive shortest-solution guided decimation approach to sparse high-dimensional linear regression. Sci Rep 11(1):1–13

    Article  Google Scholar 

  2. Wilms I, Croux C (2016) Robust sparse canonical correlation analysis. BMC Syst Biol 10 (1):1–13

    Article  Google Scholar 

  3. Wang Y, Li X, Ruiz R (2018) Weighted general group lasso for gene selection in cancer classification. IEEE Trans Cybern 49(8):2860–2873

    Article  Google Scholar 

  4. He H, Guo X, Yu J, Ai C, Shi S (2022) Overcoming the inadaptability of sparse group lasso for data with various group structures by stacking. Bioinformatics 38(6):1542–1549

    Article  Google Scholar 

  5. Liu X, Goncalves AR, Cao P, Zhao D, Banerjee A (2018) Modeling alzheimer’s disease cognitive scores using multi-task sparse group lasso. Comput Med Imaging Graph 66:100–114

    Article  Google Scholar 

  6. Cao P, Shan X, Zhao D, Huang M, Zaiane O (2017) Sparse shared structure based multi-task learning for mri based cognitive performance prediction of alzheimer’s disease. Pattern Recogn 72:219–235

    Article  Google Scholar 

  7. Liu X, Cao P, Wang J, Kong J, Zhao D (2019) Fused group lasso regularized multi-task feature learning and its application to the cognitive performance prediction of alzheimer’s disease. Neuroinformatics 17 (2):271–294

    Article  Google Scholar 

  8. Oliveira SHG, Gonçalves AR, Von Zuben FJ (2019) Group lasso with asymmetric structure estimation for multi-task learning. In: Proceedings of the 28th international joint conference on artificial intelligence, pp 3202–3208

  9. Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B (Stat Methodol) 68(1):49–67

    Article  MathSciNet  MATH  Google Scholar 

  10. Ma S, Song X, Huang J (2007) Supervised group lasso with applications to microarray data analysis. BMC Bioinforma 8(1):1–17

    Article  Google Scholar 

  11. Zhang HH, Cheng G, Liu Y (2011) Linear or nonlinear? automatic structure discovery for partially linear models. J Am Stat Assoc 106(495):1099–1112

    Article  MathSciNet  MATH  Google Scholar 

  12. Yang G, Sun Y, Cui X (2017) Automatic structure discovery for varying-coefficient partially linear models. Commun Stat-Theory Methods 46(15):7703–7716

    Article  MathSciNet  MATH  Google Scholar 

  13. Hernández-Lobato D, Hernández-Lobato JM (2013) Learning feature selection dependencies in multi-task learning. Adv Neural Inf Process Syst 26

  14. Pan C, Zhu M (2017) Group additive structure identification for kernel nonparametric regression. Adv Neural Inf Process Syst 30

  15. Frecon J, Salzo S, Pontil M (2018) Bilevel learning of the group lasso structure. Adv Neural Inf Process Syst 31:8301–8311

    Google Scholar 

  16. Franceschi L, Frasconi P, Salzo S, Grazzi R, Pontil M (2018) Bilevel programming for hyperparameter optimization and meta-learning. In: International conference on machine learning, pp 1568–1577

  17. Shaban A, Cheng C-A, Hatch N, Boots B (2019) Truncated back-propagation for bilevel optimization. In: The 22nd international conference on artificial intelligence and statistics, pp 1723–1732

  18. Feurer M, Hutter F (2019) Hyperparameter optimization. Springer, Cham, pp 3–33

    Google Scholar 

  19. Ji K, Yang J, Liang Y (2021) Bilevel optimization: convergence analysis and enhanced design. In: International conference on machine learning, pp 4882–4892

  20. Sun H, Pu W, Fu X, Chang T-H, Hong M (2022) Learning to continuously optimize wireless resource in a dynamic environment: a bilevel optimization perspective. IEEE Trans Signal Process 70:1900–1917

    Article  MathSciNet  Google Scholar 

  21. Liu H, Simonyan K, Yang Y (2019) Darts: differentiable architecture search. In: International conference on learning representations, pp 1–13

  22. Bertinetto L, Henriques JF, Torr P, Vedaldi A (2018) Meta-learning with differentiable closed-form solvers. In: International conference on learning representations, pp 1–15

  23. Rajeswaran A, Finn C, Kakade SM, Levine S (2019) Meta-learning with implicit gradients. Adv Neural Inf Process Syst 32

  24. Ji K, Lee JD, Liang Y, Poor HV (2020) Convergence of meta-learning with task-specific adaptation over partial parameters. Adv Neural Inf Process Syst 33:11490–11500

    Google Scholar 

  25. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodological) 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  26. Feng Y, Huang X, Shi L, Yang Y, Suykens JA (2015) Learning with the maximum correntropy criterion induced losses for regression. J Mach Learn Res 16(1):993–1034

    MathSciNet  MATH  Google Scholar 

  27. Li Y, Liang M, Mao L, Wang S (2021) Robust estimation and variable selection for the accelerated failure time model. Stat Med 40(20):4473–4491

    Article  MathSciNet  Google Scholar 

  28. Li T, Beirami A, Sanjabi M, Smith V (2020) Tilted empirical risk minimization. In: International conference on learning representations, pp 1–44

  29. Huber PJ (1973) Robust regression: asymptotics, conjectures and monte carlo. Ann Stat 799–821

  30. Sun Q, Zhou W-X, Fan J (2020) Adaptive huber regression. J Am Stat Assoc 115 (529):254–265

    Article  MathSciNet  MATH  Google Scholar 

  31. Feng Y, Wu Q (2022) A statistical learning assessment of huber regression. J Approx Theory 273:105660

    Article  MathSciNet  MATH  Google Scholar 

  32. Chen S, Gong C, Yang J, Li X, Wei Y, Li J (2018) Adversarial metric learning. In: Proceedings of the 27th international joint conference on artificial intelligence, pp 2021–2027

  33. Sinha A, Malo P, Deb K (2017) A review on bilevel optimization: from classical to evolutionary approaches and applications. IEEE Trans Evol Comput 22(2):276–295

    Article  Google Scholar 

  34. Sharma A (2020) Optimistic variants of single-objective bilevel optimization for evolutionary algorithms. Int J Comput Intell Appl 19(03):2050020

    Article  Google Scholar 

  35. Hong T, Zhao D, Zhang Y, Wang Z (2021) A bilevel voltage regulation operation for distribution systems with self-operated microgrids. IEEE Trans Smart Grid 13(2):1238–1248

    Article  Google Scholar 

  36. Condat L (2016) Fast projection onto the simplex and the 1-ball. Math Prog 158(1):575–585

    Article  MathSciNet  MATH  Google Scholar 

  37. Reddi SJ, Sra S, Poczos B, Smola AJ (2016) Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. Adv Neural Inf Process Syst 29:1145–1153

    Google Scholar 

  38. Van Nguyen Q (2017) Forward-backward splitting with bregman distances. Vietnam J Math 45 (3):519–539

    Article  MathSciNet  MATH  Google Scholar 

  39. Nikolova M, Chan RH (2007) The equivalence of half-quadratic minimization and the gradient linearization iteration. IEEE Trans Image Process 16(6):1623–1627

    Article  MathSciNet  Google Scholar 

  40. He R, Zheng W-S, Tan T, Sun Z (2013) Half-quadratic-based iterative minimization for robust sparse representation. IEEE Trans Pattern Anal Mach Intell 36(2):261–275

    Google Scholar 

  41. Wang X, Chen H, Cai W, Shen D, Huang H (2017) Regularized modal regression with applications in cognitive impairment prediction. Adv Neural Inf Process Syst 30:1448–1458

    Google Scholar 

  42. Huber PJ (1981) Robust statistics Wiley series in probability and mathematical statistics

  43. Yu Y-l, Aslan Ö, Schuurmans D (2012) A polynomial-time form of robust regression. Adv Neural Inf Process Syst 25

  44. Yao W, Li L (2014) A new regression model: modal linear regression. Scand J Stat 41 (3):656–671

    Article  MathSciNet  MATH  Google Scholar 

  45. Weiner MW, Aisen PS, Jack Jr CR, Jagust WJ, Trojanowski JQ, Shaw L, Saykin AJ, Morris JC, Cairns N, Beckett LA et al (2010) The alzheimer’s disease neuroimaging initiative: progress report and future plans. Alzheim Dement 6(3):202–211

    Article  Google Scholar 

  46. Yang J, Wang S, Wu T (2022) Maximum mutual information for feature extraction from graph-structured data: application to alzheimer’s disease classification. Appl Intell 1–17

Download references

Acknowledgements

This paper is supported by the Fundamental Research Funds for the Central Universities of China (No.2662022XXYJ005) and National Natural Science Foundation of China (No.12071166).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lingjuan Wu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Robustness analysis

In this section, we analyze the robustness property of the modified TERM objective function with the related sensitivity curve with respect to prediction error and the finite sample breakdown point theory.

1.1 A.1: Sensitivity curve and gradient analysis

The modified TERM objective function is defined as \(\phi _{t}(e)=\frac {1}{t} \exp \{t e^{2}\}\) with negative t, where e represents the prediction error \(y_{i} - {x_{i}^{T}} \beta \). Inspired by [26], we draw the sensitivity curve by analyzing its derivative \(2e \cdot \exp \{t e^{2}\}\), as shown in Fig. 12.

Fig. 12
figure 12

Derivative of ϕt(e) with respect to e. The derivative of the squared loss function is also provided for comparison

Compared with the square loss function, where the samples with larger prediction errors have a greater impact on the weight updates, small weights are assigned to atypical samples in the modified TERM formulation, making this method more robust to outliers. Moreover, the behavior of the modified TERM is affected by t, e.g., the function converges slowly with a sufficiently large negative t value (e.g., \(t \rightarrow -\infty \)), and the outliers will affect the model when t is a small negative value (e.g., \(t \rightarrow 0^{-}\)). The optimal value of t should be set in a specific task, which is also discussed in the simulation experiments in Section 4.1.3. The performance of the modified TERM can be tuned by t to magnify (t > 0) or suppress (t < 0) the influence of outliers. The theoretical analysis of this is shown in Lemma 3 by exploring the gradient.

The gradient of the modified objective function is a weighted average of the original individual loss gradients \(\nabla _{\theta } f\left (x_{i} ; \theta \right )\). In addition, we find that the weight wi(t; 𝜃) is closely related to the value of the hyperparameter t and the loss \(f\left (x_{i} ; \theta \right )\) of each sample. Specifically, when t < 0, samples with larger loss values are suppressed by assigning less weight to them.

Lemma 3

For a smooth loss function \(f\left (x_{i} ; \beta \right )\), we have

$$ \tilde{R}(t ; \beta)= \frac{1}{n} \sum\limits_{i=1}^{n} \frac{1}{t} e^{t f\left( x_{i} ; \beta\right)}. $$
$$ \begin{array}{@{}rcl@{}} \nabla_{\beta} \tilde{R}(t ; \beta)&=& \frac{1}{n} \sum\limits_{i=1}^{n} w_{i}(t ; \beta) \nabla_{\beta} f\left( x_{i} ; \beta\right), \text { where } w_{i}(t ; \beta)\\ &=& e^{t f\left( x_{i} ; \beta\right)}. \end{array} $$

Proof

Here, we provide the proof:

$$ \begin{array}{@{}rcl@{}} \nabla_{\beta} \tilde{R}(t ; \beta) &=\nabla_{\beta}\left\{\frac{1}{n} \sum\limits_{i=1}^{n} \frac{1}{t} e^{t f\left( x_{i} ; \beta\right)}\right\} \\ &=\frac{1}{n} \sum\limits_{i=1}^{n} e^{t f\left( x_{i} ; \beta\right)} \nabla_{\beta} f\left( x_{i} ; \beta\right). \end{array} $$

When t is positive, TERM focuses on the samples with relatively large losses (rare instances). Thus, the positive parameter t can be adopted when rare instances are important for estimation, which aims to minimize variance as proven in [28]. In summary, the new modified function of TERM without the \(\log \) operator can be considered as an alternate loss that magnifies samples with large loss when t > 0 and suppresses the outliers when t < 0. In this paper, we focus on the robust performance in nonoverlapping group lasso and variable structure discovery applications, where t < 0.

1.2 A.2: Finite sample breakdown point analysis

We are interested in studying the behavior of the modified TERM estimator when the dataset is contaminated with outliers. To further discuss the robustness of our estimator, we analyze its finite sample breakdown point.

Various definitions of the sample breakdown point have been proposed. In this work, we analyze the finite sample contamination breakdown point [42], which is used to quantify the proportion of bad data that an estimator can tolerate before returning arbitrary values.

Given original observations \(\boldsymbol {S}=\left (s_{1}, \ldots , s_{n}\right )\), where \(s_{i}=\left (\boldsymbol {x}_{i}, y_{i}\right )\), denote T(S) as the minimizer of the following objective function:

$$ \min_{\beta} \frac{1}{n} \sum\limits_{i=1}^{n} \phi_{t}\left( y_{i}-x_{i} \beta \right) $$
(13)

and

$$ \phi_{t}\left( y_{i}-x_{i} \beta \right) = \frac{1}{t} e^{t\left( y_{i}-x_{i} \beta \right)^{2}}. $$
(14)

The original samples S are then corrupted by adding extra m arbitrary points \(\boldsymbol {S}^{\prime }=\left (s_{n+1}, \ldots , s_{n+m}\right )\). Thus, the new dataset \(\boldsymbol {S} \cup \boldsymbol {S}^{\prime }\) contains ε = m/(m + n) bad samples. The finite sample contamination breakdown point ε is defined as

$$ \varepsilon^{*}=\min_{1 \leq m \leq n}\left\{\frac{m}{n+m}: \sup_{\boldsymbol{S}^{\prime}}\left\|T\left( \boldsymbol{S} \cup \boldsymbol{S}^{\prime}\right)\right\|=\infty\right\}, $$

where ∥⋅∥ is the Euclidean norm.

Theorem 2

Given original observations \(\boldsymbol {S}=\left (s_{1}, \ldots , s_{n}\right )\), suppose \(T(S)=\hat {\beta }\), which is the minimizer of (13). Let \(\phi _{t}(v)=\frac {1}{t} e^{t v^{2}}\), and t < 0. Denote \(M= t {\sum }_{i=1}^{n} \phi _{t}\) \(\left (y_{i}-\boldsymbol {x}_{i}^{T} \hat {\beta }\right )\). Then, the finite sample contamination breakdown point of the modified TERM framework is

$$ \varepsilon^{*}(\boldsymbol{S}, T)=\frac{m^{*}}{n+m^{*}}, $$
(15)

where m represents the ideal maximum number of abnormal samples and satisfies

$$ \lceil M\rceil \leq m^{*} \leq\lfloor M\rfloor+1, $$

where ⌈M⌉ represents the largest integer that is not greater than M, and ⌊M⌋ is the smallest integer that is not less than M.

Proof

Given the modified TERM objective ϕt(⋅), original n samples \(\boldsymbol {S}=\left (s_{1}, \ldots , s_{n}\right )\) and extra m arbitrary samples \(\boldsymbol {S}^{\prime }=\left (s_{n+1}, \ldots , s_{n+m}\right )\), denote T as the estimator. Let ϕ(⋅) = tϕt(⋅), \(M={\sum }_{i=1}^{n} \phi ^{*}\left (y_{i}-\boldsymbol {x}_{i}^{T} \hat {{\beta }}\right )\), where \(\hat {{\beta }}= T(\boldsymbol {S})\). In addition, ϕ(⋅) is symmetric about 0 and satisfies that \(\max \limits _{v} \phi ^{*}(v)=\phi ^{*}(0)=1\), ϕ(⋅) decreases monotonically toward both sides and \(\lim _{\vert v\vert \rightarrow \infty } \phi ^{*}(v)=0\).

First, we prove that \(T\left (\boldsymbol {S} \cup \boldsymbol {S}^{\prime }\right )\) remains bounded if m < M. Let ξ > 0 be such that m + nξ < M. Let C be such that ϕ(v) ≤ ξ for |v|≥ C. Let β be any real vector such that |yxTβ|≥ C for all samples in S. Then, we have

$$ \sum\limits_{i=1}^{m+n} \phi^{*}\left( y_{i}-\boldsymbol{x}_{i}^{T} \hat{{\beta}} \right) \geq M $$
(16)

and

$$ \sum\limits_{i=1}^{m+n} \phi^{*}\left( y_{i}-\boldsymbol{x}_{i}^{T} {\beta}\right) \leq m + n \xi. $$
(17)

From (16) and (17), we find that \(T\left (\boldsymbol {S} \cup \boldsymbol {S}^{\prime }\right )\) must satisfy \( \vert y-\boldsymbol {x}^{T} T\left (\boldsymbol {S} \cup \boldsymbol {S}^{\prime }\right ) \vert <C\) for a point in S because \({\sum }_{i=1}^{m+n} \phi ^{*}\left (y_{i}-\boldsymbol {x}_{i}^{T} T\left (\boldsymbol {S} \cup \boldsymbol {S}^{\prime }\right ) \right ) \geq M\). Thus, \(T\left (\boldsymbol {S} \cup \boldsymbol {S}^{\prime }\right )\) is bounded in the case of m < M.

Second, we prove that the estimator may breakdown if m > M. Similarly, let ξ > 0 such that m > M + mξ, and let C be such that ϕ(v) ≤ ξ for |v|≥ C. Assume that all arbitrary points \(\left \{\left (\boldsymbol {x}_{n+1}, y_{n+1}\right ), \ldots ,\left (\boldsymbol {x}_{n+m}, y_{n+m}\right )\right \}\) in \(\boldsymbol {S}^{\prime }\) are the same and satisfy a linear relationship y = xTβ. Let β be any vector such that \(\vert y_{n+1}-\boldsymbol {x}_{n+1}^{T} \beta \vert >C\). Then, we have

$$ \sum\limits_{i=1}^{m+n} \phi^{*}\left( y_{i}-\boldsymbol{x}_{i}^{T} {\beta}^{*}\right) \geq m $$
(18)

and

$$ \sum\limits_{i=1}^{m+n} \phi^{*}\left( y_{i}-\boldsymbol{x}_{i}^{T} \beta\right) \leq M+m \xi. $$
(19)

From (18) and (19), we find that \(T\left (\boldsymbol {S} \cup \boldsymbol {S}^{\prime }\right )\) must satisfy \( \vert y_{n+1}-\boldsymbol {x}_{n+1}^{T} T\left (S \cup S^{\prime }\right ) \vert \leq C\) because \({\sum }_{i=1}^{m+n} \phi ^{*}\) \(\left (y_{i}-\boldsymbol {x}_{i}^{T} T\left (\boldsymbol {S} \cup \boldsymbol {S}^{\prime }\right ) \right ) \geq m\). \(\vert T\left (\boldsymbol {S} \cup \boldsymbol {S}^{\prime }\right )\vert \) will go to infinity when we let \(y_{n+1} \rightarrow \infty \) and fix xn+ 1. Thus, we obtain the breakdown point. □

From the above theorem, we can see that the influencing factors of the breakdown point include not only the objective function ϕt(⋅) with the tuning parameter t, but also the sample configuration. However, as pointed out in [42, 44], the breakdown point is empirically quite high if the scale (contained in the hyperparameter t of TERMGL) is determined from the dataset itself.

Appendix B: Optimization details

1.1 B.1: Details for updating w

Denote the dual object \({w}=\left ({w}_{1}, \ldots , {w}_{{L}}\right ) \in \mathbb {R}^{{P} \times {L}}\), where \({w}_{l}=\left ({w}_{1 l}, {w}_{2 l}, \ldots , {w}_{P l}\right )^{{T}} \in \mathbb {R}^{{P}}\). According to the Fenchel-Rockafellar duality theorem, we can formulate the dual problem of (9) as

$$ \hat{w}=\underset{w \in \mathbb{R}^{P \times L}}{\text{argmin}}\left\{\mathcal{L}^{*}\left( -\mathcal{T}_{\theta}^{*} w\right) + {\Omega}^{*}(w)\right\}. $$
(20)

Moreover, this is a smooth constrained convex optimization problem. Here \({\mathscr{L}}^{*}\left (-\mathcal {T}_{\theta }^{*} w\right )\) is the Fenchel conjugate of \({\mathscr{L}}(\beta )\), and \(\mathcal {T}_{\theta }^{*} w={\sum }_{l=1}^{L} \mathcal {T}_{\theta _{l}} w_{l} \in \mathbb {R}^{P}\). \({\Omega }^{*}(w)={\sum }_{l=1}^{L} \delta \left (w_{l}\right )\) is the Fenchel conjugate of Ω(β), where the indicator function \(\delta \left (w_{l}\right )\) satisfies \(\delta \left (w_{l}\right )=0\) if \(\left \|w_{l}\right \|_{2} \leq \eta \), and \(+\infty \) otherwise.

According to the properties of the strongly-convex function, \({\mathscr{L}}^{*}\) can be differentiable everywhere with ε− 1-Lipschitz continuous gradient, and we also have that \(\nabla _{{w}}\left [{\mathscr{L}}^{*}\left (-\mathcal {T}_{\theta }^{*} {~w}\right )\right ]=-\left (\mathcal {T}_{\theta _{l}} \nabla {\mathscr{L}}^{*}\left (-\mathcal {T}_{\theta }^{*} {~w}\right )\right )_{1 \leq l \leq {L}} \in \mathbb {R}^{{P} \times {L}}\) is \(\left \|\left (\mathcal {T}_{\theta _{1}} \cdot \right )_{1 \leq l \leq {L}}\right \|_{2}^{2} \varepsilon ^{-1}\)-Lipschitz continuous. We can easily obtain the following primal-dual link with the properties of conjugate functions after Q iterations:

$$ \begin{array}{@{}rcl@{}} \beta&=&\nabla \mathcal{L}^{*}\left( -\mathcal{T}_{\theta}^{*} w^{(Q)}\right)\\ &=&\left( \nabla \mathcal{L}\left( -\mathcal{T}_{\theta}^{*} w^{(Q)}\right)\right)^{-1}\\ &=&\left( \widetilde{X}^{T} \widetilde{X}+\right.\varepsilon \mathbb{I})^{-1}\left( \widetilde{X}^{T} \widetilde{y}-\mathcal{T}_{\theta}^{*} w^{(Q)}\right). \end{array} $$
(21)

We next apply the forward-backward splitting method with Bregman proximity operator \(\text {prox}_{{\Omega }^{*}({w})}^{\varphi }\) to the dual problem in (20), where φ is set to be the separable Helinger-like function as follows:

$$ \begin{array}{@{}rcl@{}} \varphi(w)&=&\sum\limits_{l=1}^{L} \varphi_{l}\left( w_{l}\right)\\ &=&-\sum\limits_{l=1}^{L} \sqrt{\eta^{2} -\left\|w_{l}\right\|_{2}^{2}}, \text { s.t. }\left\|w_{l}\right\|_{2} \leq \eta. \end{array} $$
(22)

Its Fenchel conjugate function is as follows:

$$ \varphi^{*}(w)=\sum\limits_{l=1}^{L} \varphi_{l}^{*}\left( w_{l}\right)={\sum}_{l=1}^{L} \eta \sqrt{1+\left\|w_{l}\right\|_{2}^{2}}. $$
(23)

By computing directly, we obtain

$$ \nabla \varphi({w})=\left( \frac{{w}_{l}}{\sqrt{\eta^{2}-\left\|w_{l}\right\|_{2}^{2}}}\right)_{1 \leq l \leq {L}}, $$
(24)

and

$$ \nabla \varphi^{*}({w})=\left( \frac{\eta {w}_{l}}{\sqrt{1+\left\|{w}_{l}\right\|_{2}^{2}}}\right)_{1 \leq l \leq {L}}. $$
(25)

Denote q as the q-th iteration and γ as the step-size. To address the dual problem in (20), we update w, which is the dual variable of coefficient β, by the following iterative steps:

$$ \begin{array}{@{}rcl@{}} w^{(q+1)} &=&\operatorname{prox}_{{\Omega}^{*}(w)}^{\varphi}\left( \nabla \varphi\left( \!w^{(q)}\!\right)-\gamma \nabla_{w}\left[\mathcal{L}^{*}\left( \!-\mathcal{T}_{\theta}^{*} w^{(q)}\!\right)\!\right]\right) \\ &=&\underset{w \in \mathbb{R}^{P \times L}}{\operatorname{argmin}} \varphi(w)+{\Omega}^{*}(w)-\left\langle\nabla \varphi\left( w^{(q)}\right)\right.\\ &&-\left.\gamma \nabla_{w}\left[\mathcal{L}^{*}\left( -\mathcal{T}_{\theta}^{*} w^{(q)}\right)\right], w\right\rangle \\ &=&\underset{\left\|w_{l}\right\|_{2} \leq \eta , l=1, \ldots, L}{\operatorname{argmin}} \varphi(w)-\left\langle\nabla \varphi\left( w^{(q)}\right)\right.\\ &&-\left.\gamma \nabla_{w}\left[\mathcal{L}^{*}\left( -\mathcal{T}_{\theta}^{*} w^{(q)}\right)\right], w\right\rangle \\ &=&\underset{\left\|w_{l}\right\|_{2} \leq \eta , l=1, \ldots, L}{\operatorname{argmin}} {\sum}_{l=1}^{L}\left\{\varphi_{l}\left( w_{l}\right)-\left\langle\nabla \varphi_{l}\left( w_{l}^{(q)}\right)\right.\right.\\ &&-\left.\left.\gamma \nabla_{w_{l}}\left[\mathcal{L}^{*}\left( -\mathcal{T}_{\theta}^{*} w^{(q)}\right)\right], w_{l} \right>\right\} \\ &=&\left( \nabla \varphi_{l }^{* } \left( \nabla \varphi_{l}\left( w_{l}^{(q)}\right)\right.\right.\\ &&-\left.\left.\gamma \nabla_{w_{l}}\left[\mathcal{L}^{*}\left( -\mathcal{T}_{\theta}^{*} w^{(q)}\right)\right]\right)\right)_{1 \leq l \leq L}\quad \in \quad \!\!\mathbb{R}^{P \times L} \\ &=&\left( \nabla \varphi_{l }^{* } \left( \nabla \varphi_{l}\left( w_{l}^{(q)}\right)\right.\right.\\ &&+\!\left.\left.\gamma \mathcal{T}_{\theta}\!\left( \widetilde{{X}}^{T} \widetilde{{X}}\!+\varepsilon \mathbb{I}\right)^{-1}\left( \widetilde{{X}}^{T} \tilde{y}-\mathcal{T}_{\theta}^{*} w^{(q)}\!\right)\!\right)\!\right)_{\!1 \leq l \leq L} \end{array} $$
Algorithm 1
figure g

Updating β with HQ-DFBB.

The last step above for updating the dual variable w is also employed in HQ-DFBB, as summarized in Algorithm 1. In the following, we state the details of updating 𝜃 to solve the upper level problem.

1.2 B.2: Details for updating 𝜃

Recalling the definitions given in Sections 3 and 4, we summarize the simplified bilevel problems.Upper level problem:

$$ \begin{array}{@{}rcl@{}} &&\hat{\theta} \in \underset{\theta \in {\Theta}}{\arg \min } \frac{1}{K}{\sum}_{k=1}^{K} U\left( \hat{\beta}^{(k)}(\theta)\right) \text { with } \\ &&U\left( \hat{\beta}^{(k)}(\theta)\right) = \frac{1}{t} \log \left( \frac{1}{n} {\sum}_{i=1}^{n} e^{t f\left( x_{i}^{(k)}, y_{i}^{(k)} ; \widehat{\beta}^{(k)}(\theta)\right)}\right). \end{array} $$
(26)

Lower level problem:

$$ \min_{\beta \in \mathbb{R}^{P}}\underbrace{\frac{1}{2}\|\tilde{y}-\tilde{X} \beta\|_{2}^{2}+\frac{\varepsilon}{2}\|\beta\|_{2}^{2}}_{\mathcal{L}(\beta)}+\underbrace{\eta \sum\limits_{l=1}^{L}\left\|\left( \mathcal{T}_{\theta_{l}} \beta\right)\right\|_{2}}_{\Omega\left( \mathcal{T}_{\theta} \beta\right)}. $$

To update 𝜃, we then obtain the partial derivative calculator defined in Section 3.1 with the following backward gradient descent method,

$$ \frac{\partial U(\hat{\beta}(\theta))}{\partial \theta}=\left( \frac{d \hat{\beta}(\theta)}{d \theta}\right)^{T} \frac{\partial U(\hat{\beta}(\theta))}{\partial \beta}, $$

where

$$ \begin{array}{@{}rcl@{}} \frac{\partial U(\hat{\beta}(\theta) )}{\partial \beta} &=&\frac{\partial \frac{1}{t} \log \left( \frac{1}{n} {\sum}_{i=1}^{n} e^{t f\left( x_{i}^{(k)}, y_{i}^{(k)} ; \widehat{\beta}^{(k)}(\theta)\right)}\right)}{\partial \beta}\\ &=&\frac{{\sum}_{i=1}^{n} e^{t f\left( x_{i}^{(k)}, y_{i}^{(k)} ; \hat{\beta}^{(k)}(\theta)\right)} \nabla_{\theta} f\left( x_{i}^{(k)}, y_{i}^{(k)} ; \hat{\beta}^{(k)}(\theta)\right)}{{\sum}_{i=1}^{n} e^{t f\left( x_{i}^{(k)}, y_{i}^{(k)} ; \hat{\beta}^{(k)}(\theta)\right)}}. \end{array} $$

We next specify the implementation details of the partial derivative computation. For m = 0,…,M − 1 and q = 0,…,Q − 1, we denote the equations in Algorithm 2 as

$$ \begin{array}{@{}rcl@{}} \mathcal{A}\left( \beta^{(m)}\right)&:=&\left( -f^{\prime}\left( \left( {y_{i}-{{{X_{i}^{T}}}} \beta^{(m)}}\right)^{2}\right)\right)_{1 \leq i \leq n}^{T}\\ &:=&\left( e^{t \left( {y_{i}-{{X_{i}^{T}}} \beta^{(m)}}\right)^{2}}\right)_{1 \leq i \leq n}^{T}, \end{array} $$
$$ \mathcal{B}\left( b^{(m)}, w^{(m, Q)}, \theta\right):= \left( \tilde{X}^{T} \widetilde{X}+\varepsilon \mathbb{I}\right)^{-1}\left( \widetilde{X}^{T} \widetilde{y}-\mathcal{T}_{\theta}^{*} w^{(m, Q)}\right), $$
$$ \begin{array}{@{}rcl@{}} \mathcal{C}\left( b^{(m)}, w^{(m, q)}, \theta\right)\!&:=&\!\left( \nabla \varphi_{l}^{*}\left( \nabla \varphi_{l}\left( w_{l}^{(m, q)}\!\right)\right.\right.\\ &&+\left.\left.\gamma\! \mathcal{T}_{\theta_{l}}\!\left( \!\tilde{X}^{T} \widetilde{X}+\varepsilon \mathbb{I}\right)^{-1}\left( \!\tilde{X}^{T} \widetilde{y}-\mathcal{T}_{\theta}^{*} w^{(m+1, q)}\!\right)\!\right)\!\right)_{\!\!1 \leq l \leq L}. \end{array} $$

Thus, \(\mathcal {A}, {\mathscr{B}}, \mathcal {C}\) represent the updating function for the HQ parameter b, dual object w and coefficient β, respectively. Different from the process in BiGL [15], the key step for updating the HQ parameter b is added for calculation. Then, we summarize Algorithm 1 of HQ-DFBB as follows:

$$ \left\{ \begin{array}{l} \text { Set } \beta^{(0)}(\theta) \equiv 0 \in R^{P} \\ \text { for } m=0,1, \cdots, M-1 \\ \left\{\begin{array}{l} b^{(m+1)}=A\left( \beta^{(m)}\right) \\ \text { Set } w^{(m+1,0)} \equiv 0 \in R^{P \times L} \\ \text { for } q=0,1, \cdots, Q-1 \\ \left\{ w^{(m+1, q+1)}=B\left( b^{(m)}, w^{(m, q)}, \theta\right) \right. \\ \beta^{(m+1)}=C\left( b^{(m+1)}, w^{(m+1, Q)}, \theta\right) \end{array}\right.\\ \beta^{(M)}=C\left( b^{(M)}, w^{(M, Q)}, \theta\right) \end{array} \right. $$
(27)

We focus on the computation of \(\left (\frac {d \hat {\beta }(\theta )}{d \theta }\right )^{T}\) to illustrate the relation between 𝜃m+ 1 and 𝜃m. Then we analyze the backward gradient details.

For simplicity, similar to [15], we define the partial derivatives by using the following abbreviations with an additional computation for b:

$$ \frac{d \mathcal{A}^{(m)}}{d\beta}=\partial_{1} \mathcal{A}^{(m)}, $$
$$ \begin{array}{@{}rcl@{}} \frac{\partial \mathcal{B}^{(m, q)}}{\partial b}&:=&\partial_{1} \mathcal{B}^{(m, q)}, \quad\!\!\! \frac{\partial \mathcal{B}^{(m, q)}}{\partial w}:=\partial_{2} \mathcal{B}^{(m, q)}, \quad\!\!\! \frac{\partial \mathcal{B}^{(m, q)}}{\partial \theta}:=\partial_{3} \mathcal{B}^{(m, q)}, \\ \frac{\partial \mathcal{C}^{(m, Q)}}{\partial b}&:=&\partial_{1} \mathcal{C}^{(m, Q)},\!\!\! \quad \frac{\partial \mathcal{C}^{(m, Q)}}{\partial w}:=\partial_{2} \mathcal{C}^{(m, Q)},\!\!\! \quad \frac{\partial \mathcal{C}^{(m, Q)}}{\partial \theta}:=\partial_{3} \mathcal{C}^{(m, Q)}. \end{array} $$

From the simplified algorithms in (27), we can easily find that

$$ \frac{d \beta^{(m+1)}}{d \theta}=D_{m+1}+E_{m+1} \frac{d \beta^{(m)}}{d \theta}. $$
(28)

D contains the partial derivative with respect to 𝜃, and E is related to b.

We can further obtain the following equations:

$$ \begin{array}{@{}rcl@{}} D_{m+1}&=&\partial_{3} \mathcal{C}^{(m+1,Q)}\\ &&+\partial_{2} \mathcal{C}^{(m+1,Q)} \partial_{3} \mathcal{C}^{(m+1,Q-1)}\\ &&+\partial_{2} \mathcal{C}^{(m+1,Q)} \partial_{2} \mathcal{C}^{(m+1,Q-1)} \partial_{3} \mathcal{C}^{(m+1,Q-2)}\\ &&+\partial_{2} \mathcal{C}^{(m+1,Q)} \partial_{2} \mathcal{C}^{(m+1,Q-1)} \partial_{2} \mathcal{C}^{(m+1,Q-2)} \partial_{3} \mathcal{C}^{(m+1,Q-3)}\\ &&+\cdots\\ &&+ \partial_{2} \mathcal{C}^{(m+1,Q)} \partial_{2} \mathcal{C}^{(m+1,Q-1)} {\cdots} \partial_{2} \mathcal{C}^{(m+1,1)} \partial_{3} \mathcal{C}^{(m+1,0)}\\ \end{array} $$

and

$$ \begin{array}{@{}rcl@{}} E_{m+1} &=& \partial_{1} \mathcal{C}^{(m+1,Q)} \partial_{1} \mathcal{A}^{(m)}\\ &&+\partial_{2} \mathcal{C}^{(m+1,Q)} \partial_{1} \mathcal{B}^{(m+1,Q-1)} \partial_{1} \mathcal{A}^{(m)}\\ &&+\partial_{2} \mathcal{C}^{(m+1,Q)} \partial_{2} \mathcal{B}^{(m+1,Q-1)} \partial_{1} \mathcal{B}^{(m+1,Q-2)} \partial_{1} \mathcal{A}^{(m)}\\ &&+\partial_{2} \mathcal{C}^{(m+1,Q)} \partial_{2} \mathcal{B}^{(m+1,Q-1)} \partial_{2} \mathcal{B}^{(m+1,Q-2)} \partial_{1} \mathcal{B}^{(m+1,Q-3)} \partial_{1} \mathcal{A}^{(m)}\\ &&+\cdots\\ &&+\partial_{2} \mathcal{C}^{(m+1,Q)} \partial_{2} \mathcal{B}^{(m+1,Q-1)} {\cdots} \partial_{1} \mathcal{B}^{(m+1,0)} \partial_{1} \mathcal{A}^{(m)}.\\ \end{array} $$
Algorithm 2
figure h

Updating 𝜃 with Prox-SAGA.

According to the above results, we expand the analysis with (28).

$$ \begin{array}{@{}rcl@{}} \left( \frac{d \hat{\beta}^{(M)}(\theta)}{d \theta}\right)^{T} &=&\underbrace{{{D_{M}^{T}}}}_{F_{M}}+ \left( \frac{d \hat{\beta}^{(M-1)}(\theta)}{d \theta}\right)^{T} \underbrace{{{E_{M}^{T}}}}_{G_{M}} \\ &=&\underbrace{{F_{M}+{D_{M-1}^{T}}G_{M}}}_{F_{M-1}}+ \left( \frac{d \hat{\beta}^{(M-2)}(\theta)}{d \theta}\right)^{T} \underbrace{{E_{M-1}^{T}}G_{M}}_{G_{M-1}}\\ &=&\underbrace{{F_{M-1}+{D_{M-2}^{T}}G_{M-1}}}_{F_{M-2}} + \left( \frac{d \hat{\beta}^{(M-3)}(\theta)}{d \theta}\right)^{T} \underbrace{{E_{M-2}^{T}}G_{M-1}}_{G_{M-2}}\\ &=&{\cdots} \\ &=&\underbrace{{D_{M}^{T}}+D_{M-1}^{T}G_{M}+D_{M-2}^{T}G_{M-1}+\cdots+{D_{1}^{T}}G_{2}}_{F_{1}}. \end{array} $$

Therefore, the hypergradient in (28) can be computed by the above steps.

Appendix C: Convergence analysis

For completeness, we succinctly provide the proof for Theorem 1.

Proof

(a) Since the function J(β,b) is strongly convex w.r.t. β, we have

$$ J\left( \bar{\beta}^{(m+1)}, \bar{b}^{(m+1)}\right)<J\left( \bar{\beta}^{(m)}, \bar{b}^{(m+1)}\right) \leq J\left( \bar{\beta}^{(m)}, \bar{b}^{(m)}\right). $$

According to the continuous properties of J(β,b) and b with respect to β, we have that \(\left \{J\left (\beta ^{(m)}, b^{(m)}\right ), m=1,2, \ldots \right \}\) is a decreasing sequence by analyzing the case of \(Q\rightarrow +\infty \). Since J(β,b) is bounded, \(\left \{J\left (\beta ^{(m)}, b^{(m)}\right ), m=1,2, {\ldots } M\right \}\) converges. Due to \(J(\beta )=\min \limits _{b} J(\beta , b)\), we have \(J\left (\beta ^{(m)}\right )=n^{-1} J\left (\beta ^{(m)}, b^{(m+1)}\right )\) from HQ optimization. Finally, the sequence \(\left \{J\left (\beta ^{(m)}\right ), m=1,2, \ldots \right \}\) also converges.

(b) Due to the ε-strongly convex w.r.t. β of J(β,b), we have

$$ \begin{array}{@{}rcl@{}} &&{}J\left( \bar{\beta}^{(m+1)}, \bar{b}^{(m)}\right)-J\left( \bar{\beta}^{(m)}, \bar{b}^{(m)}\right) \geq G^{T}\left( \bar{\beta}^{(m+1)}-\bar{\beta}^{(m)}\right)\\ &&\qquad\qquad\qquad\quad+\frac{\varepsilon}{2}\left\|\bar{\beta}^{(m+1)}-\bar{\beta}^{(m)}\right\|^{2}, \end{array} $$

where G is the gradient of \(J\left (\bar {\beta }^{(m)}, \bar {b}^{(m)}\right )\) and could be zero as \(\bar {\beta }^{(m)}=\arg \min \limits _{\beta } J\left (\beta , \bar {b}^{(m)}\right )\). Considering the case of \(m \rightarrow 0\), we have \(J\left (\bar {\beta }^{(m+1)}, \bar {b}^{(m)}\right )-J\left (\bar {\beta }^{(m)}, \bar {b}^{(m)}\right ) \rightarrow 0\) and \(\left \|\bar {\beta }^{(m+1)}-\bar {\beta }^{(m)}\right \|^{2} \rightarrow 0\). From Section 3, we conclude that

$$ \begin{array}{@{}rcl@{}} &&\left\|\beta^{(Q, m+1)}-\bar{\beta}^{(m+1)}\right\|^{2}+ \left\|\bar{\beta}^{(m+1)}-\bar{\beta}^{(m)}\right\|^{2}\\ &&\quad+\left\|\bar{\beta}^{(m)}-\beta^{(Q, m)}\right\|^{2} \geq\left\|\beta^{(Q, m+1)}-\beta^{(Q, m)}\right\|^{2} \end{array} $$

and

$$ \frac{2 C}{Q}+\left\|\bar{\beta}^{(m+1)}-\bar{\beta}^{(m)}\right\|^{2}\geq \left\|\beta^{(Q, m+1)}-\beta^{(Q, m)}\right\|^{2}. $$

Finally, we have \(\left \|\beta ^{(M)}-\beta ^{(M-1)}\right \|_{2}^{2} \rightarrow 0\) as \(m, Q \rightarrow +\infty \).

(c) Based on Theorem 3.1 in [38], we know that the result \(\left \{w^{(m, Q)}\right \}_{Q \in \mathbb {N}}\) generated in the HQ-DFBB algorithm is convergent to \(\hat {w}^{(m)}\) as \(Q \rightarrow +\infty \).

We denote b, \(\hat {w}^{*}\), β, \(\widetilde {X}^{*}=\sqrt {b^{*}} \odot X\) and \(\widetilde {y}^{*}=\sqrt {b^{*}} \odot y\) as the sequence generated by Algorithm 1 after \(m, Q \rightarrow +\infty \) iterations. When \(Q, m \rightarrow +\infty \), we obtain that b(m) is convergent as the weight b is continuous differential with respect to β. Then we analyze the primal-dual link function in the case of \(Q, m \rightarrow +\infty \),

$$ \beta^{*}=\left( \widetilde{X}^{* T} \widetilde{X}^{*}+\varepsilon \mathbb{I}\right)^{-1}\left( \widetilde{X}^{* T} \widetilde{y}^{*}-\mathcal{T}_{\theta}^{*} \hat{w}^{*}\right). $$

We can rewrite the above formula as

$$ \left( \widetilde{X}^{* T} \widetilde{X}^{*}+\varepsilon \mathbb{I}\right) \beta^{*}-\widetilde{X}^{* T} \widetilde{y}^{*}+\mathcal{T}_{\theta}^{*} \hat{w}^{*}=0. $$
(29)

From the definitions of Ω(β) and its dual function Ω(w), we have \(\left \|w_{l}^{(m, q)}\right \| \leq \eta \) for m ∈{1,…,M},l ∈ {1,…,L},q ∈{1,…,Q}. Therefore, under the case of \(Q, m \rightarrow +\infty \), we have \(\left \|w_{l}^{*}\right \| \leq \eta \). Let \(w_{l}^{*}=\frac {1}{\eta } \hat {w}_{l}^{*}\), and we can obtain \(\left \|w_{l}^{*}\right \| \leq 1\). Equation (29) can be rewritten as

$$ \left( \widetilde{X}^{* T} \widetilde{X}^{*}+\varepsilon \mathbb{I}\right) \beta^{*}-\widetilde{X}^{* T} \widetilde{y}^{*}+\sum\limits_{l=1}^{L} \eta \mathcal{T}_{\theta_{l}} w_{l}^{*}=0. $$

By deriving Ω(β), we have

$$ \partial {\Omega}(\beta)=\eta \sum\limits_{l=1}^{L} \mathcal{T}_{\theta_{l}} \delta_{l}, $$

where δl satisfies \(\left \|\delta _{l}\right \|=1\) if \(\mathcal {T}_{\theta _{l}} \beta \neq 0\), and \(\left \|\delta _{l}\right \|<1\) otherwise (1 ≤ lL). Hence,

$$ \begin{array}{@{}rcl@{}} 0&=&\left( \widetilde{X}^{* T} \widetilde{X}^{*}+\varepsilon \mathbb{I}\right) \beta^{*}-\widetilde{X}^{* T} \widetilde{y}^{*}+\sum\limits_{l=1}^{L} \eta \mathcal{T}_{\theta_{l}} w_{l}^{*}\\ &\in& \nabla \mathcal{L}\left( \beta^{*}\right)+\partial {\Omega}\left( \beta^{*}\right). \end{array} $$

In summary, we obtain \(0 \in \partial _{\beta } J\left (\beta ^{*}, b^{*}\right )\). Finally, we know that β is the unique result of \(J\left (\beta , b^{*}\right )\) since J(β,b) is strongly convex with respect to β. □

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, X., Wang, Y., Zhu, L. et al. Robust variable structure discovery based on tilted empirical risk minimization. Appl Intell 53, 17865–17886 (2023). https://doi.org/10.1007/s10489-022-04409-z

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-04409-z

Keywords

Navigation