Skip to main content
Log in

Sparse group fused lasso for model segmentation: a hybrid approach

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

This article introduces the sparse group fused lasso (SGFL) as a statistical framework for segmenting sparse regression models with multivariate time series. To compute solutions of the SGFL, a nonsmooth and nonseparable convex program, we develop a hybrid optimization method that is fast, requires no tuning parameter selection, and is guaranteed to converge to a global minimizer. In numerical experiments, the hybrid method compares favorably to state-of-the-art techniques with respect to computation time and numerical accuracy; benefits are particularly substantial in high dimension. The method’s statistical performance is satisfactory in recovering nonzero regression coefficients and excellent in change point detection. An application to air quality data is presented. The hybrid method is implemented in the R package sparseGFL available on the author’s Github page.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Alaíz CM, Jiménez ÁB, Dorronsoro JR (2013) Group fused lasso. Artif Neural Netw Mach Learn 2013:66–73

    Google Scholar 

  • Alewijnse SPA, Buchin K, Buchin M, Sijben S, Westenberg MA (2018) Model-based segmentation and classification of trajectories. Algorithmica 80(8):2422–2452

    Article  MathSciNet  MATH  Google Scholar 

  • Bai J (1997) Estimating multiple breaks one at a time. Econom Theory 13(3):315–352

    Article  MathSciNet  Google Scholar 

  • Bai J, Perron P (2003) Computation and analysis of multiple structural change models. J Appl Econom 18(1):1–22

    Article  Google Scholar 

  • Barbero A, Sra S (2011) Fast Newton-type methods for total variation regularization. In: Proceedings of the 28th international conference on machine learning, ICML 2011, pp 313–320

  • Basseville M, Nikiforov IV (1993) Detection of abrupt changes: theory and application. Prentice Hall information and system sciences series. Prentice Hall Inc, Englewood Cliffs

    MATH  Google Scholar 

  • Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2(1):183–202

    Article  MathSciNet  MATH  Google Scholar 

  • Becker S, Bobin J, Candès EJ (2011) NESTA: a fast and accurate first-order method for sparse recovery. SIAM J Imaging Sci 4(1):1–39

    Article  MathSciNet  MATH  Google Scholar 

  • Beer JC, Aizenstein HJ, Anderson SJ, Krafty RT (2019) Incorporating prior information with fused sparse group lasso: application to prediction of clinical measures from neuroimages. Biometrics 75(4):1299–1309

    Article  MathSciNet  MATH  Google Scholar 

  • Bertsekas DP (2015) Convex optimization algorithms. Athena Scientific, Belmont

    MATH  Google Scholar 

  • Bleakley K, Vert JP (2011) The group fused lasso for multiple change-point detection. Technical Report hal-00602121. https://hal.archives-ouvertes.fr/hal-00602121. Accessed 15 Oct 2020

  • Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3(1):1–122

    Article  MATH  Google Scholar 

  • Bredies K, Lorenz DA (2008) Linear convergence of iterative soft-thresholding. J Fourier Anal Appl 14(5–6):813–837

    Article  MathSciNet  MATH  Google Scholar 

  • Cao P, Liu X, Liu H, Yang J, Zhao D, Huang M, Zaiane O (2018) Generalized fused group lasso regularized multi-task feature learning for predicting cognitive outcomes in Alzheimers disease. Comput Methods Programs Biomed 162:19–45

    Article  Google Scholar 

  • Chen J, Chen Z (2008) Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95(3):759–771

    Article  MathSciNet  MATH  Google Scholar 

  • Chen X, Lin Q, Kim S, Carbonell JG, Xing EP (2012) Smoothing proximal gradient method for general structured sparse regression. Ann Appl Stat 6(2):719–752

    Article  MathSciNet  MATH  Google Scholar 

  • Chi EC, Lange K (2015) Splitting methods for convex clustering. J Comput Graph Stat 24(4):994–1013

    Article  MathSciNet  Google Scholar 

  • Combettes PL, Pesquet JC (2011) Fixed-point algorithms for inverse problems in science and engineering, chap. proximal splitting methods in signal processing. Springer, New York, pp 185–212

    MATH  Google Scholar 

  • Condat L (2013) A primal–dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms. J Optim Theory Appl 158(2):460–479

    Article  MathSciNet  MATH  Google Scholar 

  • De Vito S, Massera E, Piga M, Martinotto L, Di Francia G (2008) On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sens Actuators B Chem 129(2):750–757

    Article  Google Scholar 

  • De Vito S, Piga M, Martinotto L, Di Francia G (2009) Co,No\(_{2}\) and No\(_{x}\) urban pollution monitoring with on-field calibrated electronic nose by automatic Bayesian regularization. Sens Actuators B Chem 143(1):182–191

    Article  Google Scholar 

  • Friedman J, Hastie T, Höfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1(2):302–332

    Article  MathSciNet  MATH  Google Scholar 

  • Fryzlewicz P (2014) Wild binary segmentation for multiple change-point detection. Ann Stat 42(6):2243

    Article  MathSciNet  MATH  Google Scholar 

  • Hadj-Selem F, Löfstedt T, Dohmatob E, Frouin V, Dubois M, Guillemot V, Duchesnay E (2018) Continuation of Nesterov’s smoothing for regression with structured sparsity in high-dimensional neuroimaging. IEEE Trans Med Imaging 37(11):2403–2413

    Article  Google Scholar 

  • Hallac D, Nystrup P, Boyd S (2019) Greedy Gaussian segmentation of multivariate time series. Adv Data Anal Classif 13(3):727–751

    Article  MathSciNet  MATH  Google Scholar 

  • Hocking T, Vert JP, Bach FR, Joulin A (2011) Clusterpath: an algorithm for clustering using convex fusion penalties. In: ICML

  • Hoefling H (2010) A path algorithm for the fused lasso signal approximator. J Comput Graph Stat 19(4):984–1006

    Article  MathSciNet  Google Scholar 

  • Kim S, Xing EP (2012) Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping. Ann Appl Stat 6(3):1095–1117

    Article  MathSciNet  MATH  Google Scholar 

  • Kuhn HW (1973) A note on Fermat’s problem. Mat Program 4:98–107

    Article  MathSciNet  MATH  Google Scholar 

  • Leonardi F, Bühlmann P (2016) Computationally efficient change point detection for high-dimensional regression

  • Li Y, Osher S (2009) Coordinate descent optimization for \(\ell ^1\) minimization with application to compressed sensing; a greedy algorithm. Inverse Probl Imaging 3(3):487–503

    Article  MathSciNet  MATH  Google Scholar 

  • Li X, Mo L, Yuan X, Zhang J (2014) Linearized alternating direction method of multipliers for sparse group and fused LASSO models. Comput Stati Data Anal 79:203–221

    Article  MathSciNet  MATH  Google Scholar 

  • Liu J, Yuan L, Ye J (2010) An efficient algorithm for a class of fused lasso problems. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’10. ACM, pp 323–332

  • Nesterov Y (2005) Smooth minimization of non-smooth functions. Math Program 103(1, Ser. A):127–152

    Article  MathSciNet  MATH  Google Scholar 

  • Nesterov Y (2012) Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J Optim 22(2):341–362

    Article  MathSciNet  MATH  Google Scholar 

  • Nystrup P, Madsen H, Lindström E (2017) Long memory of financial time series and hidden Markov models with time-varying parameters. J Forecast 36(8):989–1002

    Article  MathSciNet  MATH  Google Scholar 

  • Ohlsson H, Ljung L, Boyd S (2010) Segmentation of ARX-models using sum-of-norms regularization. Automatica 46(6):1107–1111

    Article  MathSciNet  MATH  Google Scholar 

  • Ombao H, von Sachs R, Guo W (2005) Slex analysis of multivariate nonstationary time series. J Am Stat Assoc 100(470):519–531

    Article  MathSciNet  MATH  Google Scholar 

  • Price BS, Geyer CJ, Rothman AJ (2019) Automatic response category combination in multinomial logistic regression. J Comput Graph Stat 28(3):758–766

    Article  MathSciNet  Google Scholar 

  • R Core Team (2019) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. Accessed 15 Oct 2020

  • Ranalli M, Lagona F, Picone M, Zambianchi E (2018) Segmentation of sea current fields by cylindrical hidden Markov models: a composite likelihood approach. J R Stat Soc Ser C (Appl Stat) 67(3):575–598

    Article  MathSciNet  Google Scholar 

  • Rockafellar R (2015) Convex analysis. Princeton landmarks in mathematics and physics. Princeton University Press, Princeton

    Google Scholar 

  • Sanderson C, Curtin R (2016) Armadillo: a template-based C++ library for linear algebra. J Open Source Softw 1:26

    Article  Google Scholar 

  • Saxén JE, Saxén H, Toivonen HT (2016) Identification of switching linear systems using self-organizing models with application to silicon prediction in hot metal. Appl Soft Comput 47:271–280

    Article  Google Scholar 

  • Shor NZ (1985) Minimization methods for nondifferentiable functions, Springer series in computational mathematics, vol 3. Springer, Berlin (Translated from the Russian by K. C. Kiwiel and A. Ruszczyński)

  • Songsiri J (2015) Learning multiple granger graphical models via group fused lasso. In: 2015 10th Asian control conference (ASCC), pp 1–6

  • Tibshirani R, Wang P (2007) Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics 9(1):18–29

    Article  MATH  Google Scholar 

  • Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, Tibshirani RJ (2012) Strong rules for discarding predictors in lasso-type problems. J R Stat Soc Ser B Stat Methodol 74(2):245–266

    Article  MathSciNet  MATH  Google Scholar 

  • Truong C, Oudre L, Vayatis N (2018) A review of change point detection methods. arXiv:1801.00718

  • Tseng P (2001) Convergence of a block coordinate descent method for nondifferentiable minimization. J Optim Theory Appl 109(3):475–494

    Article  MathSciNet  MATH  Google Scholar 

  • Vũ BC (2013) A variable metric extension of the forward–backward–forward algorithm for monotone operators. Numer Funct Anal Optim 34(9):1050–1065

    Article  MathSciNet  MATH  Google Scholar 

  • Wang T, Zhu L (2011) Consistent tuning parameter selection in high dimensional sparse linear regression. J Multivar Anal 102(7):1141–1151

    Article  MathSciNet  MATH  Google Scholar 

  • Wang J, Fan W, Ye J (2015a) Fused lasso screening rules via the monotonicity of subdifferentials. IEEE Trans Pattern Anal Mach Intell 37(9):1806–1820

    Article  Google Scholar 

  • Wang J, Wonka P, Ye J (2015b) Lasso screening rules via dual polytope projection. J Mach Learn Res 16:1063–1101

    MathSciNet  MATH  Google Scholar 

  • Wang B, Zhang Y, Sun WW, Fang Y (2018) Sparse convex clustering. J Comput Graph Stat 27(2):393–403

    Article  MathSciNet  Google Scholar 

  • Weiszfeld E, Plastria F (2009) On the point for which the sum of the distances to n given points is minimum. Ann Oper Res 167(1):7–41

    Article  MathSciNet  MATH  Google Scholar 

  • Wytock M, Sra S, Kolter JZ (2014) Fast Newton methods for the group fused lasso. Uncertain Artif Intell 2014:888–897

    Google Scholar 

  • Xu Y, Lindquist M (2015) Dynamic connectivity detection: an algorithm for determining functional connectivity change points in fMRI data. Front eurosci 9:285

    Google Scholar 

  • Yan M (2018) A new primal–dual algorithm for minimizing the sum of three functions with a linear operator. J Sci Comput 76(3):1698–1717

    Article  MathSciNet  MATH  Google Scholar 

  • Yao YC (1988) Estimating the number of change-points via Schwarz’ criterion. Stat Probab Lett 6(3):181–189

    Article  MathSciNet  MATH  Google Scholar 

  • Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B Stat Methodol 68(1):49–67

    Article  MathSciNet  MATH  Google Scholar 

  • Zhou J, Liu J, Narayan VA, Ye J (2013) Modeling disease progression via multi-task learning. NeuroImage 78:233–248

    Article  Google Scholar 

  • Zhu C, Xu H, Leng C, Yan S (2014) Convex optimization procedure for clustering: theoretical revisit. In: NIPS

  • Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol 67(2):301–320

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The author thanks the reviewers and the associate editor for their suggestions which led to substantial improvements of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Degras.

Ethics declarations

Conflict of interest

The author declares that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 122 KB)

Appendices

Proof of Theorem 1

Recall the notations of Sect. 3.2:

$$\begin{aligned} \begin{aligned} g(\beta _t)&= \lambda _1 \Vert \beta _t \Vert _1 + \lambda _2 w_{t-1}\Vert \beta _t - \hat{\beta }_{t-1} \Vert _2 + \lambda _2 w_t \Vert \beta _t - \hat{\beta }_{t+1} \Vert _2, \\ g_1(\beta _t)&= \lambda _1 \Vert \beta _t \Vert _1 , \\ g_2(\beta _t)&= \lambda _2 w_{t-1} \Vert \beta _t - \hat{\beta }_{t-1} \Vert _2 +\lambda _2w_t \Vert \beta _t- \hat{\beta }_{t+1} \Vert _2 + (L_t/2) \Vert \beta _t -z_t \Vert _2^2 ,\\ \bar{g} (\beta _t)&= g(\beta _t) + (L_t/2) \left\| \beta _t - z_t \right\| _2^2 = g_1 (\beta _t)+ g_2 (\beta _t) , \\ \quad \gamma _n&= \big ( L_t + (\lambda _2 w_{t-1} / \Vert \beta _t^{n} - \hat{\beta }_{t-1} \Vert _2 ) + (\lambda _2 w_t / \Vert \beta _{t}^{n} - \hat{\beta }_{t+1} \Vert _2 ) \big )^{-1}, \\ \beta _t^*&= \mathrm {argmin}\, \bar{g} = \mathrm {prox}_{g/L_t}(z_t) , \quad r_n = \bar{g} (\beta _t^n) - \bar{g} (\beta _t^*). \end{aligned} \end{aligned}$$

In view of Remark 2 framing the iterative soft-thresholding scheme (23)–(24) as a proximal gradient method, we can establish the linear convergence of this scheme to \(\mathrm {prox}_{g/L_t}(z_t)\) by adapting the results of Bredies and Lorenz (2008) to a nonsmooth setting. Essentially, the proof of linear convergence in Bredies and Lorenz (2008) works by first establishing a lower bound on \(\bar{g}(\beta _t^{n})-\bar{g}(\beta _t^n)\), the decrease in the objective function between successive iterations of the proximal gradient method (Lemma 1). This general result shows in particular that when using sufficiently small step sizes, the proximal gradient is a descent method. After that, under the additional assumptions that \(g_2\) is convex and that \(\Vert \beta _t^n - \beta _t^*\Vert _2^2 \le c r_n\) for some \(c>0\), the lower bound of Lemma 1 is exploited to show the exponential decay of \((r_n)\) and the linear convergence of \((\beta _t^n)\) (Proposition 2). In a third movement, the lower bound of Lemma 1 is decomposed as a Bregman-like distance term involving \(g_1\) plus a Taylor remainder term involving \(g_2\). The specific nature of \(g_1\) (\(\ell _1\) norm) and possible additional regularity conditions on \(g_2\) (typically, strong convexity) are then used to establish the linear convergence result (Theorem 2). For brevity, we refer the reader to Bredies and Lorenz (2008) for the exact statement of these results.

Lemma 1 and Proposition 2 of Bredies and Lorenz (2008) posit, among other things, that the “smooth” part of the objective, \(g_2\) in our notations, is differentiable everywhere and has a Lipschitz-continuous gradient. In the present case, \(g_2\) is not differentiable at \(\hat{\beta }_{t-1}\) and \(\hat{\beta }_{t+1}\); however it is differentiable everywhere else and its gradient is Lipschitz-continuous in a local sense. The main effort required for us is to show that Lemma 1 still holds if the points of nondifferentiability of \(g_2\) are not on segments joining the iterates \(\beta _t^n , n\ge 0\). Put differently, the iterative soft-thresholding scheme should not cross \(\hat{\beta }_{t-1}\) and \(\hat{\beta }_{t+1}\) on its path. This is where the requirement that \(\bar{g}(\beta _t^0) < \min ( \bar{g}(\hat{\beta }_{t-1}), \bar{g}(\hat{\beta }_{t+1}))\) in Theorem 1 plays a crucial part. We now proceed to adapt Lemma 1, after which we will establish the premises of Theorem 2 of Bredies and Lorenz (2008).

Adaptation of Lemma 1of Bredies and Lorenz (2008). The main result we need prove is that

$$\begin{aligned} \forall n\in \mathbb {N}, \quad \big \{ \hat{\beta }_{t-1} , \hat{\beta }_{t+1} \big \} \bigcap ^{\,} \left\{ \alpha \beta _{t}^{n} + (1-\alpha ) \beta _{t}^{n+1} : 0 \le \alpha \le 1 \right\} = \emptyset \,. \end{aligned}$$
(40)

Once this is established, we may follow the proof of Proposition 2 without modification. In particular, we will be in position to state that

$$\begin{aligned} \left\| \nabla g_2(\beta _t^n + \alpha (\beta _t^{n+1} - \beta _t^n ) ) - \nabla g_2(\beta _t^{n})\right\| _2 \le \alpha \tilde{L}_n \left\| \beta _t^{n+1} - \beta _t^{n} \right\| _2 \end{aligned}$$
(41)

for all \(n\in \mathbb {N}\) and \( \alpha \in [0,1]\), where

$$\begin{aligned} \tilde{L}_n = L_t + \frac{2\lambda _2 w_{t-1} }{ \Vert \beta _t^n - \hat{\beta }_{t-1} \Vert _2 } +\frac{2\lambda _2 w_t }{ \Vert \beta _t^n - \hat{\beta }_{t+1} \Vert _2 } . \end{aligned}$$

Note that the left-hand side in (41) is not well defined if (40) does not hold. Combining the local Lipschitz property (41) with the step size condition \(\gamma _{n} < 2 / \tilde{L}_n\), we may go on to establish the descent property (3.5) of Bredies and Lorenz (2008):

$$\begin{aligned} \bar{g}(\beta _t^{n+1}) \le \bar{g}(\beta _t^{n}) - \delta D_{\gamma _n}(\beta _t^{n}) \end{aligned}$$
(42)

where

$$\begin{aligned} D_{\gamma _n}(\beta _t^{n}) = g_1(\beta _t^{n}) - g_1(\beta _t^{n+1}) + \nabla g_2(\beta _t^{n})'(\beta _t^{n}-\beta _t^{n+1}) \quad \text { and } \quad \delta = 1 - \frac{\max _{n} \gamma _n \tilde{L}_n }{ 2} . \end{aligned}$$

Lemma 1 shows that \(D_{\gamma _n}(\beta _t^{n}) \ge \Vert \beta _t^{n}-\beta _t^{n+1}\Vert _2^2 / \gamma _n \ge 0\). To show the positivity of \(\delta \), note that

$$\begin{aligned} \gamma _n \tilde{L}_n = 2 - L_t \left( \displaystyle L_t + \frac{\lambda _2 w_{t-1} }{ \Vert \beta _t^{n} - \hat{\beta }_{t-1}\Vert _2} + \frac{\lambda _2 w_{t}}{ \Vert \beta _t^{n} - \hat{\beta }_{t+1}\Vert _2 } \right) ^{-1} . \end{aligned}$$

Given the descent property of \((\beta _n)\) for \(\bar{g}\), the assumption \(\bar{g}(\beta _t^0) < \min (\bar{g}(\hat{\beta }_{t-1}),\bar{g}(\hat{\beta }_{t+1})) \), and the convexity of the sublevel sets of \(\bar{g}\), it holds that \( \Vert \beta _t^{n} - \hat{\beta }_{t-1}\Vert _2 \ge d(\hat{\beta }_{t-1} , \{ \beta _t : \bar{g}(\beta _t)\le \bar{g}(\beta _t^0) \}) \) for all \(n\in \mathbb {N}\); an analog inequality holds for \(\hat{\beta }_{t+1}\). Denoting these positive lower bounds by \(m_{t-1} \) and \(m_{t+1}\), we have

$$\begin{aligned} 0 < \frac{1}{2} \left( \displaystyle 1 + \frac{\lambda _2 w_{t-1} }{L_t m_{t-1}} + \frac{\lambda _2 w_{t}}{ L_t m_{t+1} } \right) ^{-1} \le \delta \le \frac{1}{2} . \end{aligned}$$
(43)

Together, the step size condition \(\gamma _{n} < 2 / \tilde{L}_n\), descent property (42), and lower bound (43) finish to establish Lemma 1 and the precondition of Proposition 2 of Bredies and Lorenz (2008).

It remains to prove (40). We will show a weaker form of (42), namely that \(\bar{g}(\beta _t^{n+1}) \le \bar{g}(\beta _t^{n}) \) for all n. This inequality, combined with the convexity of \(\bar{g}\) and the assumption \(\bar{g}(\beta _t^0) < \min (\bar{g}(\hat{\beta }_{t-1}),\bar{g}(\hat{\beta }_{t+1})) \), implies that \(\hat{\beta }_{t-1}\) and \(\hat{\beta }_{t+1}\) cannot be on a segment joining \(\beta _t^{n}\) and \(\beta _t^{n+1}\). Otherwise, the convexity of \(\bar{g}\) would imply that, say, \(\bar{g}(\hat{\beta }_{t-1}) \le \max ( \bar{g}(\beta _t^{n}), \bar{g}(\beta _t^{n+1})) \le \bar{g}(\beta _t^{n})\le \cdots \le \bar{g}(\beta _t^{0}) < \bar{g}(\hat{\beta }_{t-1})\), a contradiction.

To prove the simple descent property, we start with an easy lemma stated without proof.

Lemma 1

For all \(x,y \in \mathbb {R}^p\) such that \(y \ne 0_p\),

$$\begin{aligned} \Vert x \Vert _2 \le \Vert y \Vert _2 + \frac{ y'(x-y) }{\Vert y \Vert _2 } + \frac{\Vert x - y\Vert _2^2 }{2\Vert y\Vert _2} \,. \end{aligned}$$

Applying this lemma to \(x= \beta _t - \hat{\beta }_{t \pm 1}\) and \(y = \beta _t^{n} - \hat{\beta }_{t \pm 1}\), we deduce that for all \(\beta _t \in \mathbb {R}^p \),

$$\begin{aligned} \Vert \beta _t - \hat{\beta }_{t - 1}\Vert _2&\le \Vert \beta _t^{n} - \hat{\beta }_{t - 1}\Vert _2 + \frac{ ( \beta _t - \hat{\beta }_{t - 1})'(\beta _t - \beta _t^n) }{\Vert \beta _t^n - \hat{\beta }_{t - 1} \Vert _2 } + \frac{\Vert \beta _t - \beta _t^n \Vert _2^2 }{2\Vert \beta _t^n - \hat{\beta }_{t - 1} \Vert _2} , \end{aligned}$$
(44)
$$\begin{aligned} \Vert \beta _t - \hat{\beta }_{t + 1}\Vert _2&\le \Vert \beta _t^{n} - \hat{\beta }_{t + 1}\Vert _2 + \frac{ ( \beta _t - \hat{\beta }_{t + 1})'(\beta _t - \beta _t^n) }{\Vert \beta _t^n - \hat{\beta }_{t +1} \Vert _2 } + \frac{\Vert \beta _t - \beta _t^n \Vert _2^2 }{ 2 \Vert \beta _t^n - \hat{\beta }_{t + 1} \Vert _2} . \end{aligned}$$
(45)

In addition, it is immediate that

$$\begin{aligned} \Vert \beta _t - z_t \Vert _2^2&= \Vert \beta _t^n - z_t \Vert _2^2 - 2 (\beta _n - z_t)'(\beta _t - \beta _t^n ) + \Vert \beta _t - \beta _t^n \Vert _2^2 . \end{aligned}$$
(46)

Multiplying (44) by \(\lambda _2 w_{t-1}\), (45) by \(\lambda _2 w_{t} \), (46) by \(L_t/2\), summing these relations, and adding \(g_1(\beta _t)\) on each side, we obtain

$$\begin{aligned} (g_1 + g_2)(\beta _t) \le g_1(\beta _t) + g_2(\beta _t^n) + \nabla g_2(\beta _t^n) ' (\beta _t - \beta _t^n) + \frac{1}{2\gamma _n} \Vert \beta _t - \beta _t^n \Vert _2^2 . \end{aligned}$$
(47)

The left-hand side of (47) is simply \(\bar{g}(\beta _t)\). Also, in view of Remark 2, the minimizer of the right-hand side of (47) is \(\mathcal {T}(\beta _t^n) = \beta _t^{n+1}\). Evaluating (47) at \( \beta _t^{n+1}\) and exploiting this minimizing property, it follows that

$$\begin{aligned} \begin{aligned} \bar{g}(\beta _t^{n+1} )&\le g_1(\beta _t^{n+1} ) + g_2(\beta _t^n) + \nabla g_2(\beta _t^n) ' (\beta _t^{n+1} - \beta _t^n) + \frac{1}{2\gamma _n} \Vert \beta _t^{n+1} - \beta _t^n \Vert _2^2 \\&\le g_1(\beta _t^{n} ) + g_2(\beta _t^n) + \nabla g_2(\beta _t^n) ' (\beta _t^n - \beta _t^n) + \frac{1}{2\gamma _n} \Vert \beta _t^{n} - \beta _t^n \Vert _2^2 \\&= \bar{g}(\beta _t^n). \end{aligned} \end{aligned}$$
(48)

This establishes the desired descent property.

Prerequisites of Theorem 2 of Bredies and Lorenz (2008). The distance \(r_n = \bar{g}(\beta _t^n) - \bar{g}(\beta _t^*)\) to the minimum of the objective can be usefully decomposed as

$$\begin{aligned} \begin{aligned} r_n&= R(\beta _t^n) + T(\beta _t^n) \\ R(\beta _t)&= \nabla g_2(\beta _t^*) '( \beta _t - \beta _t^*) + g_1(\beta _t) - g_1(\beta _t^*) \\ T(\beta _t)&= g_2(\beta _t) - g_2(\beta _t^*) - \nabla g_2(\beta _t^*)'( \beta _t - \beta _t^*) \end{aligned} \end{aligned}$$
(49)

where \(R(\beta _t)\) is a Bregman-like distance and \(T(\beta _t)\) is the remainder of the Taylor expansion of \(g_2\) at \(\beta _t^*\).

To obtain the linear convergence of \((\beta _t^n)\) to \(\beta _t^*\) and the exponential decay of \((r_n)\) to 0 with Theorem 2 of Bredies and Lorenz (2008), it suffices to show that

$$\begin{aligned} \Vert \beta _t - \beta _t^*\Vert _2^2 \le c \left( R(\beta _t) + T(\beta _t)\right) \end{aligned}$$
(50)

for some constant \(c>0\) and for all \(\beta _t\in \mathbb {R}^p\).

Invoking the convexity of \(\Vert \cdot \Vert _2\) and strong convexity of \(\Vert \cdot \Vert _2^2\), one sees that \( T(\beta _t) \ge (L_t/2) \Vert \beta _t - \beta _t^*\Vert _2^2\) for all \(\beta _t\). Also, \(R(\beta _t) \ge 0\) for all \(\beta _t\) (Lemma 2 of Bredies and Lorenz 2008) so that c can be taken as \(2/L_t\) in (50). \(\square \)

Proof of Theorem 2

We first observe that by design, each of the four steps or components of Algorithm 5 is nonincreasing in the objective function F. Indeed the first three steps (optimization with respect to single blocks, single chains, and descent over fixed chains) are all based on FISTA (Algorithms 3 and 4) which is globally convergent (Beck and Teboulle 2009, Theorem 4.4). As each of these components minimizes F under certain constraints (namely, some blocks or fusion chains are fixed), the objective value of their output, say \(F(\beta ^{n+1})\), cannot be lower than that of their input, \(F(\beta ^n)\). The fourth step, subgradient descent is also nonincreasing because the subgradient of minimum norm—if it is not zero—provides a direction of (steepest) descent. The line search for the step size in the subgradient step then guarantees that the objective does not decrease after this step. As a result, Algorithm 5 as a whole is nonincreasing in F.

Let us denote a generic segmentation of the set \(\{ 1, \ldots , T\}\) by \(C=(C_1,\ldots , C_K)\) where \(C_k =\{ T_k ,\ldots , T_{k+1} - 1\} \) and \(1= T_1 \le \cdots \le T_K < T_{K+1}=T+1\). There are \(2^{T-1}\) such segmentations. Let \(S_C\) be the associated open set for the parameter \(\beta \):

$$\begin{aligned} S_C = \left\{ \beta \in \mathbb {R}^{pT}: \beta _{T_k} = \cdots = \beta _{T_{k+1}-1},\ \beta _{T_{k+1}-1} \ne \beta _{T_{k+1}} , 1\le k \le K \right\} . \end{aligned}$$

To each segmentation C is associated an infimum value of F: \(\inf _{\beta \in S_C}F(\beta )\). Let \((\beta ^n)_{n\ge 0}\) be the sequence of iterates generated by Algorithm 5 and let \(C^{n}\) the associated segmentations of \(\{1,\ldots ,T\}\). By setting the tolerance \(\epsilon \) of Algorithm 5 sufficiently small, each time the third optimization component (descent over fixed chains) is applied, say with \(\beta ^n\) as input and \(\beta ^{n+1}\) as output, the objective \(F(\beta )\) can be made arbitrarily close to \(\min _{ \beta \in S_{C^n}}F(\beta )\) or even become inferior to this value if a fusion of chains occurs during this optimization. If the segmentation \(C^n\) is optimal, i.e. \( \min _{ S_{C^n} } F (\beta ) = \min _{\beta \in \mathbb {R}^{pT}}F (\beta )\), then Algorithm 5 has converged: for all subsequent iterates \(m\ge n\), \(F(\beta ^m)\) and \(\beta ^m\) will stay arbitrarily close to the minimum of F and to the set of minimizers, respectively, because of the nonincreasing property of Algorithm 5. If the segmentation \(C^n\) is suboptimal, i.e. \( \min _{\beta \in S_{C^n} } F(\beta ) > \min _{\beta \in \mathbb {R}^{pT}}F (\beta )\), provided that \(\epsilon \) is sufficiently small, the (fourth) subgradient step of Algorithm 5 will eventually produce an iterate \(\beta ^m\) (\(m \ge n\)) such that \(F(\beta ^m) < \min _{ \beta \in S_{C^n} }F(\beta ) \). This is because each subgradient step brings the iterates closer to the set of global minimizers of F. Once this has happened, the nonincreasing property of the algorithm guarantees that the segmentation \(C^n\) will not be visited again. Because the segmentations of \(\{ 1,\ldots ,T\}\) are in finite number, Algorithm 5 eventually finds an optimal segmentation C such that \(\min _{\beta \in S_C} F(\beta ) = \min _{\beta \in \mathbb {R}^{pT}} F(\beta )\). Then, through its third level of optimization (descent over fixed chains), it reaches the global minimum of F. We note that the first and second components of Algorithm 5 (block coordinate descent over single blocks and single chains) are not necessary to ensure global convergence; they only serve for computational speed. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Degras, D. Sparse group fused lasso for model segmentation: a hybrid approach. Adv Data Anal Classif 15, 625–671 (2021). https://doi.org/10.1007/s11634-020-00424-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-020-00424-5

Keywords

Mathematics Subject Classification

Navigation