Sparse group fused lasso for model segmentation: a hybrid approach

Degras, David

doi:10.1007/s11634-020-00424-5

Sparse group fused lasso for model segmentation: a hybrid approach

Regular Article
Published: 22 October 2020

Volume 15, pages 625–671, (2021)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

David Degras ORCID: orcid.org/0000-0002-7221-584X¹

765 Accesses
3 Citations
Explore all metrics

Abstract

This article introduces the sparse group fused lasso (SGFL) as a statistical framework for segmenting sparse regression models with multivariate time series. To compute solutions of the SGFL, a nonsmooth and nonseparable convex program, we develop a hybrid optimization method that is fast, requires no tuning parameter selection, and is guaranteed to converge to a global minimizer. In numerical experiments, the hybrid method compares favorably to state-of-the-art techniques with respect to computation time and numerical accuracy; benefits are particularly substantial in high dimension. The method’s statistical performance is satisfactory in recovering nonzero regression coefficients and excellent in change point detection. An application to air quality data is presented. The hybrid method is implemented in the R package sparseGFL available on the author’s Github page.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of methods for time series change point detection

Article 08 September 2016

Bias-constrained integer least squares estimation: distributional properties and applications in GNSS ambiguity resolution

Article Open access 14 May 2024

Clustering functional data via variational inference

Article 30 April 2024

References

Alaíz CM, Jiménez ÁB, Dorronsoro JR (2013) Group fused lasso. Artif Neural Netw Mach Learn 2013:66–73
Google Scholar
Alewijnse SPA, Buchin K, Buchin M, Sijben S, Westenberg MA (2018) Model-based segmentation and classification of trajectories. Algorithmica 80(8):2422–2452
Article MathSciNet MATH Google Scholar
Bai J (1997) Estimating multiple breaks one at a time. Econom Theory 13(3):315–352
Article MathSciNet Google Scholar
Bai J, Perron P (2003) Computation and analysis of multiple structural change models. J Appl Econom 18(1):1–22
Article Google Scholar
Barbero A, Sra S (2011) Fast Newton-type methods for total variation regularization. In: Proceedings of the 28th international conference on machine learning, ICML 2011, pp 313–320
Basseville M, Nikiforov IV (1993) Detection of abrupt changes: theory and application. Prentice Hall information and system sciences series. Prentice Hall Inc, Englewood Cliffs
MATH Google Scholar
Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2(1):183–202
Article MathSciNet MATH Google Scholar
Becker S, Bobin J, Candès EJ (2011) NESTA: a fast and accurate first-order method for sparse recovery. SIAM J Imaging Sci 4(1):1–39
Article MathSciNet MATH Google Scholar
Beer JC, Aizenstein HJ, Anderson SJ, Krafty RT (2019) Incorporating prior information with fused sparse group lasso: application to prediction of clinical measures from neuroimages. Biometrics 75(4):1299–1309
Article MathSciNet MATH Google Scholar
Bertsekas DP (2015) Convex optimization algorithms. Athena Scientific, Belmont
MATH Google Scholar
Bleakley K, Vert JP (2011) The group fused lasso for multiple change-point detection. Technical Report hal-00602121. https://hal.archives-ouvertes.fr/hal-00602121. Accessed 15 Oct 2020
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3(1):1–122
Article MATH Google Scholar
Bredies K, Lorenz DA (2008) Linear convergence of iterative soft-thresholding. J Fourier Anal Appl 14(5–6):813–837
Article MathSciNet MATH Google Scholar
Cao P, Liu X, Liu H, Yang J, Zhao D, Huang M, Zaiane O (2018) Generalized fused group lasso regularized multi-task feature learning for predicting cognitive outcomes in Alzheimers disease. Comput Methods Programs Biomed 162:19–45
Article Google Scholar
Chen J, Chen Z (2008) Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95(3):759–771
Article MathSciNet MATH Google Scholar
Chen X, Lin Q, Kim S, Carbonell JG, Xing EP (2012) Smoothing proximal gradient method for general structured sparse regression. Ann Appl Stat 6(2):719–752
Article MathSciNet MATH Google Scholar
Chi EC, Lange K (2015) Splitting methods for convex clustering. J Comput Graph Stat 24(4):994–1013
Article MathSciNet Google Scholar
Combettes PL, Pesquet JC (2011) Fixed-point algorithms for inverse problems in science and engineering, chap. proximal splitting methods in signal processing. Springer, New York, pp 185–212
MATH Google Scholar
Condat L (2013) A primal–dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms. J Optim Theory Appl 158(2):460–479
Article MathSciNet MATH Google Scholar
De Vito S, Massera E, Piga M, Martinotto L, Di Francia G (2008) On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sens Actuators B Chem 129(2):750–757
Article Google Scholar
De Vito S, Piga M, Martinotto L, Di Francia G (2009) Co,No$_{2}$ and No$_{x}$ urban pollution monitoring with on-field calibrated electronic nose by automatic Bayesian regularization. Sens Actuators B Chem 143(1):182–191
Article Google Scholar
Friedman J, Hastie T, Höfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1(2):302–332
Article MathSciNet MATH Google Scholar
Fryzlewicz P (2014) Wild binary segmentation for multiple change-point detection. Ann Stat 42(6):2243
Article MathSciNet MATH Google Scholar
Hadj-Selem F, Löfstedt T, Dohmatob E, Frouin V, Dubois M, Guillemot V, Duchesnay E (2018) Continuation of Nesterov’s smoothing for regression with structured sparsity in high-dimensional neuroimaging. IEEE Trans Med Imaging 37(11):2403–2413
Article Google Scholar
Hallac D, Nystrup P, Boyd S (2019) Greedy Gaussian segmentation of multivariate time series. Adv Data Anal Classif 13(3):727–751
Article MathSciNet MATH Google Scholar
Hocking T, Vert JP, Bach FR, Joulin A (2011) Clusterpath: an algorithm for clustering using convex fusion penalties. In: ICML
Hoefling H (2010) A path algorithm for the fused lasso signal approximator. J Comput Graph Stat 19(4):984–1006
Article MathSciNet Google Scholar
Kim S, Xing EP (2012) Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping. Ann Appl Stat 6(3):1095–1117
Article MathSciNet MATH Google Scholar
Kuhn HW (1973) A note on Fermat’s problem. Mat Program 4:98–107
Article MathSciNet MATH Google Scholar
Leonardi F, Bühlmann P (2016) Computationally efficient change point detection for high-dimensional regression
Li Y, Osher S (2009) Coordinate descent optimization for $\ell ^1$ minimization with application to compressed sensing; a greedy algorithm. Inverse Probl Imaging 3(3):487–503
Article MathSciNet MATH Google Scholar
Li X, Mo L, Yuan X, Zhang J (2014) Linearized alternating direction method of multipliers for sparse group and fused LASSO models. Comput Stati Data Anal 79:203–221
Article MathSciNet MATH Google Scholar
Liu J, Yuan L, Ye J (2010) An efficient algorithm for a class of fused lasso problems. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’10. ACM, pp 323–332
Nesterov Y (2005) Smooth minimization of non-smooth functions. Math Program 103(1, Ser. A):127–152
Article MathSciNet MATH Google Scholar
Nesterov Y (2012) Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J Optim 22(2):341–362
Article MathSciNet MATH Google Scholar
Nystrup P, Madsen H, Lindström E (2017) Long memory of financial time series and hidden Markov models with time-varying parameters. J Forecast 36(8):989–1002
Article MathSciNet MATH Google Scholar
Ohlsson H, Ljung L, Boyd S (2010) Segmentation of ARX-models using sum-of-norms regularization. Automatica 46(6):1107–1111
Article MathSciNet MATH Google Scholar
Ombao H, von Sachs R, Guo W (2005) Slex analysis of multivariate nonstationary time series. J Am Stat Assoc 100(470):519–531
Article MathSciNet MATH Google Scholar
Price BS, Geyer CJ, Rothman AJ (2019) Automatic response category combination in multinomial logistic regression. J Comput Graph Stat 28(3):758–766
Article MathSciNet Google Scholar
R Core Team (2019) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. Accessed 15 Oct 2020
Ranalli M, Lagona F, Picone M, Zambianchi E (2018) Segmentation of sea current fields by cylindrical hidden Markov models: a composite likelihood approach. J R Stat Soc Ser C (Appl Stat) 67(3):575–598
Article MathSciNet Google Scholar
Rockafellar R (2015) Convex analysis. Princeton landmarks in mathematics and physics. Princeton University Press, Princeton
Google Scholar
Sanderson C, Curtin R (2016) Armadillo: a template-based C++ library for linear algebra. J Open Source Softw 1:26
Article Google Scholar
Saxén JE, Saxén H, Toivonen HT (2016) Identification of switching linear systems using self-organizing models with application to silicon prediction in hot metal. Appl Soft Comput 47:271–280
Article Google Scholar
Shor NZ (1985) Minimization methods for nondifferentiable functions, Springer series in computational mathematics, vol 3. Springer, Berlin (Translated from the Russian by K. C. Kiwiel and A. Ruszczyński)
Songsiri J (2015) Learning multiple granger graphical models via group fused lasso. In: 2015 10th Asian control conference (ASCC), pp 1–6
Tibshirani R, Wang P (2007) Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics 9(1):18–29
Article MATH Google Scholar
Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, Tibshirani RJ (2012) Strong rules for discarding predictors in lasso-type problems. J R Stat Soc Ser B Stat Methodol 74(2):245–266
Article MathSciNet MATH Google Scholar
Truong C, Oudre L, Vayatis N (2018) A review of change point detection methods. arXiv:1801.00718
Tseng P (2001) Convergence of a block coordinate descent method for nondifferentiable minimization. J Optim Theory Appl 109(3):475–494
Article MathSciNet MATH Google Scholar
Vũ BC (2013) A variable metric extension of the forward–backward–forward algorithm for monotone operators. Numer Funct Anal Optim 34(9):1050–1065
Article MathSciNet MATH Google Scholar
Wang T, Zhu L (2011) Consistent tuning parameter selection in high dimensional sparse linear regression. J Multivar Anal 102(7):1141–1151
Article MathSciNet MATH Google Scholar
Wang J, Fan W, Ye J (2015a) Fused lasso screening rules via the monotonicity of subdifferentials. IEEE Trans Pattern Anal Mach Intell 37(9):1806–1820
Article Google Scholar
Wang J, Wonka P, Ye J (2015b) Lasso screening rules via dual polytope projection. J Mach Learn Res 16:1063–1101
MathSciNet MATH Google Scholar
Wang B, Zhang Y, Sun WW, Fang Y (2018) Sparse convex clustering. J Comput Graph Stat 27(2):393–403
Article MathSciNet Google Scholar
Weiszfeld E, Plastria F (2009) On the point for which the sum of the distances to n given points is minimum. Ann Oper Res 167(1):7–41
Article MathSciNet MATH Google Scholar
Wytock M, Sra S, Kolter JZ (2014) Fast Newton methods for the group fused lasso. Uncertain Artif Intell 2014:888–897
Google Scholar
Xu Y, Lindquist M (2015) Dynamic connectivity detection: an algorithm for determining functional connectivity change points in fMRI data. Front eurosci 9:285
Google Scholar
Yan M (2018) A new primal–dual algorithm for minimizing the sum of three functions with a linear operator. J Sci Comput 76(3):1698–1717
Article MathSciNet MATH Google Scholar
Yao YC (1988) Estimating the number of change-points via Schwarz’ criterion. Stat Probab Lett 6(3):181–189
Article MathSciNet MATH Google Scholar
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B Stat Methodol 68(1):49–67
Article MathSciNet MATH Google Scholar
Zhou J, Liu J, Narayan VA, Ye J (2013) Modeling disease progression via multi-task learning. NeuroImage 78:233–248
Article Google Scholar
Zhu C, Xu H, Leng C, Yan S (2014) Convex optimization procedure for clustering: theoretical revisit. In: NIPS
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol 67(2):301–320
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The author thanks the reviewers and the associate editor for their suggestions which led to substantial improvements of the paper.

Author information

Authors and Affiliations

Department of Mathematics, University of Massachusetts Boston, 100 William T. Morrissey Blvd., Boston, MA, 02125, USA
David Degras

Authors

David Degras
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Degras.

Ethics declarations

Conflict of interest

The author declares that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 122 KB)

Appendices

Proof of Theorem 1

Recall the notations of Sect. 3.2:

$$\begin{aligned} \begin{aligned} g(\beta _t)&= \lambda _1 \Vert \beta _t \Vert _1 + \lambda _2 w_{t-1}\Vert \beta _t - \hat{\beta }_{t-1} \Vert _2 + \lambda _2 w_t \Vert \beta _t - \hat{\beta }_{t+1} \Vert _2, \\ g_1(\beta _t)&= \lambda _1 \Vert \beta _t \Vert _1 , \\ g_2(\beta _t)&= \lambda _2 w_{t-1} \Vert \beta _t - \hat{\beta }_{t-1} \Vert _2 +\lambda _2w_t \Vert \beta _t- \hat{\beta }_{t+1} \Vert _2 + (L_t/2) \Vert \beta _t -z_t \Vert _2^2 ,\\ \bar{g} (\beta _t)&= g(\beta _t) + (L_t/2) \left\| \beta _t - z_t \right\| _2^2 = g_1 (\beta _t)+ g_2 (\beta _t) , \\ \quad \gamma _n&= \big ( L_t + (\lambda _2 w_{t-1} / \Vert \beta _t^{n} - \hat{\beta }_{t-1} \Vert _2 ) + (\lambda _2 w_t / \Vert \beta _{t}^{n} - \hat{\beta }_{t+1} \Vert _2 ) \big )^{-1}, \\ \beta _t^*&= \mathrm {argmin}\, \bar{g} = \mathrm {prox}_{g/L_t}(z_t) , \quad r_n = \bar{g} (\beta _t^n) - \bar{g} (\beta _t^*). \end{aligned} \end{aligned}$$

In view of Remark 2 framing the iterative soft-thresholding scheme (23)–(24) as a proximal gradient method, we can establish the linear convergence of this scheme to $\mathrm {prox}_{g/L_t}(z_t)$ by adapting the results of Bredies and Lorenz (2008) to a nonsmooth setting. Essentially, the proof of linear convergence in Bredies and Lorenz (2008) works by first establishing a lower bound on $\bar{g}(\beta _t^{n})-\bar{g}(\beta _t^n)$, the decrease in the objective function between successive iterations of the proximal gradient method (Lemma 1). This general result shows in particular that when using sufficiently small step sizes, the proximal gradient is a descent method. After that, under the additional assumptions that $g_2$ is convex and that $\Vert \beta _t^n - \beta _t^*\Vert _2^2 \le c r_n$ for some $c>0$, the lower bound of Lemma 1 is exploited to show the exponential decay of $(r_n)$ and the linear convergence of $(\beta _t^n)$ (Proposition 2). In a third movement, the lower bound of Lemma 1 is decomposed as a Bregman-like distance term involving $g_1$ plus a Taylor remainder term involving $g_2$. The specific nature of $g_1$ ($\ell _1$ norm) and possible additional regularity conditions on $g_2$ (typically, strong convexity) are then used to establish the linear convergence result (Theorem 2). For brevity, we refer the reader to Bredies and Lorenz (2008) for the exact statement of these results.

Lemma 1 and Proposition 2 of Bredies and Lorenz (2008) posit, among other things, that the “smooth” part of the objective, $g_2$ in our notations, is differentiable everywhere and has a Lipschitz-continuous gradient. In the present case, $g_2$ is not differentiable at $\hat{\beta }_{t-1}$ and $\hat{\beta }_{t+1}$; however it is differentiable everywhere else and its gradient is Lipschitz-continuous in a local sense. The main effort required for us is to show that Lemma 1 still holds if the points of nondifferentiability of $g_2$ are not on segments joining the iterates $\beta _t^n , n\ge 0$. Put differently, the iterative soft-thresholding scheme should not cross $\hat{\beta }_{t-1}$ and $\hat{\beta }_{t+1}$ on its path. This is where the requirement that $\bar{g}(\beta _t^0) < \min ( \bar{g}(\hat{\beta }_{t-1}), \bar{g}(\hat{\beta }_{t+1}))$ in Theorem 1 plays a crucial part. We now proceed to adapt Lemma 1, after which we will establish the premises of Theorem 2 of Bredies and Lorenz (2008).

Adaptation of Lemma 1of Bredies and Lorenz (2008). The main result we need prove is that

$$\begin{aligned} \forall n\in \mathbb {N}, \quad \big \{ \hat{\beta }_{t-1} , \hat{\beta }_{t+1} \big \} \bigcap ^{\,} \left\{ \alpha \beta _{t}^{n} + (1-\alpha ) \beta _{t}^{n+1} : 0 \le \alpha \le 1 \right\} = \emptyset \,. \end{aligned}$$

(40)

Once this is established, we may follow the proof of Proposition 2 without modification. In particular, we will be in position to state that

$$\begin{aligned} \left\| \nabla g_2(\beta _t^n + \alpha (\beta _t^{n+1} - \beta _t^n ) ) - \nabla g_2(\beta _t^{n})\right\| _2 \le \alpha \tilde{L}_n \left\| \beta _t^{n+1} - \beta _t^{n} \right\| _2 \end{aligned}$$

(41)

for all $n\in \mathbb {N}$ and $ \alpha \in [0,1]$, where

$$\begin{aligned} \tilde{L}_n = L_t + \frac{2\lambda _2 w_{t-1} }{ \Vert \beta _t^n - \hat{\beta }_{t-1} \Vert _2 } +\frac{2\lambda _2 w_t }{ \Vert \beta _t^n - \hat{\beta }_{t+1} \Vert _2 } . \end{aligned}$$

Note that the left-hand side in (41) is not well defined if (40) does not hold. Combining the local Lipschitz property (41) with the step size condition $\gamma _{n} < 2 / \tilde{L}_n$, we may go on to establish the descent property (3.5) of Bredies and Lorenz (2008):

$$\begin{aligned} \bar{g}(\beta _t^{n+1}) \le \bar{g}(\beta _t^{n}) - \delta D_{\gamma _n}(\beta _t^{n}) \end{aligned}$$

(42)

where

$$\begin{aligned} D_{\gamma _n}(\beta _t^{n}) = g_1(\beta _t^{n}) - g_1(\beta _t^{n+1}) + \nabla g_2(\beta _t^{n})'(\beta _t^{n}-\beta _t^{n+1}) \quad \text { and } \quad \delta = 1 - \frac{\max _{n} \gamma _n \tilde{L}_n }{ 2} . \end{aligned}$$

Lemma 1 shows that $D_{\gamma _n}(\beta _t^{n}) \ge \Vert \beta _t^{n}-\beta _t^{n+1}\Vert _2^2 / \gamma _n \ge 0$. To show the positivity of $\delta $, note that

$$\begin{aligned} \gamma _n \tilde{L}_n = 2 - L_t \left( \displaystyle L_t + \frac{\lambda _2 w_{t-1} }{ \Vert \beta _t^{n} - \hat{\beta }_{t-1}\Vert _2} + \frac{\lambda _2 w_{t}}{ \Vert \beta _t^{n} - \hat{\beta }_{t+1}\Vert _2 } \right) ^{-1} . \end{aligned}$$

Given the descent property of $(\beta _n)$ for $\bar{g}$, the assumption $\bar{g}(\beta _t^0) < \min (\bar{g}(\hat{\beta }_{t-1}),\bar{g}(\hat{\beta }_{t+1})) $, and the convexity of the sublevel sets of $\bar{g}$, it holds that $ \Vert \beta _t^{n} - \hat{\beta }_{t-1}\Vert _2 \ge d(\hat{\beta }_{t-1} , \{ \beta _t : \bar{g}(\beta _t)\le \bar{g}(\beta _t^0) \}) $ for all $n\in \mathbb {N}$; an analog inequality holds for $\hat{\beta }_{t+1}$. Denoting these positive lower bounds by $m_{t-1} $ and $m_{t+1}$, we have

$$\begin{aligned} 0 < \frac{1}{2} \left( \displaystyle 1 + \frac{\lambda _2 w_{t-1} }{L_t m_{t-1}} + \frac{\lambda _2 w_{t}}{ L_t m_{t+1} } \right) ^{-1} \le \delta \le \frac{1}{2} . \end{aligned}$$

(43)

Together, the step size condition $\gamma _{n} < 2 / \tilde{L}_n$, descent property (42), and lower bound (43) finish to establish Lemma 1 and the precondition of Proposition 2 of Bredies and Lorenz (2008).

It remains to prove (40). We will show a weaker form of (42), namely that $\bar{g}(\beta _t^{n+1}) \le \bar{g}(\beta _t^{n}) $ for all n. This inequality, combined with the convexity of $\bar{g}$ and the assumption $\bar{g}(\beta _t^0) < \min (\bar{g}(\hat{\beta }_{t-1}),\bar{g}(\hat{\beta }_{t+1})) $, implies that $\hat{\beta }_{t-1}$ and $\hat{\beta }_{t+1}$ cannot be on a segment joining $\beta _t^{n}$ and $\beta _t^{n+1}$. Otherwise, the convexity of $\bar{g}$ would imply that, say, $\bar{g}(\hat{\beta }_{t-1}) \le \max ( \bar{g}(\beta _t^{n}), \bar{g}(\beta _t^{n+1})) \le \bar{g}(\beta _t^{n})\le \cdots \le \bar{g}(\beta _t^{0}) < \bar{g}(\hat{\beta }_{t-1})$, a contradiction.

To prove the simple descent property, we start with an easy lemma stated without proof.

Lemma 1

For all $x,y \in \mathbb {R}^p$ such that $y \ne 0_p$,

$$\begin{aligned} \Vert x \Vert _2 \le \Vert y \Vert _2 + \frac{ y'(x-y) }{\Vert y \Vert _2 } + \frac{\Vert x - y\Vert _2^2 }{2\Vert y\Vert _2} \,. \end{aligned}$$

Applying this lemma to $x= \beta _t - \hat{\beta }_{t \pm 1}$ and $y = \beta _t^{n} - \hat{\beta }_{t \pm 1}$, we deduce that for all $\beta _t \in \mathbb {R}^p $,

$$\begin{aligned} \Vert \beta _t - \hat{\beta }_{t - 1}\Vert _2&\le \Vert \beta _t^{n} - \hat{\beta }_{t - 1}\Vert _2 + \frac{ ( \beta _t - \hat{\beta }_{t - 1})'(\beta _t - \beta _t^n) }{\Vert \beta _t^n - \hat{\beta }_{t - 1} \Vert _2 } + \frac{\Vert \beta _t - \beta _t^n \Vert _2^2 }{2\Vert \beta _t^n - \hat{\beta }_{t - 1} \Vert _2} , \end{aligned}$$

(44)

$$\begin{aligned} \Vert \beta _t - \hat{\beta }_{t + 1}\Vert _2&\le \Vert \beta _t^{n} - \hat{\beta }_{t + 1}\Vert _2 + \frac{ ( \beta _t - \hat{\beta }_{t + 1})'(\beta _t - \beta _t^n) }{\Vert \beta _t^n - \hat{\beta }_{t +1} \Vert _2 } + \frac{\Vert \beta _t - \beta _t^n \Vert _2^2 }{ 2 \Vert \beta _t^n - \hat{\beta }_{t + 1} \Vert _2} . \end{aligned}$$

(45)

In addition, it is immediate that

$$\begin{aligned} \Vert \beta _t - z_t \Vert _2^2&= \Vert \beta _t^n - z_t \Vert _2^2 - 2 (\beta _n - z_t)'(\beta _t - \beta _t^n ) + \Vert \beta _t - \beta _t^n \Vert _2^2 . \end{aligned}$$

(46)

Multiplying (44) by $\lambda _2 w_{t-1}$, (45) by $\lambda _2 w_{t} $, (46) by $L_t/2$, summing these relations, and adding $g_1(\beta _t)$ on each side, we obtain

$$\begin{aligned} (g_1 + g_2)(\beta _t) \le g_1(\beta _t) + g_2(\beta _t^n) + \nabla g_2(\beta _t^n) ' (\beta _t - \beta _t^n) + \frac{1}{2\gamma _n} \Vert \beta _t - \beta _t^n \Vert _2^2 . \end{aligned}$$

(47)

The left-hand side of (47) is simply $\bar{g}(\beta _t)$. Also, in view of Remark 2, the minimizer of the right-hand side of (47) is $\mathcal {T}(\beta _t^n) = \beta _t^{n+1}$. Evaluating (47) at $ \beta _t^{n+1}$ and exploiting this minimizing property, it follows that

$$\begin{aligned} \begin{aligned} \bar{g}(\beta _t^{n+1} )&\le g_1(\beta _t^{n+1} ) + g_2(\beta _t^n) + \nabla g_2(\beta _t^n) ' (\beta _t^{n+1} - \beta _t^n) + \frac{1}{2\gamma _n} \Vert \beta _t^{n+1} - \beta _t^n \Vert _2^2 \\&\le g_1(\beta _t^{n} ) + g_2(\beta _t^n) + \nabla g_2(\beta _t^n) ' (\beta _t^n - \beta _t^n) + \frac{1}{2\gamma _n} \Vert \beta _t^{n} - \beta _t^n \Vert _2^2 \\&= \bar{g}(\beta _t^n). \end{aligned} \end{aligned}$$

(48)

This establishes the desired descent property.

Prerequisites of Theorem 2 of Bredies and Lorenz (2008). The distance $r_n = \bar{g}(\beta _t^n) - \bar{g}(\beta _t^*)$ to the minimum of the objective can be usefully decomposed as

$$\begin{aligned} \begin{aligned} r_n&= R(\beta _t^n) + T(\beta _t^n) \\ R(\beta _t)&= \nabla g_2(\beta _t^*) '( \beta _t - \beta _t^*) + g_1(\beta _t) - g_1(\beta _t^*) \\ T(\beta _t)&= g_2(\beta _t) - g_2(\beta _t^*) - \nabla g_2(\beta _t^*)'( \beta _t - \beta _t^*) \end{aligned} \end{aligned}$$

(49)

where $R(\beta _t)$ is a Bregman-like distance and $T(\beta _t)$ is the remainder of the Taylor expansion of $g_2$ at $\beta _t^*$.

To obtain the linear convergence of $(\beta _t^n)$ to $\beta _t^*$ and the exponential decay of $(r_n)$ to 0 with Theorem 2 of Bredies and Lorenz (2008), it suffices to show that

$$\begin{aligned} \Vert \beta _t - \beta _t^*\Vert _2^2 \le c \left( R(\beta _t) + T(\beta _t)\right) \end{aligned}$$

(50)

for some constant $c>0$ and for all $\beta _t\in \mathbb {R}^p$.

Invoking the convexity of $\Vert \cdot \Vert _2$ and strong convexity of $\Vert \cdot \Vert _2^2$, one sees that $ T(\beta _t) \ge (L_t/2) \Vert \beta _t - \beta _t^*\Vert _2^2$ for all $\beta _t$. Also, $R(\beta _t) \ge 0$ for all $\beta _t$ (Lemma 2 of Bredies and Lorenz 2008) so that c can be taken as $2/L_t$ in (50). $\square $

Proof of Theorem 2

We first observe that by design, each of the four steps or components of Algorithm 5 is nonincreasing in the objective function F. Indeed the first three steps (optimization with respect to single blocks, single chains, and descent over fixed chains) are all based on FISTA (Algorithms 3 and 4) which is globally convergent (Beck and Teboulle 2009, Theorem 4.4). As each of these components minimizes F under certain constraints (namely, some blocks or fusion chains are fixed), the objective value of their output, say $F(\beta ^{n+1})$, cannot be lower than that of their input, $F(\beta ^n)$. The fourth step, subgradient descent is also nonincreasing because the subgradient of minimum norm—if it is not zero—provides a direction of (steepest) descent. The line search for the step size in the subgradient step then guarantees that the objective does not decrease after this step. As a result, Algorithm 5 as a whole is nonincreasing in F.

Let us denote a generic segmentation of the set $\{ 1, \ldots , T\}$ by $C=(C_1,\ldots , C_K)$ where $C_k =\{ T_k ,\ldots , T_{k+1} - 1\} $ and $1= T_1 \le \cdots \le T_K < T_{K+1}=T+1$. There are $2^{T-1}$ such segmentations. Let $S_C$ be the associated open set for the parameter $\beta $:

$$\begin{aligned} S_C = \left\{ \beta \in \mathbb {R}^{pT}: \beta _{T_k} = \cdots = \beta _{T_{k+1}-1},\ \beta _{T_{k+1}-1} \ne \beta _{T_{k+1}} , 1\le k \le K \right\} . \end{aligned}$$

To each segmentation C is associated an infimum value of F: $\inf _{\beta \in S_C}F(\beta )$. Let $(\beta ^n)_{n\ge 0}$ be the sequence of iterates generated by Algorithm 5 and let $C^{n}$ the associated segmentations of $\{1,\ldots ,T\}$. By setting the tolerance $\epsilon $ of Algorithm 5 sufficiently small, each time the third optimization component (descent over fixed chains) is applied, say with $\beta ^n$ as input and $\beta ^{n+1}$ as output, the objective $F(\beta )$ can be made arbitrarily close to $\min _{ \beta \in S_{C^n}}F(\beta )$ or even become inferior to this value if a fusion of chains occurs during this optimization. If the segmentation $C^n$ is optimal, i.e. $ \min _{ S_{C^n} } F (\beta ) = \min _{\beta \in \mathbb {R}^{pT}}F (\beta )$, then Algorithm 5 has converged: for all subsequent iterates $m\ge n$, $F(\beta ^m)$ and $\beta ^m$ will stay arbitrarily close to the minimum of F and to the set of minimizers, respectively, because of the nonincreasing property of Algorithm 5. If the segmentation $C^n$ is suboptimal, i.e. $ \min _{\beta \in S_{C^n} } F(\beta ) > \min _{\beta \in \mathbb {R}^{pT}}F (\beta )$, provided that $\epsilon $ is sufficiently small, the (fourth) subgradient step of Algorithm 5 will eventually produce an iterate $\beta ^m$ ($m \ge n$) such that $F(\beta ^m) < \min _{ \beta \in S_{C^n} }F(\beta ) $. This is because each subgradient step brings the iterates closer to the set of global minimizers of F. Once this has happened, the nonincreasing property of the algorithm guarantees that the segmentation $C^n$ will not be visited again. Because the segmentations of $\{ 1,\ldots ,T\}$ are in finite number, Algorithm 5 eventually finds an optimal segmentation C such that $\min _{\beta \in S_C} F(\beta ) = \min _{\beta \in \mathbb {R}^{pT}} F(\beta )$. Then, through its third level of optimization (descent over fixed chains), it reaches the global minimum of F. We note that the first and second components of Algorithm 5 (block coordinate descent over single blocks and single chains) are not necessary to ensure global convergence; they only serve for computational speed. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Degras, D. Sparse group fused lasso for model segmentation: a hybrid approach. Adv Data Anal Classif 15, 625–671 (2021). https://doi.org/10.1007/s11634-020-00424-5

Download citation

Received: 09 December 2019
Revised: 27 September 2020
Accepted: 08 October 2020
Published: 22 October 2020
Issue Date: September 2021
DOI: https://doi.org/10.1007/s11634-020-00424-5

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sparse group fused lasso for model segmentation: a hybrid approach

Abstract

Access this article

Similar content being viewed by others

A survey of methods for time series change point detection

Bias-constrained integer least squares estimation: distributional properties and applications in GNSS ambiguity resolution

Clustering functional data via variational inference

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 122 KB)

Appendices

Proof of Theorem 1

Lemma 1

Proof of Theorem 2

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Sparse group fused lasso for model segmentation: a hybrid approach

Abstract

Access this article

Similar content being viewed by others

A survey of methods for time series change point detection

Bias-constrained integer least squares estimation: distributional properties and applications in GNSS ambiguity resolution

Clustering functional data via variational inference

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 122 KB)

Appendices

Proof of Theorem 1

Lemma 1

Proof of Theorem 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation