Abstract
This article introduces the sparse group fused lasso (SGFL) as a statistical framework for segmenting sparse regression models with multivariate time series. To compute solutions of the SGFL, a nonsmooth and nonseparable convex program, we develop a hybrid optimization method that is fast, requires no tuning parameter selection, and is guaranteed to converge to a global minimizer. In numerical experiments, the hybrid method compares favorably to state-of-the-art techniques with respect to computation time and numerical accuracy; benefits are particularly substantial in high dimension. The method’s statistical performance is satisfactory in recovering nonzero regression coefficients and excellent in change point detection. An application to air quality data is presented. The hybrid method is implemented in the R package sparseGFL available on the author’s Github page.
Similar content being viewed by others
References
Alaíz CM, Jiménez ÁB, Dorronsoro JR (2013) Group fused lasso. Artif Neural Netw Mach Learn 2013:66–73
Alewijnse SPA, Buchin K, Buchin M, Sijben S, Westenberg MA (2018) Model-based segmentation and classification of trajectories. Algorithmica 80(8):2422–2452
Bai J (1997) Estimating multiple breaks one at a time. Econom Theory 13(3):315–352
Bai J, Perron P (2003) Computation and analysis of multiple structural change models. J Appl Econom 18(1):1–22
Barbero A, Sra S (2011) Fast Newton-type methods for total variation regularization. In: Proceedings of the 28th international conference on machine learning, ICML 2011, pp 313–320
Basseville M, Nikiforov IV (1993) Detection of abrupt changes: theory and application. Prentice Hall information and system sciences series. Prentice Hall Inc, Englewood Cliffs
Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2(1):183–202
Becker S, Bobin J, Candès EJ (2011) NESTA: a fast and accurate first-order method for sparse recovery. SIAM J Imaging Sci 4(1):1–39
Beer JC, Aizenstein HJ, Anderson SJ, Krafty RT (2019) Incorporating prior information with fused sparse group lasso: application to prediction of clinical measures from neuroimages. Biometrics 75(4):1299–1309
Bertsekas DP (2015) Convex optimization algorithms. Athena Scientific, Belmont
Bleakley K, Vert JP (2011) The group fused lasso for multiple change-point detection. Technical Report hal-00602121. https://hal.archives-ouvertes.fr/hal-00602121. Accessed 15 Oct 2020
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3(1):1–122
Bredies K, Lorenz DA (2008) Linear convergence of iterative soft-thresholding. J Fourier Anal Appl 14(5–6):813–837
Cao P, Liu X, Liu H, Yang J, Zhao D, Huang M, Zaiane O (2018) Generalized fused group lasso regularized multi-task feature learning for predicting cognitive outcomes in Alzheimers disease. Comput Methods Programs Biomed 162:19–45
Chen J, Chen Z (2008) Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95(3):759–771
Chen X, Lin Q, Kim S, Carbonell JG, Xing EP (2012) Smoothing proximal gradient method for general structured sparse regression. Ann Appl Stat 6(2):719–752
Chi EC, Lange K (2015) Splitting methods for convex clustering. J Comput Graph Stat 24(4):994–1013
Combettes PL, Pesquet JC (2011) Fixed-point algorithms for inverse problems in science and engineering, chap. proximal splitting methods in signal processing. Springer, New York, pp 185–212
Condat L (2013) A primal–dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms. J Optim Theory Appl 158(2):460–479
De Vito S, Massera E, Piga M, Martinotto L, Di Francia G (2008) On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sens Actuators B Chem 129(2):750–757
De Vito S, Piga M, Martinotto L, Di Francia G (2009) Co,No\(_{2}\) and No\(_{x}\) urban pollution monitoring with on-field calibrated electronic nose by automatic Bayesian regularization. Sens Actuators B Chem 143(1):182–191
Friedman J, Hastie T, Höfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1(2):302–332
Fryzlewicz P (2014) Wild binary segmentation for multiple change-point detection. Ann Stat 42(6):2243
Hadj-Selem F, Löfstedt T, Dohmatob E, Frouin V, Dubois M, Guillemot V, Duchesnay E (2018) Continuation of Nesterov’s smoothing for regression with structured sparsity in high-dimensional neuroimaging. IEEE Trans Med Imaging 37(11):2403–2413
Hallac D, Nystrup P, Boyd S (2019) Greedy Gaussian segmentation of multivariate time series. Adv Data Anal Classif 13(3):727–751
Hocking T, Vert JP, Bach FR, Joulin A (2011) Clusterpath: an algorithm for clustering using convex fusion penalties. In: ICML
Hoefling H (2010) A path algorithm for the fused lasso signal approximator. J Comput Graph Stat 19(4):984–1006
Kim S, Xing EP (2012) Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping. Ann Appl Stat 6(3):1095–1117
Kuhn HW (1973) A note on Fermat’s problem. Mat Program 4:98–107
Leonardi F, Bühlmann P (2016) Computationally efficient change point detection for high-dimensional regression
Li Y, Osher S (2009) Coordinate descent optimization for \(\ell ^1\) minimization with application to compressed sensing; a greedy algorithm. Inverse Probl Imaging 3(3):487–503
Li X, Mo L, Yuan X, Zhang J (2014) Linearized alternating direction method of multipliers for sparse group and fused LASSO models. Comput Stati Data Anal 79:203–221
Liu J, Yuan L, Ye J (2010) An efficient algorithm for a class of fused lasso problems. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’10. ACM, pp 323–332
Nesterov Y (2005) Smooth minimization of non-smooth functions. Math Program 103(1, Ser. A):127–152
Nesterov Y (2012) Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J Optim 22(2):341–362
Nystrup P, Madsen H, Lindström E (2017) Long memory of financial time series and hidden Markov models with time-varying parameters. J Forecast 36(8):989–1002
Ohlsson H, Ljung L, Boyd S (2010) Segmentation of ARX-models using sum-of-norms regularization. Automatica 46(6):1107–1111
Ombao H, von Sachs R, Guo W (2005) Slex analysis of multivariate nonstationary time series. J Am Stat Assoc 100(470):519–531
Price BS, Geyer CJ, Rothman AJ (2019) Automatic response category combination in multinomial logistic regression. J Comput Graph Stat 28(3):758–766
R Core Team (2019) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. Accessed 15 Oct 2020
Ranalli M, Lagona F, Picone M, Zambianchi E (2018) Segmentation of sea current fields by cylindrical hidden Markov models: a composite likelihood approach. J R Stat Soc Ser C (Appl Stat) 67(3):575–598
Rockafellar R (2015) Convex analysis. Princeton landmarks in mathematics and physics. Princeton University Press, Princeton
Sanderson C, Curtin R (2016) Armadillo: a template-based C++ library for linear algebra. J Open Source Softw 1:26
Saxén JE, Saxén H, Toivonen HT (2016) Identification of switching linear systems using self-organizing models with application to silicon prediction in hot metal. Appl Soft Comput 47:271–280
Shor NZ (1985) Minimization methods for nondifferentiable functions, Springer series in computational mathematics, vol 3. Springer, Berlin (Translated from the Russian by K. C. Kiwiel and A. Ruszczyński)
Songsiri J (2015) Learning multiple granger graphical models via group fused lasso. In: 2015 10th Asian control conference (ASCC), pp 1–6
Tibshirani R, Wang P (2007) Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics 9(1):18–29
Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, Tibshirani RJ (2012) Strong rules for discarding predictors in lasso-type problems. J R Stat Soc Ser B Stat Methodol 74(2):245–266
Truong C, Oudre L, Vayatis N (2018) A review of change point detection methods. arXiv:1801.00718
Tseng P (2001) Convergence of a block coordinate descent method for nondifferentiable minimization. J Optim Theory Appl 109(3):475–494
Vũ BC (2013) A variable metric extension of the forward–backward–forward algorithm for monotone operators. Numer Funct Anal Optim 34(9):1050–1065
Wang T, Zhu L (2011) Consistent tuning parameter selection in high dimensional sparse linear regression. J Multivar Anal 102(7):1141–1151
Wang J, Fan W, Ye J (2015a) Fused lasso screening rules via the monotonicity of subdifferentials. IEEE Trans Pattern Anal Mach Intell 37(9):1806–1820
Wang J, Wonka P, Ye J (2015b) Lasso screening rules via dual polytope projection. J Mach Learn Res 16:1063–1101
Wang B, Zhang Y, Sun WW, Fang Y (2018) Sparse convex clustering. J Comput Graph Stat 27(2):393–403
Weiszfeld E, Plastria F (2009) On the point for which the sum of the distances to n given points is minimum. Ann Oper Res 167(1):7–41
Wytock M, Sra S, Kolter JZ (2014) Fast Newton methods for the group fused lasso. Uncertain Artif Intell 2014:888–897
Xu Y, Lindquist M (2015) Dynamic connectivity detection: an algorithm for determining functional connectivity change points in fMRI data. Front eurosci 9:285
Yan M (2018) A new primal–dual algorithm for minimizing the sum of three functions with a linear operator. J Sci Comput 76(3):1698–1717
Yao YC (1988) Estimating the number of change-points via Schwarz’ criterion. Stat Probab Lett 6(3):181–189
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B Stat Methodol 68(1):49–67
Zhou J, Liu J, Narayan VA, Ye J (2013) Modeling disease progression via multi-task learning. NeuroImage 78:233–248
Zhu C, Xu H, Leng C, Yan S (2014) Convex optimization procedure for clustering: theoretical revisit. In: NIPS
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol 67(2):301–320
Acknowledgements
The author thanks the reviewers and the associate editor for their suggestions which led to substantial improvements of the paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author declares that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendices
Proof of Theorem 1
Recall the notations of Sect. 3.2:
In view of Remark 2 framing the iterative soft-thresholding scheme (23)–(24) as a proximal gradient method, we can establish the linear convergence of this scheme to \(\mathrm {prox}_{g/L_t}(z_t)\) by adapting the results of Bredies and Lorenz (2008) to a nonsmooth setting. Essentially, the proof of linear convergence in Bredies and Lorenz (2008) works by first establishing a lower bound on \(\bar{g}(\beta _t^{n})-\bar{g}(\beta _t^n)\), the decrease in the objective function between successive iterations of the proximal gradient method (Lemma 1). This general result shows in particular that when using sufficiently small step sizes, the proximal gradient is a descent method. After that, under the additional assumptions that \(g_2\) is convex and that \(\Vert \beta _t^n - \beta _t^*\Vert _2^2 \le c r_n\) for some \(c>0\), the lower bound of Lemma 1 is exploited to show the exponential decay of \((r_n)\) and the linear convergence of \((\beta _t^n)\) (Proposition 2). In a third movement, the lower bound of Lemma 1 is decomposed as a Bregman-like distance term involving \(g_1\) plus a Taylor remainder term involving \(g_2\). The specific nature of \(g_1\) (\(\ell _1\) norm) and possible additional regularity conditions on \(g_2\) (typically, strong convexity) are then used to establish the linear convergence result (Theorem 2). For brevity, we refer the reader to Bredies and Lorenz (2008) for the exact statement of these results.
Lemma 1 and Proposition 2 of Bredies and Lorenz (2008) posit, among other things, that the “smooth” part of the objective, \(g_2\) in our notations, is differentiable everywhere and has a Lipschitz-continuous gradient. In the present case, \(g_2\) is not differentiable at \(\hat{\beta }_{t-1}\) and \(\hat{\beta }_{t+1}\); however it is differentiable everywhere else and its gradient is Lipschitz-continuous in a local sense. The main effort required for us is to show that Lemma 1 still holds if the points of nondifferentiability of \(g_2\) are not on segments joining the iterates \(\beta _t^n , n\ge 0\). Put differently, the iterative soft-thresholding scheme should not cross \(\hat{\beta }_{t-1}\) and \(\hat{\beta }_{t+1}\) on its path. This is where the requirement that \(\bar{g}(\beta _t^0) < \min ( \bar{g}(\hat{\beta }_{t-1}), \bar{g}(\hat{\beta }_{t+1}))\) in Theorem 1 plays a crucial part. We now proceed to adapt Lemma 1, after which we will establish the premises of Theorem 2 of Bredies and Lorenz (2008).
Adaptation of Lemma 1of Bredies and Lorenz (2008). The main result we need prove is that
Once this is established, we may follow the proof of Proposition 2 without modification. In particular, we will be in position to state that
for all \(n\in \mathbb {N}\) and \( \alpha \in [0,1]\), where
Note that the left-hand side in (41) is not well defined if (40) does not hold. Combining the local Lipschitz property (41) with the step size condition \(\gamma _{n} < 2 / \tilde{L}_n\), we may go on to establish the descent property (3.5) of Bredies and Lorenz (2008):
where
Lemma 1 shows that \(D_{\gamma _n}(\beta _t^{n}) \ge \Vert \beta _t^{n}-\beta _t^{n+1}\Vert _2^2 / \gamma _n \ge 0\). To show the positivity of \(\delta \), note that
Given the descent property of \((\beta _n)\) for \(\bar{g}\), the assumption \(\bar{g}(\beta _t^0) < \min (\bar{g}(\hat{\beta }_{t-1}),\bar{g}(\hat{\beta }_{t+1})) \), and the convexity of the sublevel sets of \(\bar{g}\), it holds that \( \Vert \beta _t^{n} - \hat{\beta }_{t-1}\Vert _2 \ge d(\hat{\beta }_{t-1} , \{ \beta _t : \bar{g}(\beta _t)\le \bar{g}(\beta _t^0) \}) \) for all \(n\in \mathbb {N}\); an analog inequality holds for \(\hat{\beta }_{t+1}\). Denoting these positive lower bounds by \(m_{t-1} \) and \(m_{t+1}\), we have
Together, the step size condition \(\gamma _{n} < 2 / \tilde{L}_n\), descent property (42), and lower bound (43) finish to establish Lemma 1 and the precondition of Proposition 2 of Bredies and Lorenz (2008).
It remains to prove (40). We will show a weaker form of (42), namely that \(\bar{g}(\beta _t^{n+1}) \le \bar{g}(\beta _t^{n}) \) for all n. This inequality, combined with the convexity of \(\bar{g}\) and the assumption \(\bar{g}(\beta _t^0) < \min (\bar{g}(\hat{\beta }_{t-1}),\bar{g}(\hat{\beta }_{t+1})) \), implies that \(\hat{\beta }_{t-1}\) and \(\hat{\beta }_{t+1}\) cannot be on a segment joining \(\beta _t^{n}\) and \(\beta _t^{n+1}\). Otherwise, the convexity of \(\bar{g}\) would imply that, say, \(\bar{g}(\hat{\beta }_{t-1}) \le \max ( \bar{g}(\beta _t^{n}), \bar{g}(\beta _t^{n+1})) \le \bar{g}(\beta _t^{n})\le \cdots \le \bar{g}(\beta _t^{0}) < \bar{g}(\hat{\beta }_{t-1})\), a contradiction.
To prove the simple descent property, we start with an easy lemma stated without proof.
Lemma 1
For all \(x,y \in \mathbb {R}^p\) such that \(y \ne 0_p\),
Applying this lemma to \(x= \beta _t - \hat{\beta }_{t \pm 1}\) and \(y = \beta _t^{n} - \hat{\beta }_{t \pm 1}\), we deduce that for all \(\beta _t \in \mathbb {R}^p \),
In addition, it is immediate that
Multiplying (44) by \(\lambda _2 w_{t-1}\), (45) by \(\lambda _2 w_{t} \), (46) by \(L_t/2\), summing these relations, and adding \(g_1(\beta _t)\) on each side, we obtain
The left-hand side of (47) is simply \(\bar{g}(\beta _t)\). Also, in view of Remark 2, the minimizer of the right-hand side of (47) is \(\mathcal {T}(\beta _t^n) = \beta _t^{n+1}\). Evaluating (47) at \( \beta _t^{n+1}\) and exploiting this minimizing property, it follows that
This establishes the desired descent property.
Prerequisites of Theorem 2 of Bredies and Lorenz (2008). The distance \(r_n = \bar{g}(\beta _t^n) - \bar{g}(\beta _t^*)\) to the minimum of the objective can be usefully decomposed as
where \(R(\beta _t)\) is a Bregman-like distance and \(T(\beta _t)\) is the remainder of the Taylor expansion of \(g_2\) at \(\beta _t^*\).
To obtain the linear convergence of \((\beta _t^n)\) to \(\beta _t^*\) and the exponential decay of \((r_n)\) to 0 with Theorem 2 of Bredies and Lorenz (2008), it suffices to show that
for some constant \(c>0\) and for all \(\beta _t\in \mathbb {R}^p\).
Invoking the convexity of \(\Vert \cdot \Vert _2\) and strong convexity of \(\Vert \cdot \Vert _2^2\), one sees that \( T(\beta _t) \ge (L_t/2) \Vert \beta _t - \beta _t^*\Vert _2^2\) for all \(\beta _t\). Also, \(R(\beta _t) \ge 0\) for all \(\beta _t\) (Lemma 2 of Bredies and Lorenz 2008) so that c can be taken as \(2/L_t\) in (50). \(\square \)
Proof of Theorem 2
We first observe that by design, each of the four steps or components of Algorithm 5 is nonincreasing in the objective function F. Indeed the first three steps (optimization with respect to single blocks, single chains, and descent over fixed chains) are all based on FISTA (Algorithms 3 and 4) which is globally convergent (Beck and Teboulle 2009, Theorem 4.4). As each of these components minimizes F under certain constraints (namely, some blocks or fusion chains are fixed), the objective value of their output, say \(F(\beta ^{n+1})\), cannot be lower than that of their input, \(F(\beta ^n)\). The fourth step, subgradient descent is also nonincreasing because the subgradient of minimum norm—if it is not zero—provides a direction of (steepest) descent. The line search for the step size in the subgradient step then guarantees that the objective does not decrease after this step. As a result, Algorithm 5 as a whole is nonincreasing in F.
Let us denote a generic segmentation of the set \(\{ 1, \ldots , T\}\) by \(C=(C_1,\ldots , C_K)\) where \(C_k =\{ T_k ,\ldots , T_{k+1} - 1\} \) and \(1= T_1 \le \cdots \le T_K < T_{K+1}=T+1\). There are \(2^{T-1}\) such segmentations. Let \(S_C\) be the associated open set for the parameter \(\beta \):
To each segmentation C is associated an infimum value of F: \(\inf _{\beta \in S_C}F(\beta )\). Let \((\beta ^n)_{n\ge 0}\) be the sequence of iterates generated by Algorithm 5 and let \(C^{n}\) the associated segmentations of \(\{1,\ldots ,T\}\). By setting the tolerance \(\epsilon \) of Algorithm 5 sufficiently small, each time the third optimization component (descent over fixed chains) is applied, say with \(\beta ^n\) as input and \(\beta ^{n+1}\) as output, the objective \(F(\beta )\) can be made arbitrarily close to \(\min _{ \beta \in S_{C^n}}F(\beta )\) or even become inferior to this value if a fusion of chains occurs during this optimization. If the segmentation \(C^n\) is optimal, i.e. \( \min _{ S_{C^n} } F (\beta ) = \min _{\beta \in \mathbb {R}^{pT}}F (\beta )\), then Algorithm 5 has converged: for all subsequent iterates \(m\ge n\), \(F(\beta ^m)\) and \(\beta ^m\) will stay arbitrarily close to the minimum of F and to the set of minimizers, respectively, because of the nonincreasing property of Algorithm 5. If the segmentation \(C^n\) is suboptimal, i.e. \( \min _{\beta \in S_{C^n} } F(\beta ) > \min _{\beta \in \mathbb {R}^{pT}}F (\beta )\), provided that \(\epsilon \) is sufficiently small, the (fourth) subgradient step of Algorithm 5 will eventually produce an iterate \(\beta ^m\) (\(m \ge n\)) such that \(F(\beta ^m) < \min _{ \beta \in S_{C^n} }F(\beta ) \). This is because each subgradient step brings the iterates closer to the set of global minimizers of F. Once this has happened, the nonincreasing property of the algorithm guarantees that the segmentation \(C^n\) will not be visited again. Because the segmentations of \(\{ 1,\ldots ,T\}\) are in finite number, Algorithm 5 eventually finds an optimal segmentation C such that \(\min _{\beta \in S_C} F(\beta ) = \min _{\beta \in \mathbb {R}^{pT}} F(\beta )\). Then, through its third level of optimization (descent over fixed chains), it reaches the global minimum of F. We note that the first and second components of Algorithm 5 (block coordinate descent over single blocks and single chains) are not necessary to ensure global convergence; they only serve for computational speed. \(\square \)
Rights and permissions
About this article
Cite this article
Degras, D. Sparse group fused lasso for model segmentation: a hybrid approach. Adv Data Anal Classif 15, 625–671 (2021). https://doi.org/10.1007/s11634-020-00424-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-020-00424-5
Keywords
- Multivariate time series
- Model segmentation
- High-dimensional regression
- Convex optimization
- Hybrid algorithm