Stochastics and StatisticsSelective linearization for multi-block statistical learning
Introduction
Big data challenge is common among many data-driven models in computer vision, bioinformatics, and e-commerce (Goldfarb, Ma, Scheinberg, 2013, Lin, Pham, Ruszczyński, 2014, Olafsson, Li, Wu, 2008, Wong, Hsu, 2006). Large data sets of interest may have millions of data records to be processed. In video streaming services or online retailers such as YouTube and Amazon, respectively, data is updated in matters of seconds. On the other hand, data sets can be of high dimension while the number of samples is much smaller than the dimension. By using classical regression models, it is not possible to estimate the desired features with a limited number of samples.
In recent years, extensive research efforts have been devoted to modeling high-dimensional statistical learning problems by incorporating complex structural regularization penalties into the model. Model parameters are estimated by solving an optimization problem (1) consisting of a sum of loss functions which measure the goodness-of-fit of the data, and convex regularization functions (or penalties) :
All the component functions in the model are convex but not necessarily smooth, and possibly non-separable. Problem (1) covers many important applications. For example, variable selection using group lasso penalty in linear regression and logistic regression not only encourages the model to select meaningful features together in groups, but also induces sparsity at the group level (Yuan, Liu, & Ye, 2011). In classification models, the elastic net method, which combines -norm and -norm regularization, has been applied to large-scale support vector machine problems where the number of features is larger than the number of objects (Zou & Hastie, 2005). In the compressed sensing literature, the Magnetic Resonance Imaging recovery model uses a linear combination of the total variation (TV) norm and -norm penalties. Several models in recommendation systems combine the nuclear norm and -norm to induce sparsity structure in estimated coefficients (Zhou, Gong, Wang, & Ye, 2015). It is also convenient to formulate generic constrained convex problems by choosing the appropriate convex penalty functions . The common feature of these examples of problem (1) in statistical learning is that the functions are nonsmooth and nonseparable, which makes the problems difficult to solve.
In the last two decades, many methods have been proposed to solve problem (1) in large-scale settings. The most notable and popular methods are first order methods. Yuan et al. (2011) proposed a variation of Nesterov’s composite gradient method to solve the group Lasso model. This method’s efficiency relies on the ability to compute the proximal operator of the group Lasso penalty. When the number of group Lasso regularizers is large, the method becomes very slow. Motivated by this setback, Yu (2013) proposed the Accelerated Proximal Average Gradient (APG) method. APG’s idea is to use the average of the proximal operators of each component function to approximate that of the full penalty function. This idea substantially reduces the complexity of each iteration. However, it impedes the progress of the method in terms of objective function evaluations from one iteration to another. As observed by Cheung and Lou (2017) and Shen, Liu, Yuan, and Ma (2017) and our comparison results in Section 4, APG requires many iterations to reach a reasonable accuracy. Another disadvantage of many first order methods, such as the block coordinate descent methods of Xu and Yin (2013) and Wang, Banerjee, and Luo (2014), is that they require the loss function to be differentiable and the nondifferentiable part to be separable. Therefore, they are not directly applicable to problems with non-smooth loss functions such as the Square Root Group Lasso (Bunea, Lederer, & She, 2014), the Least Absolute Deviation loss, and the hinge loss in SVM, etc. Another popular class of methods are the Alternating Direction Methods of Multipliers (ADMM) (Boyd, Parikh, Chu, Peleato, Eckstein, 2010, Wang, Zhao, & Wu, Wang, Hong, Ma, Luo, 2013, Wang, Yin, Zeng, 2019b). It has been empirically shown in He and Yuan (2012) and Lin et al. (2014) that the ADMM methods have slow tail convergence.
In Du, Lin, and Ruszczyński (2017), we proposed a fast operator-splitting type method, SLIN, which is globally convergent for an arbitrary number of operators (subdifferentials of the said functions). The method generalizes the earlier works (Kiwiel, Rosa, Ruszczyński, 1999, Lin, Pham, Ruszczyński, 2014) for two blocks. SLIN combines the advantages of the aforementioned first order methods: fast convergence, simplicity of the implementation, and low iteration complexity. However, SLIN does not replicate the decision variables. At each iteration, it linearizes all but one of the component functions and uses a proximal term penalizing the distance to the last iterate. The order of processing the functions is not fixed; the method uses precise criteria for selecting the function to be treated exactly at the current step. It also employs special rules for updating the proximal center, which adapt similar ideas employed in bundle method for nonsmooth optimization (Du & Ruszczyński, 2017). These two rules differ our approach from the other first order methods as mentioned before. The theoretical advantage of the method is due to determining the splitting order on-line, depending on the values of the functions minimized, and accepting the result of the splitting step only when it leads to the decrease of the overall objective. Further, we proved sublinear convergence rate of the method under mild assumptions.
In this paper, we propose efficient and novel solution methods for the sub-problems of the overlapping group Lasso model and the doubly regularized SVM. Extensive numerical experiments using data from simulation, cancer research, and Amazon review, demonstrate the strengths of SLIN in solving complex, high dimensional, multi-block, non-smooth, and non-separable statistical learning problems.
The paper is organized as follows. Section 2 describes the idea of the selective linearization method for solving multi-block non-smooth optimization problems and summarizes its convergence properties. In Section 3, we study SLIN’s operation on overlapping group Lasso and doubly regularized SVM. In Section 4, we provide comprehensive experiments showing that SLIN is a highly efficient and reliable general-purpose method for multi-block optimization of convex non-smooth functions. Conclusion and future research plans are finally discussed in Section 5.
Section snippets
The SLIN method
For notational uniformity, we write the problem of interest as:where all functions are convex.
Application to structured regularized regression problems
The SLIN algorithm can be directly applied to solve the general problem (1). It can be further specialized to take advantage of additional features of the functions involved. In the following subsections, we discuss the overlapping group lasso problem and the regularized support vector machine problem. At the same time, we introduce new ways to solve sub-problems arising in the application of SLIN.
Numerical study
In this section, we present some experimental results for problems (13) and (22). All the studies are performed on a 2.3 gigahertz, 16 gigabytes RAM computer using MATLAB.
Conclusion
We consider the problem of minimizing a sum of several convex non-smooth functions. In this paper, we study a new algorithm called the selective linearization (SLIN) method, which iteratively linearizes all but one of the functions and employs simple proximal steps. The algorithm is a form of the multiple operator splitting method in which the order of processing partial functions is not fixed, but rather determined in the course of calculations. Our proposed method is fast and globally
References (31)
- et al.
Operations research and data mining
European Journal of Operational Research
(2008) - et al.
Application of SVM and ANN for image retrieval
European Journal of Operational Research
(2006) - et al.
Distributed optimization and statistical learning via the alternating direction method of multipliers
Foundations and Trends in Machine Learning
(2010) - et al.
The group square-root lasso: Theoretical properties and fast algorithms
IEEE Transactions on Information Theory
(2014) - et al.
An augmented lagrangian method for distributed optimization
Mathematical Programming
(2015) - et al.
SMOTE: Synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
(2002) - et al.
Smoothing proximal gradient method for general structured sparse regression
The Annals of Applied Statistics
(2012) - et al.
Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization
Machine Learning
(2017) - et al.
Parallel multi-block ADMM with o (1/k) convergence
Journal of Scientific Computing
(2017) - et al.
A selective linearization method for multiblock convex optimization
SIAM Journal on Optimization
(2017)
Rate of convergence of the bundle method
Journal of Optimization Theory and Applications
Fast alternating linearization methods for minimizing the sum of two convex functions
Mathematical Programming
On the O(1/n) convergence rate of the douglas-rachford alternating direction method
SIAM Journal on Numerical Analysis
Proximal decomposition via alternating linearization
SIAM Journal on Optimization
Cited by (1)
Estimating Functional Brain Networks by Low-Rank Representation With Local Constraint
2024, IEEE Transactions on Neural Systems and Rehabilitation Engineering