Stochastics and Statistics
Selective linearization for multi-block statistical learning

https://doi.org/10.1016/j.ejor.2020.12.010Get rights and content

Highlights

  • A Selective Linearization (SLIN) algorithm is introduced to optimize sum of several convex non-smooth functions.

  • SLIN is a fast operator splitting method that guarantees global convergence and convergence rate without artificial duplication of variables.

  • Novel and efficient methods for solving subproblems of overlapping group Lasso and doubly regularized support vector machine are introduced.

  • Numerical experiments using data from simulation, cancer research, and Amazon review demonstrate the efficacy and accuracy of the method.

Abstract

We consider the problem of minimizing a sum of several convex non-smooth functions and discuss the selective linearization method (SLIN), which iteratively linearizes all but one of the functions and employs simple proximal steps. The algorithm is a form of multiple operator splitting in which the order of processing partial functions is not fixed, but rather determined in the course of calculations. SLIN is globally convergent for an arbitrary number of component functions without artificial duplication of variables. We report results from extensive numerical experiments in two statistical learning settings such as large-scale overlapping group Lasso and doubly regularized support vector machine. In each setting, we introduce novel and efficient solutions for solving sub-problems. The numerical results demonstrate the efficacy and accuracy of SLIN.

Introduction

Big data challenge is common among many data-driven models in computer vision, bioinformatics, and e-commerce (Goldfarb, Ma, Scheinberg, 2013, Lin, Pham, Ruszczyński, 2014, Olafsson, Li, Wu, 2008, Wong, Hsu, 2006). Large data sets of interest may have millions of data records to be processed. In video streaming services or online retailers such as YouTube and Amazon, respectively, data is updated in matters of seconds. On the other hand, data sets can be of high dimension while the number of samples is much smaller than the dimension. By using classical regression models, it is not possible to estimate the desired features with a limited number of samples.

In recent years, extensive research efforts have been devoted to modeling high-dimensional statistical learning problems by incorporating complex structural regularization penalties into the model. Model parameters are estimated by solving an optimization problem (1) consisting of a sum of M loss functions fi, which measure the goodness-of-fit of the data, and N convex regularization functions (or penalties) hj:minxRnF(x)=i=1Mfi(x)+j=1Nhj(x).

All the component functions in the model are convex but not necessarily smooth, and possibly non-separable. Problem (1) covers many important applications. For example, variable selection using group lasso penalty in linear regression and logistic regression not only encourages the model to select meaningful features together in groups, but also induces sparsity at the group level (Yuan, Liu, & Ye, 2011). In classification models, the elastic net method, which combines 1-norm and 2-norm regularization, has been applied to large-scale support vector machine problems where the number of features is larger than the number of objects (Zou & Hastie, 2005). In the compressed sensing literature, the Magnetic Resonance Imaging recovery model uses a linear combination of the total variation (TV) norm and 1-norm penalties. Several models in recommendation systems combine the nuclear norm and 1-norm to induce sparsity structure in estimated coefficients (Zhou, Gong, Wang, & Ye, 2015). It is also convenient to formulate generic constrained convex problems by choosing the appropriate convex penalty functions hj. The common feature of these examples of problem (1) in statistical learning is that the functions are nonsmooth and nonseparable, which makes the problems difficult to solve.

In the last two decades, many methods have been proposed to solve problem (1) in large-scale settings. The most notable and popular methods are first order methods. Yuan et al. (2011) proposed a variation of Nesterov’s composite gradient method to solve the group Lasso model. This method’s efficiency relies on the ability to compute the proximal operator of the group Lasso penalty. When the number of group Lasso regularizers is large, the method becomes very slow. Motivated by this setback, Yu (2013) proposed the Accelerated Proximal Average Gradient (APG) method. APG’s idea is to use the average of the proximal operators of each component function to approximate that of the full penalty function. This idea substantially reduces the complexity of each iteration. However, it impedes the progress of the method in terms of objective function evaluations from one iteration to another. As observed by Cheung and Lou (2017) and Shen, Liu, Yuan, and Ma (2017) and our comparison results in Section 4, APG requires many iterations to reach a reasonable accuracy. Another disadvantage of many first order methods, such as the block coordinate descent methods of Xu and Yin (2013) and Wang, Banerjee, and Luo (2014), is that they require the loss function to be differentiable and the nondifferentiable part to be separable. Therefore, they are not directly applicable to problems with non-smooth loss functions such as the Square Root Group Lasso (Bunea, Lederer, & She, 2014), the Least Absolute Deviation loss, and the hinge loss in SVM, etc. Another popular class of methods are the Alternating Direction Methods of Multipliers (ADMM) (Boyd, Parikh, Chu, Peleato, Eckstein, 2010, Wang, Zhao, & Wu, Wang, Hong, Ma, Luo, 2013, Wang, Yin, Zeng, 2019b). It has been empirically shown in He and Yuan (2012) and Lin et al. (2014) that the ADMM methods have slow tail convergence.

In Du, Lin, and Ruszczyński (2017), we proposed a fast operator-splitting type method, SLIN, which is globally convergent for an arbitrary number of operators (subdifferentials of the said functions). The method generalizes the earlier works (Kiwiel, Rosa, Ruszczyński, 1999, Lin, Pham, Ruszczyński, 2014) for two blocks. SLIN combines the advantages of the aforementioned first order methods: fast convergence, simplicity of the implementation, and low iteration complexity. However, SLIN does not replicate the decision variables. At each iteration, it linearizes all but one of the component functions and uses a proximal term penalizing the distance to the last iterate. The order of processing the functions is not fixed; the method uses precise criteria for selecting the function to be treated exactly at the current step. It also employs special rules for updating the proximal center, which adapt similar ideas employed in bundle method for nonsmooth optimization (Du & Ruszczyński, 2017). These two rules differ our approach from the other first order methods as mentioned before. The theoretical advantage of the method is due to determining the splitting order on-line, depending on the values of the functions minimized, and accepting the result of the splitting step only when it leads to the decrease of the overall objective. Further, we proved sublinear convergence rate of the method under mild assumptions.

In this paper, we propose efficient and novel solution methods for the sub-problems of the overlapping group Lasso model and the doubly regularized SVM. Extensive numerical experiments using data from simulation, cancer research, and Amazon review, demonstrate the strengths of SLIN in solving complex, high dimensional, multi-block, non-smooth, and non-separable statistical learning problems.

The paper is organized as follows. Section 2 describes the idea of the selective linearization method for solving multi-block non-smooth optimization problems and summarizes its convergence properties. In Section 3, we study SLIN’s operation on overlapping group Lasso and doubly regularized SVM. In Section 4, we provide comprehensive experiments showing that SLIN is a highly efficient and reliable general-purpose method for multi-block optimization of convex non-smooth functions. Conclusion and future research plans are finally discussed in Section 5.

Section snippets

The SLIN method

For notational uniformity, we write the problem of interest as:min{F(x)=i=1Nfi(x)},where all functions fi(·) are convex.

Application to structured regularized regression problems

The SLIN algorithm can be directly applied to solve the general problem (1). It can be further specialized to take advantage of additional features of the functions involved. In the following subsections, we discuss the overlapping group lasso problem and the regularized support vector machine problem. At the same time, we introduce new ways to solve sub-problems arising in the application of SLIN.

Numerical study

In this section, we present some experimental results for problems (13) and (22). All the studies are performed on a 2.3 gigahertz, 16 gigabytes RAM computer using MATLAB.

Conclusion

We consider the problem of minimizing a sum of several convex non-smooth functions. In this paper, we study a new algorithm called the selective linearization (SLIN) method, which iteratively linearizes all but one of the functions and employs simple proximal steps. The algorithm is a form of the multiple operator splitting method in which the order of processing partial functions is not fixed, but rather determined in the course of calculations. Our proposed method is fast and globally

References (31)

  • S. Olafsson et al.

    Operations research and data mining

    European Journal of Operational Research

    (2008)
  • W. Wong et al.

    Application of SVM and ANN for image retrieval

    European Journal of Operational Research

    (2006)
  • S. Boyd et al.

    Distributed optimization and statistical learning via the alternating direction method of multipliers

    Foundations and Trends in Machine Learning

    (2010)
  • F. Bunea et al.

    The group square-root lasso: Theoretical properties and fast algorithms

    IEEE Transactions on Information Theory

    (2014)
  • N. Chatzipanagiotis et al.

    An augmented lagrangian method for distributed optimization

    Mathematical Programming

    (2015)
  • N.V. Chawla et al.

    SMOTE: Synthetic minority over-sampling technique

    Journal of Artificial Intelligence Research

    (2002)
  • X. Chen et al.

    Smoothing proximal gradient method for general structured sparse regression

    The Annals of Applied Statistics

    (2012)
  • Y. Cheung et al.

    Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization

    Machine Learning

    (2017)
  • W. Deng et al.

    Parallel multi-block ADMM with o (1/k) convergence

    Journal of Scientific Computing

    (2017)
  • Y. Du et al.

    A selective linearization method for multiblock convex optimization

    SIAM Journal on Optimization

    (2017)
  • Y. Du et al.

    Rate of convergence of the bundle method

    Journal of Optimization Theory and Applications

    (2017)
  • Dua, D., & Graff, C. (2017). UCI machine learning repository....
  • D. Goldfarb et al.

    Fast alternating linearization methods for minimizing the sum of two convex functions

    Mathematical Programming

    (2013)
  • B. He et al.

    On the O(1/n) convergence rate of the douglas-rachford alternating direction method

    SIAM Journal on Numerical Analysis

    (2012)
  • K.C. Kiwiel et al.

    Proximal decomposition via alternating linearization

    SIAM Journal on Optimization

    (1999)
  • Cited by (1)

    View full text