2,0-norm based selection and estimation for multivariate generalized linear models

https://doi.org/10.1016/j.jmva.2021.104782Get rights and content

Abstract

Group sparse regression has been well considered in multivariate linear models with appropriate relaxation schemes for the involved 2,0-norm penalty. Lacking of the extended research on multivariate generalized linear models (GLMs), this paper targets at the original discontinuous and nonconvex 2,0-norm based selection and estimation for multivariate GLMs. Under mild conditions, we give a necessary condition for selection consistency based on the notion of degree of separation, and propose the feature selection consistency as well as optimal coefficient estimation for the resulting 2,0-likelihood estimators in terms of the Hellinger risk. Numerical studies on synthetic data and a real data in chemometrics confirm superior performance of the 2,0-likelihood methods.

Introduction

Consider multivariate generalized linear models (GLMs) with n observations {(Xi,Yi)}i=1n, where Xi is a p-dimensional predictor vector, Yi is a q-dimensional response vector. The observations Yi,i{1,,n} are i.i.d. drawn from the exponential family with the canonical parameter vector θRq: f(y|x,θ)h(y)exp{θTT(y)ϕ(θ)},where h(y) is the underlying measure, T(y) is the sufficient statistic and ϕ(θ) is a known link function. Multivariate GLMs assume that T(y)=y and θ=(B)Tx, with BRp×q the true underlying coefficient matrix. The density of Yi can be written as f(y|x,B)h(y)exp{((B)Tx)Tyϕ((B)Tx)}.Note that setting q=1 leads to traditional GLMs introduced by McCullagh and Nelder [20]. The maximum likelihood estimator of the multivariate GLM, termed as BˆML, is the optimal solution of the following minimization problem: minBRp×ql(B)i=1n{(BTxi)Tyi+ϕ(BTxi)}.More about multivariate GLMs can be found in the monographs [7] by Fahrmeir and Tutz, and [10] by Hardin and Hilbe.

In high-dimensional data analysis, feature selection becomes increasingly crucial since true underlying models admit a sparse representation. Meanwhile, it is worth noting that the inherited sparsity may possess group structures, such as the row sparsity of the coefficient matrix B in the setting of multivariate GLMs, resulting from the fact that multiple responses are possibly related to a group of common features. For example, in joint sparse signal recovery problems [22], multiple signals share a common sparse support set, and the sparsity is defined as the number of nonzero rows in the source matrix. In multiple classification problems [24], multiple tasks are related to a common sparse set of covariates, and the corresponding covariate matrix also possess the row sparsity. Our goal is to explicitly take this predefined group sparse structure into account for multivariate GLMs.

Comparing with classical sparse regression using 0-norm (i.e., the number of nonzero entries) [29], [30], [34] and its relaxation surrogates, see, e.g., [8], [16], [32], [41], [43], there are several compelling reasons to consider the row sparsity for multivariate GLMs. It is more reasonable since the group structure is an important piece of a prior knowledge in many applications, such as integrative genomics [27], visual classification [40] and mentioned above [22], [24]. In addition, taking the information of dependence of responses into account, the resulting estimator will perform better and be more interpretable [39]. Moreover, the use of group structure can avoid the dramatic increase in dimension caused by the direct use of the vectorization technique, especially for large-scale problems.

Write the coefficient matrix B as B=(b1,,bp)T=((BA)T,0AcT)T, where the row support set A={j:bj20} is the index set of nonzero rows in B with the row sparsity p0, 0Ac is a matrix of 0’s with Ac the complementary set of A. Then the 2,0-norm can be defined as B2,0=|A|=p0 with || the size of a set. It is the most direct and accurate depiction of the row sparse structure. In addition, numerical results reported in [30] have shown that 0-norm constrained likelihood method outperforms the lasso under the univariate regression framework. In such senses, it is natural to formulate the group sparse regression for multivariate GLMs by solving minBRp×ql(B)+λB2,0,where B2,0 is the 2,0-norm of B that counts the number of nonzero rows, and λ>0 is a regularization parameter which balances the magnitude of the loss function and row sparsity of the estimator. We refer to this method as the regularized 2,0-likelihood. Note that 2,0-norm has already been employed for group selection of multivariate linear models [25], in which the least squares loss l(B)=12YXBF2 has been adopted, where F denotes the Frobenius norm with AF=i,jai,j2 for a matrix A. Furthermore, if the true row sparsity is known a priori, the constrained counterpart of (3) would be a good alternative for variable selection and estimation, which is minBRp×ql(B), s.t.B2,0K,where K>0 is a prescribed parameter that controls the row sparsity. Analogously, the constrained counterpart (4) is referred to as the constrained 2,0-likelihood. Although 2,0-norm is discontinuous and nonconvex, Wang et al. [35] have proved that there exists the global minimizer for the typical example of (3), optimization problems corresponding to the multinomial logistic regression.

There is a huge body of work on group selection through relaxation schemes for the involved 2,0-norm regularization under the multivariate regression (or multitask learning) framework. The two of the most popular methods are 2,1-norm (i.e., the sum of all Euclidean norms of rows) regularization [2], [3], [9], [18], [25] and ,1-norm (i.e., the sum of all maximum absolute elements of rows) regularization [19], [23], [33]. Due to the hardness of 2,0-norm, little is known about statistical properties of the estimators resulting by the regularized as well as the constrained 2,0-likelihood, although these approximations methods have been extensively studied, particularly in the setting of multivariate linear regression. For example, feature selection consistency for the group lasso [25], oracle bounds of the estimated order of sparsity, prediction error and estimation error for the sparse group lasso [15], oracle properties [8] for a three-level variable selection approach based on group smoothly clipped absolute deviation (SCAD) penalty [11] and the generalized adaptive elastic-net [39].

In addition, considering necessary or sufficient conditions for feature selection consistency has been shown imperative for any sparse regression method, for example, a strong irrepresentable condition which is nearly necessary for the lasso [21], [42], a sparse Riesz condition (SRC) for the minimax concave penalty (MCP) [41] and the group lasso [36], [37], a necessary and a sufficient conditions for the multivariate group lasso [25]. Moreover, necessary conditions in [13], [29], [30] seem to have suggested that exponential candidate features in the sample size are possible for some methods to be selection consistent.

In this paper, we focus on the uninvestigated group sparse regression for multivariate GLMs. The original 2,0-norm penalty is directly used for feature selection instead of the existing relaxation schemes. We further make a statistical theory analysis for the regularized 2,0-likelihood method defined in (3), including the necessary condition for selection consistency, the two asymptotic results of feature selection consistency and optimal coefficient estimation. Meanwhile, all the results apply to the constrained 2,0-likelihood method defined in (4). In addition, we propose an 2,0-proximal gradient (2,0-PG) algorithm and an 2,0-improved iterative hard thresholding (2,0-IIHT) algorithm to solve the regularized model (3) and the constrained model (4), respectively. Numerical results on both synthetic and real data confirm the superiority of our proposed 2,0-likelihood method on the group sparsity and the use of original 2,0-norm.

The remainder of the article is organized as follows. In Section 2, we derive a necessary condition for selection consistency of any sparse regression. In Section 3, we give the theoretical results of mis-estimation error bound and asymptotic properties for both the regularized and the constrained 2,0-likelihoods. Section 4 reports some numerical studies to compare our 2,0-likelihood methods with several classical and popular sparse regression methods. Concluding remarks are drawn in Section 5. Proofs of the main results are presented in Appendix A.

Section snippets

Necessary condition for selection consistency

This section establishes a necessary condition for feature selection consistency. Assume that minimizing (3) or (4) gives a global minimizer leading to the 2,0-likelihood estimator Bˆ=(BˆAˆT,0AˆcT)T with Aˆ={j:bˆj20} the estimation of A. Selection consistency requires that Pr(AˆA)0 as n for an estimated Aˆ. Write B=(BAT,0AcT)T with the row support set A={j:bj20}{1,,p}. The following definition of the degree of the separation between the true index set A and a least favorable

2,0-likelihood methods

This section is devoted to feature selection consistency and optimal coefficient estimation of the regularized 2,0-likelihood method by accurate reconstruction of the oracle estimator BML=((BˆAML)T,0AcT)T given A. In addition, a parallel theory for constrained 2,0-likelihood method is established.

Numerical experiments

In this section, we propose an 2,0-proximal gradient (2,0-PG) algorithm and an 2,0-improved iterative hard thresholding (2,0-IIHT) algorithm to handle the regularized model (3) and the constrained model (4), respectively. According to [5] and [26], under some standard assumptions, the stationary point and even the optimal solution of (3) or (4) can be obtained by the proposed algorithm. To examine the effectiveness of our proposed 2,0-likelihoods for multivariate GLMs, numerical

Conclusions

In this article we have considered the 2,0-likelihood methods for multivariate GLMs in high-dimensional settings. Under mild conditions, a necessary condition for selection consistency has been derived, which permits up to exponential candidate features in regression. Moreover, selection consistency and optimal coefficient estimation have been established for both the regularized and constrained 2,0-likelihood methods, and numerical experiments have verified the superiority of the 2,0

CRediT authorship contribution statement

Yang Chen: Methodology, Software, Data curation, Writing - original draft, Writing - review & editing. Ziyan Luo: Conceptualization, Validation, Investigation, Writing - review & editing, Supervision. Lingchen Kong: Validation, Investigation, Visualization.

Acknowledgments

The authors sincerely thank Prof. Dietrich von Rosen, the Editor-in-Chief, and the anonymous reviewers for their helpful comments and suggestions to improve the quality of the paper. This research was supported by the National Natural Science Foundation of China [Grant Numbers 11771038, 12071022] and Beijing Natural Science Foundation [Grant Number Z190002].

References (43)

  • FanJ. et al.

    Variable selection via nonconcave penalized likelihood and its oracle properties

    J. Amer. Statist. Assoc.

    (2001)
  • P. Gong, J. Ye, C. Zhang, Robust multi-task feature learning, in: Proceedings of the 18th ACM SIGKDD International...
  • HardinJ.W. et al.

    Generalized Linear Models and Extensions

    (2018)
  • IbragimovI.A. et al.
  • KimY. et al.

    Smoothly clipped absolute deviation on high dimensions

    J. Amer. Statist. Assoc.

    (2008)
  • KolmogorovA.N. et al.

    ε-entropy and ε-capacity of sets in functional space

    Amer. Math. Soc. Transl.

    (1961)
  • LiY. et al.

    Multivariate sparse group Lasso for the multivariate multiple linear regression with an arbitrary group structure

    Biometrics

    (2015)
  • LiZ. et al.

    Variable selection and estimation in generalized linear models with the seamless 0 penalty

    Can. J. Stat.

    (2012)
  • LinM. et al.

    Efficient algorithms for multivariate shape-constrained convex regression problems

    (2020)
  • LiuJ. et al.

    Multi-task feature learning via efficient 2,1-norm minimization

    (2012)
  • H. Liu, M. Palatucci, J. Zhang, Blockwise coordinate descent procedures for the multi-task Lasso, with applications to...
  • Cited by (0)

    View full text