-norm based selection and estimation for multivariate generalized linear models
Introduction
Consider multivariate generalized linear models (GLMs) with observations , where is a -dimensional predictor vector, is a -dimensional response vector. The observations are i.i.d. drawn from the exponential family with the canonical parameter vector : where is the underlying measure, is the sufficient statistic and is a known link function. Multivariate GLMs assume that and , with the true underlying coefficient matrix. The density of can be written as Note that setting leads to traditional GLMs introduced by McCullagh and Nelder [20]. The maximum likelihood estimator of the multivariate GLM, termed as , is the optimal solution of the following minimization problem: More about multivariate GLMs can be found in the monographs [7] by Fahrmeir and Tutz, and [10] by Hardin and Hilbe.
In high-dimensional data analysis, feature selection becomes increasingly crucial since true underlying models admit a sparse representation. Meanwhile, it is worth noting that the inherited sparsity may possess group structures, such as the row sparsity of the coefficient matrix in the setting of multivariate GLMs, resulting from the fact that multiple responses are possibly related to a group of common features. For example, in joint sparse signal recovery problems [22], multiple signals share a common sparse support set, and the sparsity is defined as the number of nonzero rows in the source matrix. In multiple classification problems [24], multiple tasks are related to a common sparse set of covariates, and the corresponding covariate matrix also possess the row sparsity. Our goal is to explicitly take this predefined group sparse structure into account for multivariate GLMs.
Comparing with classical sparse regression using -norm (i.e., the number of nonzero entries) [29], [30], [34] and its relaxation surrogates, see, e.g., [8], [16], [32], [41], [43], there are several compelling reasons to consider the row sparsity for multivariate GLMs. It is more reasonable since the group structure is an important piece of a prior knowledge in many applications, such as integrative genomics [27], visual classification [40] and mentioned above [22], [24]. In addition, taking the information of dependence of responses into account, the resulting estimator will perform better and be more interpretable [39]. Moreover, the use of group structure can avoid the dramatic increase in dimension caused by the direct use of the vectorization technique, especially for large-scale problems.
Write the coefficient matrix as , where the row support set is the index set of nonzero rows in with the row sparsity , is a matrix of ’s with the complementary set of . Then the -norm can be defined as with the size of a set. It is the most direct and accurate depiction of the row sparse structure. In addition, numerical results reported in [30] have shown that -norm constrained likelihood method outperforms the lasso under the univariate regression framework. In such senses, it is natural to formulate the group sparse regression for multivariate GLMs by solving where is the -norm of that counts the number of nonzero rows, and is a regularization parameter which balances the magnitude of the loss function and row sparsity of the estimator. We refer to this method as the regularized -likelihood. Note that -norm has already been employed for group selection of multivariate linear models [25], in which the least squares loss has been adopted, where denotes the Frobenius norm with for a matrix . Furthermore, if the true row sparsity is known a priori, the constrained counterpart of (3) would be a good alternative for variable selection and estimation, which is where is a prescribed parameter that controls the row sparsity. Analogously, the constrained counterpart (4) is referred to as the constrained -likelihood. Although -norm is discontinuous and nonconvex, Wang et al. [35] have proved that there exists the global minimizer for the typical example of (3), optimization problems corresponding to the multinomial logistic regression.
There is a huge body of work on group selection through relaxation schemes for the involved -norm regularization under the multivariate regression (or multitask learning) framework. The two of the most popular methods are -norm (i.e., the sum of all Euclidean norms of rows) regularization [2], [3], [9], [18], [25] and -norm (i.e., the sum of all maximum absolute elements of rows) regularization [19], [23], [33]. Due to the hardness of -norm, little is known about statistical properties of the estimators resulting by the regularized as well as the constrained -likelihood, although these approximations methods have been extensively studied, particularly in the setting of multivariate linear regression. For example, feature selection consistency for the group lasso [25], oracle bounds of the estimated order of sparsity, prediction error and estimation error for the sparse group lasso [15], oracle properties [8] for a three-level variable selection approach based on group smoothly clipped absolute deviation (SCAD) penalty [11] and the generalized adaptive elastic-net [39].
In addition, considering necessary or sufficient conditions for feature selection consistency has been shown imperative for any sparse regression method, for example, a strong irrepresentable condition which is nearly necessary for the lasso [21], [42], a sparse Riesz condition (SRC) for the minimax concave penalty (MCP) [41] and the group lasso [36], [37], a necessary and a sufficient conditions for the multivariate group lasso [25]. Moreover, necessary conditions in [13], [29], [30] seem to have suggested that exponential candidate features in the sample size are possible for some methods to be selection consistent.
In this paper, we focus on the uninvestigated group sparse regression for multivariate GLMs. The original -norm penalty is directly used for feature selection instead of the existing relaxation schemes. We further make a statistical theory analysis for the regularized -likelihood method defined in (3), including the necessary condition for selection consistency, the two asymptotic results of feature selection consistency and optimal coefficient estimation. Meanwhile, all the results apply to the constrained -likelihood method defined in (4). In addition, we propose an -proximal gradient (-PG) algorithm and an -improved iterative hard thresholding (-IIHT) algorithm to solve the regularized model (3) and the constrained model (4), respectively. Numerical results on both synthetic and real data confirm the superiority of our proposed -likelihood method on the group sparsity and the use of original -norm.
The remainder of the article is organized as follows. In Section 2, we derive a necessary condition for selection consistency of any sparse regression. In Section 3, we give the theoretical results of mis-estimation error bound and asymptotic properties for both the regularized and the constrained -likelihoods. Section 4 reports some numerical studies to compare our -likelihood methods with several classical and popular sparse regression methods. Concluding remarks are drawn in Section 5. Proofs of the main results are presented in Appendix A.
Section snippets
Necessary condition for selection consistency
This section establishes a necessary condition for feature selection consistency. Assume that minimizing (3) or (4) gives a global minimizer leading to the -likelihood estimator with the estimation of . Selection consistency requires that as for an estimated . Write with the row support set . The following definition of the degree of the separation between the true index set and a least favorable
-likelihood methods
This section is devoted to feature selection consistency and optimal coefficient estimation of the regularized -likelihood method by accurate reconstruction of the oracle estimator given . In addition, a parallel theory for constrained -likelihood method is established.
Numerical experiments
In this section, we propose an -proximal gradient (-PG) algorithm and an -improved iterative hard thresholding (-IIHT) algorithm to handle the regularized model (3) and the constrained model (4), respectively. According to [5] and [26], under some standard assumptions, the stationary point and even the optimal solution of (3) or (4) can be obtained by the proposed algorithm. To examine the effectiveness of our proposed -likelihoods for multivariate GLMs, numerical
Conclusions
In this article we have considered the -likelihood methods for multivariate GLMs in high-dimensional settings. Under mild conditions, a necessary condition for selection consistency has been derived, which permits up to exponential candidate features in regression. Moreover, selection consistency and optimal coefficient estimation have been established for both the regularized and constrained -likelihood methods, and numerical experiments have verified the superiority of the
CRediT authorship contribution statement
Yang Chen: Methodology, Software, Data curation, Writing - original draft, Writing - review & editing. Ziyan Luo: Conceptualization, Validation, Investigation, Writing - review & editing, Supervision. Lingchen Kong: Validation, Investigation, Visualization.
Acknowledgments
The authors sincerely thank Prof. Dietrich von Rosen, the Editor-in-Chief, and the anonymous reviewers for their helpful comments and suggestions to improve the quality of the paper. This research was supported by the National Natural Science Foundation of China [Grant Numbers 11771038, 12071022] and Beijing Natural Science Foundation [Grant Number Z190002].
References (43)
- et al.
A rich family of generalized Poisson regression models with applications
Math. Comput. Simulation
(2005) - et al.
Model determination and estimation for the growth curve model via group SCAD penalty
J. Multivariate Anal.
(2014) - et al.
Multivariate data analysis applied to low-density polyethylene reactors
Chemometr. Intell. Lab. Syst.
(1992) - et al.
On the oracle property of a generalized adaptive elastic-net for multivariate linear regression with a diverging number of parameters
J. Multivariate Anal.
(2017) - G.I. Allen, Z. Liu, A log-linear graphical model for inferring genetic networks from high-throughput sequencing data,...
- et al.
Multi-task feature learning
Adv. Neural Inf. Process. Syst.
(2007) - et al.
Convex multi-task feature learning
Mach. Learn.
(2008) First-Order Methods in Optimization
(2017)- et al.
Efficient online and batch learning using forward backward splitting
J. Mach. Learn. Res.
(2009) - et al.
Multivariate Statistical Modelling Based on Generalized Linear Models
(2001)