Homogeneity detection for the high-dimensional generalized linear model
Introduction
For years, the penalized estimations have been extended to reflect a prior knowledge of relationships among covariates (Tibshirani et al., 2005, Yuan and Lin, 2006). However, there are more complicated but interesting applications in grouping covariates without any prior knowledge, which is called homogeneity detection in this paper. The homogeneity detection means identifying a specific model structure where regression coefficients are grouped having exactly the same value in each group and has higher conceptual generalizability than variable selection (Shen and Huang, 2010, Ke et al., 2015).
Many researchers have developed homogeneity detection techniques in the linear regression model. For example, Bondell and Reich (2008) proposed an octagonal shrinkage and clustering algorithm for regression by using two convex penalties; Petry et al. (2011) and Jang et al. (2013) developed similar techniques called the pairwise least absolute shrinkage and selection operator and the hexagonal operator for regression with shrinkage and equality selection, respectively. Shen and Huang (2010) proposed to use the capped -penalty and proved that the resulting estimator has an oracle property (Shen and Huang, 2010, Shen et al., 2012, Zhu et al., 2013). Similar works can be found in Ke et al. (2015), Petry et al. (2011) and Masarotto and Varin (2012) who used the smoothly clipped absolute deviation (SCAD) (Fan and Li, 2001) and adaptive version (Zou, 2006) of least absolute shrinkage and selection operator (LASSO) penalties, respectively.
The homogeneity detection is useful especially when there are categorical covariates in the model. In this case, regression coefficients are interpreted as relative effects on the response with respect to a predefined reference level. Through the homogeneity detection we can produce a simpler model by collapsing multiple levels with the same effect. (Gertheiss and Tutz, 2010). Moreover, homogeneity detection gives better prediction accuracy than other conventional sparse estimators when the true model has the homogeneity structure (Ke et al., 2015).
In this paper, we propose to use a penalized estimator for the homogeneity detection in the high-dimensional generalized linear model (GLM), that composed of two non-convex penalties: individual sparsity and sparsity of pairwise difference. We consider a class of non-convex penalties that includes most of existing non-convex penalties considered by previous researchers.
First, we extend the homogeneity detection from the linear regression model (Ke et al., 2015, Shen and Huang, 2010) to GLMs. The main challenges are investigating asymptotic properties that support the use of the proposed estimator and developing a computational algorithm when the model is high-dimensional. We prove that the proposed estimator satisfies weak oracle property (Fan and Lv, 2011, Kwon and Kim, 2012, Kim et al., 2016) which is new and covers the results in Ke et al. (2015) and Shen and Huang (2010) under mild conditions. We develop an algorithm by applying the concave–convex procedure (Yuille and Rangarajan, 2003, Kim et al., 2008) and alternating direction method of multipliers (Boyd et al., 2011).
Second, we prove that the homogeneity structure constructed by the proposed estimator does not depend on choice of reference levels under the presence of categorical covariates. This proof offers a new and critical justification for the proposed method of homogeneity detection. This implies that the proposed estimator is invariant to the choice of reference levels which does not hold for the conventional sparse estimators such as the LASSO and SCAD to hold.
We organize the paper as follows. Section 2 introduces a penalized MLE and presents a computational algorithm for homogeneity detection in GLMs. Section 3 proves weak oracle property under regularity conditions. Section 4 studies the invariance property when the model includes categorical covariates. Section 5 presents the results of numerical studies and concluding remarks follow in Section 6.
Section snippets
Penalized estimation for homogeneity detection
Let be a random sample of response–predictor pairs from GLM where the conditional density function of is with for a link function and the marginal distribution of does not depend on . Given and , we consider a penalized MLE that is defined as where and is a nonconvex penalty with tuning parameter . The penalized negative log-likelihood
Asymptotic property
In this section, we provide sufficient conditions for local optimality of the oracle MLE and then prove that the oracle MLE becomes a local minimizer of the penalized negative log-likelihood with probability converging to 1.
An application: homogeneity detection in the presence of categorical covariates
In this section we focus on the case where all covariates are categorical and consider a problem of reducing levels of each categorical variables using the proposed method. For example, we usually categorize continuous variables to accommodate nonlinear effect as in locally constant nonparametric regression model. It is an important problem how many levels should be made for a given continuous covariate.
Unlike the usual MLE the penalized MLE can produce a different predictive model depending on
Numerical studies
This section presents results of numerical studies including two simulated examples and two real data analysis.
Concluding remarks
In this paper we extend the sparse regularization method for regression model to the group pursuit method in the GLM with various non-convex penalties. This extension is based on the generalization of local optimality conditions in the log-likelihood function with a large class of non-convex grouping penalties and on the investigation of asymptotic properties of the oracle estimator under regularity conditions. The regularity conditions require that clustered coefficients are sufficiently far
Acknowledgments
Jeon’s research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF-2016R1C1B1010545) funded by the Ministry of Science, ICT and Future Planning. Kwon’s research was supported by Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Science, ICT and Future Planning (No. 2014R1A1A1002995). Choi’s research was supported by Basic Science Research Program through the National Research Foundation
References (46)
- et al.
Split bregman method for large scale fused LASSO
Comput. Statist. Data Anal.
(2011) - et al.
The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients
Expert Syst. Appl.
(2009) - et al.
Learning to detect phishing webpages
J. Internet Serv. Inf. Secur.
(2014) - et al.
A fast iterative shrinkage-thresholding algorithm for linear inverse problems
SIAM J. Imag. Sci.
(2009) - et al.
Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR
Biometrics
(2008) - et al.
Simultaneous factor selection and collapsing levels in ANOVA
Biometrics
(2009) - et al.
Distributed optimization and statistical learning via the alternating direction method of multipliers
Found. Trends® Mach. Learn.
(2011) - et al.
Fused least absolute shrinkage and selection operator for credit scoring
J. Stat. Comput. Simul.
(2015) - et al.
Variable selection via nonconcave penalized likelihood and its oracle properties
J. Amer. Statist. Assoc.
(2001) - et al.
Nonconcave penalized likelihood with np-dimensionality
IEEE Trans. Inform. Theory
(2011)
Nonconcave penalized likelihood with a diverging number of parameters
Ann. Statist.
Tuning parameter selection in high dimensional penalized likelihood
J. R. Stat. Soc. Ser. B Stat. Methodol.
A statistical view of some chemometrics regression tools
Technometrics
Sparse modeling of categorial explanatory variables
Ann. Appl. Stat.
Fast alternating direction optimization methods
SIAM J. Imag. Sci.
A tutorial on mm algorithms
Amer. Statist.
Homogeneity pursuit
J. Amer. Statist. Assoc.
Smoothly clipped absolute deviation on high dimensions
J. Amer. Statist. Assoc.
A necessary condition for the strong Oracle property
Scand. J. Statist.
Global optimality of non-convex penalized estimators
Biometrika
Large sample properties of the scad-penalized maximum likelihood estimation on high dimensions
Statist. Sinica
Cited by (8)
Subgroup analysis for high-dimensional functional regression
2022, Journal of Multivariate AnalysisHigh-dimensional integrative analysis with homogeneity and sparsity recovery
2019, Journal of Multivariate AnalysisPrediction Using Many Samples with Models Possibly Containing Partially Shared Parameters
2024, Journal of Business and Economic StatisticsA two-way additive model with unknown group-specific interactions applied to gene expression data
2022, Biometrical JournalADMM for least square problems with pairwise-difference penalties for coefficient grouping
2022, Communications for Statistical Applications and Methods