Abstract
Variable selection is a well-studied problem in linear regression, but the existing works mostly deal with continuous responses. However, in many applications, we come across data with categorical responses. In the classical (frequentist) approach there exists penalized regression methods (e.g. logistic Lasso) which can be used for variable selection when we have a categorical response, and a large number of predictors. In this paper, we compare the performance of three alternative approaches for handling data with a single categorical response and multiple continuous (or count) predictors. In addition to the well-known logistic Lasso, we consider a model-based Bayesian approach, and a model-free approach for variable selection. We consider a binary response, and a response with three categories. Through extensive simulation studies we compare the performance of these three competing methods. We observe that the model-based methods can often accurately identify the important predictors, but sometimes fail to detect the unimportant ones. Also the model-based approaches are computationally expensive whereas the model-free approach is extremely fast. For misspecified models, the model-free method really outperforms in prediction. However, when the predictors are correlated (moderately or substantially) then the model-based methods perform better than the model-free method. We analyse the well-known Pima Indian Diabetes dataset for illustrating the effectiveness of three competing methods under consideration.



Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Albert JH, Chib S (1993) Bayesian analysis of binary and polychotomous response data. J Am Stat Assoc 88:669–679
Casella G (2001) Empirical Bayes Gibbs sampling. Biostatistics 2:485–500
Chang J, Tang CY, Wu Y (2013) Marginal empirical likelihood and sure independence feature screening. Ann Stat 41:2123–2148
Chib S (1995) Marginal likelihood from the Gibbs output. J Am Stat Assoc 90:1313–1321
Chib S, Shin M, Simoni A (2018) Bayesian estimation and comparison of moment condition models. J Am Stat Assoc 113:1656–1668
Das K, Sobel M (2015) Dirichlet Lasso: a Bayesian approach to variable selection. Stat Model 15:215–232
Das K, Ghosh P, Daniels MJ (2021) Modeling multiple time-varying related groups: a dynamic hierarchical Bayesian approach with an application to the health and retirement study. J Am Stat Assoc 116:558–568
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32:407–499
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B (Stat Methodol) 70:849–911
Fan J, Ma Y, Dai W (2014) Nonparametric independence screening in sparse ultra-high-dimensional varying coefficient models. J Am Stat Assoc 109:1270–1284
Fernandez C, Ley E, Steel MF (2001) Benchmark priors for Bayesian model averaging. J Econom 100:381–427
George EI, McCulloch RE (1993) Variable selection via Gibbs sampling. J Am Stat Assoc 88:881–889
George EI, McCulloch RE (1997) Approaches for Bayesian variable selection. Stat Sin 7:339–373
Huang D, Li R, Wang H (2014) Feature screening for ultrahigh dimensional categorical data with applications. J Bus Econ Stat 32:237–244
Kuo L, Mallick B (1998) Variable selection for regression models. Sankhya Indian J Stat Ser B 60:65–81
Li Q, Lin N (2010) The Bayesian elastic net. Bayesian Anal 5:151–170
Liu J, Li R, Wu R (2014) Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J Am Stat Assoc 109:266–274
Lu J, Lin L (2017) Model-free conditional screening via conditional distance correlation. Stat Pap 61:225–244
Lu W, Goldberg Y, Fine JP (2012) On the robustness of the adaptive lasso to model misspecification. Biometrika 99:717–731
Mai Q, Zou H (2013) The Kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika 100:229–234
Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103:681–686
Pan R, Wang H, Li R (2016) Ultrahigh-dimensional multiclass linear discriminant analysis by pairwise sure independence screening. J Am Stat Assoc 111:169–179
Smith M, Kohn R (1996) Nonparametric regression using Bayesian variable selection. J Econom 75:317–344
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288
Wang H (2009) Forward regression for ultra-high dimensional variable screening. J Am Stat Assoc 104:1512–1524
Wang X, Leng C (2016) High dimensional ordinary least squares projection for screening variables. J R Stat Soc Ser B 78:589–611
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67:301–320
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sen, S., Kundu, D. & Das, K. Variable selection for categorical response: a comparative study. Comput Stat 38, 809–826 (2023). https://doi.org/10.1007/s00180-022-01260-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-022-01260-1