Variable selection for categorical response: a comparative study

Sen, Sweata; Kundu, Damitri; Das, Kiranmoy

doi:10.1007/s00180-022-01260-1

Variable selection for categorical response: a comparative study

Original paper
Published: 15 July 2022

Volume 38, pages 809–826, (2023)
Cite this article

Computational Statistics Aims and scope Submit manuscript

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Variable selection is a well-studied problem in linear regression, but the existing works mostly deal with continuous responses. However, in many applications, we come across data with categorical responses. In the classical (frequentist) approach there exists penalized regression methods (e.g. logistic Lasso) which can be used for variable selection when we have a categorical response, and a large number of predictors. In this paper, we compare the performance of three alternative approaches for handling data with a single categorical response and multiple continuous (or count) predictors. In addition to the well-known logistic Lasso, we consider a model-based Bayesian approach, and a model-free approach for variable selection. We consider a binary response, and a response with three categories. Through extensive simulation studies we compare the performance of these three competing methods. We observe that the model-based methods can often accurately identify the important predictors, but sometimes fail to detect the unimportant ones. Also the model-based approaches are computationally expensive whereas the model-free approach is extremely fast. For misspecified models, the model-free method really outperforms in prediction. However, when the predictors are correlated (moderately or substantially) then the model-based methods perform better than the model-free method. We analyse the well-known Pima Indian Diabetes dataset for illustrating the effectiveness of three competing methods under consideration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

OLS: Is That So Useless for Regression with Categorical Data?

Doubly robust estimation and robust empirical likelihood in generalized linear models with missing responses

Article 14 November 2023

Calibrated imputation for multivariate categorical data

Article Open access 05 October 2023

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Albert JH, Chib S (1993) Bayesian analysis of binary and polychotomous response data. J Am Stat Assoc 88:669–679
Article MathSciNet MATH Google Scholar
Casella G (2001) Empirical Bayes Gibbs sampling. Biostatistics 2:485–500
Article MATH Google Scholar
Chang J, Tang CY, Wu Y (2013) Marginal empirical likelihood and sure independence feature screening. Ann Stat 41:2123–2148
Article MathSciNet MATH Google Scholar
Chib S (1995) Marginal likelihood from the Gibbs output. J Am Stat Assoc 90:1313–1321
Article MathSciNet MATH Google Scholar
Chib S, Shin M, Simoni A (2018) Bayesian estimation and comparison of moment condition models. J Am Stat Assoc 113:1656–1668
Article MathSciNet MATH Google Scholar
Das K, Sobel M (2015) Dirichlet Lasso: a Bayesian approach to variable selection. Stat Model 15:215–232
Article MathSciNet MATH Google Scholar
Das K, Ghosh P, Daniels MJ (2021) Modeling multiple time-varying related groups: a dynamic hierarchical Bayesian approach with an application to the health and retirement study. J Am Stat Assoc 116:558–568
Article MathSciNet MATH Google Scholar
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32:407–499
Article MathSciNet MATH Google Scholar
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Article MathSciNet MATH Google Scholar
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B (Stat Methodol) 70:849–911
Article MathSciNet MATH Google Scholar
Fan J, Ma Y, Dai W (2014) Nonparametric independence screening in sparse ultra-high-dimensional varying coefficient models. J Am Stat Assoc 109:1270–1284
Article MathSciNet MATH Google Scholar
Fernandez C, Ley E, Steel MF (2001) Benchmark priors for Bayesian model averaging. J Econom 100:381–427
Article MathSciNet MATH Google Scholar
George EI, McCulloch RE (1993) Variable selection via Gibbs sampling. J Am Stat Assoc 88:881–889
Article Google Scholar
George EI, McCulloch RE (1997) Approaches for Bayesian variable selection. Stat Sin 7:339–373
MATH Google Scholar
Huang D, Li R, Wang H (2014) Feature screening for ultrahigh dimensional categorical data with applications. J Bus Econ Stat 32:237–244
Article MathSciNet Google Scholar
Kuo L, Mallick B (1998) Variable selection for regression models. Sankhya Indian J Stat Ser B 60:65–81
MathSciNet MATH Google Scholar
Li Q, Lin N (2010) The Bayesian elastic net. Bayesian Anal 5:151–170
Article MathSciNet MATH Google Scholar
Liu J, Li R, Wu R (2014) Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J Am Stat Assoc 109:266–274
Article MathSciNet MATH Google Scholar
Lu J, Lin L (2017) Model-free conditional screening via conditional distance correlation. Stat Pap 61:225–244
Article MathSciNet MATH Google Scholar
Lu W, Goldberg Y, Fine JP (2012) On the robustness of the adaptive lasso to model misspecification. Biometrika 99:717–731
Article MathSciNet MATH Google Scholar
Mai Q, Zou H (2013) The Kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika 100:229–234
Article MathSciNet MATH Google Scholar
Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103:681–686
Article MathSciNet MATH Google Scholar
Pan R, Wang H, Li R (2016) Ultrahigh-dimensional multiclass linear discriminant analysis by pairwise sure independence screening. J Am Stat Assoc 111:169–179
Article MathSciNet Google Scholar
Smith M, Kohn R (1996) Nonparametric regression using Bayesian variable selection. J Econom 75:317–344
Article MATH Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288
MathSciNet MATH Google Scholar
Wang H (2009) Forward regression for ultra-high dimensional variable screening. J Am Stat Assoc 104:1512–1524
Article MathSciNet MATH Google Scholar
Wang X, Leng C (2016) High dimensional ordinary least squares projection for screening variables. J R Stat Soc Ser B 78:589–611
Article MathSciNet MATH Google Scholar
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67:301–320
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Applied Statistics Division, Indian Statistical Institute, Kolkata, India
Sweata Sen, Damitri Kundu & Kiranmoy Das
Center for Artificial Intelligence and Machine Learning, Indian Statistical Institute, Kolkata, India
Kiranmoy Das
CIB R & A Analyst, JP Morgan & Chase, Mumbai, India
Sweata Sen

Authors

Sweata Sen
View author publications
You can also search for this author inPubMed Google Scholar
Damitri Kundu
View author publications
You can also search for this author inPubMed Google Scholar
Kiranmoy Das
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Kiranmoy Das.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sen, S., Kundu, D. & Das, K. Variable selection for categorical response: a comparative study. Comput Stat 38, 809–826 (2023). https://doi.org/10.1007/s00180-022-01260-1

Download citation

Received: 09 November 2021
Accepted: 27 June 2022
Published: 15 July 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s00180-022-01260-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variable selection for categorical response: a comparative study

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

OLS: Is That So Useless for Regression with Categorical Data?

Doubly robust estimation and robust empirical likelihood in generalized linear models with missing responses

Calibrated imputation for multivariate categorical data

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now