Visualization and assessment of model selection uncertainty

https://doi.org/10.1016/j.csda.2022.107598Get rights and content

Abstract

Although model selection is ubiquitous in scientific discovery, the stability and uncertainty of the selected model is often hard to evaluate. How to characterize the random behavior of the model selection procedure is the key to understand and quantify the model selection uncertainty. To this goal, initially several graphical tools are proposed. These include the G-plots and H-plots, to visualize the distribution of the selected model. Then the concept of model selection deviation to quantify the model selection uncertainty is introduced. Similar to the standard error of an estimator, model selection deviation measures the stability of the selected model given by a model selection procedure. For such a measure, a bootstrap estimation procedure is discussed and its desirable performance is demonstrated through simulation studies and real data analysis.

Introduction

Model selection plays an important role in modern scientific discoveries. There has been exciting work on developing penalized model selection methods for linear regressions, such as Lasso (Tibshirani, 1996), SCAD (Fan and Li, 2001), MCP (Zhang et al., 2010), and others. With many methods at our disposal, it is crucial to understand their stability and uncertainty before committing to one or a few methods. No matter which selection method is applied, selection uncertainty is a ubiquitous issue and has complex impact on the subsequent inference. Therefore, in this article, we propose a few graphical and numerical tools to understand and quantify the model selection uncertainty.

Similar to a point estimate of the population parameter in the classical statistics framework, the selected model given by a selection method can be viewed as a random “point estimate” of the true model or optimal model. To study its random behavior, it is natural to investigate its distribution, i.e., the distribution of the selected model or model selection distribution. Unfortunately, such a distribution is complex and sometimes mathematically intractable. Many classical tools for parameter estimation are no longer applicable in the model selection setting. This is partially because the support of such a distribution, i.e., all possible models, is discrete. Also, these models cannot be ordered or compared easily due to their complex relationship. In addition, the size of the support grows exponentially as the data dimension increases. For example, with p covariates, there can be 2p possible regression models.

Such a distribution, if available, would give analysts a comprehensive understanding of the random behavior of the selected model. Among all the aspects of the random behavior, the stability is one of the most important ones because it measures the selection uncertainty and trustworthiness of the selected model. The issue of model selection uncertainty is two folds. First, given different samples drawn from a common population, the same selection method may identify different models. Second, different selection methods, when applying to the same data set, will result in different models. Although many methods claim to have achieved the optimal performance under specific settings, their selection results are often quite different.

Model selection uncertainty has always been an active area of research (Chatfield, 1995). Hansen et al. (2011) propose model confidence set (MCS) to yield information about the uncertainty surrounding the model selection, which has been frequently used to measure the estimation uncertainty (Bayer, 2018; Seri et al., 2020). Nan and Yang (2014) propose the variable selection deviation measure to evaluate the reliability and trustworthiness of the selected model based on a model averaging approach (Yang and Yang, 2017; Ye et al., 2018). To derive a suitable model in multivariate regression models, Sauerbrei et al. (2015) adopt bootstrap resampling to assess variable selection stability. Hennig and Sauerbrei (2019) propose a measure to explore variable selection instability by analyzing dissimilarities among the results from different bootstrap samples. Yu et al. (2022) propose to estimate F- and G-measures to compare different variable selection methods. Ferrari and Yang (2015) propose to construct the variable selection confidence set by a sequence of F-tests for linear regression model and likelihood ratio testings for generalized linear model (Zheng et al., 2019a, Zheng et al., 2019b). To increase the interpretability of the confidence set, Li et al. (2019b) propose model confidence bounds, which contain a pair of nested models that trap the true model with a certain probability. Wang et al. (2021) further extend the confidence bounds to graphical models and introduce some visualization tools. Alternatively, Liu et al. (2020) propose two simple measures of uncertainty for a model selection procedure. One is similar in spirit to confidence set and the other is focusing on the error in model selection. The aforementioned methods all focus on measuring the model selection uncertainty. Meanwhile, to reduce selection uncertainty and provide a better selection result, Meinshausen and Bühlmann (2010) propose stability selection to improve upon existing selection methods using a subsampling approach. Lim and Yu (2016) propose a model-free criterion for selecting the tuning parameter based on a new estimation stability metric. For a comprehensive review on model selection, please see Ding et al. (2018).

On the other hand, the concept of the distribution of the selected model, which is a broader topic containing selection uncertainty, is relatively less studied and has only started to gain more attention. Knight and Fu (2000) first provide the asymptotic distribution of the Lasso estimator in the low-dimensional setting. The distribution of parameter estimates from Lasso, SCAD, and thresholding are further investigated in Pötscher and Leeb (2009) for finite sample and large sample limit. Zhou (2014) propose Monte Carlo simulation based approach to estimate such distributions. Finally, Ewald et al. (2020) completely characterize the distribution of the Lasso estimator in finite samples for linear regressions and study the model selection properties of the Lasso. These existing works mostly focus on the distribution of the parameter estimate as opposed to the selected model. Although some theoretical results have been established for the distribution of selected model by Lasso, much less work is done for visualizing such a distribution, which is difficult but also useful in practice. To fill this gap in the literature, we propose to visualize the distribution of the selected model in this article and use the visualization to measure selection uncertainty.

In this article, we introduce new visualization tools for the distribution of the selected model. By grouping models of a similar structure together, we are able to visualize the distribution more efficiently and clearly, and reveal important patterns in the distribution that are not available through other types of analysis. The proposed visualization is useful in graphical comparison of different selection methods, giving analysts a good sense of level of randomness each method comes with. With the help of the proposed visualization, we further introduce the concept of model selection deviation (MSD) which can be considered as the standard deviation of the distribution of the selected model. Such a measure allows numerical comparison of various model selection methods in terms of their stability. Under appropriate transformation, we further develop a fast bootstrap estimation procedure for model selection deviation and demonstrate its desirable performance in simulation and real data analysis. Throughout the article, we have focused on linear regressions. However, we would like to point out that the methodology developed in this article can be extended to more complex settings, such as generalized linear models and graphical models, with minimum modifications.

Note that, under a consistent model selection procedure, the probability that the selected model equals to the true model converges to one. Therefore, the model selection uncertainty vanishes asymptotically given a fixed number of covariates. However, under the finite sample size, the model selection uncertainty is nonnegligible. Most of the analysis presented in this article focus on the moderate sample size.

This article contributes to the literature in the following aspects. To the best of our knowledge, this article presents the first attempt in the literature to visualize the distribution of selected model. Using the proposed visualization techniques, the random behavior of the different model selection procedures can be compared and studied. Such visualizations allow us to define various attributes of the distribution, such as the mode and the skewness, characterizing various aspects of the distribution. One of the most important numeric attributes is our model selection deviation, which is an extension of the traditional standard deviation of a univariate distribution to the distribution of the selected model. Therefore, the tools provided in this article allow analysts to both quantitatively and graphically compare various selection procedures in terms of their stability and other aspects.

The rest of the paper is organized as follows. In Section 2, we introduce the framework and notations. In Section 3, we propose several new graphical tools to visualize the distribution of the selected model and discuss their connections and distinctions. Based on these visualizations, in Section 4, we introduce a new numeric measure, model selection deviation, to quantify the model selection uncertainty, and further discuss a bootstrap estimation procedure. Lastly, we demonstrate the desirable performance of the proposed visualization and uncertainty measure in simulation in Section 5 and in real data analysis in Section 6. We conclude in Section 7 and provide additional results in the supplementary materials.

Section snippets

Preliminaries

In this article, we focus on linear regression models. Let Y=(y1,...,yn)T be an n×1 response vector. Suppose Y=Xβ+ϵ where X=(x1,...,xn)T is an n×p design matrix of p predictors and xi=(xi1,...,xip)TRp. Without loss of generality, we assume X are column-wise standardized with zero means and unit variances. Let β=(β1,,βp)TRp be the parameter vector and ϵ=(ϵ1,...,ϵn)TNn(0,σ2In). We further assume some elements in β are zeros, but we do not know which ones. Let β0 be the true coefficients and m0

Model selection visualization

The selected model from a model selection procedure can be considered as a random “point estimate” for the true model. Therefore, it is important to understand its random behavior through its distribution, i.e., the distribution of the selected model (or model selection distribution). However, such a distribution is often complex and mathematically intractable in many settings. As the first step, we introduce several graphical tools to visualize such a distribution. Note that these

Model selection uncertainty assessment

Based on the visualization proposed above, we now introduce a new measure of model selection uncertainty. This section focuses on penalized regression methods such as Lasso-type estimator. We use Lasso as a running example.

Simulations

In this section, we demonstrate the advantages of the proposed methods using simulation. All simulated covariates are standardized with zero mean and unit variance. We use R packages glmnet (Friedman et al., 2010) and ncvreg (Breheny and Huang, 2011) to perform Lasso, SCAD, adaptive Lasso, MCP, and elastice net.

Real data example

In this section, we illustrate the proposed method using two real data sets with different dimensionalities.

We first analyze a yeast cell-cycle gene expression data set collected in the experiment of Spellman et al. (1998). To understand the cell-cycle process, biologists are interested in identifying transcription factors (TFs) that regulate the expression levels of cell cycle-regulated genes. Therefore, we analyzed a data set with n=1132 gene expression levels of yeast as response variable

Discussion

In this article, we have proposed several new graphical tools to visualize the distributions of the selected model under various model selection procedures. The visualization helps us to understand the behavior of the model selection procedure. To the best of our knowledge, there is the first attempt in visualizing such a complex distribution. We further propose a few numerical attributes on the distribution to quantify its central tendency, dispersion, and skewness. Among them, the model

Acknowledgements

Y Li's work is supported by the National Natural Science Foundation of China (72271237) and the Platform of Public Health & Disease Control and Prevention, Major Innovation & Planning Interdisciplinary Platform for the “Double-First Class” Initiative, Renmin University of China.

References (43)

  • D. Das et al.

    Interacting models of cooperative gene regulation

    Proc. Natl. Acad. Sci. USA

    (2004)
  • J. Ding et al.

    Model selection techniques: an overview

    IEEE Signal Process. Mag.

    (2018)
  • B. Efron et al.

    Least angle regression

    Ann. Stat.

    (2004)
  • K. Ewald et al.

    On the distribution, model selection properties and uniqueness of the lasso estimator in low and high dimensions

    Electron. J. Stat.

    (2020)
  • J. Fan et al.

    Variable selection via nonconcave penalized likelihood and its oracle properties

    J. Am. Stat. Assoc.

    (2001)
  • D. Ferrari et al.

    Confidence sets for model selection by F-testing

    Stat. Sin.

    (2015)
  • J. Friedman et al.

    Regularization paths for generalized linear models via coordinate descent

    J. Stat. Softw.

    (2010)
  • P.R. Hansen et al.

    The model confidence set

    Econometrica

    (2011)
  • C. Hennig et al.

    Exploration of the variability of variable selection based on distances between bootstrap sample results

    Adv. Data Anal. Classif.

    (2019)
  • K. Knight et al.

    Asymptotics for lasso-type estimators

    Ann. Stat.

    (2000)
  • T.I. Lee et al.

    Transcriptional regulatory networks in saccharomyces cerevisiae

    Science

    (2002)
  • Cited by (0)

    View full text