Fused variable screening for massive imbalanced data

https://doi.org/10.1016/j.csda.2019.06.013Get rights and content

Abstract

Imbalanced data, in which the data exhibit an unequal or highly-skewed distribution between its classes/categories, are pervasive in many scientific fields, with application range from bioinformatics, text classification, face recognition, fraud detection, etc. Imbalanced data in modern science are often of massive size and high dimensionality, for example, gene expression data for diagnosing rare diseases. To address this issue, a fused screening procedure is proposed for dimension reduction with large-scale high dimensional imbalanced data under repeated case-control samplings. There are several advantages of the proposed method: it is model-free without any model specification for the underlying distribution; it is relatively inexpensive in computational cost by using the subsampling technique; it is robust to outliers in the predictors. The theoretical properties are established under regularity conditions. Numerical studies including extensive simulations and a real data example confirm that the proposed method performs well in practical settings.

Introduction

Imbalanced data have become pronounced in big data era, with applications to many modern scientific fields including face or speech recognition, text classification, oil detection in satellite images, etc. Any dataset exhibiting an unequal/too-skewed distribution between its classes/categories can be regarded as an imbalanced dataset. Typically, one of the classes/categories is rather rare, for example, data for diagnosing rare diseases or fraud detection. The number of fraud instances is much less than that of non-fraudulent instances in credit card transactions. The between-class imbalance could be at the order of 1:102, 1:103 or 1:104. Novel technologies and solutions have been proposed for learning the imbalanced data in machine learning literature; see Chawla et al. (2002), He and Garcia (2009), Liu and Chen (2005), Mazurowski et al. (2008), Yu et al. (2013), Pio et al. (2014), among many others. Most of the research efforts there target at specific case studies and algorithms. Recently, for analyzing imbalanced data, Fithian and Hastie (2014) introduced a novel subsampling method for logistic regression by adjusting the class imbalance locally, so as to obtain a consistent and more efficient estimator of the regression coefficients.

High dimensional data, where the number of candidate predictors may be much larger than the sample size, pose unprecedented challenge for statistical analysis. With or without censoring, there have been numerous state-of-the-art variable selection and feature screening methods in the literature, including Lasso (Tibshirani, 1996), SCAD (Fan and Li, 2001), group Lasso (Yuan and Lin, 2006), adaptive Lasso (Zou, 2006) and their variants. For moderate or large dimensionality, the optimization problems associated with the penalized approaches can be solved effectively and quickly. However, when the dimensionality grows exponentially fast with the sample size, penalized methods encounter computational complexity in handling such ultrahigh dimensional data. Feature screening methods are particularly designed to reduce the high dimensionality to a moderate scale. Popular model-based feature screening methods include the sure independence screening (SIS) and its variants; see Fan and Lv (2008), Fan and Song (2010), Fan et al. (2011), Chang et al. (2013), etc. Recently, important findings on model-free feature screening were reported in the literature. Zhu et al. (2011) introduced a sure independence ranking and screening (SIRS) to identify significant predictors. Li et al. (2012a) proposed to use Kendall’s tau correlation, rather than the Pearson’s correlation, as a robust ranking utility. Li et al. (2012b) developed a sure screening procedure based on the distance correlation (DCS). A quantile-adaptive model-free variable screening (QA) was studied by He et al. (2013) for high dimensional heterogeneous data. The novel Kolmogorov–Smirnov distance was developed by Mai and Zou (2013) to deal with binary classification problems and was extended to handle continuous response in Mai and Zou (2015). Cui et al. (2015) proposed a model-free feature screening index named MV for ultrahigh dimensional discriminant analysis. Recently, novel feature screenings have been studied to analyze ultra-high dimensional censored data; see Zhao and Li (2012), Hong et al. (2018), Song et al. (2014), Wu and Yin (2015), Zhou and Zhu (2017), Hong and Li (2017), Zhang et al., 2017, Zhang et al., 2018 etc. These methods are elegant and examined to be effective for dimension reduction with prospective samples or i.i.d. samples of the underlying population. However, directly applying existing methods to ultrahigh dimensional imbalanced data without accounting for the imbalanced nature may result in inaccurate results.

In addition, with the availability of enormous high dimensional imbalanced data in various disciplines, computational costs become one of the major concerns, as one may run out of all the computing resources before running out of data. A direct way to reduce the computational cost is to subsample the original full data set before doing anything else. Case-control sampling, a special case of the response-selective sampling, is a popular sampling scheme of the original data set by sampling uniformly from each class/category but adjusting the mixture of the classes to enrich the rare class and save the computational cost. Statistical analysis of case-control sampling and other biased samplings have been extensively studied in the literature; see Anderson (1972), Manski and Lerman (1977), Prentice and Pyke (1979), Breslow and Day (1980), Cosslet (1981), Scott and Wild, 1986, Scott and Wild, 1997, Manski (1993), Chen (2001), Chen et al. (2017) and Xie et al. (2019). Moreover, novel approaches to analyze length-biased data and general biased sampling data with semiparametric transformation and accelerated failure time models have been developed by Shen et al. (2009), Ning et al. (2010), Kim et al. (2013), Wang and Wang (2014), Kim et al. (2016), Xu et al. (2017), Qin (2017) and Sun et al. (2018). Generally speaking, case-control samples and other biased samples are likely to contain more information relevant to one’s interest. To be specific, let (Y,X) and (Y,X) represent the pair of response and covariates in the population and in the sample, respectively. As it is defined in Lawless (1997), sampling designs that depend on the value of Y are called response-selective or response-biased sampling. It is known that the joint distribution of the samples obtained by the response-selective sampling is typically not of the same distribution as the population distribution. But the response-selective sampling assumes that, for any y, the conditional distribution of X given Y=y is the same as that of X given Y=y. Therefore, in case-control sampling, the conditional distribution of X given Y=1(0) is the population distribution of the covariates for all cases (controls), which is the same as that of the covariates of cases (controls) in the case-control sample.

In this paper, we propose a new variable screening procedure for ultrahigh dimensional imbalanced data. The proposed method is based on Kendall’s tau correlation under case-control sampling. The motivation of this work is that case-control sampling will not change the positive correlation between the ranks of the responses and predictors. Hence, the rank correlation can be used to rank the candidate variables with case-control sampling data. Moreover, to pursue a ranking index less sensitive to the case-control sampling design, we consider a fused ranking utility by repeating the case-control sampling for several times. Our proposed method enjoys the following several merits. First, it is a model-free approach and no need to specify an actual model for the original full data. Second, the ranking statistic is of very simple form and the computation is rather fast and straightforward. In contrast to a direct analysis of the full dataset which might cost vast computing resources, our proposed method saves plenty of computational costs with the help of multiple case-control samplings. Third, our method inherits the robust property of Kendall’s tau correlation and shall be robust to outliers in the predictors.

The rest of the paper is organized as follows. We present the methodology and its theoretical properties for binary and multi-category cases under regularity conditions in Section 2. We evaluate the performance of the proposed procedure through extensive simulation studies in Section 3 and a real data example in Section 4. A few closing remarks are given in Section 5. All technical details are given in Appendix.

Section snippets

Methodologies and main results

In many classification problems, the typical response variable of interest is often categorized. For example, the response of medical treatment might be categorized as Good, Satisfactory, Average and Poor, 4 outcomes in total. Note that the outcomes are ordered here. Let Ỹ be an unobserved variable that characterizes in an ad hoc fashion of the outcomes, so that ỸIk leads to the patient being categorized into class k. There are K classes in total. Here, I1,,IK are K ordered but unknown

Simulation studies

We conduct extensive simulations to examine the finite-sample performance of our proposed procedure and compare it with some existing methods. In each simulation example, we report the performance of different methods via the minimum model size S needed to include all the important variables, which is an index to measure the effectiveness of a screening method. Clearly, the closer it is to the true model size, the better the screening procedure performs. We present the median and the

Applications

We apply the proposed method to analyze the p53 mutants dataset (Danziger et al., 2006, Danziger et al., 2007, Danziger et al., 2009), which is available at https://archive.ics.uci.edu/ml/datasets/p53+Mutants. In this study, the goal is to detect the mutant p53 transcriptional activity (active or inactive) with total 16 772 samples. The dataset contains 5408 features (pn=5408), in which the first 4826 features represent 2D electrostatic and surface-based features while the rest represent 3D

Closing remarks

This paper proposes the fused case-control screening for large scale and high dimensional imbalanced data. The main point of the paper is to advocate such a procedure, which may have broader applications in medical studies as shown in the real example of this paper, text classifications, face or speech classification, etc. The fused case-control screening that we adopt in (6) is not necessarily the unique choice. There are variations such as τ̃k=sup1lLτk(l).For such variations, the sure

Acknowledgments

The authors are indebted to the Editor, the Associate Editor and two anonymous reviewers for their professional review and insightful comments that lead to substantial improvements in the paper. Meiling Hao ’s research is supported by the Fundamental Research Funds for the Central Universities in UIBE, China (No. CXTD10-09). Yuanyuan Lin’s research is supported by the Hong Kong Research Grants Council (Grant No. 509413 and 14311916), the National Natural Science Foundation of China (Grant No.

References (56)

  • ChenK. et al.

    Case-cohort and case-control analysis with Cox’s model

    Biometrika

    (1999)
  • CossletS.R.

    Maximum likelihood estimate for choice–based samples

    Econometrica

    (1981)
  • CuiH. et al.

    Model–free feature screening for ultrahigh dimensional discriminant analysis

    J. Am. Statist. Assoc.

    (2015)
  • DanzigerS.A. et al.

    Predicting positive p53 Cancer rescue regions using most informative positive MIP active learning

    PLoS Comput. Biol.

    (2009)
  • DanzigerS.A. et al.

    Functional census of mutation sequence spaces: the example of p53 cancer rescue mutants

    IEEE/ACM Trans. Comput. Biol. Bioinform.

    (2006)
  • DanzigerS.A. et al.

    Choosing where to look next in a mutation sequence space: Active learning of informative p53 cancer rescue mutants

    Bioinformatics

    (2007)
  • FanJ. et al.

    Nonparametric independence screening in sparse ultrahigh–dimensional additive models

    J. Am. Statist. Assoc.

    (2011)
  • FanJ. et al.

    Variable selection via nonconcave penalized likelihood and its oracle properties

    J. Amer. Statist. Assoc.

    (2001)
  • FanJ. et al.

    Sure independence screening for ultrahigh dimensional feature space

    J. R. Stat. Soc. Ser. B Stat. Methodol.

    (2008)
  • FanJ. et al.

    Sure independence screening in generalized linear models with NP–dimensionality

    Ann. Statist.

    (2010)
  • FithianW. et al.

    Local case–control sampling: efficient subsampling in imbalanced data sets

    Ann. Statist.

    (2014)
  • HeH. et al.

    Learning from imbalanced data

    IEEE Trans. Knowl. Data. Eng.

    (2009)
  • HeX. et al.

    Quantile–adaptive model–free variable screening for high–dimensional heterogeneous data

    Ann. Statist.

    (2013)
  • HongH.G. et al.

    Conditional screening for ultra-high dimensional covariates with survival outcomes

    Lifetime Data Anal.

    (2018)
  • HongH.G. et al.

    Feature selection of ultrahigh-dimensional covariates with survival outcomes: A selective review

    Appl. Math. Ser. B

    (2017)
  • KimJ.P. et al.

    A unified approach to semiparametric transformation models under general biased sampling schemes

    J. Am. Statist. Assoc.

    (2013)
  • KimJ.P. et al.

    Accelerated failure time model under general biased sampling scheme

    Biostatistics

    (2016)
  • LawlessJ.F.

    Likelihood and pseudo likelihood estimation based on response–biased observation

    Lect. Notes. Monog.

    (1997)
  • View full text