Fused variable screening for massive imbalanced data
Introduction
Imbalanced data have become pronounced in big data era, with applications to many modern scientific fields including face or speech recognition, text classification, oil detection in satellite images, etc. Any dataset exhibiting an unequal/too-skewed distribution between its classes/categories can be regarded as an imbalanced dataset. Typically, one of the classes/categories is rather rare, for example, data for diagnosing rare diseases or fraud detection. The number of fraud instances is much less than that of non-fraudulent instances in credit card transactions. The between-class imbalance could be at the order of , or . Novel technologies and solutions have been proposed for learning the imbalanced data in machine learning literature; see Chawla et al. (2002), He and Garcia (2009), Liu and Chen (2005), Mazurowski et al. (2008), Yu et al. (2013), Pio et al. (2014), among many others. Most of the research efforts there target at specific case studies and algorithms. Recently, for analyzing imbalanced data, Fithian and Hastie (2014) introduced a novel subsampling method for logistic regression by adjusting the class imbalance locally, so as to obtain a consistent and more efficient estimator of the regression coefficients.
High dimensional data, where the number of candidate predictors may be much larger than the sample size, pose unprecedented challenge for statistical analysis. With or without censoring, there have been numerous state-of-the-art variable selection and feature screening methods in the literature, including Lasso (Tibshirani, 1996), SCAD (Fan and Li, 2001), group Lasso (Yuan and Lin, 2006), adaptive Lasso (Zou, 2006) and their variants. For moderate or large dimensionality, the optimization problems associated with the penalized approaches can be solved effectively and quickly. However, when the dimensionality grows exponentially fast with the sample size, penalized methods encounter computational complexity in handling such ultrahigh dimensional data. Feature screening methods are particularly designed to reduce the high dimensionality to a moderate scale. Popular model-based feature screening methods include the sure independence screening (SIS) and its variants; see Fan and Lv (2008), Fan and Song (2010), Fan et al. (2011), Chang et al. (2013), etc. Recently, important findings on model-free feature screening were reported in the literature. Zhu et al. (2011) introduced a sure independence ranking and screening (SIRS) to identify significant predictors. Li et al. (2012a) proposed to use Kendall’s tau correlation, rather than the Pearson’s correlation, as a robust ranking utility. Li et al. (2012b) developed a sure screening procedure based on the distance correlation (DCS). A quantile-adaptive model-free variable screening (QA) was studied by He et al. (2013) for high dimensional heterogeneous data. The novel Kolmogorov–Smirnov distance was developed by Mai and Zou (2013) to deal with binary classification problems and was extended to handle continuous response in Mai and Zou (2015). Cui et al. (2015) proposed a model-free feature screening index named MV for ultrahigh dimensional discriminant analysis. Recently, novel feature screenings have been studied to analyze ultra-high dimensional censored data; see Zhao and Li (2012), Hong et al. (2018), Song et al. (2014), Wu and Yin (2015), Zhou and Zhu (2017), Hong and Li (2017), Zhang et al., 2017, Zhang et al., 2018 etc. These methods are elegant and examined to be effective for dimension reduction with prospective samples or i.i.d. samples of the underlying population. However, directly applying existing methods to ultrahigh dimensional imbalanced data without accounting for the imbalanced nature may result in inaccurate results.
In addition, with the availability of enormous high dimensional imbalanced data in various disciplines, computational costs become one of the major concerns, as one may run out of all the computing resources before running out of data. A direct way to reduce the computational cost is to subsample the original full data set before doing anything else. Case-control sampling, a special case of the response-selective sampling, is a popular sampling scheme of the original data set by sampling uniformly from each class/category but adjusting the mixture of the classes to enrich the rare class and save the computational cost. Statistical analysis of case-control sampling and other biased samplings have been extensively studied in the literature; see Anderson (1972), Manski and Lerman (1977), Prentice and Pyke (1979), Breslow and Day (1980), Cosslet (1981), Scott and Wild, 1986, Scott and Wild, 1997, Manski (1993), Chen (2001), Chen et al. (2017) and Xie et al. (2019). Moreover, novel approaches to analyze length-biased data and general biased sampling data with semiparametric transformation and accelerated failure time models have been developed by Shen et al. (2009), Ning et al. (2010), Kim et al. (2013), Wang and Wang (2014), Kim et al. (2016), Xu et al. (2017), Qin (2017) and Sun et al. (2018). Generally speaking, case-control samples and other biased samples are likely to contain more information relevant to one’s interest. To be specific, let and represent the pair of response and covariates in the population and in the sample, respectively. As it is defined in Lawless (1997), sampling designs that depend on the value of are called response-selective or response-biased sampling. It is known that the joint distribution of the samples obtained by the response-selective sampling is typically not of the same distribution as the population distribution. But the response-selective sampling assumes that, for any , the conditional distribution of given is the same as that of given . Therefore, in case-control sampling, the conditional distribution of given is the population distribution of the covariates for all cases (controls), which is the same as that of the covariates of cases (controls) in the case-control sample.
In this paper, we propose a new variable screening procedure for ultrahigh dimensional imbalanced data. The proposed method is based on Kendall’s tau correlation under case-control sampling. The motivation of this work is that case-control sampling will not change the positive correlation between the ranks of the responses and predictors. Hence, the rank correlation can be used to rank the candidate variables with case-control sampling data. Moreover, to pursue a ranking index less sensitive to the case-control sampling design, we consider a fused ranking utility by repeating the case-control sampling for several times. Our proposed method enjoys the following several merits. First, it is a model-free approach and no need to specify an actual model for the original full data. Second, the ranking statistic is of very simple form and the computation is rather fast and straightforward. In contrast to a direct analysis of the full dataset which might cost vast computing resources, our proposed method saves plenty of computational costs with the help of multiple case-control samplings. Third, our method inherits the robust property of Kendall’s tau correlation and shall be robust to outliers in the predictors.
The rest of the paper is organized as follows. We present the methodology and its theoretical properties for binary and multi-category cases under regularity conditions in Section 2. We evaluate the performance of the proposed procedure through extensive simulation studies in Section 3 and a real data example in Section 4. A few closing remarks are given in Section 5. All technical details are given in Appendix.
Section snippets
Methodologies and main results
In many classification problems, the typical response variable of interest is often categorized. For example, the response of medical treatment might be categorized as Good, Satisfactory, Average and Poor, outcomes in total. Note that the outcomes are ordered here. Let be an unobserved variable that characterizes in an ad hoc fashion of the outcomes, so that leads to the patient being categorized into class . There are classes in total. Here, are ordered but unknown
Simulation studies
We conduct extensive simulations to examine the finite-sample performance of our proposed procedure and compare it with some existing methods. In each simulation example, we report the performance of different methods via the minimum model size needed to include all the important variables, which is an index to measure the effectiveness of a screening method. Clearly, the closer it is to the true model size, the better the screening procedure performs. We present the median and the
Applications
We apply the proposed method to analyze the p53 mutants dataset (Danziger et al., 2006, Danziger et al., 2007, Danziger et al., 2009), which is available at https://archive.ics.uci.edu/ml/datasets/p53+Mutants. In this study, the goal is to detect the mutant p53 transcriptional activity (active or inactive) with total 16 772 samples. The dataset contains features (), in which the first features represent 2D electrostatic and surface-based features while the rest represent 3D
Closing remarks
This paper proposes the fused case-control screening for large scale and high dimensional imbalanced data. The main point of the paper is to advocate such a procedure, which may have broader applications in medical studies as shown in the real example of this paper, text classifications, face or speech classification, etc. The fused case-control screening that we adopt in (6) is not necessarily the unique choice. There are variations such as For such variations, the sure
Acknowledgments
The authors are indebted to the Editor, the Associate Editor and two anonymous reviewers for their professional review and insightful comments that lead to substantial improvements in the paper. Meiling Hao ’s research is supported by the Fundamental Research Funds for the Central Universities in UIBE, China (No. CXTD10-09). Yuanyuan Lin’s research is supported by the Hong Kong Research Grants Council (Grant No. 509413 and 14311916), the National Natural Science Foundation of China (Grant No.
References (56)
An introduction to ROC analysis
Pattern Recognit. Lett.
(2006)The selection problem in econometrics and statistics
Handbook of Statist.
(1993)- et al.
Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance
Neural. Netw.
(2008) - et al.
Correlation rank screening for ultrahigh-dimensional survival data
Comput. Statist. Data Anal.
(2017) Separate sample logistic discrimination
Biometrika
(1972)- et al.
The Analysis of Case–Control Studies
(1980) - et al.
Marginal empirical likelihood and sure independence feature screening
Ann. Statist.
(2013) - et al.
SMOTE: Synthetic minority over–sampling technique
J. Artif. Intell. Res.
(2002) Parametric models for response–biased sampling
J. R. Stat. Soc. Ser. B Stat. Methodol.
(2001)- et al.
Regression analysis with response–biased sampling
Statist. Sinica
(2017)
Case-cohort and case-control analysis with Cox’s model
Biometrika
Maximum likelihood estimate for choice–based samples
Econometrica
Model–free feature screening for ultrahigh dimensional discriminant analysis
J. Am. Statist. Assoc.
Predicting positive p53 Cancer rescue regions using most informative positive MIP active learning
PLoS Comput. Biol.
Functional census of mutation sequence spaces: the example of p53 cancer rescue mutants
IEEE/ACM Trans. Comput. Biol. Bioinform.
Choosing where to look next in a mutation sequence space: Active learning of informative p53 cancer rescue mutants
Bioinformatics
Nonparametric independence screening in sparse ultrahigh–dimensional additive models
J. Am. Statist. Assoc.
Variable selection via nonconcave penalized likelihood and its oracle properties
J. Amer. Statist. Assoc.
Sure independence screening for ultrahigh dimensional feature space
J. R. Stat. Soc. Ser. B Stat. Methodol.
Sure independence screening in generalized linear models with NP–dimensionality
Ann. Statist.
Local case–control sampling: efficient subsampling in imbalanced data sets
Ann. Statist.
Learning from imbalanced data
IEEE Trans. Knowl. Data. Eng.
Quantile–adaptive model–free variable screening for high–dimensional heterogeneous data
Ann. Statist.
Conditional screening for ultra-high dimensional covariates with survival outcomes
Lifetime Data Anal.
Feature selection of ultrahigh-dimensional covariates with survival outcomes: A selective review
Appl. Math. Ser. B
A unified approach to semiparametric transformation models under general biased sampling schemes
J. Am. Statist. Assoc.
Accelerated failure time model under general biased sampling scheme
Biostatistics
Likelihood and pseudo likelihood estimation based on response–biased observation
Lect. Notes. Monog.
Cited by (6)
A statistical method for massive data based on partial least squares algorithm
2024, Applied Mathematics and Nonlinear SciencesConditional characteristic feature screening for massive imbalanced data
2023, Statistical PapersNew hard-thresholding rules based on data splitting in high-dimensional imbalanced classification
2022, Electronic Journal of Statistics