A simulation study on classic and robust variable selection in linear regression
Introduction
Variable selection is a key component in any statistical analysis. The choice of the final variable(s) is a procedure based on subject matter knowledge and on formal selection criteria. Classical variable selection procedures are based on classical estimators and tests.
For instance Mallows’ Cp [10] is a powerful selection procedure in regression. Since the Cp statistics is based on least squares estimation, it is very sensitive to outliers and other departures from the normality assumption on the error distribution. One of the methods often used in practice is the Akaike information criterion [1]. Much of the work on AIC based on the case where data are normally distributed and sample size is large. Hurvich and Tsai [7] have shown that AIC can be quite biased, and have proposed a corrected version (AICC). AICC is unbiased and tends to select much better subsets than in AIC. Since the AIC statistics for regression models is a direct consequence of the normality assumption on the error’s distribution, we cannot use it in this form with robust estimators and robust tests. Like AIC, Cp criterion is also based on least squares (LS) estimation, it is highly sensitive to the presence of outliers and leverage points. On the other hand, the need for robust selection procedures is obvious because one cannot estimate the parameters robustly and apply unmodified classical selection procedures. A very extensive literature exists for variable selection criteria in regression. One of main goals of robust statistics is to find new statistical procedures that are not influenced too much by small deviations from the distributional assumption of the model. There are many robust estimators in the literature. The most used robust estimators were proposed by Huber, Hampel and Andrews. The well-known Ψ functions are the ones proposed by Huber, Hampel and Andrews. They are as follows [6]:where k = 1.345.where a = 2, b = 4, c = 8.where k = 1.
In this study, the aim is to compare classical and robust variable selection with Huber, Hampel and Andrews M-estimators.
The outline of the paper is follow. In Section 2, we introduce robust variable selection criteria. In Section 3, we proposed a new corrected Akaike information criteria using Fisher information as measure of the discrepancy between the true and approximating models. In Section 4, we presented a simulation study for comparing robust and classical variable selection criteria.
Section snippets
Robust variable selection criteria
There are a few attempts to robustify some classical variable selection procedures in literature. Ronchetti [11], [12] proposed and investigated the properties of a robust version of Akaike’s information criterion, and Hampel [4] suggested a modified version of it. Hurvich and Tsai [8] compared some variable selection procedures for L1 regression, Ronchetti and Staudte [13] proposed a robust version of Mallows’ Cp for regression models and Sommer and Huggins [16] suggested a new selection
Akaike information criteria with Fisher information
In this section, we proposed a biased correction on AIC used for variable selection. We proposed to use of Fisher information as measure of the difference between the true and approximating models replaced by Kullback–Leibler information [3].
In this section, we propose a biased correction for AIC, which is used for variable selection. Suppose the data are generated by a true model, i.e., the true model,whereand the εi are independent identically
Simulation study
In this paper, in order to compare robust and classical variable selection criteria that are given in Section 2, a simulation study has been presented. In previous studies, Mallows-type estimators were used to obtain robust parameter estimators when robust variable selection criteria are determined (see [13]). We use Huber, Hampel and Andrews estimators that given Eq. (1.1) to calculate robust selection criteria. Here the robust Cp criterion used by Ronchetti and Staudte [13] is given by RCP2
Acknowledgements
We would like to thank Prof. Dr. İsmihan Bairamov and Assc. Prof. Olcay Arslan for helpful comments which improved significantly the presentation of this paper.
References (16)
- et al.
Model selection for least absolute deviations regression in small samples
Stat. Probab. Lett.
(1990) Robust model selection in regression
Stat. Probab. Lett.
(1985)A new look at the statistical model identification
IEEE, Trans. Automat. Control
(1974)- et al.
Some properties of the order of an autoregressive model selected by a generalization of Akaike’s FPE criterion
Biometrika
(1977) - M. Çetin, Variable selection criteria in robust regression, unpublished Ph.D. thesis, University of Hacettepe,...
- F.R. Hampel, Some aspects of model choice in robust statistics, in: Proceedings of the 44th Session of the ISI, Madrid,...
- et al.
Robust Statistics: The Approach Based on Influence Functions
(1986) - et al.
Understanding Robust and Exploratory Data Analysis
(1983)
Cited by (5)
A two-dimensional sample screening method based on data quality and variable correlation
2022, Analytica Chimica ActaCitation Excerpt :This illustrates that the meaning of sample screening can not only exclude abnormal samples but also remove redundant information from sample data and improve the training effect of the model. Wavelength screening is significant to analysis accuracy improvement and has been widely used in the spectral analysis [23–27]. Wavelength filtering can effectively remove redundant information and eliminate nonlinear or irrelevant variables, thus simplifying the model and improving accuracy [28].
Robust logistic regression to narrow down the winner's curse for rare and recessive susceptibility variants
2017, Briefings in BioinformaticsM-estimator-based robust estimation of the number of components of a superimposed sinusoidal signal model
2014, Journal of Applied StatisticsA neural network approach based on gold-nanoparticle enzyme biosensor
2008, Journal of Chemometrics