A simulation study on classic and robust variable selection in linear regression

doi:10.1016/j.amc.2005.09.010

Applied Mathematics and Computation

Volume 175, Issue 2, 15 April 2006, Pages 1629-1643

https://doi.org/10.1016/j.amc.2005.09.010 Get rights and content

Abstract

In linear regression analysis, outliers often have large influence in the variable selection process. The aim of this study is to select the subsets of independent variables, which explain dependent variables in the presence of outliers and possible departures from the normality assumption of the error distribution in robust regression analysis. We compared robust and classical variable selection. Here, as a classics selection criteria we used Cp, AICC and AICF which we proposed. Besides we used Andrews, Huber and Hampel M-estimators in computing of the robust variable selection criteria.

Introduction

Variable selection is a key component in any statistical analysis. The choice of the final variable(s) is a procedure based on subject matter knowledge and on formal selection criteria. Classical variable selection procedures are based on classical estimators and tests.

For instance Mallows’ Cp [10] is a powerful selection procedure in regression. Since the Cp statistics is based on least squares estimation, it is very sensitive to outliers and other departures from the normality assumption on the error distribution. One of the methods often used in practice is the Akaike information criterion [1]. Much of the work on AIC based on the case where data are normally distributed and sample size is large. Hurvich and Tsai [7] have shown that AIC can be quite biased, and have proposed a corrected version (AICC). AICC is unbiased and tends to select much better subsets than in AIC. Since the AIC statistics for regression models is a direct consequence of the normality assumption on the error’s distribution, we cannot use it in this form with robust estimators and robust tests. Like AIC, Cp criterion is also based on least squares (LS) estimation, it is highly sensitive to the presence of outliers and leverage points. On the other hand, the need for robust selection procedures is obvious because one cannot estimate the parameters robustly and apply unmodified classical selection procedures. A very extensive literature exists for variable selection criteria in regression. One of main goals of robust statistics is to find new statistical procedures that are not influenced too much by small deviations from the distributional assumption of the model. There are many robust estimators in the literature. The most used robust estimators were proposed by Huber, Hampel and Andrews. The well-known Ψ functions are the ones proposed by Huber, Hampel and Andrews. They are as follows [6]: $Huber Ψ (x) = \{\begin{matrix} - k, & x < - k, \\ x, & | x | ⩽ k, \\ k, & x > k, \end{matrix}$ where k = 1.345. $Hampel Ψ (x) = \{\begin{matrix} | x |, & 0 ⩽ | x | ⩽ a, \\ a, & a ⩽ | x | ⩽ b, \\ a (\frac{c - | x |}{c - b}), & b ⩽ | x | < c, \\ 0, & c ⩽ | x |, \end{matrix}$ where a = 2, b = 4, c = 8. $Andrews Ψ (x) = \{\begin{matrix} \sin (x / k), & | x | ⩽ k π, \\ 0, & | x | > k π, \end{matrix}$ where k = 1.

In this study, the aim is to compare classical and robust variable selection with Huber, Hampel and Andrews M-estimators.

The outline of the paper is follow. In Section 2, we introduce robust variable selection criteria. In Section 3, we proposed a new corrected Akaike information criteria using Fisher information as measure of the discrepancy between the true and approximating models. In Section 4, we presented a simulation study for comparing robust and classical variable selection criteria.

Section snippets

Robust variable selection criteria

There are a few attempts to robustify some classical variable selection procedures in literature. Ronchetti [11], [12] proposed and investigated the properties of a robust version of Akaike’s information criterion, and Hampel [4] suggested a modified version of it. Hurvich and Tsai [8] compared some variable selection procedures for L1 regression, Ronchetti and Staudte [13] proposed a robust version of Mallows’ Cp for regression models and Sommer and Huggins [16] suggested a new selection

Akaike information criteria with Fisher information

In this section, we proposed a biased correction on AIC used for variable selection. We proposed to use of Fisher information as measure of the difference between the true and approximating models replaced by Kullback–Leibler information [3].

In this section, we propose a biased correction for AIC, which is used for variable selection. Suppose the data are generated by a true model, i.e., the true model, $y = μ + ε,$ where $y = (y_{1}, \dots, y_{n})^{'}, μ = (μ_{1}, \dots, μ_{n})^{'}, ε = (ε_{1}, \dots, ε_{n})^{'}$ and the ε_i are independent identically

Simulation study

In this paper, in order to compare robust and classical variable selection criteria that are given in Section 2, a simulation study has been presented. In previous studies, Mallows-type estimators were used to obtain robust parameter estimators when robust variable selection criteria are determined (see [13]). We use Huber, Hampel and Andrews estimators that given Eq. (1.1) to calculate robust selection criteria. Here the robust Cp criterion used by Ronchetti and Staudte [13] is given by RCP2

Acknowledgements

We would like to thank Prof. Dr. İsmihan Bairamov and Assc. Prof. Olcay Arslan for helpful comments which improved significantly the presentation of this paper.

References (16)

C.F. Hurvich et al.
Model selection for least absolute deviations regression in small samples
Stat. Probab. Lett.
(1990)
E. Ronchetti
Robust model selection in regression
Stat. Probab. Lett.
(1985)
H. Akaike
A new look at the statistical model identification
IEEE, Trans. Automat. Control
(1974)
R.J. Bhansali et al.
Some properties of the order of an autoregressive model selected by a generalization of Akaike’s FPE criterion
Biometrika
(1977)
M. Çetin, Variable selection criteria in robust regression, unpublished Ph.D. thesis, University of Hacettepe,...
F.R. Hampel, Some aspects of model choice in robust statistics, in: Proceedings of the 44th Session of the ISI, Madrid,...
F.R. Hampel et al.
Robust Statistics: The Approach Based on Influence Functions
(1986)
D.C. Hoaglin et al.
Understanding Robust and Exploratory Data Analysis
(1983)

There are more references available in the full text version of this article.

Cited by (5)

A two-dimensional sample screening method based on data quality and variable correlation
2022, Analytica Chimica Acta
Citation Excerpt :
This illustrates that the meaning of sample screening can not only exclude abnormal samples but also remove redundant information from sample data and improve the training effect of the model. Wavelength screening is significant to analysis accuracy improvement and has been widely used in the spectral analysis [23–27]. Wavelength filtering can effectively remove redundant information and eliminate nonlinear or irrelevant variables, thus simplifying the model and improving accuracy [28].
The selection of a training set is the key to determining the quality of the model. In the spectrum analysis, due to various interference factors, the quality of the collected spectral data of some samples has a serious deviation. If directly used in modeling, it will introduce bias to the establishment of the model. Therefore, to get the most representative samples, it is necessary to select samples before establishing the model. This paper proposes a two-dimensional sample selection (TDSS) method, which selects samples from two angles of spectral data quality and variable correlation. This method and Mahalanobis distance method were respectively applied to dynamic spectrum (DS) data to screen samples. The samples screened by the two methods were used for modeling. Finally, establish partial least squares (PLS) linear regression model with a quadratic nonlinear correction method to predict the target components. The experimental results show that the sample screening method significantly improved the accuracy and prediction performance of the model, and it is better than the Mahalanobis distance method. In the prediction of triglyceride and total cholesterol, the correlation coefficient can reach above 0.82. The experimental results fully prove the effectiveness of the sample selection method in this paper, and it has a remarkable effect on improving the accuracy and robustness of the model. This paper provides a new way for sample selection of modeling set in spectral analysis of complex solutions.
Robust logistic regression to narrow down the winner's curse for rare and recessive susceptibility variants
2017, Briefings in Bioinformatics
M-estimator-based robust estimation of the number of components of a superimposed sinusoidal signal model
2014, Journal of Applied Statistics
Atmospheric nitrogen deposition influences denitrification and nitrous oxide production in lakes
2010, Ecology
A neural network approach based on gold-nanoparticle enzyme biosensor
2008, Journal of Chemometrics

View full text

A simulation study on classic and robust variable selection in linear regression

Abstract

Introduction

Section snippets

Robust variable selection criteria

Akaike information criteria with Fisher information

Simulation study

Acknowledgements

Stat. Probab. Lett.

Stat. Probab. Lett.

A new look at the statistical model identification

IEEE, Trans. Automat. Control

Some properties of the order of an autoregressive model selected by a generalization of Akaike’s FPE criterion

Biometrika

Robust Statistics: The Approach Based on Influence Functions

Understanding Robust and Exploratory Data Analysis