Robust regression with application to symbolic interval data

https://doi.org/10.1016/j.engappai.2012.05.004Get rights and content

Abstract

This paper presents a robust regression model that deals with cases that have interval-valued outliers in the input data set. Each interval of the input data is represented by its range and midpoint and the fitting to interval-valued data is not sensible in the presence of midpoint and/or range outliers on the interval response. The predictions of the lower and upper bounds of new intervals are performed and simulation studies are carried out to validate these predictions. Two applications with real-life interval data sets are considered. The prediction quality is assessed by a mean magnitude of relative error calculated from a test data set.

Highlights

► This study aims to propose a robust regression model for interval data sets. ► These data sets contain interval-valued outliers. ► Each interval is represented by its midpoint and range. ► Experiments with real-life and simulated interval data sets are considered. ► The robustness of this model is shown.

Introduction

Regression analysis is one of the most widely used techniques in engineering, management and many other fields. The widespread availability of regression software has greatly expanded its application in recent years. A problem that is frequently encountered in the application of regression is the presence of one or more outliers in the data. The outliers give valuable information about the fit of the model on data quality and they are indicative for atypical phenomena. Outlier observations are unusual observations in a data set that substantially differ from the rest. Such data may have a strong influence on the statistical analysis, particularly in regression models based on least square estimators. In view of such potential impact on the fitted model, identifying outlying observations is an important concern of the regression model building process. That is, occasionally certain observations will have a disproportionate effect on the precision of the parameter estimates, and/or the overall predictive ability of the model.

Robust regression is an important technique for analyzing data that are contaminated with outliers. A robust estimation technique is essentially a method which tolerates the presence of data atypical points. This technique has been developed as an alternative to least squares estimation in the presence of outliers. The primary purpose of robust regression techniques is to fit a model that describes the information in the majority of the data. This general definition implies that this technique should perform well on both messy data (with outliers) and on clean data (without outliers).

The statistical treatment of interval data has been considered in the context of Symbolic Data Analysis (SDA) which is a domain in the area of knowledge discovery and data management related to multivariate analysis, pattern recognition and artificial intelligence. The aim of SDA is to provide a comprehensive way to summarize data sets by means of symbolic data resulting in a smaller and more manageable data set which preserves the essential information, and its subsequent analysis by means of the generalization of the exploratory data analysis and data mining techniques to symbolic data. Symbolic data allow multiple values for each variable. Those new variables (set-valued, interval-valued and histogram-valued) make it possible to hold data intrinsic variability and/or uncertainty from the original data set as shown in Diday and Noirhomme-Fraiture (2008).

The process of obtaining symbolic data starts with the extraction of knowledge from data sets as in data mining process in order to provide symbolic descriptions. In practice, symbolic descriptions are mathematically modeled by a generalization process applied to a set of individuals described by classical data (categorical or quantitative values). According to Diday and Noirhomme-Fraiture (2008), overgeneralization problems can arise when extreme values are presented in classical descriptions and these values are in fact outliers or when the set of individuals to generalize is in fact composed of subsets of different distributions. In classical data analysis, sometimes specialists after the identification of point outliers prefer to discard outliers before computing the line that best fits the data under investigation. In symbolic data analysis, a single interval outlier may represent an aggregation of a group of measurements that contain valuable information about the process being analyzed. Therefore, it is not recommendable to discard interval outliers because these observations can cause great loss of information.

This paper introduces a robust regression for estimation and prediction in the presence of atypical interval data. The outline of this work is as follows: Section 2 shows the motivation and related works for linear regression model with interval symbolic data. Section 3 describes the robust regression for interval data proposed in this paper. Section 4 carries out a simulation study and an analysis with two real-life interval data sets to show the performance of the introduced approach in comparison with a linear regression method for interval data of the literature of symbolic data analysis. Section 5 concludes the work.

Section snippets

Motivation and related works

Regarding that an interval can be represented by its center (midpoint) and range, interval outliers can be identified by investigating if there are point outliers in the respective midpoint and range data sets. Fig. 1 displays mushroom and football data sets in which interval-valued data outliers are presented. The mushroom data set (Fig. 1(a)) consists of 23 species described by two predictor interval variables that are stipe length and stipe thickness and the response interval variable that

Constructing weighted regression for interval data

The importance of taking into account the midpoint and range information in a linear regression model for predicting interval-valued data was demonstrated in Lima Neto and De Carvalho (2008). In this model, the estimation procedure is based on the least square method that does not assume probabilistic supposition on the error variable. However, this model may also suffer strong influence when there are interval outliers.

This section presents a robust regression model for interval-valued data

Experiment results

To show the usefulness of the robust regression method proposed in this paper (here called IRR method), experiments with simulated interval-valued data sets in R3, each one of size 375, are now presented. Our aim is to achieve a comparison between this method and the linear regression method introduced in Lima Neto and De Carvalho (2008) (here called IR) that has been widely used to predict interval data. The performance assessment of these approaches will be measured in terms of the mean

Application with real-life interval-valued data sets

The IRR and IR models are applied on the mushroom and football interval data sets. For each interval data set, the MMRE is estimated based on leave-one-out method and the Wilcoxon test for paired samples at a significance level of 5% is then performed to compare the models.

Conclusion

In this paper, a robust regression method for interval-valued data sets is introduced. Different types of interval outliers are defined according to the presence of unusual midpoints and/or ranges of the intervals. The performance of the method is evaluated through a mean magnitude of relative error for intervals proposed in this paper. Experiments in the framework of Monte Carlos simulation regarding several scenarios of simulated data sets containing interval outliers and applications with

References (19)

There are more references available in the full text version of this article.

Cited by (44)

  • A regularized MM estimate for interval-valued regression

    2024, Expert Systems with Applications
  • Exploratory spatial analysis for interval data: A new autocorrelation index with COVID-19 and rent price applications

    2022, Expert Systems with Applications
    Citation Excerpt :

    Another proposed linear regression model was introduced by Neto and de Carvalho (2008), by using a new representation for interval data, named center and range. Data sets with interval values were analyzed in Fagundes, Souza, and Cysneiros (2013) by employing simulation studies. In Neto and de Carvalho (2018), it is introduced a new robust regression method for interval-valued variables that penalizes the presence of outliers in the midpoints and/or in the ranges of interval-valued observations through the use of exponential-type kernel functions.

  • Interval joint robust regression method

    2021, Neurocomputing
    Citation Excerpt :

    An observation can be considered as an outlier regarding the center, the radius or both. We remark that the outliers of these real interval-valued data sets were not imputed, they were identified as natural outliers observations in consonance with the definitions of Ref. [17]. Finally, Table 5 presents a brief overview about the 4 interval-valued data sets considered in this section.

View all citing articles on Scopus
View full text