Elsevier

Information Sciences

Volumes 454–455, July 2018, Pages 419-442
Information Sciences

An exponential-type kernel robust regression model for interval-valued variables

https://doi.org/10.1016/j.ins.2018.05.008Get rights and content

Highlights

  • The paper provides a robust regression method for interval-valued variables.

  • The computation of the sum of squares errors uses exponential-type kernel functions.

  • Outlier observations have a small weight for the parameter estimates.

  • Applications on synthetic and real data sets corroborate the proposed method.

Abstract

The presence of outliers is very common in regression problems and the use of robust regression methods is strongly recommended such that the bad fitted observations not affect the parameter estimates of the model. Interval-valued variables are becoming common in data analysis problems since this type of data represents either the uncertainty existing in an error measurement or the natural variability present in the data. Regarding the presence of outliers in interval-valued data sets, few robust regression methods have been proposed in literature. This paper introduces a new robust regression method for interval-valued variables that penalizes the presence of outliers in the midpoints and/or in the ranges of interval-valued observations through the use of exponential-type kernel functions. Thus, the weight given to the midpoint and range of each interval-valued observation is updated at each iteration in order to optimize a suitable objective function. The convergence of the parameter estimation algorithm is guaranteed with a low computational cost. A comparative study between the proposed method against some previous robust regression approaches for interval-valued variables is also considered. The performance of these methods are evaluated based on the bias and mean squared error (MSE) of the parameter estimates for the midpoints and ranges of the intervals, considering synthetic data sets with X-space outliers, Y-space outliers and leverage points, different sample sizes and percentage of outliers in a Monte Carlo framework. The results suggest that the proposed approach presents a competitive performance (or best), in comparison with the previous approaches, on interval-valued outliers scenarios that are comparable to those found in practices. Applications to real interval-valued data sets corroborates the usefulness of the proposed method.

Introduction

Understand the relationship between a set of variables is a common task to solve practical problems in many areas. The regression analysis represents an interesting tool that can be used for this aim, representing an important research field with new methods in constant development.

Nowadays information systems allow to collect and storage data with low cost and faster. Moreover, massive data sets have been generated in a easy way in many areas like Economy, Meteorology, Business, Telecommunications, Biology, among others. These data sets tend to be released in an aggregated format due to confidentiality reasons or because the interest of study is not the individual unit but a group of units. Thus, the researcher does not analyze a classical data set with single values in the real line, but a complex data set aggregated with new types of data, like interval-valued data, that offer information on the lower and upper bound of the variable of interest.

Therefore, interval datasets are becoming common in data analysis problems. This type of data can represents the imprecision and/or uncertainty existing in an error measurement but also can represents the natural variability present in the data. Some examples of interval variables are technical specifications, temperatures in meteorological stations and daily stock prices. In this context, statistical tools to analyze interval variables are very much required.

Interval data representing imprecision or uncertainty has been mainly dealt with by means of fuzzy-valued data, with various research developing regression models for fuzzy-valued variables. In this framework, two main approaches are present in literature, fuzzy models using linear and non-linear programming [8], [22], [33], [34], [36], [41], [50] and fuzzy models using least squares method [9], [10], [12], [15], [16], [17], [18], [46].

This paper is concerned with interval-valued data representing natural variability in the data, which have been mainly treated in Symbolic Data Analysis (SDA) field [3], [4] with various research addressed to regression models for interval-valued variables taking into account parametric and nonparametric regression algorithms as well as linear and nonlinear relationships.

Regarding the SDA field, many approaches have been proposed in order to consider a regression model for interval-valued variables. Some of these approaches represent extensions or modifications of regression models for real-valued data. A seminal paper was proposed by Billard and Diday [2] considering a linear regression model for interval-valued variable. Other works were proposed in the same direction, most of them, taking into account a parametric linear relationship between a response interval-valued variable Z and a set of explanatory interval-valued variables W1,,Wp, represented in terms of the midpoints (centers) and half-ranges (radius) ([38], [49], [52] and the references therein). The use of constraints was also considered by some authors in order to guarantee that the radius is greater or equal than zero and, consequently, the lower bound is less or equal than the upper bound ([23], [24], [25], [39] and the references therein). However, the use of constraints limits the domain of the objective function, penalizing the parameter estimates obtained in the optimization process. Other regression methods for interval-valued data have taken into account a probabilistic support for the response interval-valued variable Z, allowing the use of inference techniques over the parameter estimates [1], [5], [37], [40]. More recently, semi-parametric and nonparametric regression models for interval-valued variables were proposed by Refs. [20], [29], [30] and [51]. Besides, a model where the intervals are represented by quantile functions and that considers the distribution Uniform or Symmetric Triangular within the intervals has been proposed by Ref. [13].

Robust regression attempts to cope with outliers and leverage points [26], [27], [31], [44], [45], [53]. The regression outliers are data that move away from the linear model pattern of the majority of the observations and the use of a non-robust techniques typically leads to biased inferences. Concerning interval-valued data, some contributions were proposed related to robust regression models for interval-valued variables. Ref. [14] presented the symbolic symmetrical linear regression model for interval variables that take into account a Student-t distribution for the midpoints of interval and a Gaussian distribution for the ranges. Ref. [19] considers two independent classic robust regressions over the midpoints and ranges of the intervals. The regression outliers (in the midpoints and ranges) are penalized according to the Tukey’s bi-weight function. Ref. [21] adapted the technique of quantile regression for interval-valued variables.

The use of positive definite kernels has become popular in the computational intelligence community. The idea of using exponential-type kernel functions to measure the similarity between two objects have been successfully applied in computer vision, signal processing, clustering, pattern classification, among others, with a large literature on the family of kernel-based algorithms ([11], [47], [48] and the references contained therein). Recently, Ref. [7] proposed a robust regression method for real-valued data based on exponential-type kernel function (called ETKRR method) which presented a competitive performance (or best) in comparison with the well established classical robust linear regression models like L1-regression, MM-Estimator regression, weighted least squares, among others.

This paper introduces a robust regression method for interval-valued variables, hereafter named iETKRR (Exponential-type kernel robust regression for interval-valued variables). It extends to interval-valued variables the robust regression model for real-valued variables proposed by Ref. [7]. Its main contributions are as follows:

  • the iETKKR provides a new objective function that has two terms aiming to take into account the informations provided either by the center and the radius of the intervals or by the lower and upper boundaries of the intervals. Therefore, the new objective function is suitable to manage interval-valued data.

  • Besides, the proposed method allows to combine different hyper-parameter estimators, respectively, one for the center and one for the radius (or one for the lower bound and one for the upper bound), and thus provides more flexibility and robustness to treat the different outlier’s types present in interval-valued data sets.

The iETKRR method re-weights the interval observations based on exponential-type kernel functions, in such a way that the weight assigned to an interval outlier observation is as small as possible, considering an iterative process to minimize a suitable objective function. The weighting is provided by calculating the similarities between the observed and predicted values for the midpoint an range, respectively, of the response variables and updating it at each iteration in order to optimize the objective function. The convergence of the estimation algorithm is guaranteed with a low computational cost.

A comparative study between the iETKRR method and the robust regression approaches for interval-valued variables present in literature [14], [19], [21] is considered. These methods will be evaluated in terms of the bias and mean squared error (MSE) of the parameter estimates taking into account X-space outliers, Y-space outliers, leverage points, different sample sizes and percentage of outliers in the sample, representing a total of 138 different configurations in a Monte Carlo simulation framework with 10,000 replications. Applications on real interval-valued data sets also illustrate the usefulness of the iETKRR method.

The paper is organized as follows: Section 2 reviews some concepts about exponential-type kernel functions and presents the iETKRR method for interval-valued variable as well as the parameter estimate algorithm. Section 3 exhibits the Monte Carlo experiments that evaluates the convergence of the parameter estimation algorithm, compares the iETKRR method with the existing robust regression methods for interval-valued variables and discuss the results obtained in the numerical analysis. Section 4 brings the applications to real interval-valued data sets and Section 5 gives the concluding remarks.

Section snippets

Exponential-type kernel robust regression for interval-valued variables (iETKRR)

This section reviews some concepts about exponential-type kernel functions, presents the iETKRR method for interval-valued variable and provides the corresponding parameter estimation algorithm.

Monte Carlo experiments

A Monte Carlo simulation study is performed aiming to compare the iETKRR method against the robust interval regression methods IRR [19], IQR [21] and SSLR [14]. All experiments were implemented in the R language [43] and performed on the same machine (OS: Windows 7 Professional 64-bits, Memory: 16 GiB, Processor: Intel Core i7-X990 CPU @ 3.47 GHz). The code with the iETKRR parameter estimation algorithm can be requested to the authors.

Applications on real data sets

This section evaluates the proposed robust regression method iETKRR in applications concerning real interval-valued data sets, as well as presents a comparative study in relation to other robust regression methods. The aim is to illustrate the usefulness of the iETKRR method in comparison with the other robust methods SSLR [14], IRR [19] and IQR [21]. We considered seven real data sets with the presence of outliers interval-valued observations. The non-robust method CRM [38] is also considered

Concluding remarks

A robust linear regression method for interval-valued variables based on exponential-type kernel functions was proposed in this paper. The proposed method provides a new objective function that is suitable to manage interval-valued data because it is able to take into account the informations provided either by the center and the radius of the intervals or the lower and upper boundaries of the intervals. Moreover, it allows to combine different hyper-parameter estimators either on the center

Acknowledgments

The authors are grateful to the anonymous referees and the Associate  Editor for their careful revision, valuable suggestions, and comments which improved this paper. The authors would like to thank CAPES (National Foundation for Post-Graduated Programs, Brazil) and CNPq (National Council for Scientific and Technological Development, Brazil) for their financial support. The second author would like to thank also FACEPE (Reseach Agency from the State of Pernambuco, Brazil).

References (53)

  • M. Modarres et al.

    Fuzzy linear regression models with least square errors

    Appl. Math. Comput.

    (2005)
  • G. Peters

    Fuzzy linear regression with fuzzy intervals

    Fuzzy Sets Syst.

    (1994)
  • B.A. Pimentel et al.

    A weighted multivariate fuzzy c-means method in interval-valued scientific production data

    Expert Syst. Appl.

    (2014)
  • J. Ahn et al.

    A resampling approach for interval-valued data regression

    Stat. Anal. Data Min.

    (2012)
  • L. Billard et al.

    Regression analysis for interval-valued data

    Proceedings of the Seventh Conference of the International Federation of Classification Societies on Data Analysis, Classification and Related Methods

    (2000)
  • L. Billard et al.

    From the statistics of data to the statistics of knowledge: symbolic data analysis

    J. Am. Stat. Assoc.

    (2003)
  • H.H. Bock, E. Diday, editors. Analysis of Symbolic Data, Exploratory Methods for Extracting Statistical Information...
  • P. Brito et al.

    Modeling interval data with normal and skew-normal distributions

    J. Appl. Stat.

    (2012)
  • B. Caputo et al.

    Appearance-based object recognition using svms: which kernel should I use?

    Proceedings of NIPS Workshop on Statistical Methods for Computational Experiments in Visual Processing and Computer Vision

    (2002)
  • F.A.T.D. Carvalho et al.

    A robust regression method based on exponential-type kernel functions

    Neurocomputing

    (2017)
  • E.P.-T. Chang et al.

    A generalized fuzzy weighted least-squares regression

    Fuzzy Sets Syst.

    (1996)
  • S.H. Choi et al.

    Least absolute deviation estimator in fuzzy regression

    Soft Comput.

    (2008)
  • N. Cristianini et al.

    Introduction to Support Vector Machines

    (2000)
  • P. D’Urso et al.

    Weighted least squares and least median squares estimation for the fuzzy linear regression analysis

    Metron

    (2013)
  • R.A.A. Fagundes et al.

    Robust regression with application to symbolic interval data

    Eng. Appl. Artif. Intell.

    (2013)
  • R.A.A. Fagundes et al.

    Quantile regression of interval-valued data

    Proceedings of 23rd International Conference on Pattern Recognition

    (2016)
  • Cited by (0)

    View full text