An exponential-type kernel robust regression model for interval-valued variables
Introduction
Understand the relationship between a set of variables is a common task to solve practical problems in many areas. The regression analysis represents an interesting tool that can be used for this aim, representing an important research field with new methods in constant development.
Nowadays information systems allow to collect and storage data with low cost and faster. Moreover, massive data sets have been generated in a easy way in many areas like Economy, Meteorology, Business, Telecommunications, Biology, among others. These data sets tend to be released in an aggregated format due to confidentiality reasons or because the interest of study is not the individual unit but a group of units. Thus, the researcher does not analyze a classical data set with single values in the real line, but a complex data set aggregated with new types of data, like interval-valued data, that offer information on the lower and upper bound of the variable of interest.
Therefore, interval datasets are becoming common in data analysis problems. This type of data can represents the imprecision and/or uncertainty existing in an error measurement but also can represents the natural variability present in the data. Some examples of interval variables are technical specifications, temperatures in meteorological stations and daily stock prices. In this context, statistical tools to analyze interval variables are very much required.
Interval data representing imprecision or uncertainty has been mainly dealt with by means of fuzzy-valued data, with various research developing regression models for fuzzy-valued variables. In this framework, two main approaches are present in literature, fuzzy models using linear and non-linear programming [8], [22], [33], [34], [36], [41], [50] and fuzzy models using least squares method [9], [10], [12], [15], [16], [17], [18], [46].
This paper is concerned with interval-valued data representing natural variability in the data, which have been mainly treated in Symbolic Data Analysis (SDA) field [3], [4] with various research addressed to regression models for interval-valued variables taking into account parametric and nonparametric regression algorithms as well as linear and nonlinear relationships.
Regarding the SDA field, many approaches have been proposed in order to consider a regression model for interval-valued variables. Some of these approaches represent extensions or modifications of regression models for real-valued data. A seminal paper was proposed by Billard and Diday [2] considering a linear regression model for interval-valued variable. Other works were proposed in the same direction, most of them, taking into account a parametric linear relationship between a response interval-valued variable Z and a set of explanatory interval-valued variables represented in terms of the midpoints (centers) and half-ranges (radius) ([38], [49], [52] and the references therein). The use of constraints was also considered by some authors in order to guarantee that the radius is greater or equal than zero and, consequently, the lower bound is less or equal than the upper bound ([23], [24], [25], [39] and the references therein). However, the use of constraints limits the domain of the objective function, penalizing the parameter estimates obtained in the optimization process. Other regression methods for interval-valued data have taken into account a probabilistic support for the response interval-valued variable Z, allowing the use of inference techniques over the parameter estimates [1], [5], [37], [40]. More recently, semi-parametric and nonparametric regression models for interval-valued variables were proposed by Refs. [20], [29], [30] and [51]. Besides, a model where the intervals are represented by quantile functions and that considers the distribution Uniform or Symmetric Triangular within the intervals has been proposed by Ref. [13].
Robust regression attempts to cope with outliers and leverage points [26], [27], [31], [44], [45], [53]. The regression outliers are data that move away from the linear model pattern of the majority of the observations and the use of a non-robust techniques typically leads to biased inferences. Concerning interval-valued data, some contributions were proposed related to robust regression models for interval-valued variables. Ref. [14] presented the symbolic symmetrical linear regression model for interval variables that take into account a Student-t distribution for the midpoints of interval and a Gaussian distribution for the ranges. Ref. [19] considers two independent classic robust regressions over the midpoints and ranges of the intervals. The regression outliers (in the midpoints and ranges) are penalized according to the Tukey’s bi-weight function. Ref. [21] adapted the technique of quantile regression for interval-valued variables.
The use of positive definite kernels has become popular in the computational intelligence community. The idea of using exponential-type kernel functions to measure the similarity between two objects have been successfully applied in computer vision, signal processing, clustering, pattern classification, among others, with a large literature on the family of kernel-based algorithms ([11], [47], [48] and the references contained therein). Recently, Ref. [7] proposed a robust regression method for real-valued data based on exponential-type kernel function (called ETKRR method) which presented a competitive performance (or best) in comparison with the well established classical robust linear regression models like L1-regression, MM-Estimator regression, weighted least squares, among others.
This paper introduces a robust regression method for interval-valued variables, hereafter named iETKRR (Exponential-type kernel robust regression for interval-valued variables). It extends to interval-valued variables the robust regression model for real-valued variables proposed by Ref. [7]. Its main contributions are as follows:
- •
the iETKKR provides a new objective function that has two terms aiming to take into account the informations provided either by the center and the radius of the intervals or by the lower and upper boundaries of the intervals. Therefore, the new objective function is suitable to manage interval-valued data.
- •
Besides, the proposed method allows to combine different hyper-parameter estimators, respectively, one for the center and one for the radius (or one for the lower bound and one for the upper bound), and thus provides more flexibility and robustness to treat the different outlier’s types present in interval-valued data sets.
The iETKRR method re-weights the interval observations based on exponential-type kernel functions, in such a way that the weight assigned to an interval outlier observation is as small as possible, considering an iterative process to minimize a suitable objective function. The weighting is provided by calculating the similarities between the observed and predicted values for the midpoint an range, respectively, of the response variables and updating it at each iteration in order to optimize the objective function. The convergence of the estimation algorithm is guaranteed with a low computational cost.
A comparative study between the iETKRR method and the robust regression approaches for interval-valued variables present in literature [14], [19], [21] is considered. These methods will be evaluated in terms of the bias and mean squared error (MSE) of the parameter estimates taking into account X-space outliers, Y-space outliers, leverage points, different sample sizes and percentage of outliers in the sample, representing a total of 138 different configurations in a Monte Carlo simulation framework with 10,000 replications. Applications on real interval-valued data sets also illustrate the usefulness of the iETKRR method.
The paper is organized as follows: Section 2 reviews some concepts about exponential-type kernel functions and presents the iETKRR method for interval-valued variable as well as the parameter estimate algorithm. Section 3 exhibits the Monte Carlo experiments that evaluates the convergence of the parameter estimation algorithm, compares the iETKRR method with the existing robust regression methods for interval-valued variables and discuss the results obtained in the numerical analysis. Section 4 brings the applications to real interval-valued data sets and Section 5 gives the concluding remarks.
Section snippets
Exponential-type kernel robust regression for interval-valued variables (iETKRR)
This section reviews some concepts about exponential-type kernel functions, presents the iETKRR method for interval-valued variable and provides the corresponding parameter estimation algorithm.
Monte Carlo experiments
A Monte Carlo simulation study is performed aiming to compare the iETKRR method against the robust interval regression methods IRR [19], IQR [21] and SSLR [14]. All experiments were implemented in the R language [43] and performed on the same machine (OS: Windows 7 Professional 64-bits, Memory: 16 GiB, Processor: Intel Core i7-X990 CPU @ 3.47 GHz). The code with the iETKRR parameter estimation algorithm can be requested to the authors.
Applications on real data sets
This section evaluates the proposed robust regression method iETKRR in applications concerning real interval-valued data sets, as well as presents a comparative study in relation to other robust regression methods. The aim is to illustrate the usefulness of the iETKRR method in comparison with the other robust methods SSLR [14], IRR [19] and IQR [21]. We considered seven real data sets with the presence of outliers interval-valued observations. The non-robust method CRM [38] is also considered
Concluding remarks
A robust linear regression method for interval-valued variables based on exponential-type kernel functions was proposed in this paper. The proposed method provides a new objective function that is suitable to manage interval-valued data because it is able to take into account the informations provided either by the center and the radius of the intervals or the lower and upper boundaries of the intervals. Moreover, it allows to combine different hyper-parameter estimators either on the center
Acknowledgments
The authors are grateful to the anonymous referees and the Associate Editor for their careful revision, valuable suggestions, and comments which improved this paper. The authors would like to thank CAPES (National Foundation for Post-Graduated Programs, Brazil) and CNPq (National Council for Scientific and Technological Development, Brazil) for their financial support. The second author would like to thank also FACEPE (Reseach Agency from the State of Pernambuco, Brazil).
References (53)
- et al.
Least squares estimation of a linear regression model with lr fuzzy response
Comput. Stat. Data Anal.
(2006) Fuzzy least squares
Inf. Sci.
(1988)- et al.
Off the beaten track: a new linear model for interval data
Eur. J. Oper. Res.
(2017) - et al.
A robust method for linear regression of symbolic interval data
Pattern Recognit. Lett.
(2010) Linear regression analysis for fuzzy/crisp input and fuzzy/crisp output data
Comput. Stat. Data Anal.
(2003)- et al.
A least-squares approach to fuzzy linear regression analysis
Comput. Stat. Data Anal.
(2000) - et al.
Robust fuzzy regression analysis
Inf. Sci.
(2011) - et al.
Interval kernel regression
Neurocomputing
(2014) - et al.
Dependency between degree of fit and input noise in fuzzy linear regression using non-symmetric fuzzy triangular coefficients
Fuzzy Sets Syst.
(2007) - et al.
Constrained center and range joint model for interval-valued symbolic data regression
Comput. Stat. Data Anal.
(2017)