A parametrized approach for linear regression of interval data
Introduction
Linear regression is related to the construction of models that explore linear dependency between variables. Two types of variables are involved: the response (or dependent) variable and the regressor (or independent) variables. The main goal is to define a linear model which explains the response based on regressors. It can be used to predict unknown or unobservable values of the response variable based on the regressor variables’ values [1], [2], [3], [4].
Symbolic Data Analysis (SDA) [5], [6] defines a way of extracting knowledge from complex data types, called symbolic data, which represent higher level units, such as classes or concepts. In order to take into account the variability within each unit member, they can be described by intervals, distributions, sets of categories or numbers, which can sometimes be weighted. The first step in SDA is to build a symbolic data table where the rows are higher level units and the columns can take symbolic values. The second step is to study and extract new knowledge from these new kinds of data using extensions of Computer Statistics and Data Mining to symbolic data. SDA gives answers to big data and complex data challenges, because big data can be reduced and summarized by classes [7].
In the literature of linear regression applied to interval data, there are many works which do not assume any distribution for the residuals. Billard and Diday [8] proposed the Center Method (CM), which builds a linear regression model using the midpoints of response and regressor intervals and predicts response bounds using the regressor variable bounds. Later, Billard and Diday [9] discussed the MinMax Method (MinMax), which defines two models, one for each response bound, with lower bounds depending on the regressor variables’ lower bounds and the response upper bounds depending on the regressor variables’ upper bounds. Lima Neto and De Carvalho [10] introduced the Center and Range Method (CRM), which also proposes two linear models: one for the midpoints and another for the ranges. Lima Neto and De Carvalho [11] later extended CRM to include positive constraints to the coefficients of interval ranges, introducing the Constrained Center and Range Method (CCRM) which guarantees the mathematical coherence of predicted values, where the predicted upper bounds are greater than or equal to their respective lower bounds. Wang et al. [12] proposed a linear model which uses all interval points, named Complete Information Method (CIM). By using Moore’s linear combination [13], CIM also guarantees the mathematical coherence of its predictions.
Other methods assume certain properties for the datasets. Domingues et al. [14] introduced an interval linear regression which is robust against outliers and builds two models, which have symmetric errors for centers and ranges. In another work, Lima Neto et al. [15] suggested the representation of intervals as bivariate vectors. They use a bivariate symbolic regression for interval data which is based on generalized linear models. Souza et al. [16] proposed multi-class logistic regression models that employ different interval representations. Fagundes et al. [17] proposed a robust regression model for intervals which is resistant to outliers. A method based on kernel functions was introduced by Fagundes et al. [18]. They suggested the use of non-parametric functions for interval centers and ranges. Giordani [19] proposed a Lasso-IR method using one regression model for the interval centers and another for the ranges. The models’ coefficients are related by a degree of diversity which is a parameter of the method. The sum of the squared errors is minimized by Least Absolute Shrinkage and Selection Operator (Lasso) [20], which also includes a limit for the sum of absolute coefficient values. Lima Neto and De Carvalho [21] extended classical nonlinear regression to build an interval nonlinear regression model. They use some optimization algorithms to build models with the best accuracy and prediction precision.
Due to the complex nature of interval data, the formulation of regression models is not trivial. Usually, the methods proposed in the literature and discussed above choose certain reference points or parameters from intervals to build the models, such as midpoint and range or lower bound and upper bound. The problem with this kind of approach is that there are differences in the information contained in intervals from different datasets.
If a method always fixes the same reference points to build its models, it might have a poor performance with datasets that would be better modeled by another set of reference points. Therefore, this paper proposes the novel Parametrized Method (PM) for interval linear regression modeling. In this new approach, the intervals of regressor variables are parametrized through the parametric equation of the straight line. Two models are proposed for the estimation of response bounds. PM discovers, automatically, the set of reference points from regressor variables to build the regression models. This is an improvement and a generalization over existing methods.
In addition to being able to choose the reference points that better represent the intervals, a criterion is proposed to verify the mathematical coherence of the model’s predictions, before building the regression. A novel strategy, through interval transformations applied to the response variable, guarantees the mathematical coherence.
The rest of the paper is organized as follows: Section 2 reviews regression methods available for SDA; Section 3 shows the construction of PM, explaining the model fitting and the least squares solution; Section 4 introduces a procedure for PM models which guarantees the mathematical coherence; Section 5 provides experimental evaluation of regression methods, using synthetic and real interval data and Section 6 presents some concluding remarks.
Section snippets
Interval linear regression methods
An interval γ is defined by its bounds, with and . The values γ and are, respectively, the lower and upper bounds of interval γ [6]. This paper uses the following notation for interval variables: Y is an interval response variable with n observations, such as with ; There are p regressor variables {X1, X2, ⋅⋅⋅, Xp}, each one with n interval observations such as and . Let xϕ be a multivariate interval
The parametrized method
This section describes the PM to build interval linear regressions using the least squares method. PM builds two different models for each one of the response bounds. For regressors, the parametrized equation of the straight line is used. Differently from existing methods, which fix reference points on regressors, PM automatically selects regressor reference points to fit the models.
Aanalysis of prediction coherence
A desirable feature for interval regression models is to maintain mathematical coherence for predicted bounds. In this section, we will investigate the PM’s behavior regarding the mathematical coherence of interval predictions and propose an approach based on transformations to provide it. This way, we define the Box-Cox transformation for interval data.
Experimental evaluation
This section compares PM’s performance against the methods proposed in the literature: CM, MinMax, CRM, CCRM and CIM. Synthetic datasets are generated to analyse the fit of these methods under different configurations for the dependency between regressor and response variables. Some real datasets are used to fit regression models, confirming the adaptability and the better fit of PM’s models.
Conclusion
This paper proposed the PM method, a new linear regression method for interval data. Two different models are built: one for response lower bounds and another for response upper bounds. Both models use automatically chosen reference points for regressors. An advantage of PM is the use of the least squares method, with no assumption for the probability distribution of errors. This allows the computation of the models using the classic matrix approach for multidimensional regression.
PM has the
References (30)
- et al.
A robust method for linear regression of symbolic interval data
Pattern Recognit. Lett.
(2010) - et al.
Robust regression with application to symbolic interval data
Eng. Appl. Artif. Intell.
(2013) - et al.
Interval kernel regression
Neurocomputing
(2014) - et al.
Linear Models in Statistics
(2008) - et al.
Introduction to Linear Regression Analysis
3rd Edition, Wiley Series in Probability and Statistics
(2001) - et al.
Applied Regression Analysis
2nd Edition, Applied Regression Analysis
(1981) Linear Regression Analysis
Wiley Series in Probability and Statistics
(1977)- et al.
Symbolic Data Analysis and the SODAS Software
(2008) - et al.
Symbolic data analysis : conceptual statistics and data mining
Wiley Series in Computational Statistics
(2006) Thinking by classes in data science: the symbolic data analysis paradigm
Wiley Interdiscip. Rev. Comput. Stat.
(2016)
Regression analysis for interval-valued data
Symbolic regression analysis
Centre and range method for fitting a linear regression model to symbolic interval data
Constrained linear regression models for symbolic interval-valued variables
Linear regression of interval-valued data based on complete information in hypercubes
SP Systems Engineering Society of China
Cited by (40)
Generalized linear models for symbolic polygonal data
2024, Knowledge-Based SystemsParametrized linear regression for boxplot-multivalued data applied to the Brazilian Electric Sector
2024, Information SciencesTwo-dimensional Gaussian hierarchical priority fuzzy modeling for interval-valued data
2023, Information SciencesNonparametric regression for interval-valued data based on local linear smoothing approach
2022, NeurocomputingCitation Excerpt :In Domigues et al. [7], they established a robust regression model with interval-valued data by applying the symmetrical linear regression methodology. Souza et al. [30] proposed a parametrized method and built two linear regressions for the lower and upper bounds of the response variable. Some other works on this topic can be found in Xu [34], Fagundes et al. [8], Blanco-Fernández et al. [3], Giordani [23], Dias and Brito [6], García-Bárzana [14], etc.
Fixed effects panel interval-valued data models and applications
2022, Knowledge-Based SystemsCitation Excerpt :In SDA, Billard and Diday [7] introduced dispersion measures and central tendency of interval-valued data. Interval-valued linear regression models were also built based on certain predefined criterion [8–14]. Billard and Diday [8] presented the first algorithm for fitting interval-valued linear regression, and this algorithm consisted of fitting a linear regression model to the midpoints of the interval values and the parameters were obtained by minimization of the mid-point error.