A parametrized approach for linear regression of interval data

doi:10.1016/j.knosys.2017.06.012

Knowledge-Based Systems

Volume 131, 1 September 2017, Pages 149-159

https://doi.org/10.1016/j.knosys.2017.06.012 Get rights and content

Abstract

Interval symbolic data is a complex data type that can often be obtained by summarizing large datasets. All existing linear regression approaches for interval data use certain fixed reference points to model intervals, such as midpoints, ranges and lower and upper bounds. This is a limitation, because different datasets might be better represented by different reference points. In this paper, we propose a new method for extracting knowledge from interval data. Our parametrized approach automatically extracts the best reference points from the regressor variables. These reference points are then used to build two linear regressions: one for the lower bounds of the response variable and another for its upper bounds. Before the regressions are applied, we compute a criterion to verify the mathematical coherence of predicted values. Mathematical coherence means that the upper bounds are greater than the lower bounds. If the criterion shows that the coherence is not guaranteed, we suggest the use of a novel interval Box-Cox transformation of the response variable. Experimental evaluations with synthetic and real interval datasets illustrate the advantages and the usefulness of the proposed method to perform interval linear regression.

Introduction

Linear regression is related to the construction of models that explore linear dependency between variables. Two types of variables are involved: the response (or dependent) variable and the regressor (or independent) variables. The main goal is to define a linear model which explains the response based on regressors. It can be used to predict unknown or unobservable values of the response variable based on the regressor variables’ values [1], [2], [3], [4].

Symbolic Data Analysis (SDA) [5], [6] defines a way of extracting knowledge from complex data types, called symbolic data, which represent higher level units, such as classes or concepts. In order to take into account the variability within each unit member, they can be described by intervals, distributions, sets of categories or numbers, which can sometimes be weighted. The first step in SDA is to build a symbolic data table where the rows are higher level units and the columns can take symbolic values. The second step is to study and extract new knowledge from these new kinds of data using extensions of Computer Statistics and Data Mining to symbolic data. SDA gives answers to big data and complex data challenges, because big data can be reduced and summarized by classes [7].

In the literature of linear regression applied to interval data, there are many works which do not assume any distribution for the residuals. Billard and Diday [8] proposed the Center Method (CM), which builds a linear regression model using the midpoints of response and regressor intervals and predicts response bounds using the regressor variable bounds. Later, Billard and Diday [9] discussed the MinMax Method (MinMax), which defines two models, one for each response bound, with lower bounds depending on the regressor variables’ lower bounds and the response upper bounds depending on the regressor variables’ upper bounds. Lima Neto and De Carvalho [10] introduced the Center and Range Method (CRM), which also proposes two linear models: one for the midpoints and another for the ranges. Lima Neto and De Carvalho [11] later extended CRM to include positive constraints to the coefficients of interval ranges, introducing the Constrained Center and Range Method (CCRM) which guarantees the mathematical coherence of predicted values, where the predicted upper bounds are greater than or equal to their respective lower bounds. Wang et al. [12] proposed a linear model which uses all interval points, named Complete Information Method (CIM). By using Moore’s linear combination [13], CIM also guarantees the mathematical coherence of its predictions.

Other methods assume certain properties for the datasets. Domingues et al. [14] introduced an interval linear regression which is robust against outliers and builds two models, which have symmetric errors for centers and ranges. In another work, Lima Neto et al. [15] suggested the representation of intervals as bivariate vectors. They use a bivariate symbolic regression for interval data which is based on generalized linear models. Souza et al. [16] proposed multi-class logistic regression models that employ different interval representations. Fagundes et al. [17] proposed a robust regression model for intervals which is resistant to outliers. A method based on kernel functions was introduced by Fagundes et al. [18]. They suggested the use of non-parametric functions for interval centers and ranges. Giordani [19] proposed a Lasso-IR method using one regression model for the interval centers and another for the ranges. The models’ coefficients are related by a degree of diversity which is a parameter of the method. The sum of the squared errors is minimized by Least Absolute Shrinkage and Selection Operator (Lasso) [20], which also includes a limit for the sum of absolute coefficient values. Lima Neto and De Carvalho [21] extended classical nonlinear regression to build an interval nonlinear regression model. They use some optimization algorithms to build models with the best accuracy and prediction precision.

Due to the complex nature of interval data, the formulation of regression models is not trivial. Usually, the methods proposed in the literature and discussed above choose certain reference points or parameters from intervals to build the models, such as midpoint and range or lower bound and upper bound. The problem with this kind of approach is that there are differences in the information contained in intervals from different datasets.

If a method always fixes the same reference points to build its models, it might have a poor performance with datasets that would be better modeled by another set of reference points. Therefore, this paper proposes the novel Parametrized Method (PM) for interval linear regression modeling. In this new approach, the intervals of regressor variables are parametrized through the parametric equation of the straight line. Two models are proposed for the estimation of response bounds. PM discovers, automatically, the set of reference points from regressor variables to build the regression models. This is an improvement and a generalization over existing methods.

In addition to being able to choose the reference points that better represent the intervals, a criterion is proposed to verify the mathematical coherence of the model’s predictions, before building the regression. A novel strategy, through interval transformations applied to the response variable, guarantees the mathematical coherence.

The rest of the paper is organized as follows: Section 2 reviews regression methods available for SDA; Section 3 shows the construction of PM, explaining the model fitting and the least squares solution; Section 4 introduces a procedure for PM models which guarantees the mathematical coherence; Section 5 provides experimental evaluation of regression methods, using synthetic and real interval data and Section 6 presents some concluding remarks.

Section snippets

Interval linear regression methods

An interval γ is defined by its bounds, $γ = [\underset{̲}{γ}, \bar{γ}],$ with $\underset{̲}{γ} \in R,$ $\bar{γ} \in R$ and $\underset{̲}{γ} \leq \bar{γ}$ . The values γ and $\bar{γ}$ are, respectively, the lower and upper bounds of interval γ [6]. This paper uses the following notation for interval variables: Y is an interval response variable with n observations, such as $Y = {y_{1}, y_{2}, \dots, y_{n}},$ with $y_{i} = [{\underset{̲}{y}}_{i}, {\bar{y}}_{i}]$ ; There are p regressor variables {X₁, X₂, ⋅⋅⋅, X_p}, each one with n interval observations such as $X_{j} = {x_{j 1}, x_{j 2}, \dots, x_{j n}}$ and $x_{j i} = [{\underset{̲}{x}}_{j i}, {\bar{x}}_{j i}]$ . Let x_ϕ be a multivariate interval

The parametrized method

This section describes the PM to build interval linear regressions using the least squares method. PM builds two different models for each one of the response bounds. For regressors, the parametrized equation of the straight line is used. Differently from existing methods, which fix reference points on regressors, PM automatically selects regressor reference points to fit the models.

Aanalysis of prediction coherence

A desirable feature for interval regression models is to maintain mathematical coherence for predicted bounds. In this section, we will investigate the PM’s behavior regarding the mathematical coherence of interval predictions and propose an approach based on transformations to provide it. This way, we define the Box-Cox transformation for interval data.

Experimental evaluation

This section compares PM’s performance against the methods proposed in the literature: CM, MinMax, CRM, CCRM and CIM. Synthetic datasets are generated to analyse the fit of these methods under different configurations for the dependency between regressor and response variables. Some real datasets are used to fit regression models, confirming the adaptability and the better fit of PM’s models.

Conclusion

This paper proposed the PM method, a new linear regression method for interval data. Two different models are built: one for response lower bounds and another for response upper bounds. Both models use automatically chosen reference points for regressors. An advantage of PM is the use of the least squares method, with no assumption for the probability distribution of errors. This allows the computation of the models using the classic matrix approach for multidimensional regression.

PM has the

References (30)

M.A.O. Domingues et al.
A robust method for linear regression of symbolic interval data
Pattern Recognit. Lett.
(2010)
R.A.A. Fagundes et al.
Robust regression with application to symbolic interval data
Eng. Appl. Artif. Intell.
(2013)
R.A.A. Fagundes et al.
Interval kernel regression
Neurocomputing
(2014)
A. Rencher et al.
Linear Models in Statistics
(2008)
D. Montgomery et al.
Introduction to Linear Regression Analysis
3rd Edition, Wiley Series in Probability and Statistics
(2001)
N. Draper et al.
Applied Regression Analysis
2nd Edition, Applied Regression Analysis
(1981)
G. Seber
Linear Regression Analysis
Wiley Series in Probability and Statistics
(1977)
E. Diday et al.
Symbolic Data Analysis and the SODAS Software
(2008)
L. Billard et al.
Symbolic data analysis : conceptual statistics and data mining
Wiley Series in Computational Statistics
(2006)
E. Diday
Thinking by classes in data science: the symbolic data analysis paradigm
Wiley Interdiscip. Rev. Comput. Stat.
(2016)

L. Billard et al.

Regression analysis for interval-valued data

L. Billard et al.

Symbolic regression analysis

E.d.A.L. Neto et al.

Centre and range method for fitting a linear regression model to symbolic interval data

(2008)

E.A.L. Neto et al.

Constrained linear regression models for symbolic interval-valued variables

(2010)

H. Wang et al.

Linear regression of interval-valued data based on complete information in hypercubes

SP Systems Engineering Society of China

(2012)

Cited by (40)

An interval-valued estimation method of aircraft route carbon emission: A function of aircraft seat capacity and route flight time
2024, Energy
In this paper, we introduce a novel approach that utilizes aircraft seat capacity and its route flight time to predict carbon emissions for aircraft manufacturers during the aircraft design stage. To address data fluctuations arising from the human-aircraft-environment-management system throughout the production and operation of the future-designed aircraft, we employ the interval-valued data type, replacing the traditional point data type, and construct an interval-valued regression model. Within the parameterized method framework, our programming model incorporates additional constraints to ensure the intersection of each interval-valued sample. This addresses the challenges significant data fluctuations pose, leading to improved fitting performance. Furthermore, the model is demonstrated to satisfy Kuhn-Tucker conditions for solvability, and the obtained regressed parameters exhibit favorable small-sample properties. An empirical study using airline operating data from China validates the prediction effectiveness of the programming model. Finally, based on these predictions, we provide applications and suggestions for decision-makers aimed at reducing CO₂ emissions.
Generalized linear models for symbolic polygonal data
2024, Knowledge-Based Systems
Symbolic data analysis data has provided several advances in regression models concerning the type of symbolic variable. Due to the advantages of using symbolic polygonal data, this paper introduces a linear regression approach for polygonal data based on the generalize linear model theory that provides a unified method to broad range of modeling problems for different types of response as asymmetric continuous and discrete. Ordinary polygonal residuals and a way for finding model inadequacies are presented. Moreover, a quality measure of fit for polygons is also proposed in this paper. Experimental evaluation results illustrate the usefulness of the proposed approach regarding synthetic and real polygonal data.
Parametrized linear regression for boxplot-multivalued data applied to the Brazilian Electric Sector
2024, Information Sciences
Symbolic boxplot data can be considered as a particular case of the numerical multi-valued variable. This kind of symbolic data is an useful exploratory tool with a simple structure for summarizing groups of numerical data. However, in the literature of symbolic data analysis it has been little explored. In this paper, we propose a new prediction method for extracting knowledge from boxplot data. A parametrized regression approach automatically extracts the best reference points from the regressor variables. These reference points are then used to build five linear regression models based on values of the boxplot: minimum (m), lower quartile (Q1), median (Q2), upper quartile (Q3) and maximum (M). A strategy based on BoxCox transformation is applied to the response variable in order to guarantee the mathematical coherence of the predictions and build the boxplot. Experimental evaluation with synthetic and real boxplot datasets illustrates the advantages of the proposed method. Moreover, the present work also focuses in the development of an application for predicting temperature data based on boxplot in the Brazilian Electric Sector.
Two-dimensional Gaussian hierarchical priority fuzzy modeling for interval-valued data
2023, Information Sciences
In this paper, a new two-dimensional gaussian hierarchical priority fuzzy system (TGHPFS) is proposed to handle interval-valued data. TGHPFS first performs hierarchical clustering of the average value of interval-valued data in each dimension to generate two-dimensional gaussian membership functions of two-level rules. The two levels of rules are associated by calculating the activation strength of the second-level rules to the first-level rules and setting the connection threshold. The regularized least squares method is used to optimize the consequents of the second-level rules. The two-dimensional gaussian membership function designed in this paper is used to model the antecedents of interval-valued data, solving the correlation problem between the left and right values of interval-valued data. The effectiveness of TGHPFS is validated using real-world datasets, and the proposed method is compared with other latest methods to show the superiority of TGHPFS.
Nonparametric regression for interval-valued data based on local linear smoothing approach
2022, Neurocomputing
Citation Excerpt :
In Domigues et al. [7], they established a robust regression model with interval-valued data by applying the symmetrical linear regression methodology. Souza et al. [30] proposed a parametrized method and built two linear regressions for the lower and upper bounds of the response variable. Some other works on this topic can be found in Xu [34], Fagundes et al. [8], Blanco-Fernández et al. [3], Giordani [23], Dias and Brito [6], García-Bárzana [14], etc.
In this paper, we propose an interval local linear method (ILLM) to fit the regression model with interval-valued explanatory and response variables. The proposed method has no restriction on the form of the regression function. Moreover, it reduces the boundary effect of interval kernel method. Some experimental studies including two simulations and four real datasets are examined to evaluate the proposed method. Experimental results show that our method has higher predictive accuracy than some existed methods, including the center and range method, the interval least absolute method, the interval kernel method, multi-output support vector regression and the interval multilayer perceptron.
Fixed effects panel interval-valued data models and applications
2022, Knowledge-Based Systems
Citation Excerpt :
In SDA, Billard and Diday [7] introduced dispersion measures and central tendency of interval-valued data. Interval-valued linear regression models were also built based on certain predefined criterion [8–14]. Billard and Diday [8] presented the first algorithm for fitting interval-valued linear regression, and this algorithm consisted of fitting a linear regression model to the midpoints of the interval values and the parameters were obtained by minimization of the mid-point error.
Interval-valued data is a complex data type which can be got by summarizing large datasets, linear regression models for interval-valued data have been widely studied. Panel data models combining cross-section and time series real-valued data have become increasingly popular in economic research and data mining. It is very important to construct the regression models for panel data with uncertainty and range variability. This paper introduces panel data regression model for interval-valued data and constructs three kinds of panel interval-valued data regression models: the centre model of fixed effects panel interval-valued data regression, the min–max model of fixed effects panel interval-valued data regression and its special model, the centre and range model of fixed effects panel interval-valued data regression. Then combining the parameters estimation of interval-valued regression and analysis of covariance for panel data, this paper presents the parameters estimations for three kinds of panel interval-valued data regression models. Finally, our proposed panel interval-valued data regression models are applied in forecasting of Air Quality Index, the experimental evaluation of actual data sets shows the advantages and the performance of our proposed panel interval-valued data models.

View all citing articles on Scopus

View full text

A parametrized approach for linear regression of interval data

Abstract

Introduction

Section snippets

Interval linear regression methods

The parametrized method

Aanalysis of prediction coherence

Experimental evaluation

Conclusion

Pattern Recognit. Lett.

Eng. Appl. Artif. Intell.

Neurocomputing

Linear Models in Statistics

Introduction to Linear Regression Analysis

3rd Edition, Wiley Series in Probability and Statistics

Applied Regression Analysis

2nd Edition, Applied Regression Analysis

Linear Regression Analysis

Wiley Series in Probability and Statistics

Symbolic Data Analysis and the SODAS Software

Symbolic data analysis : conceptual statistics and data mining

Wiley Series in Computational Statistics

Thinking by classes in data science: the symbolic data analysis paradigm