Elsevier

Knowledge-Based Systems

Volume 131, 1 September 2017, Pages 149-159
Knowledge-Based Systems

A parametrized approach for linear regression of interval data

https://doi.org/10.1016/j.knosys.2017.06.012Get rights and content

Abstract

Interval symbolic data is a complex data type that can often be obtained by summarizing large datasets. All existing linear regression approaches for interval data use certain fixed reference points to model intervals, such as midpoints, ranges and lower and upper bounds. This is a limitation, because different datasets might be better represented by different reference points. In this paper, we propose a new method for extracting knowledge from interval data. Our parametrized approach automatically extracts the best reference points from the regressor variables. These reference points are then used to build two linear regressions: one for the lower bounds of the response variable and another for its upper bounds. Before the regressions are applied, we compute a criterion to verify the mathematical coherence of predicted values. Mathematical coherence means that the upper bounds are greater than the lower bounds. If the criterion shows that the coherence is not guaranteed, we suggest the use of a novel interval Box-Cox transformation of the response variable. Experimental evaluations with synthetic and real interval datasets illustrate the advantages and the usefulness of the proposed method to perform interval linear regression.

Introduction

Linear regression is related to the construction of models that explore linear dependency between variables. Two types of variables are involved: the response (or dependent) variable and the regressor (or independent) variables. The main goal is to define a linear model which explains the response based on regressors. It can be used to predict unknown or unobservable values of the response variable based on the regressor variables’ values [1], [2], [3], [4].

Symbolic Data Analysis (SDA) [5], [6] defines a way of extracting knowledge from complex data types, called symbolic data, which represent higher level units, such as classes or concepts. In order to take into account the variability within each unit member, they can be described by intervals, distributions, sets of categories or numbers, which can sometimes be weighted. The first step in SDA is to build a symbolic data table where the rows are higher level units and the columns can take symbolic values. The second step is to study and extract new knowledge from these new kinds of data using extensions of Computer Statistics and Data Mining to symbolic data. SDA gives answers to big data and complex data challenges, because big data can be reduced and summarized by classes [7].

In the literature of linear regression applied to interval data, there are many works which do not assume any distribution for the residuals. Billard and Diday [8] proposed the Center Method (CM), which builds a linear regression model using the midpoints of response and regressor intervals and predicts response bounds using the regressor variable bounds. Later, Billard and Diday [9] discussed the MinMax Method (MinMax), which defines two models, one for each response bound, with lower bounds depending on the regressor variables’ lower bounds and the response upper bounds depending on the regressor variables’ upper bounds. Lima Neto and De Carvalho [10] introduced the Center and Range Method (CRM), which also proposes two linear models: one for the midpoints and another for the ranges. Lima Neto and De Carvalho [11] later extended CRM to include positive constraints to the coefficients of interval ranges, introducing the Constrained Center and Range Method (CCRM) which guarantees the mathematical coherence of predicted values, where the predicted upper bounds are greater than or equal to their respective lower bounds. Wang et al. [12] proposed a linear model which uses all interval points, named Complete Information Method (CIM). By using Moore’s linear combination [13], CIM also guarantees the mathematical coherence of its predictions.

Other methods assume certain properties for the datasets. Domingues et al. [14] introduced an interval linear regression which is robust against outliers and builds two models, which have symmetric errors for centers and ranges. In another work, Lima Neto et al. [15] suggested the representation of intervals as bivariate vectors. They use a bivariate symbolic regression for interval data which is based on generalized linear models. Souza et al. [16] proposed multi-class logistic regression models that employ different interval representations. Fagundes et al. [17] proposed a robust regression model for intervals which is resistant to outliers. A method based on kernel functions was introduced by Fagundes et al. [18]. They suggested the use of non-parametric functions for interval centers and ranges. Giordani [19] proposed a Lasso-IR method using one regression model for the interval centers and another for the ranges. The models’ coefficients are related by a degree of diversity which is a parameter of the method. The sum of the squared errors is minimized by Least Absolute Shrinkage and Selection Operator (Lasso) [20], which also includes a limit for the sum of absolute coefficient values. Lima Neto and De Carvalho [21] extended classical nonlinear regression to build an interval nonlinear regression model. They use some optimization algorithms to build models with the best accuracy and prediction precision.

Due to the complex nature of interval data, the formulation of regression models is not trivial. Usually, the methods proposed in the literature and discussed above choose certain reference points or parameters from intervals to build the models, such as midpoint and range or lower bound and upper bound. The problem with this kind of approach is that there are differences in the information contained in intervals from different datasets.

If a method always fixes the same reference points to build its models, it might have a poor performance with datasets that would be better modeled by another set of reference points. Therefore, this paper proposes the novel Parametrized Method (PM) for interval linear regression modeling. In this new approach, the intervals of regressor variables are parametrized through the parametric equation of the straight line. Two models are proposed for the estimation of response bounds. PM discovers, automatically, the set of reference points from regressor variables to build the regression models. This is an improvement and a generalization over existing methods.

In addition to being able to choose the reference points that better represent the intervals, a criterion is proposed to verify the mathematical coherence of the model’s predictions, before building the regression. A novel strategy, through interval transformations applied to the response variable, guarantees the mathematical coherence.

The rest of the paper is organized as follows: Section 2 reviews regression methods available for SDA; Section 3 shows the construction of PM, explaining the model fitting and the least squares solution; Section 4 introduces a procedure for PM models which guarantees the mathematical coherence; Section 5 provides experimental evaluation of regression methods, using synthetic and real interval data and Section 6 presents some concluding remarks.

Section snippets

Interval linear regression methods

An interval γ is defined by its bounds, γ=[γ̲,γ¯], with γ̲R,γ¯R and γ̲γ¯. The values γ and γ¯ are, respectively, the lower and upper bounds of interval γ [6]. This paper uses the following notation for interval variables: Y is an interval response variable with n observations, such as Y={y1,y2,,yn}, with yi=[y̲i,y¯i]; There are p regressor variables {X1, X2, ⋅⋅⋅, Xp}, each one with n interval observations such as Xj={xj1,xj2,,xjn} and xji=[x̲ji,x¯ji]. Let xϕ be a multivariate interval

The parametrized method

This section describes the PM to build interval linear regressions using the least squares method. PM builds two different models for each one of the response bounds. For regressors, the parametrized equation of the straight line is used. Differently from existing methods, which fix reference points on regressors, PM automatically selects regressor reference points to fit the models.

Aanalysis of prediction coherence

A desirable feature for interval regression models is to maintain mathematical coherence for predicted bounds. In this section, we will investigate the PM’s behavior regarding the mathematical coherence of interval predictions and propose an approach based on transformations to provide it. This way, we define the Box-Cox transformation for interval data.

Experimental evaluation

This section compares PM’s performance against the methods proposed in the literature: CM, MinMax, CRM, CCRM and CIM. Synthetic datasets are generated to analyse the fit of these methods under different configurations for the dependency between regressor and response variables. Some real datasets are used to fit regression models, confirming the adaptability and the better fit of PM’s models.

Conclusion

This paper proposed the PM method, a new linear regression method for interval data. Two different models are built: one for response lower bounds and another for response upper bounds. Both models use automatically chosen reference points for regressors. An advantage of PM is the use of the least squares method, with no assumption for the probability distribution of errors. This allows the computation of the models using the classic matrix approach for multidimensional regression.

PM has the

References (30)

  • M.A.O. Domingues et al.

    A robust method for linear regression of symbolic interval data

    Pattern Recognit. Lett.

    (2010)
  • R.A.A. Fagundes et al.

    Robust regression with application to symbolic interval data

    Eng. Appl. Artif. Intell.

    (2013)
  • R.A.A. Fagundes et al.

    Interval kernel regression

    Neurocomputing

    (2014)
  • A. Rencher et al.

    Linear Models in Statistics

    (2008)
  • D. Montgomery et al.

    Introduction to Linear Regression Analysis

    3rd Edition, Wiley Series in Probability and Statistics

    (2001)
  • N. Draper et al.

    Applied Regression Analysis

    2nd Edition, Applied Regression Analysis

    (1981)
  • G. Seber

    Linear Regression Analysis

    Wiley Series in Probability and Statistics

    (1977)
  • E. Diday et al.

    Symbolic Data Analysis and the SODAS Software

    (2008)
  • L. Billard et al.

    Symbolic data analysis : conceptual statistics and data mining

    Wiley Series in Computational Statistics

    (2006)
  • E. Diday

    Thinking by classes in data science: the symbolic data analysis paradigm

    Wiley Interdiscip. Rev. Comput. Stat.

    (2016)
  • L. Billard et al.

    Regression analysis for interval-valued data

  • L. Billard et al.

    Symbolic regression analysis

  • E.d.A.L. Neto et al.

    Centre and range method for fitting a linear regression model to symbolic interval data

    (2008)
  • E.A.L. Neto et al.

    Constrained linear regression models for symbolic interval-valued variables

    (2010)
  • H. Wang et al.

    Linear regression of interval-valued data based on complete information in hypercubes

    SP Systems Engineering Society of China

    (2012)
  • Cited by (40)

    • Nonparametric regression for interval-valued data based on local linear smoothing approach

      2022, Neurocomputing
      Citation Excerpt :

      In Domigues et al. [7], they established a robust regression model with interval-valued data by applying the symmetrical linear regression methodology. Souza et al. [30] proposed a parametrized method and built two linear regressions for the lower and upper bounds of the response variable. Some other works on this topic can be found in Xu [34], Fagundes et al. [8], Blanco-Fernández et al. [3], Giordani [23], Dias and Brito [6], García-Bárzana [14], etc.

    • Fixed effects panel interval-valued data models and applications

      2022, Knowledge-Based Systems
      Citation Excerpt :

      In SDA, Billard and Diday [7] introduced dispersion measures and central tendency of interval-valued data. Interval-valued linear regression models were also built based on certain predefined criterion [8–14]. Billard and Diday [8] presented the first algorithm for fitting interval-valued linear regression, and this algorithm consisted of fitting a linear regression model to the midpoints of the interval values and the parameters were obtained by minimization of the mid-point error.

    View all citing articles on Scopus
    View full text