Centre and Range method for fitting a linear regression model to symbolic interval data

https://doi.org/10.1016/j.csda.2007.04.014Get rights and content

Abstract

This paper introduces a new approach to fitting a linear regression model to symbolic interval data. Each example of the learning set is described by a feature vector, for which each feature value is an interval. The new method fits a linear regression model on the mid-points and ranges of the interval values assumed by the variables in the learning set. The prediction of the lower and upper bounds of the interval value of the dependent variable is accomplished from its mid-point and range, which are estimated from the fitted linear regression model applied to the mid-point and range of each interval value of the independent variables. The assessment of the proposed prediction method is based on the estimation of the average behaviour of both the root mean square error and the square of the correlation coefficient in the framework of a Monte Carlo experiment. Finally, the approaches presented in this paper are applied to a real data set and their performance is compared.

Introduction

Predicting the behaviour of a (dependent) variable in relation to other (independent) variables that are thought be responsible for the variability of the former is an important task in data analysis, pattern recognition, data mining, machine learning, etc. The classical regression model is used to predict the values of a dependent quantitative variable in relation to the values of independent quantitative variables. However, to fit this model to the data, it is necessary to estimate a vector β, of parameters from the data vector Y and the model matrix X, assumed with complete rank p. The estimation using the least square method does not require any probabilistic hypothesis on the variable Y. This method consists of minimising the sum of the square of residuals. A detailed study on linear regression models for usual quantitative data can be found in Draper and Smith (1981), Montgomery and Peck (1982), Scheffé (1959), as well as others.

In regression analysis of usual data, the items are usually represented as a vector of quantitative measurements for which each column represents a variable. In practice, however, this model is too restrictive to represent complex data. In order to take into account the variability and/or uncertainty inherent to the data, variables must assume sets of categories or intervals, possibly even with frequencies or weights. Such type of data have been mainly studied in symbolic data analysis (SDA), which is a domain in the area of knowledge discovery and data management related to multivariate analysis, pattern recognition and artificial intelligence. The aim of SDA is to provide suitable methods (clustering, factorial techniques, decision trees, etc.) for managing aggregated data described by multi-valued variables, for which the cells of the data table contain sets of categories, intervals or weight (probability) distributions (Bock and Diday, 2000).

As mentioned above, the items are usually represented as a vector of quantitative measurements. However, due to recent advances in information technologies, it is now common to record interval data. In the framework of SDA, interval data appear when the observed values of the variables are intervals from the set of real numbers R. Interval data arise in situations such as recording monthly interval temperatures in meteorological stations, daily interval stock prices, etc. Another source of interval data is the aggregation of huge data-bases into a reduced number of groups, the properties of which are described by symbolic interval variables. Therefore, tools for symbolic interval data analysis are very much required.

Currently, different approaches have been introduced to analyse symbolic interval data. Regarding univariate statistics, Bertrand and Goupil (2000) and Billard and Diday (2003) introduced central tendency and dispersion measures suitable for symbolic interval data. DeCarvalho (1995) proposed histograms for symbolic interval data. Factorial methods for analysing symbolic interval data have also been considered in SDA. Cazes et al. (1997) and Lauro and Palumbo (2000) introduced principal component analysis methods suitable for symbolic interval data. Palumbo and Verde (2000) and Lauro et al. (2000) generalised factorial discriminant analysis (FDA) to symbolic interval data. Regarding classification, Ichino et al. (1996) introduced a symbolic classifier as a region-oriented approach for symbolic interval data. Rasson and Lissoir (2000) presented a symbolic kernel classifier based on dissimilarity functions suitable for symbolic interval data. Périnel and Lechevallier (2000) proposed a tree-growing algorithm for classifying symbolic interval data.

SDA provides a number of clustering methods for symbolic data. These methods differ with regard to the type of symbolic data considered, their cluster structures and/or the clustering criteria considered. With hierarchical clustering methods, an agglomerative approach has been introduced that forms composite symbolic objects using a join operator whenever mutual pairs of symbolic objects are selected for agglomeration based on minimum dissimilarity (Gowda and Diday, 1991) or maximum similarity (Gowda and Diday, 1992). Ichino and Yaguchi (1994) defined generalised Minkowski metrics for mixed feature variables and present dendrograms obtained from the application of standard linkage methods for data sets containing numeric and symbolic feature values. Chavent (1998) proposed a divisive clustering method for symbolic data that simultaneously furnishes a hierarchy of the symbolic data set and a monothetic characterisation of each cluster in the hierarchy. Guru et al. (2004) introduced agglomerative clustering algorithms based on similarity functions that are multi-valued and non-symmetric.

Regarding partitioning clustering algorithms for symbolic interval data, Bock (2002) proposed several clustering algorithms for symbolic data described by interval variables, and presented a sequential clustering and updating strategy for constructing a self-organising map (SOM) to visualise symbolic interval data. Chavent and Lechevallier (2002) proposed a dynamic clustering algorithm for interval data where the class representatives are defined by an optimality criterion based on a modified Hausdorff distance. Souza and De Carvalho (2004) presented partitioning clustering methods for interval data based on (adaptive and non-adaptive) city-block distances. More recently, De Carvalho et al. (2006) proposed an algorithm using an adequacy criterion based on adaptive Hausdorff distances.

This paper addresses linear regression models for predicting symbolic interval data. Billard and Diday (2000) presented the first approach to fitting a linear regression model to symbolic interval data sets from an SDA perspective. Their approach consists of fitting a linear regression model to the mid-points of the interval values assumed by the symbolic interval variables in the learning set and applies this model to the lower and upper bounds of the interval values of the independent symbolic interval variables to be predicted the lower and upper bounds of the interval value of the dependent variable, respectively.

This paper introduces a Centre and Range approach to fitting a linear regression model to symbolic interval data. The probabilistic assumptions that involve the linear regression model theory for classical data will not be considered in the case of symbolic data (symbolic interval variables), as this remains an open research topic. Thus, the problem will be investigated as an optimisation problem, in which we seek to minimise a predefined criterion.

In Table 1, we show the criteria and models that represent the three approaches presented in this paper.

The first method (Billard and Diday, 2000) is based on the minimisation of the mid-point error, since (εLi+εUi)/2=εic. The lower and upper bounds of the dependent variable are predicted, respectively, from the lower and upper bounds of the independent variable using the same vector of parameters β. The second approach (Billard and Diday, 2002) fits two independent linear regression models on the lower and upper bounds of the intervals, respectively, and minimises i=1n(εiL)2+i=1n(εiU)2. The third approach considers the minimisation of the sum of the mid-point square error plus the sum of the range square error, and the reconstruction of the interval bounds is based on the mid-point and range estimates.

In order to show the usefulness of these approaches, the lower and upper bounds of the interval values of an interval-valued variable that is linearly related to a set of independent interval-valued variables will be predicted for independent data sets according to each method. The assessment of the proposed methods will be based on the estimation of the average behaviour of the root mean square error and the square of the correlation coefficient in the framework of a Monte Carlo experiment.

This paper is organised as follows: Section 2.1 presents the Centre (Billard and Diday, 2000) and the MinMax (Billard and Diday, 2002) methods from an optimisation perspective. Section 2.2 presents the Centre and Range approach to fitting a linear regression model to interval-valued data. To show the usefulness of the Centre and Range approach, Section 3 describes the framework of the Monte Carlo simulations and presents experiments with synthetic and real interval-valued data sets. Finally, Section 4 gives the concluding remarks.

Section snippets

Linear regression models for symbolic interval-valued data

In the following sections, we will present different approaches to fitting a linear regression model to symbolic interval-valued data. Each approach will be based on a predefined criterion.

The Monte Carlo experiments

To show the usefulness of the Centre and Range approach proposed in this paper, experiments with synthetic symbolic interval-valued data sets with different degrees of difficulty fitting a linear regression model together with a cardiological data set (Billard and Diday, 2000) are considered in this section.

Concluding remarks

This paper presented a CRM for fitting a linear regression model to interval-valued data. The method uses the information contained in the mid-points and ranges of the intervals based on a predefined criterion. The assessment of the proposed prediction method was based on the average behaviour of the root mean square error and the square of the correlation coefficient in the framework of a Monte Carlo simulation. The synthetic symbolic interval data sets were constructed with (and without)

Acknowledgments

The authors would like to thank CNPq and CAPES (Brazilian Agencies) for their financial support.

References (25)

  • M. Chavent

    A monothetic clustering method

    Pattern Recognition Lett.

    (1998)
  • F.A. De Carvalho et al.

    Adaptive Hausdorff distances and dynamic clustering of symbolic data

    Pattern Recognition Lett.

    (2006)
  • D.S. Guru et al.

    Multivalued type proximity measure and concept of mutual similarity value useful for clustering symbolic patterns

    Pattern Recognition Lett.

    (2004)
  • Bertrand, P., Goupil, F., 2000. Descriptive statistic for symbolic data. In: Bock, H.-H., Diday, E. (Eds.), Analysis of...
  • H.-H. Bock

    Clustering algorithms and Kohonen maps for symbolic data

    J. Jpn. Soc. Comput. Statist.

    (2002)
  • H.H. Bock et al.

    Analysis of Symbolic Data, Exploratory Methods for Extracting Statistical Information from Complex Data

    (2000)
  • Billard, L., Diday, E., 2000. Regression analysis for interval-valued data. In: Data Analysis, Classification and...
  • Billard, L., Diday, E., 2002. Symbolic regression analysis. In: Classification, Clustering and Data Analysis,...
  • L. Billard et al.

    From the statistics of data to the statistics of knowledge: symbolic data analysis

    J. Amer. Statist. Assoc.

    (2003)
  • P. Cazes et al.

    Extension de l’analyse en composantes principales des donnes de type intervalle

    Rev. Statist. Aplique

    (1997)
  • Chavent, M, Lechevallier, Y., 2002. Dynamical clustering algorithm of interval data: optimization of an adequacy...
  • F.A.T. De Carvalho

    Histograms in symbolic data analysis

    Ann. Oper. Res.

    (1995)
  • Cited by (215)

    • A regularized MM estimate for interval-valued regression

      2024, Expert Systems with Applications
    View all citing articles on Scopus
    View full text