Least squares estimation of a linear regression model with LR fuzzy response

https://doi.org/10.1016/j.csda.2006.04.036Get rights and content

Abstract

The problem of regression analysis in a fuzzy setting is discussed. A general linear regression model for studying the dependence of a LR fuzzy response variable on a set of crisp explanatory variables, along with a suitable iterative least squares estimation procedure, is introduced. This model is then framed within a wider strategy of analysis, capable to manage various types of uncertainty. These include the imprecision of the regression coefficients and the choice of a specific parametric model within a given class of models. The first source of uncertainty is dealt with by exploiting the implicit fuzzy arithmetic relationships between the spreads of the regression coefficients and the spreads of the response variable. Concerning the second kind of uncertainty, a suitable selection procedure is illustrated. This consists in maximizing an appropriately introduced goodness of fit index, within the given class of parametric models. The above strategy is illustrated in detail, with reference to an application to real data collected in the framework of an environmental study. In the final remarks, some critical points are underlined, along with a few indications for future research in this field.

Introduction

The study of relationships between real world phenomena is one of the basic aims in science, and plays a fundamental role in decision making in everyday life. Statistical regression analysis is a powerful tool in this domain. It concerns the analysis of the statistical link between a “response” variable (say Y) and a set of “explanatory” or “predictive” variables (say X1,,Xm), on the basis of a set of observations of the joint behavior of these variables.

Several sources of uncertainty may affect this kind of study, including: (a) the sampling effects due to the selection of the specific set of statistical units on which the analysis is carried out; (b) the ignorance concerning the type of model expressing the dependence relationship of Y on the Xj's (j=1,,m); (c) the ignorance about the specific model to be selected within a given class of regression models; (d) the imprecision of the mechanism ruling the dependence relationship, whatever the type of model considered; (e) the imprecision/vagueness of the observed data.

In the literature, the uncertainty stemming from sources (a) and (c) has been widely explored. The classical theory of Linear Models (e.g. Graybill, 1961, Neter et al., 1996) provides the most relevant piece of methodology in this respect. More recent developments enlarge the scope of the traditional approach to non-parametric methods and statistical learning techniques (e.g. Hastie et al., 2001). To a certain extent, also uncertainty of type (b) is dealt with in the latter case. Bayesian analysis provides a different viewpoint in coping with uncertainty of types (a), (b) and (c), making a “full” use of probability in jointly managing the randomness of the data and the uncertainty concerning the models and their parameters (see, e.g., Gelman et al., 1995). In the latter case, the introduction of prior probability distributions over the set of regression parameters represents a way of dealing with the “imprecision” of the regression mechanism (source (d)).

However, in the traditional framework, uncertainty of type (e) is not envisaged. The data are considered as “crisp” empirical information to be fed into the “Statistical Reasoning” process, which may be affected only by other sources of uncertainty.

In this paper, we introduce and develop a statistical regression model enabling us to manage imprecise (fuzzy) data, with particular reference to the response variable, which, in this case, will be denoted by Y˜. The fuzziness of Y˜ may stem from various sources: (i) imprecision in measuring the empirical phenomenon represented by Y; (ii) vagueness of Y when this is expressed in linguistic terms; (iii) partial or total ignorance concerning the values taken by Y on specific observational instances; (iv) “granularity” of the Y-variable, with reference to the way it is defined and used in the analysis (e.g. the age of a person may be described in terms of 5-year intervals, or just as “young”, “middle age”, “old”; to each of these “granulations” there is associated a different amount of uncertainty; see Zadeh, 2005). We argue that, in the above mentioned situations, an appropriate fuzzification of Y may exploit the available information in a more complete and efficient way, than just reducing it to a single value (a number, or a category).

Several approaches to regression analysis for fuzzy data have been developed, starting from the pioneering works by Tanaka et al. (1982), Celminš (1987), Diamond (1988), based respectively on possibilistic (the first one) and least squares principles. A brief overview of the various proposals in this domain will be given in Section 2.2.

The present work focuses on the observational situation where the response variable is fuzzy and the explanatory variables are crisp quantitative characters. In this context, we set up a general linear regression model, assuming that the membership function of the response variable belongs to the LR family. This is illustrated in Section 2.3. Then, in Section 3, the estimation procedure is described. This is based on the least squares (LS) principle. In this connection, an appropriate distance function for LR fuzzy variables is introduced and the corresponding LS objective function is defined (Section 3.1). An iterative LS solution is shown in Section 3.2 and some relevant properties of this solution are proved in Section 3.3, while in Section 3.4 specific methodological aspects related to the estimation procedure are discussed. In Section 4, a procedure for assessing the imprecision associated with the estimates of the regression coefficients obtained by the proposed model, is illustrated. This involves the use of an implicit fuzzy regression model with LR fuzzy coefficients, whose parameters (centers and spreads) are estimated by means of independent LS equations with input given by the estimates obtained by the basic regression model (exploiting fuzzy arithmetic relationships).

Next, in Section 5, the problem of model selection is faced. The corresponding source of uncertainty (of the above mentioned type (c)), is represented by the selection of an “optimal” regression function from among a suitable class of parametric models (Section 5.1). In Section 5.2, a selection tool is suggested, based on an appropriate decomposition of the sum of squares of the response variable and the construction of a multiple determination coefficient. This is used in setting up a suitable selection procedure in the above mentioned parametric class. In Section 6, an application to real world data collected in the framework of an environmental study is utilized for showing the informational capability of the proposed strategy of regression analysis. In this connection, another source of uncertainty is discussed, namely the one related to the sampling variation (source (a)). A possible way for coping with it is suggested, based on a bootstrap procedure for estimating the standard errors pertaining to the estimates of the various parameters introduced in the regression model. Finally, in Section 7, we make a few concluding remarks concerning some critical points of the proposed strategy of analysis, and outline possible perspectives of future research in this domain.

Section snippets

The data

We assume to observe m crisp quantitative explanatory variables X1,,Xm, and a fuzzy response variable, Y˜, on n statistical units. The data will be denoted by y˜i,xi,i=1,,n,where xi=xi1,,xim, or, in a compact form by y˜,X.We will focus on the family of LR fuzzy variables: Y˜(m,l,u)LR,where m denotes the center and l and u the left and right spreads, respectively, with the following membership function (see examples of membership functions for LR variables in Fig. 1):μ(y)=Lm-ylym(l>0)Ry-muy

Distance and objective function

According to the LS criterion the parameters of model (2.3) should be estimated by minimizing the squared distance between the observed values of the response variable, Y˜, and the corresponding theoretical values Y˜* defined through model (2.3). To this purpose, we introduce the following Euclidean distance which is a generalization of a metric proposed by Yang and Ko (1996) for LR fuzzy numbers (see also Coppi and D’Urso, 2003, D’Urso, 2003).d2y˜,y˜*=d2(m,l,u)LR,μ,δ̲L,δ̲ULR=ΔLR2=m-μ2+(m-λl)-(μ

Assessment of imprecision of the regression function

In this section we illustrate a procedure enabling us to manage a source of uncertainty affecting the model estimated by means of the procedure described in Section 3. This source refers to the imprecision of the regression coefficients expressing the relationship between Y˜ and the Xj's. In fact, the estimation procedure of model (2.3), described in the previous Section (see formulae (3.8)–(3.12)), provides a crisp evaluation of the regression coefficients, γ̲, which link the theoretical

Class of parametric models

A further source of uncertainty in parametric regression analysis is associated with the choice of the design matrix F. Having observed m quantitative explanatory variables on n statistical units, the most simple linear regression model is characterized by the design matrix F=fik, whose generic row is given byfi=1,xi1,,xij,,xim,i=1,,n.However, the design vectors in (5.1) can be modified in several ways: by adding non-linear terms in the Xj's, by reducing the number of terms (eliminating the

Applicative example

In the previous sections we have set out the basis of a systematic procedure for selecting and estimating a suitable regression model enabling us to analyze the dependence of a fuzzy variable on a set of quantitative crisp explanatory variables and possibly to exploit it in predictive terms. Various sources of uncertainty are dealt with by the proposed procedure, including: the fuzziness of the response variable; the fuzziness of the regression coefficients; the uncertainty concerning the

Final remarks

In this work, a strategy of analysis for estimating and selecting a regression model capable to express the dependence relationship of a fuzzy variable on a set of crisp explanatory variables, has been proposed. Although this proposal is strictly related to previous works by some of the Authors (Coppi and D’Urso, 2003, D’Urso, 2003), several new results are incorporated in the methodology illustrated in the present paper. First of all, a more systematic analysis of uncertainty is associated to

Acknowledgments

We would like to express our gratitude to the Referees for their helpful comments, which improved the quality of the paper. This research was partially supported by the grant PRIN 2005 of the Italian Ministry of Education, University and Research (“Models and methods to handle information and uncertainty in knowledge acquisition processes”).

References (36)

  • H. Tanaka et al.

    Possibilistic linear systems and their application to the linear regression model

    Fuzzy Sets and Systems

    (1988)
  • H.C. Wu

    Fuzzy least squares estimators in linear regression analysis for imprecise input and output data

    Comput. Statist. Data Anal.

    (2003)
  • A. Wünsche et al.

    Least-squares fuzzy regression with fuzzy random variables

    Fuzzy Sets and Systems

    (2002)
  • M.S. Yang et al.

    On a class of fuzzy c-numbers clustering procedures for fuzzy data

    Fuzzy Sets and Systems

    (1996)
  • L. Zadeh

    Toward a generalized theory of uncertainty (GTU) —an outline

    Inform. Sci.

    (2005)
  • Y.H. Chang et al.

    Fuzzy regression methods—a comparative assessment

    Fuzzy Sets and Systems

    (2001)
  • P.T. Chang et al.

    A generalized fuzzy weighted least-squares regression

    Fuzzy Sets and Systems

    (1996)
  • R. Coppi et al.

    Regression analysis with fuzzy informational paradigm: a least-squares approach using membership function information

    Int. J. Pure Appl. Math.

    (2003)
  • Cited by (0)

    View full text