Principal Component Analysis of symmetric fuzzy data

https://doi.org/10.1016/S0167-9473(02)00352-3Get rights and content

Abstract

Principal Component Analysis (PCA) is a well-known tool often used for the exploratory analysis of a numerical data set. Here an extension of classical PCA is proposed, which deals with fuzzy data (in short PCAF), where the elementary datum cannot be recognized exactly by a specific number but by a center, two spread measures and a membership function. Specifically, two different PCAF methods, associated with different hypotheses of interrelation between parts of the solution, are proposed. In the first method, called Centers-related Spread PCAF (CS-PCAF), the size of the spread measures depends on the size of the centers. In the second method, called Loadings-related Spread PCAF (LS-PCAF), the spreads are not related directly to the sizes of the centers, but indirectly, via the component loadings. To analyze how well PCAF works a simulation study was carried out. On the whole, the PCAF method performed better than or equally well as PCA, except in a few particular conditions. Finally, the application of PCAF to an empirical fuzzy data set is described.

Introduction

There are many practical situations in which numerical observations of I units with respect to J variables cannot be represented by precisely specified values. Examples can be found in economics (daily rate exchanges), in marketing research (assessing public opinions about special events like government elections, launching of new consumer goods), in psychology (the mental state or subjective perceptions), in physics (in a ballistic sphere, the needed velocity of perforation of a plate by projectiles). Another significant situation is related to the attempt to analyze subjective perception or linguistic variables. In each of the above examples an exact numerical coding seems to be very hard to give: thus, the available information can only be represented approximately.

In the above situations the information is vague or fuzzy and can be summarized using interval valued data instead of numerical data. Thus, the score of the generic observation i on the generic variable j is represented by a couple of values: a lower bound and an upper bound which enclose the exact observation. In contrast to the interval valued data framework, the fuzzy data approach offers the possibility to take into account additional information given by the membership function. This function assigns a different “role” to each value of an interval in contrast to the interval valued data approach in which each value has a uniform importance. This distinction will be clearer after defining a fuzzy number, as follows.

A fuzzy number is defined as the triple F=(m,l,r)LR where m denotes the center and l and r are the left and right spread, respectively, with the following membership functionμF(α)=Lm−αl,α⩽m(l>0),Rα−mr,α⩾m(r>0),where L and R are continuous strictly decreasing functions on [0,1] called shape functions. Moreover these functions must fulfil additional requirements: for example with reference to L,L(0)=1, L(z)<1 if z>0, L(z)>0 if z<1 and L(1)=0. Further details can be found in Dubois and Prade (1980).

Hereafter, for the sake of simplicity, we consider ‘symmetric’ fuzzy numbers only. For a symmetric fuzzy number l=r and L=R. It follows that a symmetric fuzzy number can be completely identified by the couple F=(m,s)L=R where s=l=r is the spread. Thus, a fuzzy data set is a set of fuzzy numbers, which represent the scores of I observation units on J symmetric fuzzy variables. By a symmetric fuzzy variable we mean a variable for which each observed datum cannot be quantified exactly but only by using the couple F=(m,s)L=R.

In this paper we generalize classical Principal Component Analysis (PCA) to deal with fuzzy data sets using a least squares approach. A few proposals for this are available in the literature: Yabuuchi et al. (1997) propose a method to perform PCA on fuzzy data, in which fuzzy eigenvalues and crisp (non-fuzzy) eigenvectors are obtained by solving a linear programming problem. Cazes et al. (1997) and extensions of their approach (see, for example, Palumbo and Lauro, 2002) propose to carry out factorial decompositions of interval valued data as well as a probabilistic generalization thereof. The basic idea consists of performing PCA on the bounds or on the centers of the interval valued data. In their probabilistic generalization, they make use of a coefficient the value of which depends on the probability law at hand that modifies the information taken into account in the factorial decomposition.

The need for generalizing PCA to fuzzy data is based on the assumption that, as well as for single valued data, it can be helpful to synthesize the data without losing relevant information. In the next section, we present a general least squares approach to fuzzy data analysis. In Section 3, we propose two different models that extend PCA to fuzzy data and in Section 4 an alternating least squares algorithm to estimate the solutions will be given. In Section 5, we discuss the representation of each observation unit in the low dimensional subspace obtained by PCAF. In 6 Simulation study, 7 Application to a real data set, the results of a simulation study to compare the PCAF to classical PCA and an application of PCAF to real fuzzy data will be given.

Section snippets

Least squares approach to fuzzy data analysis

The need of studying and understanding fuzzy data has led to a growing interest in fuzzy data analysis. Many authors have dealt with fuzzy regression analysis in the past as well as nowadays. In the least squares sense, the analysis involves a minimization problem of a distance function between two sets of values, the empirical data set and the values estimated according to the specific model involved. In this framework we refer to Diamond (1988) who has developed a fuzzy least squares method

Principal Component Analysis of symmetric fuzzy data (PCAF)

In this paper we propose a principal component model for fuzzy data by extending (3) to deal with matrices instead of vectors. If each observation unit is represented by a score on a single fuzzy variable, the information can be represented as a segment in R1. It follows that, if one wants to compare two units, it is sufficient to compare the center and the two vertices as in (3). Instead, if two different variables are associated with each unit, a generic unit is represented as a rectangle in R

Estimation procedure: an alternating least squares approach

In this section we propose an alternating least squares algorithm in order to solve the minimization problem of PCAF in (18). Indeed we propose two different algorithms to find the estimations of the parameters: one for CS-PCAF and one for LS-PCAF. We note that the following procedures do not prohibit the estimated spreads to become negative, even though this turns out to happen rarely. Indeed, we are working on a modified version of the algorithm that does guarantee non-negative spread

Plotting procedure to display the observation units

In PCAF we think that it is very useful and interesting to plot each unit represented in RJ as a hypercube in the low dimensional space RP. Especially if P=2, we can represent each unit as a rectangle. One of the aims of PCAF is to offer a simpler graphical description of each unit. In this section we suggest two procedures to plot the observation units on the subspace spanned by the columns of F.

In the algorithm in the previous section we did not make any assumption about orthogonality

Simulation study

In this section we give the results of a simulation study carried out to assess how PCAF works. In particular the simulation study aims to answer three questions:

  • 1.

    Does PCAF recover the underlying structure in the data better than the classical PCA applied to the centers matrix?

  • 2.

    Are the algorithms (for CS-PCAF and LS-PCAF) efficient?

  • 3.

    Do the algorithms hit local optima frequently?

To answer the above questions, we have randomly generated fuzzy data sets with a known underlying factor structure and

Application to a real data set

In this section we apply PCAF to a real data set, introduced by Ichino (1988). The involved data set is known as ‘Fats and Oils data’ and is reproduced in Table 2.

The data set refers to eight oils (I=8) described by four quantitative interval valued variables (J=4). Thus, each oil is described by the ‘Specific Gravity’, the ‘Freezing Point’, the ‘Iodine Value’ and the ‘Saponification’. In fact, there is also a qualitative variable but we only refer to the quantitative ones.

We have assumed that

Conclusion

In this paper we have proposed two PCA procedures to detect the underlying structure of fuzzy data of I observation units and J fuzzy variables. In the first (CS-PCAF), we assume that estimated spreads are linearly related to the estimated centers matrix. In the second (LS-PCAF) we assume that the spreads are decomposed in a component scores matrix (different from the one of the centers) and the same component loadings matrix of the centers. In this way, LS-PCAF searches a compromise structure

Acknowledgements

The authors are grateful to the Co-Editor and the Referees for their suggestions and comments.

References (15)

There are more references available in the full text version of this article.

Cited by (40)

  • Evaluation of γ-radiation on green tea odor volatiles

    2011, Radiation Physics and Chemistry
    Citation Excerpt :

    In this study, the numerical observations of volatiles’ behaviour with respect to radiation doses cannot be represented by precisely specified values, making information vague or fuzzy. The principal component analysis (PCA) is a technique whose main objective is to obtain a small number of linear combinations (called principal components) of a set of variables that have the most possible information contained in the original variables through the assignment of a different “role” to each value of an interval in contrast to the interval valued data approach in which each value has a uniform importance (Giordani and Kiers, 2004). PCA was performed using the Win-DAS software (Kemsley, 1998).

  • Three-way analysis of imprecise data

    2010, Journal of Multivariate Analysis
  • Bayesian network with interval probability parameters

    2011, International Journal on Artificial Intelligence Tools
View all citing articles on Scopus
View full text