Principal Component Analysis of symmetric fuzzy data
Introduction
There are many practical situations in which numerical observations of I units with respect to J variables cannot be represented by precisely specified values. Examples can be found in economics (daily rate exchanges), in marketing research (assessing public opinions about special events like government elections, launching of new consumer goods), in psychology (the mental state or subjective perceptions), in physics (in a ballistic sphere, the needed velocity of perforation of a plate by projectiles). Another significant situation is related to the attempt to analyze subjective perception or linguistic variables. In each of the above examples an exact numerical coding seems to be very hard to give: thus, the available information can only be represented approximately.
In the above situations the information is vague or fuzzy and can be summarized using interval valued data instead of numerical data. Thus, the score of the generic observation i on the generic variable j is represented by a couple of values: a lower bound and an upper bound which enclose the exact observation. In contrast to the interval valued data framework, the fuzzy data approach offers the possibility to take into account additional information given by the membership function. This function assigns a different “role” to each value of an interval in contrast to the interval valued data approach in which each value has a uniform importance. This distinction will be clearer after defining a fuzzy number, as follows.
A fuzzy number is defined as the triple F=(m,l,r)LR where m denotes the center and l and r are the left and right spread, respectively, with the following membership functionwhere L and R are continuous strictly decreasing functions on [0,1] called shape functions. Moreover these functions must fulfil additional requirements: for example with reference to , L(z)<1 if z>0, L(z)>0 if z<1 and L(1)=0. Further details can be found in Dubois and Prade (1980).
Hereafter, for the sake of simplicity, we consider ‘symmetric’ fuzzy numbers only. For a symmetric fuzzy number l=r and L=R. It follows that a symmetric fuzzy number can be completely identified by the couple F=(m,s)L=R where s=l=r is the spread. Thus, a fuzzy data set is a set of fuzzy numbers, which represent the scores of I observation units on J symmetric fuzzy variables. By a symmetric fuzzy variable we mean a variable for which each observed datum cannot be quantified exactly but only by using the couple F=(m,s)L=R.
In this paper we generalize classical Principal Component Analysis (PCA) to deal with fuzzy data sets using a least squares approach. A few proposals for this are available in the literature: Yabuuchi et al. (1997) propose a method to perform PCA on fuzzy data, in which fuzzy eigenvalues and crisp (non-fuzzy) eigenvectors are obtained by solving a linear programming problem. Cazes et al. (1997) and extensions of their approach (see, for example, Palumbo and Lauro, 2002) propose to carry out factorial decompositions of interval valued data as well as a probabilistic generalization thereof. The basic idea consists of performing PCA on the bounds or on the centers of the interval valued data. In their probabilistic generalization, they make use of a coefficient the value of which depends on the probability law at hand that modifies the information taken into account in the factorial decomposition.
The need for generalizing PCA to fuzzy data is based on the assumption that, as well as for single valued data, it can be helpful to synthesize the data without losing relevant information. In the next section, we present a general least squares approach to fuzzy data analysis. In Section 3, we propose two different models that extend PCA to fuzzy data and in Section 4 an alternating least squares algorithm to estimate the solutions will be given. In Section 5, we discuss the representation of each observation unit in the low dimensional subspace obtained by PCAF. In 6 Simulation study, 7 Application to a real data set, the results of a simulation study to compare the PCAF to classical PCA and an application of PCAF to real fuzzy data will be given.
Section snippets
Least squares approach to fuzzy data analysis
The need of studying and understanding fuzzy data has led to a growing interest in fuzzy data analysis. Many authors have dealt with fuzzy regression analysis in the past as well as nowadays. In the least squares sense, the analysis involves a minimization problem of a distance function between two sets of values, the empirical data set and the values estimated according to the specific model involved. In this framework we refer to Diamond (1988) who has developed a fuzzy least squares method
Principal Component Analysis of symmetric fuzzy data (PCAF)
In this paper we propose a principal component model for fuzzy data by extending (3) to deal with matrices instead of vectors. If each observation unit is represented by a score on a single fuzzy variable, the information can be represented as a segment in . It follows that, if one wants to compare two units, it is sufficient to compare the center and the two vertices as in (3). Instead, if two different variables are associated with each unit, a generic unit is represented as a rectangle in
Estimation procedure: an alternating least squares approach
In this section we propose an alternating least squares algorithm in order to solve the minimization problem of PCAF in (18). Indeed we propose two different algorithms to find the estimations of the parameters: one for CS-PCAF and one for LS-PCAF. We note that the following procedures do not prohibit the estimated spreads to become negative, even though this turns out to happen rarely. Indeed, we are working on a modified version of the algorithm that does guarantee non-negative spread
Plotting procedure to display the observation units
In PCAF we think that it is very useful and interesting to plot each unit represented in as a hypercube in the low dimensional space . Especially if P=2, we can represent each unit as a rectangle. One of the aims of PCAF is to offer a simpler graphical description of each unit. In this section we suggest two procedures to plot the observation units on the subspace spanned by the columns of .
In the algorithm in the previous section we did not make any assumption about orthogonality
Simulation study
In this section we give the results of a simulation study carried out to assess how PCAF works. In particular the simulation study aims to answer three questions:
- 1.
Does PCAF recover the underlying structure in the data better than the classical PCA applied to the centers matrix?
- 2.
Are the algorithms (for CS-PCAF and LS-PCAF) efficient?
- 3.
Do the algorithms hit local optima frequently?
Application to a real data set
In this section we apply PCAF to a real data set, introduced by Ichino (1988). The involved data set is known as ‘Fats and Oils data’ and is reproduced in Table 2.
The data set refers to eight oils (I=8) described by four quantitative interval valued variables (J=4). Thus, each oil is described by the ‘Specific Gravity’, the ‘Freezing Point’, the ‘Iodine Value’ and the ‘Saponification’. In fact, there is also a qualitative variable but we only refer to the quantitative ones.
We have assumed that
Conclusion
In this paper we have proposed two PCA procedures to detect the underlying structure of fuzzy data of I observation units and J fuzzy variables. In the first (CS-PCAF), we assume that estimated spreads are linearly related to the estimated centers matrix. In the second (LS-PCAF) we assume that the spreads are decomposed in a component scores matrix (different from the one of the centers) and the same component loadings matrix of the centers. In this way, LS-PCAF searches a compromise structure
Acknowledgements
The authors are grateful to the Co-Editor and the Referees for their suggestions and comments.
References (15)
Fuzzy least squares
Inform. Sci.
(1988)- et al.
A least-squares approach to fuzzy linear regression analysis
Comput. Statist. Data Anal.
(2000) - et al.
On a class of c-numbers clustering procedures for fuzzy data
Fuzzy Sets and Systems
(1996) - et al.
Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data
(2000) - et al.
Entension de l'analyse en composantes principales á des données de type intervalle
Rev. Statist. Appl.
(1997) - et al.
Regression analysis with fuzzy informational paradigm: an adaptive fuzzy regression model
- et al.
Fuzzy Sets and Systems: Theory and Applications
(1980)
Cited by (40)
Fuzzy clustering of spatial interval-valued data
2023, Spatial StatisticsOn possibilistic clustering with repulsion constraints for imprecise data
2013, Information SciencesEvaluation of γ-radiation on green tea odor volatiles
2011, Radiation Physics and ChemistryCitation Excerpt :In this study, the numerical observations of volatiles’ behaviour with respect to radiation doses cannot be represented by precisely specified values, making information vague or fuzzy. The principal component analysis (PCA) is a technique whose main objective is to obtain a small number of linear combinations (called principal components) of a set of variables that have the most possible information contained in the original variables through the assignment of a different “role” to each value of an interval in contrast to the interval valued data approach in which each value has a uniform importance (Giordani and Kiers, 2004). PCA was performed using the Win-DAS software (Kemsley, 1998).
Three-way analysis of imprecise data
2010, Journal of Multivariate AnalysisQuality evaluation for composting products through fuzzy latent component analysis
2008, Resources, Conservation and RecyclingBayesian network with interval probability parameters
2011, International Journal on Artificial Intelligence Tools