Principal Component Analysis of symmetric fuzzy data

doi:10.1016/S0167-9473(02)00352-3

Computational Statistics & Data Analysis

Volume 45, Issue 3, 10 April 2004, Pages 519-548

https://doi.org/10.1016/S0167-9473(02)00352-3 Get rights and content

Abstract

Principal Component Analysis (PCA) is a well-known tool often used for the exploratory analysis of a numerical data set. Here an extension of classical PCA is proposed, which deals with fuzzy data (in short PCAF), where the elementary datum cannot be recognized exactly by a specific number but by a center, two spread measures and a membership function. Specifically, two different PCAF methods, associated with different hypotheses of interrelation between parts of the solution, are proposed. In the first method, called Centers-related Spread PCAF (CS-PCAF), the size of the spread measures depends on the size of the centers. In the second method, called Loadings-related Spread PCAF (LS-PCAF), the spreads are not related directly to the sizes of the centers, but indirectly, via the component loadings. To analyze how well PCAF works a simulation study was carried out. On the whole, the PCAF method performed better than or equally well as PCA, except in a few particular conditions. Finally, the application of PCAF to an empirical fuzzy data set is described.

Introduction

There are many practical situations in which numerical observations of I units with respect to J variables cannot be represented by precisely specified values. Examples can be found in economics (daily rate exchanges), in marketing research (assessing public opinions about special events like government elections, launching of new consumer goods), in psychology (the mental state or subjective perceptions), in physics (in a ballistic sphere, the needed velocity of perforation of a plate by projectiles). Another significant situation is related to the attempt to analyze subjective perception or linguistic variables. In each of the above examples an exact numerical coding seems to be very hard to give: thus, the available information can only be represented approximately.

In the above situations the information is vague or fuzzy and can be summarized using interval valued data instead of numerical data. Thus, the score of the generic observation i on the generic variable j is represented by a couple of values: a lower bound and an upper bound which enclose the exact observation. In contrast to the interval valued data framework, the fuzzy data approach offers the possibility to take into account additional information given by the membership function. This function assigns a different “role” to each value of an interval in contrast to the interval valued data approach in which each value has a uniform importance. This distinction will be clearer after defining a fuzzy number, as follows.

A fuzzy number is defined as the triple F=(m,l,r)_LR where m denotes the center and l and r are the left and right spread, respectively, with the following membership function $μ_{F} (α)= L m−α l, α⩽m (l>0), R α−m r, α⩾m (r>0),$ where L and R are continuous strictly decreasing functions on [0,1] called shape functions. Moreover these functions must fulfil additional requirements: for example with reference to $L, L(0)=1$ , L(z)<1 if z>0, L(z)>0 if z<1 and L(1)=0. Further details can be found in Dubois and Prade (1980).

Hereafter, for the sake of simplicity, we consider ‘symmetric’ fuzzy numbers only. For a symmetric fuzzy number l=r and L=R. It follows that a symmetric fuzzy number can be completely identified by the couple F=(m,s)_L=R where s=l=r is the spread. Thus, a fuzzy data set is a set of fuzzy numbers, which represent the scores of I observation units on J symmetric fuzzy variables. By a symmetric fuzzy variable we mean a variable for which each observed datum cannot be quantified exactly but only by using the couple F=(m,s)_L=R.

In this paper we generalize classical Principal Component Analysis (PCA) to deal with fuzzy data sets using a least squares approach. A few proposals for this are available in the literature: Yabuuchi et al. (1997) propose a method to perform PCA on fuzzy data, in which fuzzy eigenvalues and crisp (non-fuzzy) eigenvectors are obtained by solving a linear programming problem. Cazes et al. (1997) and extensions of their approach (see, for example, Palumbo and Lauro, 2002) propose to carry out factorial decompositions of interval valued data as well as a probabilistic generalization thereof. The basic idea consists of performing PCA on the bounds or on the centers of the interval valued data. In their probabilistic generalization, they make use of a coefficient the value of which depends on the probability law at hand that modifies the information taken into account in the factorial decomposition.

The need for generalizing PCA to fuzzy data is based on the assumption that, as well as for single valued data, it can be helpful to synthesize the data without losing relevant information. In the next section, we present a general least squares approach to fuzzy data analysis. In Section 3, we propose two different models that extend PCA to fuzzy data and in Section 4 an alternating least squares algorithm to estimate the solutions will be given. In Section 5, we discuss the representation of each observation unit in the low dimensional subspace obtained by PCAF. In 6 Simulation study, 7 Application to a real data set, the results of a simulation study to compare the PCAF to classical PCA and an application of PCAF to real fuzzy data will be given.

Section snippets

Least squares approach to fuzzy data analysis

The need of studying and understanding fuzzy data has led to a growing interest in fuzzy data analysis. Many authors have dealt with fuzzy regression analysis in the past as well as nowadays. In the least squares sense, the analysis involves a minimization problem of a distance function between two sets of values, the empirical data set and the values estimated according to the specific model involved. In this framework we refer to Diamond (1988) who has developed a fuzzy least squares method

Principal Component Analysis of symmetric fuzzy data (PCAF)

In this paper we propose a principal component model for fuzzy data by extending (3) to deal with matrices instead of vectors. If each observation unit is represented by a score on a single fuzzy variable, the information can be represented as a segment in $R^{1}$ . It follows that, if one wants to compare two units, it is sufficient to compare the center and the two vertices as in (3). Instead, if two different variables are associated with each unit, a generic unit is represented as a rectangle in $R$

Estimation procedure: an alternating least squares approach

In this section we propose an alternating least squares algorithm in order to solve the minimization problem of PCAF in (18). Indeed we propose two different algorithms to find the estimations of the parameters: one for CS-PCAF and one for LS-PCAF. We note that the following procedures do not prohibit the estimated spreads to become negative, even though this turns out to happen rarely. Indeed, we are working on a modified version of the algorithm that does guarantee non-negative spread

Plotting procedure to display the observation units

In PCAF we think that it is very useful and interesting to plot each unit represented in $R^{J}$ as a hypercube in the low dimensional space $R^{P}$ . Especially if P=2, we can represent each unit as a rectangle. One of the aims of PCAF is to offer a simpler graphical description of each unit. In this section we suggest two procedures to plot the observation units on the subspace spanned by the columns of $F$ .

In the algorithm in the previous section we did not make any assumption about orthogonality

Simulation study

In this section we give the results of a simulation study carried out to assess how PCAF works. In particular the simulation study aims to answer three questions:

1.
Does PCAF recover the underlying structure in the data better than the classical PCA applied to the centers matrix?
2.
Are the algorithms (for CS-PCAF and LS-PCAF) efficient?
3.
Do the algorithms hit local optima frequently?

To answer the above questions, we have randomly generated fuzzy data sets with a known underlying factor structure and

Application to a real data set

In this section we apply PCAF to a real data set, introduced by Ichino (1988). The involved data set is known as ‘Fats and Oils data’ and is reproduced in Table 2.

The data set refers to eight oils (I=8) described by four quantitative interval valued variables (J=4). Thus, each oil is described by the ‘Specific Gravity’, the ‘Freezing Point’, the ‘Iodine Value’ and the ‘Saponification’. In fact, there is also a qualitative variable but we only refer to the quantitative ones.

We have assumed that

Conclusion

In this paper we have proposed two PCA procedures to detect the underlying structure of fuzzy data of I observation units and J fuzzy variables. In the first (CS-PCAF), we assume that estimated spreads are linearly related to the estimated centers matrix. In the second (LS-PCAF) we assume that the spreads are decomposed in a component scores matrix (different from the one of the centers) and the same component loadings matrix of the centers. In this way, LS-PCAF searches a compromise structure

Acknowledgements

The authors are grateful to the Co-Editor and the Referees for their suggestions and comments.

References (15)

P. Diamond
Fuzzy least squares
Inform. Sci.
(1988)
P. D'Urso et al.
A least-squares approach to fuzzy linear regression analysis
Comput. Statist. Data Anal.
(2000)
M.S. Yang et al.
On a class of c-numbers clustering procedures for fuzzy data
Fuzzy Sets and Systems
(1996)
H.H. Bock et al.
Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data
(2000)
P. Cazes et al.
Entension de l'analyse en composantes principales á des données de type intervalle
Rev. Statist. Appl.
(1997)
R. Coppi et al.
Regression analysis with fuzzy informational paradigm: an adaptive fuzzy regression model
D. Dubois et al.
Fuzzy Sets and Systems: Theory and Applications
(1980)

There are more references available in the full text version of this article.

Cited by (40)

Fuzzy clustering of spatial interval-valued data
2023, Spatial Statistics
In this paper, two fuzzy clustering methods for spatial interval-valued data are proposed, i.e. the fuzzy $C$ -Medoids clustering of spatial interval-valued data with and without entropy regularization. Both methods are based on the Partitioning Around Medoids (PAM) algorithm, inheriting the great advantage of obtaining non-fictitious representative units for each cluster.
In both methods, the units are endowed with a relation of contiguity, represented by a symmetric binary matrix. This can be intended both as contiguity in a physical space and as a more abstract notion of contiguity. The performances of the methods are proved by simulation, testing the methods with different contiguity matrices associated to natural clusters of units. In order to show the effectiveness of the methods in empirical studies, three applications are presented: the clustering of municipalities based on interval-valued pollutants levels, the clustering of European fact-checkers based on interval-valued data on the average number of impressions received by their tweets and the clustering of the residential zones of the city of Rome based on the interval of price values.
On possibilistic clustering with repulsion constraints for imprecise data
2013, Information Sciences
In possibilistic clustering objects are assigned to clusters according to the so-called membership degrees taking values in the unit interval. Differently from fuzzy clustering, it is not required that the sum of the membership degrees of an object to all clusters is equal to one. This is very helpful in the presence of outliers, which are usually assigned to the clusters with membership degrees close to zero. Unfortunately, a drawback of the possibilistic approach is the tendency to produce coincident clusters. A remedy is to add a repulsion term among prototypes in the loss function forcing the prototypes to be far ‘enough’ from each other. Here, a possibilistic clustering algorithm with repulsion constraints for imprecise data, managed in terms of fuzzy sets, is introduced. Applications to synthetic and real fuzzy data are considered in order to analyze how the proposed clustering algorithm works in practice.
Evaluation of γ-radiation on green tea odor volatiles
2011, Radiation Physics and Chemistry
Citation Excerpt :
In this study, the numerical observations of volatiles’ behaviour with respect to radiation doses cannot be represented by precisely specified values, making information vague or fuzzy. The principal component analysis (PCA) is a technique whose main objective is to obtain a small number of linear combinations (called principal components) of a set of variables that have the most possible information contained in the original variables through the assignment of a different “role” to each value of an interval in contrast to the interval valued data approach in which each value has a uniform importance (Giordani and Kiers, 2004). PCA was performed using the Win-DAS software (Kemsley, 1998).
The aim of this study was to evaluate the gamma radiation effects on green tea odor volatiles in green tea at doses of 0, 5, 10, 15 and 20 kGy. The volatile organic compounds were extracted by hydrodistillation and analyzed by GC/MS. The green tea had a large influence on radiation effects, increasing the identified volatiles in relation to control samples. The dose of 10 kGy was responsible to form the majority of new odor compounds following by 5 and 20 kGy. However, the dose of 5 kGy was the dose that degraded the majority of volatiles in non-irradiated samples, following by 20 kGy. The dose of 15 kGy showed has no effect on odor volatiles. The gamma radiation, at dose up to 20 kGy, showed statistically no difference between irradiated and non irradiated green tea on odors compounds.
Three-way analysis of imprecise data
2010, Journal of Multivariate Analysis
Data are often affected by uncertainty. Uncertainty is usually referred to as randomness. Nonetheless, other sources of uncertainty may occur. In particular, the empirical information may also be affected by imprecision. Also in these cases it can be fruitful to analyze the underlying structure of the data. In this paper we address the problem of summarizing a sample of three-way imprecise data. In order to manage the different sources of uncertainty a twofold strategy is adopted. On the one hand, imprecise data are transformed into fuzzy sets by means of the so-called fuzzification process. The so-obtained fuzzy data are then analyzed by suitable generalizations of the Tucker3 and CANDECOMP/PARAFAC models, which are the two most popular three-way extensions of Principal Component Analysis. On the other hand, the statistical validity of the obtained underlying structure is evaluated by (nonparametric) bootstrapping. A simulation experiment is performed for assessing whether the use of fuzzy data is helpful in order to summarize three-way uncertain data. Finally, to show how our models work in practice, an application to real data is discussed.
Quality evaluation for composting products through fuzzy latent component analysis
2008, Resources, Conservation and Recycling
A fuzzy latent component analysis (FLCA) method was proposed for assessing the quality of composting products under uncertainty. In FLCA, vague and ambiguous information associated with multiple compost quality indicators were handled as fuzzy sets and converted into independent fuzzy components that had lower dimensions; the converted non-correlative component information could then be used for ranking compost quality. The proposed method was used for assessing eight types of co-composting products. Two scenarios were considered. The main strategy of scenario A was to evaluate the compost quality, and provide the decision makers a cursory suggestion. Scenario B is more conservative, representing a more robust alternative. The two scenarios were analyzed under two fitting degree levels. By ranking the center values of each component, an assessment system in terms of compost quality could be generated.
Bayesian network with interval probability parameters
2011, International Journal on Artificial Intelligence Tools

View all citing articles on Scopus

View full text

Principal Component Analysis of symmetric fuzzy data

Abstract

Introduction

Section snippets

Least squares approach to fuzzy data analysis

Principal Component Analysis of symmetric fuzzy data (PCAF)

Estimation procedure: an alternating least squares approach

Plotting procedure to display the observation units

Simulation study

Application to a real data set

Conclusion

Acknowledgements

Inform. Sci.

Comput. Statist. Data Anal.

Fuzzy Sets and Systems

Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data

Entension de l'analyse en composantes principales á des données de type intervalle

Rev. Statist. Appl.

Regression analysis with fuzzy informational paradigm: an adaptive fuzzy regression model

Fuzzy Sets and Systems: Theory and Applications