Abstract
In the symbolic data framework, probabilistic symbolic data are considered as those whose components are random variables with general probability distributions. Intervals (or uniform distributions), histograms (or empirical distributions), Gaussian distribution and Chi-squared distribution are all the special cases of them. The existing approaches devoted to the subject have a common shortcoming since they can not obtain the distributions of linear combinations (i.e., principal components) of random variables especially for not identically distributed ones. This paper will overcome the shortcoming by providing an exact probability density function for each principal component by using the inversion theorem. Further, the paper defines a covariance matrix for probabilistic symbolic data and presents a new principal component analysis based on this variance–covariance structure. The effectiveness of the proposed method is illustrated by a simulated numerical experiment, and two real-life cases including clustering of oils and fats data, and evaluation of indexed journals of Science Citation Index.







Similar content being viewed by others
References
Aitchison J (1982) The statistical analysis of compositional data. J R Stat Soc Ser B (Methodological) 44(2):139–177
Aitchison J (1986) The statistical analysis of compositional data. Springer, Dordrecht
Billard L, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98(462):470–487
Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, Chichester
Bock HH, Diday E (2000) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, New York
Cazes P (2002) Analyse factorielle d’un tableau de lois de probabilité. Revue de Statistique Appliquée 50(3):5–24
Cazes P, Chouakria A, Diday E, Schektrman Y (1997) Entension de l’analyse en composantes principales à des données de type intervalle. Revue de Statistique Appliquée 45(3):5–24
Cazes P, Chouakria A, Diday E (2000) Symbolic principal components analysis. In: Bock HH, Diday E (eds) Analysis of symbolic data. Springer, New York, pp 200–212
Chouakria A, Diday E, Cazes P (1998) Vertices principal components analysis with an improved factorial representation. In: Rizzi A, Vichi M, Bock HH (eds) Advances in data science and classification. Springer, Berlin, Heidelberg, pp 397–402
Diday E (1987) The symbolic approach in clustering and relating methods of data analysis: the basic choices. In: Conference of the International Federation of Classification Societies, pp 673–684
Diday E (1995) Probabilist, possibilist and belief objects for knowledge analysis. Ann Oper Res 55(2):225–276
Diday E, Noirhomme-Fraiture M (2008) Symbolic data analysis and the SODAS software. Wiley-Interscience, Chichester
Diday E, Vrac M (2005) Mixture decomposition of distributions by copulas in the symbolic data analysis framework. Discrete Appl Math 147(1):27–41
Douzal-Chouakria A, Billard L, Diday E (2011) Principal component analysis for interval-valued observations. Stat Anal Data Min 4(2):229–246
D’Urso P, Giordani P (2004) A least squares approach to principal component analysis for interval valued data. Chemometr Intell Lab Syst 70(2):179–192
Ferraty F, Vieu P (2006) Nonparametric functional data analysis: theory and practice. Springer, New York
Gioia F, Lauro CN (2006) Principal component analysis on interval data. Comput Stat 21(2):343–363
Ichino M (2011) The quantile method for symbolic principal component analysis. Stat Anal Data Min 4(2):184–198
Ichino M, Yaguchi H (1994) Generalized Minkowski metrics for mixed feature-type data analysis. Syst Man Cybernet IEEE Trans 24(4):698–708
Irpino A, Verde R (2011) Basic statistics for probabilistic symbolic variables: a novel metric-based approach. arXiv:1110.2295 [statME]
Lauro CN, Verde R, Palumbo F (2000) Factorial data analysis on symbolic objects under cohesion constrains. Springer, Berlin
Le-Rademacher J, Billard L (2012) Symbolic covariance principal component analysis and visualization for interval-valued data. J Comput Graph Stat 21(2):413–432
Makosso-Kallyth S, Diday E (2012) Adaptation of interval PCA to symbolic histogram variables. Adv Data Anal Classif 6(2):147–159
Malerba D, Esposito F, Monopoli M (2002) Comparing dissimilarity measures for probabilistic symbolic objects. Data Min III Ser Manag Inf Syst 6:31–40
Nagabhushan P, Kumar RP (2007) Histogram PCA. In: Liu D et al (eds) Advances in Neural Networks—ISNN 2007, vol 4492. Springer, Berlin, Heidelberg, pp 1012–1021
Nagabhushan P, Chidananda Gowda K, Diday E (1995) Dimensionality reduction of symbolic data. Pattern Recogn Lett 16(2):219–223
Palumbo F, Lauro CN (2003) A PCA for interval-valued data based on midpoints and radii. In: Yanai H et al (eds) New developments in psychometric. Springer, Tokyo, pp 641–648
Pawlowsky-Glahn V, Buccianti A (2011) Compositional data analysis: theory and applications. Wiley, Chichester
Ramsay J (1982) When the data are functions. Psychometrika 47(4):379–396
Ramsay J (2005) Functional data analysis. Springer, New York
Rodrıguez O, Diday E, Winsberg S (2000) Generalization of the principal components analysis to histogram data. In: Workshop on simbolic data analysis of the 4th European Conference on principles and practice of knowledge discovery in data bases, Setiembre, pp 12–16
Verde R, Irpino A (2009) New statistics for new data: a proposal for comparing multivalued numerical data. Stat Appl 21(2):185–206
Wang H, Chen M, Li N, Wang L (2011) Principal component analysis of modal interval-valued data with constant numerical characteristics. In: The 58th World Statistics Congress of the International Statistical Institute. Ireland, Dublin. http://www.2011.isiproceedings.org/papers/950719.pdf
Wang H, Guan R, Wu J (2012) Cipca: complete-information-based principal component analysis for interval-valued data. Neurocomputing 86:158–169
Acknowledgments
The authors are grateful to the Editor and anonymous reviewers for their insightful comments which have helped to improve the quality of this paper. This work was supported by the National Natural Science Foundation of China (Grant Nos. 71031001, 70771004, 71371019), and the Program for New Century Excellent Talents in University, by Ministry of Education of China (Grant No. NCET-12-0026).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, M., Wang, H. & Qin, Z. Principal component analysis for probabilistic symbolic data: a more generic and accurate algorithm. Adv Data Anal Classif 9, 59–79 (2015). https://doi.org/10.1007/s11634-014-0178-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-014-0178-2