Skip to main content
Log in

Principal component analysis for probabilistic symbolic data: a more generic and accurate algorithm

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

    We’re sorry, something doesn't seem to be working properly.

    Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

In the symbolic data framework, probabilistic symbolic data are considered as those whose components are random variables with general probability distributions. Intervals (or uniform distributions), histograms (or empirical distributions), Gaussian distribution and Chi-squared distribution are all the special cases of them. The existing approaches devoted to the subject have a common shortcoming since they can not obtain the distributions of linear combinations (i.e., principal components) of random variables especially for not identically distributed ones. This paper will overcome the shortcoming by providing an exact probability density function for each principal component by using the inversion theorem. Further, the paper defines a covariance matrix for probabilistic symbolic data and presents a new principal component analysis based on this variance–covariance structure. The effectiveness of the proposed method is illustrated by a simulated numerical experiment, and two real-life cases including clustering of oils and fats data, and evaluation of indexed journals of Science Citation Index.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Aitchison J (1982) The statistical analysis of compositional data. J R Stat Soc Ser B (Methodological) 44(2):139–177

  • Aitchison J (1986) The statistical analysis of compositional data. Springer, Dordrecht

    Book  MATH  Google Scholar 

  • Billard L, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98(462):470–487

    Article  MathSciNet  Google Scholar 

  • Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, Chichester

    Book  Google Scholar 

  • Bock HH, Diday E (2000) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, New York

    Book  Google Scholar 

  • Cazes P (2002) Analyse factorielle d’un tableau de lois de probabilité. Revue de Statistique Appliquée 50(3):5–24

    MathSciNet  Google Scholar 

  • Cazes P, Chouakria A, Diday E, Schektrman Y (1997) Entension de l’analyse en composantes principales à des données de type intervalle. Revue de Statistique Appliquée 45(3):5–24

    Google Scholar 

  • Cazes P, Chouakria A, Diday E (2000) Symbolic principal components analysis. In: Bock HH, Diday E (eds) Analysis of symbolic data. Springer, New York, pp 200–212

    Google Scholar 

  • Chouakria A, Diday E, Cazes P (1998) Vertices principal components analysis with an improved factorial representation. In: Rizzi A, Vichi M, Bock HH (eds) Advances in data science and classification. Springer, Berlin, Heidelberg, pp 397–402

    Chapter  Google Scholar 

  • Diday E (1987) The symbolic approach in clustering and relating methods of data analysis: the basic choices. In: Conference of the International Federation of Classification Societies, pp 673–684

  • Diday E (1995) Probabilist, possibilist and belief objects for knowledge analysis. Ann Oper Res 55(2):225–276

    Article  Google Scholar 

  • Diday E, Noirhomme-Fraiture M (2008) Symbolic data analysis and the SODAS software. Wiley-Interscience, Chichester

    MATH  Google Scholar 

  • Diday E, Vrac M (2005) Mixture decomposition of distributions by copulas in the symbolic data analysis framework. Discrete Appl Math 147(1):27–41

    Article  MATH  MathSciNet  Google Scholar 

  • Douzal-Chouakria A, Billard L, Diday E (2011) Principal component analysis for interval-valued observations. Stat Anal Data Min 4(2):229–246

    Article  MathSciNet  Google Scholar 

  • D’Urso P, Giordani P (2004) A least squares approach to principal component analysis for interval valued data. Chemometr Intell Lab Syst 70(2):179–192

    Article  Google Scholar 

  • Ferraty F, Vieu P (2006) Nonparametric functional data analysis: theory and practice. Springer, New York

    Google Scholar 

  • Gioia F, Lauro CN (2006) Principal component analysis on interval data. Comput Stat 21(2):343–363

    Article  MATH  MathSciNet  Google Scholar 

  • Ichino M (2011) The quantile method for symbolic principal component analysis. Stat Anal Data Min 4(2):184–198

    Article  MathSciNet  Google Scholar 

  • Ichino M, Yaguchi H (1994) Generalized Minkowski metrics for mixed feature-type data analysis. Syst Man Cybernet IEEE Trans 24(4):698–708

    Article  MathSciNet  Google Scholar 

  • Irpino A, Verde R (2011) Basic statistics for probabilistic symbolic variables: a novel metric-based approach. arXiv:1110.2295 [statME]

  • Lauro CN, Verde R, Palumbo F (2000) Factorial data analysis on symbolic objects under cohesion constrains. Springer, Berlin

    Google Scholar 

  • Le-Rademacher J, Billard L (2012) Symbolic covariance principal component analysis and visualization for interval-valued data. J Comput Graph Stat 21(2):413–432

    Article  MathSciNet  Google Scholar 

  • Makosso-Kallyth S, Diday E (2012) Adaptation of interval PCA to symbolic histogram variables. Adv Data Anal Classif 6(2):147–159

    Article  MATH  MathSciNet  Google Scholar 

  • Malerba D, Esposito F, Monopoli M (2002) Comparing dissimilarity measures for probabilistic symbolic objects. Data Min III Ser Manag Inf Syst 6:31–40

    Google Scholar 

  • Nagabhushan P, Kumar RP (2007) Histogram PCA. In: Liu D et al (eds) Advances in Neural Networks—ISNN 2007, vol 4492. Springer, Berlin, Heidelberg, pp 1012–1021

  • Nagabhushan P, Chidananda Gowda K, Diday E (1995) Dimensionality reduction of symbolic data. Pattern Recogn Lett 16(2):219–223

  • Palumbo F, Lauro CN (2003) A PCA for interval-valued data based on midpoints and radii. In: Yanai H et al (eds) New developments in psychometric. Springer, Tokyo, pp 641–648

  • Pawlowsky-Glahn V, Buccianti A (2011) Compositional data analysis: theory and applications. Wiley, Chichester

    Book  Google Scholar 

  • Ramsay J (1982) When the data are functions. Psychometrika 47(4):379–396

    Article  MATH  MathSciNet  Google Scholar 

  • Ramsay J (2005) Functional data analysis. Springer, New York

    Google Scholar 

  • Rodrıguez O, Diday E, Winsberg S (2000) Generalization of the principal components analysis to histogram data. In: Workshop on simbolic data analysis of the 4th European Conference on principles and practice of knowledge discovery in data bases, Setiembre, pp 12–16

  • Verde R, Irpino A (2009) New statistics for new data: a proposal for comparing multivalued numerical data. Stat Appl 21(2):185–206

    Google Scholar 

  • Wang H, Chen M, Li N, Wang L (2011) Principal component analysis of modal interval-valued data with constant numerical characteristics. In: The 58th World Statistics Congress of the International Statistical Institute. Ireland, Dublin. http://www.2011.isiproceedings.org/papers/950719.pdf

  • Wang H, Guan R, Wu J (2012) Cipca: complete-information-based principal component analysis for interval-valued data. Neurocomputing 86:158–169

    Article  Google Scholar 

Download references

Acknowledgments

The authors are grateful to the Editor and anonymous reviewers for their insightful comments which have helped to improve the quality of this paper. This work was supported by the National Natural Science Foundation of China (Grant Nos. 71031001, 70771004, 71371019), and the Program for New Century Excellent Talents in University, by Ministry of Education of China (Grant No. NCET-12-0026).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhongfeng Qin.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, M., Wang, H. & Qin, Z. Principal component analysis for probabilistic symbolic data: a more generic and accurate algorithm. Adv Data Anal Classif 9, 59–79 (2015). https://doi.org/10.1007/s11634-014-0178-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-014-0178-2

Keywords

Mathematics Subject Classification