Skip to main content
Log in

Dealing with Distances and Transformations for Fuzzy C-Means Clustering of Compositional Data

  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

Clustering techniques are based upon a dissimilarity or distance measure between objects and clusters. This paper focuses on the simplex space, whose elements—compositions—are subject to non-negativity and constant-sum constraints. Any data analysis involving compositions should fulfill two main principles: scale invariance and subcompositional coherence. Among fuzzy clustering methods, the FCM algorithm is broadly applied in a variety of fields, but it is not well-behaved when dealing with compositions. Here, the adequacy of different dissimilarities in the simplex, together with the behavior of the common log-ratio transformations, is discussed in the basis of compositional principles. As a result, a well-founded strategy for FCM clustering of compositions is suggested. Theoretical findings are accompanied by numerical evidence, and a detailed account of our proposal is provided. Finally, a case study is illustrated using a nutritional data set known in the clustering literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • AITCHISON, J. (1986), The Statistical Analysis of Compositional Data, London: Chapman & Hall, reprinted in 2003 by Blackburn Press.

    Book  MATH  Google Scholar 

  • AITCHISON, J. (1992), “On Criteria for Measures of Compositional Difference,” Mathematical Geology, 24, 365–379.

    Article  MathSciNet  MATH  Google Scholar 

  • AITCHISON, J., BARCELÓ-VIDAL, C., MARTÍN-FERNÁNDEZ, J.A., and PAWLOWSKY-GLAHN, V. (2000), “Logratio Analysis and Compositional Distance,” Mathematical Geology, 32, 271–275.

    Article  MATH  Google Scholar 

  • AITCHISON, J., and GREENACRE, M. (2002), “Biplots for Compositional Data,” Journal of the Royal Statistical Society, Series C, 51, 375–392.

    Article  MathSciNet  MATH  Google Scholar 

  • BAXTER, M.J., and FREESTONE, I.C. (2006), “Log-ratio Compositional Data Analysis in Archeometry,” Archaeometry, 48, 511–531.

    Article  Google Scholar 

  • BERGET, I., MEVIK, B-H., and NAES, T. (2008), “New Modifications and Applications of Fuzzy C-Means Methodology,” Computational Statistics & Data Analysis, 52, 2403–2418.

    Article  MathSciNet  MATH  Google Scholar 

  • BEZDEK, J. (1981), Pattern Recognition with Fuzzy Objective Function Algorithms, New York: Plenum Press.

    Book  MATH  Google Scholar 

  • BILLHEIMER, D., GUTTORP, P., and FAGAN, W. (2001), “Statistical Interpretation of Species Composition,” Journal of the American Statistical Association, 96, 1205–1214.

    Article  MathSciNet  MATH  Google Scholar 

  • CHACÓN, J.E., MATEU-FIGUERAS, G., and MARTÍN-FERNÁNDEZ, J.A. (2011), “Gaussian Kernels for Density Estimation with Compositional Data,” Computers & Geosciences, 37, 702–711.

    Article  Google Scholar 

  • DESARBO, W.S., RAMASWAMY, V., and LENK, P. (1993), “A Latent Class Procedure for the Structural Analysis of Two-Way Compositional Data,” Journal of Classification, 10, 159–193.

    Article  MATH  Google Scholar 

  • DÖRING, C., LESOT, M-J., and KRUSE, R. (2006), “Data Analysis with Fuzzy Clustering Methods,” Computational Statistics & Data Analysis, 51, 192–214.

    Article  MathSciNet  MATH  Google Scholar 

  • EGOZCUE, J.J., PAWLOWSKY-GLAHN, V., MATEU-FIGUERAS, G., and BARCELÓ-VIDAL, C. (2003), “Isometric Logratio Transformations for Compositional Data Analysis,” Mathematical Geology, 35, 279–300.

    Article  MathSciNet  Google Scholar 

  • EGOZCUE, J.J., and PAWLOWSKY-GLAHN, V. (2005), “CoDa-Dendrogram: A New Exploratory Tool,” in Proceedings of the Second Compositional Data Analysis Workshop - CoDaWork’05, Girona, Spain.

  • GABRIEL, K.R. (1971), “The Biplot Graphic Display of Matrices with Application to Principal Component Analysis,” Biometrika, 58, 453–467.

    Article  MathSciNet  MATH  Google Scholar 

  • GAVIN, D.G., OSWALD, W.W., WAHL, E.R., and WILLIAMS, J.W. (2003), “A Statistical Approach to Evaluating Distance Metrics and Analog Assignments for Pollen Records,” Quaternary Research, 60, 356–367.

    Article  Google Scholar 

  • GREENACRE, M. (1988), “Clustering the Rows and Columns of a Contingency Table,” Journal of Classification, 5, 39–51.

    Article  MathSciNet  MATH  Google Scholar 

  • HARTIGAN, J.A. (1975), Clustering Algorithms, New York: Wiley & Sons.

    MATH  Google Scholar 

  • HÖPPNER, F., KLAWONN, F., KRUSE, R., and RUNKLER, T. (1999), Fuzzy Cluster Analysis: Methods for Classification, Data analysis, and Image Recognition, Chichester: John Wiley & Sons.

    MATH  Google Scholar 

  • LEGENDRE, P., and GALLAGHER, E.D. (2001), “Ecologically Meaningful Transformations for Ordination of Species Data,” Oecologia, 129, 271–280.

    Article  Google Scholar 

  • MARTÍN, M.C. (1996), “Performance of Eight Dissimilarity Coefficients to Cluster a Compositional Data Set,” in Abstracts of the Fifth Conference of International Federation of Classification Societies (Vol. 1), Kobe, Japan, pp. 215–217.

  • MARTÍN-FERNÁNDEZ, J.A., BREN, M., BARCELÓ-VIDAL, C., and PAWLOWSKYGLAHN, V. (1999), “A Measure of Difference for Compositional Data Based On Measures of Divergence,” in Proceedings of the Fifth Annual Conference of the International Assotiation for Mathematical Geology (Vol. 1), Trondheim, Norway, pp. 211–215.

  • MARTÍN-FERNÁNDEZ, J.A., BARCELÓ-VIDAL, C., and PAWLOWSKY-GLAHN, V. (2003), “Dealing with Zeros and Missing Values in Compositional Data Sets,” Mathematical Geology, 35, 253–278.

    Article  Google Scholar 

  • MILLER, W.E. (2002), “Revisiting the Geometry of a Ternary Diagram with the Half-Taxi Metric,” Mathematical Geology, 34, 275–290.

    Article  MathSciNet  MATH  Google Scholar 

  • PALAREA-ALBALADEJO, J., MARTÍN-FERNÁNDEZ, J.A., and GÓMEZ-GARCÍA, J. (2007), “A Parametric Approach for Dealing with Compositional Rounded Zeros,” Mathematical Geology, 39, 625–645.

    Article  MATH  Google Scholar 

  • PALAREA-ALBALADEJO, J., and MARTÍN-FERNÁNDEZ, J.A. (2008), “A Modified EM alr-Algorithm for Replacing Rounded Zeros in Compositional Data Sets,” Computers & Geosciences, 34, 902–917.

    Article  Google Scholar 

  • PAWLOWSKY-GLAHN, V., and EGOZCUE, J.J. (2001), “Geometric Approach to Statistical Analysis on the Simplex,” Stochastic Environmental Research and Risk Assessment, 15, 384–398.

    Article  MATH  Google Scholar 

  • PAWLOWSKY-GLAHN, V. (2003), “Statistical Modelling on Coordinates,” in Proceedings of the First Compositional Data Analysis Workshop - CoDaWork’03, Girona, Spain.

  • PAWLOWSKY-GLAHN, V., and EGOZCUE, J.J. (2008), “Compositional Data and Simpson’s Paradox,” in Proceedings of the Third Compositional Data Analysis Workshop - CoDaWork’08, Girona, Spain.

  • SOTO, J., FLORES-SINTAS, A., and PALAREA-ALBALADEJO, J. (2008), “Improving Probabilities in a Fuzzy Clustering Partition,” Fuzzy Sets & Systems, 159, 406–421.

    Article  MathSciNet  MATH  Google Scholar 

  • TEMPL, M., FILZMOSER, P., and REIMANN, C. (2008), “Cluster Analysis Applied to Regional Geochemical Data: Problems and Possibilities,” Applied Geochemistry, 23, 2198–2213.

    Article  Google Scholar 

  • VÊNCIO, R., VARUZZA, L., PEREIRA, C., BRENTANI, H. and SHMULEVICH, I. (2007), “Simcluster: Clustering Enumeration Gene Expression Data on the Simplex Space,” BMC Bioinformatics, 8, 246.

    Article  Google Scholar 

  • WAHL, E.R. (2004), “A General Framework for Determining Cut-off Values to Select Pollen Analogs with Dissimilarity Metrics in the Modern Analog Technique,” Review of Palaeobotany and Palynology, 128, 263–280.

    Article  Google Scholar 

  • WANG, H., LIU, Q., MOK, H.M.K., FU, L., and TSE, W.M. (2007), “A Hyperspherical Transformation Forecasting Model for Compositional Data,” European Journal of Operations Research, 179, 459–468.

    Article  MATH  Google Scholar 

  • WATSON, D.F., and PHILIP, G.M. (1989), “Measures of Variability for Geological Data,” Mathematical Geology, 21, 233–254.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Javier Palarea-Albaladejo.

Additional information

This research has been supported by the Scottish Government, the Spanish Ministry of Science and Innovation under the project “CODA-RSS” Ref. MTM2009-13272; and by the Agència de Gestió d’Ajuts Universitaris i de Recerca of the Generalitat de Catalunya under the project Ref: 2009SGR424. We are in debt with the editor and the referees for their helpful comments and suggestions on an earlier version of this paper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Palarea-Albaladejo, J., Martín-Fernández, J.A. & Soto, J.A. Dealing with Distances and Transformations for Fuzzy C-Means Clustering of Compositional Data. J Classif 29, 144–169 (2012). https://doi.org/10.1007/s00357-012-9105-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-012-9105-4

Keywords

Navigation