Abstract
Clustering techniques are based upon a dissimilarity or distance measure between objects and clusters. This paper focuses on the simplex space, whose elements—compositions—are subject to non-negativity and constant-sum constraints. Any data analysis involving compositions should fulfill two main principles: scale invariance and subcompositional coherence. Among fuzzy clustering methods, the FCM algorithm is broadly applied in a variety of fields, but it is not well-behaved when dealing with compositions. Here, the adequacy of different dissimilarities in the simplex, together with the behavior of the common log-ratio transformations, is discussed in the basis of compositional principles. As a result, a well-founded strategy for FCM clustering of compositions is suggested. Theoretical findings are accompanied by numerical evidence, and a detailed account of our proposal is provided. Finally, a case study is illustrated using a nutritional data set known in the clustering literature.
Similar content being viewed by others
References
AITCHISON, J. (1986), The Statistical Analysis of Compositional Data, London: Chapman & Hall, reprinted in 2003 by Blackburn Press.
AITCHISON, J. (1992), “On Criteria for Measures of Compositional Difference,” Mathematical Geology, 24, 365–379.
AITCHISON, J., BARCELÓ-VIDAL, C., MARTÍN-FERNÁNDEZ, J.A., and PAWLOWSKY-GLAHN, V. (2000), “Logratio Analysis and Compositional Distance,” Mathematical Geology, 32, 271–275.
AITCHISON, J., and GREENACRE, M. (2002), “Biplots for Compositional Data,” Journal of the Royal Statistical Society, Series C, 51, 375–392.
BAXTER, M.J., and FREESTONE, I.C. (2006), “Log-ratio Compositional Data Analysis in Archeometry,” Archaeometry, 48, 511–531.
BERGET, I., MEVIK, B-H., and NAES, T. (2008), “New Modifications and Applications of Fuzzy C-Means Methodology,” Computational Statistics & Data Analysis, 52, 2403–2418.
BEZDEK, J. (1981), Pattern Recognition with Fuzzy Objective Function Algorithms, New York: Plenum Press.
BILLHEIMER, D., GUTTORP, P., and FAGAN, W. (2001), “Statistical Interpretation of Species Composition,” Journal of the American Statistical Association, 96, 1205–1214.
CHACÓN, J.E., MATEU-FIGUERAS, G., and MARTÍN-FERNÁNDEZ, J.A. (2011), “Gaussian Kernels for Density Estimation with Compositional Data,” Computers & Geosciences, 37, 702–711.
DESARBO, W.S., RAMASWAMY, V., and LENK, P. (1993), “A Latent Class Procedure for the Structural Analysis of Two-Way Compositional Data,” Journal of Classification, 10, 159–193.
DÖRING, C., LESOT, M-J., and KRUSE, R. (2006), “Data Analysis with Fuzzy Clustering Methods,” Computational Statistics & Data Analysis, 51, 192–214.
EGOZCUE, J.J., PAWLOWSKY-GLAHN, V., MATEU-FIGUERAS, G., and BARCELÓ-VIDAL, C. (2003), “Isometric Logratio Transformations for Compositional Data Analysis,” Mathematical Geology, 35, 279–300.
EGOZCUE, J.J., and PAWLOWSKY-GLAHN, V. (2005), “CoDa-Dendrogram: A New Exploratory Tool,” in Proceedings of the Second Compositional Data Analysis Workshop - CoDaWork’05, Girona, Spain.
GABRIEL, K.R. (1971), “The Biplot Graphic Display of Matrices with Application to Principal Component Analysis,” Biometrika, 58, 453–467.
GAVIN, D.G., OSWALD, W.W., WAHL, E.R., and WILLIAMS, J.W. (2003), “A Statistical Approach to Evaluating Distance Metrics and Analog Assignments for Pollen Records,” Quaternary Research, 60, 356–367.
GREENACRE, M. (1988), “Clustering the Rows and Columns of a Contingency Table,” Journal of Classification, 5, 39–51.
HARTIGAN, J.A. (1975), Clustering Algorithms, New York: Wiley & Sons.
HÖPPNER, F., KLAWONN, F., KRUSE, R., and RUNKLER, T. (1999), Fuzzy Cluster Analysis: Methods for Classification, Data analysis, and Image Recognition, Chichester: John Wiley & Sons.
LEGENDRE, P., and GALLAGHER, E.D. (2001), “Ecologically Meaningful Transformations for Ordination of Species Data,” Oecologia, 129, 271–280.
MARTÍN, M.C. (1996), “Performance of Eight Dissimilarity Coefficients to Cluster a Compositional Data Set,” in Abstracts of the Fifth Conference of International Federation of Classification Societies (Vol. 1), Kobe, Japan, pp. 215–217.
MARTÍN-FERNÁNDEZ, J.A., BREN, M., BARCELÓ-VIDAL, C., and PAWLOWSKYGLAHN, V. (1999), “A Measure of Difference for Compositional Data Based On Measures of Divergence,” in Proceedings of the Fifth Annual Conference of the International Assotiation for Mathematical Geology (Vol. 1), Trondheim, Norway, pp. 211–215.
MARTÍN-FERNÁNDEZ, J.A., BARCELÓ-VIDAL, C., and PAWLOWSKY-GLAHN, V. (2003), “Dealing with Zeros and Missing Values in Compositional Data Sets,” Mathematical Geology, 35, 253–278.
MILLER, W.E. (2002), “Revisiting the Geometry of a Ternary Diagram with the Half-Taxi Metric,” Mathematical Geology, 34, 275–290.
PALAREA-ALBALADEJO, J., MARTÍN-FERNÁNDEZ, J.A., and GÓMEZ-GARCÍA, J. (2007), “A Parametric Approach for Dealing with Compositional Rounded Zeros,” Mathematical Geology, 39, 625–645.
PALAREA-ALBALADEJO, J., and MARTÍN-FERNÁNDEZ, J.A. (2008), “A Modified EM alr-Algorithm for Replacing Rounded Zeros in Compositional Data Sets,” Computers & Geosciences, 34, 902–917.
PAWLOWSKY-GLAHN, V., and EGOZCUE, J.J. (2001), “Geometric Approach to Statistical Analysis on the Simplex,” Stochastic Environmental Research and Risk Assessment, 15, 384–398.
PAWLOWSKY-GLAHN, V. (2003), “Statistical Modelling on Coordinates,” in Proceedings of the First Compositional Data Analysis Workshop - CoDaWork’03, Girona, Spain.
PAWLOWSKY-GLAHN, V., and EGOZCUE, J.J. (2008), “Compositional Data and Simpson’s Paradox,” in Proceedings of the Third Compositional Data Analysis Workshop - CoDaWork’08, Girona, Spain.
SOTO, J., FLORES-SINTAS, A., and PALAREA-ALBALADEJO, J. (2008), “Improving Probabilities in a Fuzzy Clustering Partition,” Fuzzy Sets & Systems, 159, 406–421.
TEMPL, M., FILZMOSER, P., and REIMANN, C. (2008), “Cluster Analysis Applied to Regional Geochemical Data: Problems and Possibilities,” Applied Geochemistry, 23, 2198–2213.
VÊNCIO, R., VARUZZA, L., PEREIRA, C., BRENTANI, H. and SHMULEVICH, I. (2007), “Simcluster: Clustering Enumeration Gene Expression Data on the Simplex Space,” BMC Bioinformatics, 8, 246.
WAHL, E.R. (2004), “A General Framework for Determining Cut-off Values to Select Pollen Analogs with Dissimilarity Metrics in the Modern Analog Technique,” Review of Palaeobotany and Palynology, 128, 263–280.
WANG, H., LIU, Q., MOK, H.M.K., FU, L., and TSE, W.M. (2007), “A Hyperspherical Transformation Forecasting Model for Compositional Data,” European Journal of Operations Research, 179, 459–468.
WATSON, D.F., and PHILIP, G.M. (1989), “Measures of Variability for Geological Data,” Mathematical Geology, 21, 233–254.
Author information
Authors and Affiliations
Corresponding author
Additional information
This research has been supported by the Scottish Government, the Spanish Ministry of Science and Innovation under the project “CODA-RSS” Ref. MTM2009-13272; and by the Agència de Gestió d’Ajuts Universitaris i de Recerca of the Generalitat de Catalunya under the project Ref: 2009SGR424. We are in debt with the editor and the referees for their helpful comments and suggestions on an earlier version of this paper.
Rights and permissions
About this article
Cite this article
Palarea-Albaladejo, J., Martín-Fernández, J.A. & Soto, J.A. Dealing with Distances and Transformations for Fuzzy C-Means Clustering of Compositional Data. J Classif 29, 144–169 (2012). https://doi.org/10.1007/s00357-012-9105-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-012-9105-4