Clustering of High-Dimensional Data via Finite Mixture Models

McLachlan, Geoff J.; Baek, Jangsun

doi:10.1007/978-3-642-01044-6_3

Geoff J. McLachlan⁵ &
Jangsun Baek

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

2982 Accesses
1 Citations

Abstract

Finite mixture models are being commonly used in a wide range of applications in practice concerning density estimation and clustering. An attractive feature of this approach to clustering is that it provides a sound statistical framework in which to assess the important question of how many clusters there are in the data and their validity. We review the application of normal mixture models to high-dimensional data of a continuous nature. One way to handle the fitting of normal mixture models is to adopt mixtures of factor analyzers. They enable model-based density estimation and clustering to be undertaken for high-dimensional data, where the number of observations n is not very large relative to their dimension p. In practice, there is often the need to reduce further the number of parameters in the specification of the component-covariance matrices. We focus here on a new modified approach that uses common component-factor loadings, which considerably reduces further the number of parameters. Moreover, it allows the data to be displayed in low-dimensional plots.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baek, J., & McLachlan, G. J. (2008). Mixtures of factor analyzers with common factor loadings for the clustering and visualization of high-dimensional data (Technical Report NI08020-HOP). Preprint Series of the Isaac Newton Institute for Mathematical Sciences, Cambridge.
Google Scholar
Banfield, J. D., & Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821.
Article MATH MathSciNet Google Scholar
Coleman, D., Dong, X., Hardin, J., Rocke, D., & Woodruff, D. (1999). Some computational issues in cluster analysis with no a priori metric. Computational Statistics and Data Analysis, 31, 1–11.
Article MATH Google Scholar
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society: Series B, 39, 1–38.
MATH MathSciNet Google Scholar
Hartigan, J. (1975). Clustering Algorithms. New York: Wiley.
MATH Google Scholar
Hennig, C. (2003). Clusters, outliers and regression: Fixed point clusters. Journal of Multivariate Analysis, 86, 183–212.
Article MATH MathSciNet Google Scholar
Hennig, C. (2004). Breakdown points for maximum likelihood-estimators of location-scale mixtures. Annals of Statistics, 32, 1313–1340.
Article MATH MathSciNet Google Scholar
Hinton, G. E., Dayan, P., & Revow, M. (1997). Modeling the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks, 8, 65–73.
Article Google Scholar
McLachlan, G. J. (1982). The classification and mixture maximum likelihood approaches to cluster analysis. In P. R. Krishnaiah, & L. Kanal (Eds.), Handbook of statistics (Vol. 2, pp. 199–208). Amsterdam: North-Holland.
Google Scholar
McLachlan, G. J., Bean, R. W., & Ben-Tovim Jones, L. (2007). Extension of the mixture of factor analyzers model to incorporate the multivariate t distribution. Computational Statistics and Data Analysis, 51, 5327–5338.
Article MATH MathSciNet Google Scholar
McLachlan, G. J., Bean, R. W., & Peel, D. (2002). A mixture model-based approach to the clustering of microarray expression data. Bioinformatics, 18, 413–422.
Article Google Scholar
McLachlan, G. J., & Krishnan, T. (2008). The EM algorithm and extensions (2nd ed.). New York: Wiley.
MATH Google Scholar
McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: Wiley.
Book MATH Google Scholar
McLachlan, G. J., & Peel, D. (1998). Robust cluster analysis via mixtures of multivariate t-distributions. Lecture Notes in Computer Science (Vol. 1451, pp. 658–666). Berlin: Springer.
Google Scholar
McLachlan, G. J., Peel, D., Basford, K. E., & Adams, P. (1999). The EMMIX software for the fitting of mixtures of normal and t-components. Journal of Statistical Software, 4(2).
Google Scholar
McLachlan, G. J., Peel, D., & Bean, R. W. (2003). Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics and Data Analysis, 41, 379–388.
Article MathSciNet Google Scholar
Meng, X., & van Dyk, D. (1997). The EM algorithm – an old folk song sung to a fast new tune (with discussion). Journal of the Royal Statistical Society B, 59, 511–567.
Article MATH Google Scholar
Montanari, A., & Viroli, C. (2007). Two layer latent regression. Technical Report. Voorburg, Netherlands: International Statistical Institute.
Google Scholar
Rao, C. R. (1973). Linear statistical inference and its applications. New York: Wiley.
Book MATH Google Scholar
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statisics, 6, 461–464.
Article MATH Google Scholar
Yoshida, R., Higuchi, T., & Imoto, S. (2004). A mixed factors model for dimension reduction and extraction of a group structure in gene expression data. In Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (pp. 161–172).
Google Scholar
Yoshida, R., Higuchi, T., Imoto, S., & Miyano, S. (2006). ArrayCluster: An analytic tool for clustering, data visualization and model finder on gene expression profiles. Bioinformatics, 22, 1538–1539.
Article Google Scholar

Download references

Acknowledgements

The work of J. Baek was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD, Basic Research Promotion Fund, KRF-2007-521-C00048). The work of G. McLachlan was supported by the Australian Research Council.

Author information

Authors and Affiliations

Department of Mathematics and Institute for Molecular Bioscience, University of Queensland, Brisbane, QLD, 4072, Australia
Geoff J. McLachlan

Authors

Geoff J. McLachlan
View author publications
You can also search for this author in PubMed Google Scholar
Jangsun Baek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Geoff J. McLachlan .

Editor information

Editors and Affiliations

Universität der Bundeswehr, Fak. Wirtschafts-/Sozialwissenschaften, Helmut-Schmidt-Universität, Holstenhofweg 85, Hamburg, 22043, Germany
Andreas Fink
Dept. Mathematical Sciences, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, United Kingdom
Berthold Lausen
Universität der Bundeswehr, Fak. Wirtschafts-/Sozialwissenschaften, Helmut-Schmidt-Universität, Holstenhofweg 85, Hamburg, 22043, Germany
Wilfried Seidel
FB 12 Mathematik und Informatik, Datenbionik AG, Universität Marburg, Hans-Meerwein-Straße, Marburg, 35032, Germany
Alfred Ultsch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

McLachlan, G.J., Baek, J. (2009). Clustering of High-Dimensional Data via Finite Mixture Models. In: Fink, A., Lausen, B., Seidel, W., Ultsch, A. (eds) Advances in Data Analysis, Data Handling and Business Intelligence. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01044-6_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-01044-6_3
Published: 31 July 2009
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01043-9
Online ISBN: 978-3-642-01044-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics