Skip to main content

Clustering of High-Dimensional Data via Finite Mixture Models

  • Conference paper
  • First Online:
Book cover Advances in Data Analysis, Data Handling and Business Intelligence

Abstract

Finite mixture models are being commonly used in a wide range of applications in practice concerning density estimation and clustering. An attractive feature of this approach to clustering is that it provides a sound statistical framework in which to assess the important question of how many clusters there are in the data and their validity. We review the application of normal mixture models to high-dimensional data of a continuous nature. One way to handle the fitting of normal mixture models is to adopt mixtures of factor analyzers. They enable model-based density estimation and clustering to be undertaken for high-dimensional data, where the number of observations n is not very large relative to their dimension p. In practice, there is often the need to reduce further the number of parameters in the specification of the component-covariance matrices. We focus here on a new modified approach that uses common component-factor loadings, which considerably reduces further the number of parameters. Moreover, it allows the data to be displayed in low-dimensional plots.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Baek, J., & McLachlan, G. J. (2008). Mixtures of factor analyzers with common factor loadings for the clustering and visualization of high-dimensional data (Technical Report NI08020-HOP). Preprint Series of the Isaac Newton Institute for Mathematical Sciences, Cambridge.

    Google Scholar 

  • Banfield, J. D., & Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821.

    Article  MATH  MathSciNet  Google Scholar 

  • Coleman, D., Dong, X., Hardin, J., Rocke, D., & Woodruff, D. (1999). Some computational issues in cluster analysis with no a priori metric. Computational Statistics and Data Analysis, 31, 1–11.

    Article  MATH  Google Scholar 

  • Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society: Series B, 39, 1–38.

    MATH  MathSciNet  Google Scholar 

  • Hartigan, J. (1975). Clustering Algorithms. New York: Wiley.

    MATH  Google Scholar 

  • Hennig, C. (2003). Clusters, outliers and regression: Fixed point clusters. Journal of Multivariate Analysis, 86, 183–212.

    Article  MATH  MathSciNet  Google Scholar 

  • Hennig, C. (2004). Breakdown points for maximum likelihood-estimators of location-scale mixtures. Annals of Statistics, 32, 1313–1340.

    Article  MATH  MathSciNet  Google Scholar 

  • Hinton, G. E., Dayan, P., & Revow, M. (1997). Modeling the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks, 8, 65–73.

    Article  Google Scholar 

  • McLachlan, G. J. (1982). The classification and mixture maximum likelihood approaches to cluster analysis. In P. R. Krishnaiah, & L. Kanal (Eds.), Handbook of statistics (Vol. 2, pp. 199–208). Amsterdam: North-Holland.

    Google Scholar 

  • McLachlan, G. J., Bean, R. W., & Ben-Tovim Jones, L. (2007). Extension of the mixture of factor analyzers model to incorporate the multivariate t distribution. Computational Statistics and Data Analysis, 51, 5327–5338.

    Article  MATH  MathSciNet  Google Scholar 

  • McLachlan, G. J., Bean, R. W., & Peel, D. (2002). A mixture model-based approach to the clustering of microarray expression data. Bioinformatics, 18, 413–422.

    Article  Google Scholar 

  • McLachlan, G. J., & Krishnan, T. (2008). The EM algorithm and extensions (2nd ed.). New York: Wiley.

    MATH  Google Scholar 

  • McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: Wiley.

    Book  MATH  Google Scholar 

  • McLachlan, G. J., & Peel, D. (1998). Robust cluster analysis via mixtures of multivariate t-distributions. Lecture Notes in Computer Science (Vol. 1451, pp. 658–666). Berlin: Springer.

    Google Scholar 

  • McLachlan, G. J., Peel, D., Basford, K. E., & Adams, P. (1999). The EMMIX software for the fitting of mixtures of normal and t-components. Journal of Statistical Software, 4(2).

    Google Scholar 

  • McLachlan, G. J., Peel, D., & Bean, R. W. (2003). Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics and Data Analysis, 41, 379–388.

    Article  MathSciNet  Google Scholar 

  • Meng, X., & van Dyk, D. (1997). The EM algorithm – an old folk song sung to a fast new tune (with discussion). Journal of the Royal Statistical Society B, 59, 511–567.

    Article  MATH  Google Scholar 

  • Montanari, A., & Viroli, C. (2007). Two layer latent regression. Technical Report. Voorburg, Netherlands: International Statistical Institute.

    Google Scholar 

  • Rao, C. R. (1973). Linear statistical inference and its applications. New York: Wiley.

    Book  MATH  Google Scholar 

  • Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statisics, 6, 461–464.

    Article  MATH  Google Scholar 

  • Yoshida, R., Higuchi, T., & Imoto, S. (2004). A mixed factors model for dimension reduction and extraction of a group structure in gene expression data. In Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (pp. 161–172).

    Google Scholar 

  • Yoshida, R., Higuchi, T., Imoto, S., & Miyano, S. (2006). ArrayCluster: An analytic tool for clustering, data visualization and model finder on gene expression profiles. Bioinformatics, 22, 1538–1539.

    Article  Google Scholar 

Download references

Acknowledgements

The work of J. Baek was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD, Basic Research Promotion Fund, KRF-2007-521-C00048). The work of G. McLachlan was supported by the Australian Research Council.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Geoff J. McLachlan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

McLachlan, G.J., Baek, J. (2009). Clustering of High-Dimensional Data via Finite Mixture Models. In: Fink, A., Lausen, B., Seidel, W., Ultsch, A. (eds) Advances in Data Analysis, Data Handling and Business Intelligence. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01044-6_3

Download citation

Publish with us

Policies and ethics