Abstract
Finite mixture models are being commonly used in a wide range of applications in practice concerning density estimation and clustering. An attractive feature of this approach to clustering is that it provides a sound statistical framework in which to assess the important question of how many clusters there are in the data and their validity. We review the application of normal mixture models to high-dimensional data of a continuous nature. One way to handle the fitting of normal mixture models is to adopt mixtures of factor analyzers. They enable model-based density estimation and clustering to be undertaken for high-dimensional data, where the number of observations n is not very large relative to their dimension p. In practice, there is often the need to reduce further the number of parameters in the specification of the component-covariance matrices. We focus here on a new modified approach that uses common component-factor loadings, which considerably reduces further the number of parameters. Moreover, it allows the data to be displayed in low-dimensional plots.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Baek, J., & McLachlan, G. J. (2008). Mixtures of factor analyzers with common factor loadings for the clustering and visualization of high-dimensional data (Technical Report NI08020-HOP). Preprint Series of the Isaac Newton Institute for Mathematical Sciences, Cambridge.
Banfield, J. D., & Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821.
Coleman, D., Dong, X., Hardin, J., Rocke, D., & Woodruff, D. (1999). Some computational issues in cluster analysis with no a priori metric. Computational Statistics and Data Analysis, 31, 1–11.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society: Series B, 39, 1–38.
Hartigan, J. (1975). Clustering Algorithms. New York: Wiley.
Hennig, C. (2003). Clusters, outliers and regression: Fixed point clusters. Journal of Multivariate Analysis, 86, 183–212.
Hennig, C. (2004). Breakdown points for maximum likelihood-estimators of location-scale mixtures. Annals of Statistics, 32, 1313–1340.
Hinton, G. E., Dayan, P., & Revow, M. (1997). Modeling the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks, 8, 65–73.
McLachlan, G. J. (1982). The classification and mixture maximum likelihood approaches to cluster analysis. In P. R. Krishnaiah, & L. Kanal (Eds.), Handbook of statistics (Vol. 2, pp. 199–208). Amsterdam: North-Holland.
McLachlan, G. J., Bean, R. W., & Ben-Tovim Jones, L. (2007). Extension of the mixture of factor analyzers model to incorporate the multivariate t distribution. Computational Statistics and Data Analysis, 51, 5327–5338.
McLachlan, G. J., Bean, R. W., & Peel, D. (2002). A mixture model-based approach to the clustering of microarray expression data. Bioinformatics, 18, 413–422.
McLachlan, G. J., & Krishnan, T. (2008). The EM algorithm and extensions (2nd ed.). New York: Wiley.
McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: Wiley.
McLachlan, G. J., & Peel, D. (1998). Robust cluster analysis via mixtures of multivariate t-distributions. Lecture Notes in Computer Science (Vol. 1451, pp. 658–666). Berlin: Springer.
McLachlan, G. J., Peel, D., Basford, K. E., & Adams, P. (1999). The EMMIX software for the fitting of mixtures of normal and t-components. Journal of Statistical Software, 4(2).
McLachlan, G. J., Peel, D., & Bean, R. W. (2003). Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics and Data Analysis, 41, 379–388.
Meng, X., & van Dyk, D. (1997). The EM algorithm – an old folk song sung to a fast new tune (with discussion). Journal of the Royal Statistical Society B, 59, 511–567.
Montanari, A., & Viroli, C. (2007). Two layer latent regression. Technical Report. Voorburg, Netherlands: International Statistical Institute.
Rao, C. R. (1973). Linear statistical inference and its applications. New York: Wiley.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statisics, 6, 461–464.
Yoshida, R., Higuchi, T., & Imoto, S. (2004). A mixed factors model for dimension reduction and extraction of a group structure in gene expression data. In Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (pp. 161–172).
Yoshida, R., Higuchi, T., Imoto, S., & Miyano, S. (2006). ArrayCluster: An analytic tool for clustering, data visualization and model finder on gene expression profiles. Bioinformatics, 22, 1538–1539.
Acknowledgements
The work of J. Baek was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD, Basic Research Promotion Fund, KRF-2007-521-C00048). The work of G. McLachlan was supported by the Australian Research Council.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
McLachlan, G.J., Baek, J. (2009). Clustering of High-Dimensional Data via Finite Mixture Models. In: Fink, A., Lausen, B., Seidel, W., Ultsch, A. (eds) Advances in Data Analysis, Data Handling and Business Intelligence. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01044-6_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-01044-6_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01043-9
Online ISBN: 978-3-642-01044-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)