Abstract
Hybrid species tend to exhibit a mixture of parent characteristics; we propose chimeral clusters as exhibiting a mixture of parent parameters, a type of intercluster structure. Morphometric measurements in the iris dataset describe the hybrid Iris versicolor as intermediate to those of parent species Iris setosa and Iris virginica, which motivates our extension of Gaussian mixture models to allow mixing in the parameter space. We propose a mixing mechanism whereby chimeral clusters are parameterized by a convex combination of fully varying prototype cluster parameters and characterize the identifiability of the postulated mixture model. Estimation of chimeral clustering models is described using variations of the expectation-maximization algorithm and the solution to the continuous-time algebraic Riccati equation. The efficacy of chimeral clustering is demonstrated using morphometric datasets describing iris, Cooper’s hawks, and water striders, with comparisons to typical Gaussian mixture models. We evaluate parameter recovery on a synthetic dataset and demonstrate that parsimonious covariance matrices and chimeral clustering capture different kinds of intercluster structure.









Similar content being viewed by others
References
Anderson, E. (1936). The species problem in Iris. Annals of the Missouri Botanical Garden, 23(3), 457–509.
Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49(3), 803–821.
Battle, A., Segal, E., & Koller, D. (2005). Probabilistic discovery of overlapping cellular processes and their regulation. Journal of Computational Biology, 12(7), 909–927. pMID 16201912.
Biernacki, C., Celeux, G., & Govaert, G. (2003). Choosing starting values for the em algorithm for getting the highest likelihood in multivariate gaussian mixture models. Computational Statistics & Data Analysis, 41(3), 561–575. recent Developments in Mixture Model.
Blei, D.M., Ng, A.Y., & Jordan, M.I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Browne, R.P., & McNicholas, P.D. (2014). Estimating common principal components in high dimensions. Advances in Data Analysis and Classification, 8(2), 217–226.
Cannon, A., Cobb, G., Hartlaub, B., Legler, J., Lock, R., Moore, T., Rossman, A., & Witmer, J. (2019). Stat2data: Datasets for Stat2. R package version 2.0.0.
Celeux, G., & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition, 28(5), 781–793.
Clarkson, D.B., & Jennrich, R.I. (1988). Quartic rotation criteria and algorithms. Psychometrika, 53(2), 251–259.
De Leeuw, J., & Heiser, W.J. (1977). Convergence of correction matrix algorithms for multidimensional scaling. Geometric Representations of Relational Data, 735–752.
Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1–22.
Erosheva, E.A., Fienberg, S.E., & Joutard, C. (2007). Describing disability through individual-level mixture models for multivariate binary data. The Annals of Applied Statistics, 1(2), 346–384. 21687832[pmid].
Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179–188.
Fraley, C., & Raftery, A.E. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97(458), 611–631.
Grünbaum, B. (2003). Convex Polytopes. New York: Springer.
Hansen, F., & Pedersen, G.K. (2003). Jensen’s operator inequality. Bulletin of the London Mathematical Society, 35(4), 553–564.
Heller, K.A., Williamson, S., & Ghahramani, Z. (2008). Statistical models for partial membership. In Proceedings of the 25th international conference on machine learning, association for computing machinery, New York, NY, USA, ICML ’08 (pp. 392–399).
Holzmann, H., Munk, A., & Gneiting, T. (2006). Identifiability of finite mixtures of elliptical distributions. Scandinavian Journal of Statistics, 33 (4), 753–763.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.
Hunter, D.R., & Lange, K. (2004). A tutorial on MM algorithms. The American Statistician, 58(1), 30–37.
Kaiser, H.F. (1958). The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23(3), 187–200.
Klingenberg, C.P., & Spence, J.R. (1993). Heterochrony and allometry: Lessons from the water strider genus limnoporus. Evolution, 47(6), 1834–1853.
Laub, A. (1979). A Schur method for solving algebraic Riccati equations. IEEE Transactions on Automatic Control, 24(6), 913–921.
McNicholas, P.D., & Murphy, T.B. (2008). Parsimonious gaussian mixture models. Statistics and Computing, 18(3), 285–296.
Meng, X.L., & Rubin, D.B. (1993). Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika, 80(2), 267–278.
Ortega, J.M., & Rheinboldt, W.C. (2000). Iterative solution of nonlinear equations in several variables. Society for industrial and applied mathematics.
Pritchard, J.K., Stephens, M., & Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2), 945.
Rand, W.M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846–850.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461–464.
Scrucca, L., Fop, M., Murphy, T.B., & Raftery, A.E. (2016). mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1), 289–317.
Shapiro, A. (1985). Identifiability of factor analysis: some results and open problems. Linear Algebra and its Applications, 70, 1–7.
Symons, M.J. (1981). Clustering criteria and multivariate normal mixtures. Biometrics, 37(1), 35–43.
Teicher, H. (1961). Maximum likelihood characterization of distributions. Annals of Mathematical Statistics, 32(4), 1214–1222.
Wolfe, J.H. (1963). Object cluster analysis of social areas. PhD thesis, University of California.
Woodbury, M.A., Clive, J., & Garson, A. (1978). Mathematical typology: a grade of membership technique for obtaining disease definition. Computers and Biomedical Research, 11(3), 277–298.
Yakowitz, S.J., & Spragins, J.D. (1968). On the identifiability of finite mixtures. Annals of Mathematical Statistics, 39(1), 209–214.
Zhang, J. (2013). Epistatic clustering: a model-based approach for identifying links between clusters. Journal of the American Statistical Association, 108 (504), 1366–1384.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Hou-Liu, J., Browne, R.P. Chimeral Clustering. J Classif 39, 171–190 (2022). https://doi.org/10.1007/s00357-021-09396-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-021-09396-3