Clustering bivariate mixed-type data via the cluster-weighted model

Punzo, Antonio; Ingrassia, Salvatore

doi:10.1007/s00180-015-0600-z

Clustering bivariate mixed-type data via the cluster-weighted model

Original Paper
Published: 04 July 2015

Volume 31, pages 989–1013, (2016)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Antonio Punzo¹ &
Salvatore Ingrassia¹

614 Accesses
27 Citations
Explore all metrics

Abstract

The cluster-weighted model (CWM) is a mixture model with random covariates that allows for flexible clustering/classification and distribution estimation of a random vector composed of a response variable and a set of covariates. Within this class of models, the generalized linear exponential CWM is here introduced especially for modeling bivariate data of mixed-type. Its natural counterpart in the family of latent class models is also defined. Maximum likelihood parameter estimates are derived using the expectation-maximization algorithm and some computational issues are detailed. Through Monte Carlo experiments, the classification performance of the proposed model is compared with other mixture-based approaches, consistency of the estimators of the regression coefficients is evaluated, and several likelihood-based information criteria are compared for selecting the number of mixture components. An application to real data is also finally considered.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

References

Akaike H (1973) Information theory and an extension of maximum likelihood principle. In: Petrov BN, Csaki F (eds) Second International Symposium on Information Theory. Akademiai Kiado, Budapest, pp 267–281
Google Scholar
Bagnato L, Punzo A (2013) Finite mixtures of unimodal beta and gamma densities and the \(k\)-bumps algorithm. Comput Stat 28(4):1571–1597
Article MathSciNet MATH Google Scholar
Balakrishnan N, Lai C-D (2009) Continuous bivariate distributions. Springer, New York
MATH Google Scholar
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3):803–821
Article MathSciNet MATH Google Scholar
Bermúdez L, Karlis D (2012) A finite mixture of bivariate Poisson regression models with an application to insurance ratemaking. Comput Stat Data Anal 56(12):3988–3999
Article MathSciNet MATH Google Scholar
Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22(7):719–725
Article Google Scholar
Biernacki C, Celeux G, Govaert G (2003) Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput Stat Data Anal 41(3–4):561–575
Article MathSciNet MATH Google Scholar
Böhning D, Dietz E, Schaub R, Schlattmann P, Lindsay BG (1994) The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Ann Inst Stat Math 46(2):373–388
Article MATH Google Scholar
Bozdogan H (1994) Theory and methodology of time series analysis. In: Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach, vol 1. Kluwer Academic Publishers, Dordrecht
Bozdogan H (1987) Model selection and Akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika 52(3):345–370
Article MathSciNet MATH Google Scholar
Browne RP, McNicholas PD (2012) Model-based clustering, classification, and discriminant analysis of data with mixed type. J Stat Plan Inference 142(11):2976–2984
Article MathSciNet MATH Google Scholar
Celeux G, Hurn M, Robert CP (2000) Computational and inferential difficulties with mixture posterior distributions. J Am Stat Assoc 95(451):957–970
Article MathSciNet MATH Google Scholar
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Series B Methodol 39(1):1–38
MathSciNet MATH Google Scholar
Escobar M, West M (1995) Bayesian density estimation and inference using mixtures. J Am Stat Assoc 90(430):577–588
Article MathSciNet MATH Google Scholar
Fonseca JRS, Cardoso MGMS (2005) Retail clients latent segments. In: Progress in Artificial Intelligence. Springer, Berlin, pp 348–358
Fonseca JRS (2008) The application of mixture modeling and information criteria for discovering patterns of coronary heart disease. J Appl Quant Methods 3(4):292–303
Google Scholar
Fonseca JRS (2010) On the performance of information criteria in latent segment models. World Acad Sci Eng Technol 63:2010
Google Scholar
Fraley C, Raftery AE, Murphy TB, Scrucca L (2012) mclust version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Technical report 597, Department of Statistics, University of Washington, Seattle, Washington, USA
Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Springer, New York
MATH Google Scholar
Genest C, Neslehova J (2007) A primer on copulas for count data. Astin Bull 37(2):475–515
Article MathSciNet MATH Google Scholar
Gershenfeld N (1997) Nonlinear inference and cluster-weighted modeling. Ann New York Acad Sci 808(1):18–24
Article Google Scholar
Grün B, Leisch F (2008) FlexMix version 2: finite mixtures with concomitant variables and varying and constant parameters. J Stat Softw 28(4):1–35
Article Google Scholar
Hennig C (2000) Identifiablity of models for clusterwise linear regression. J Classif 17(2):273–296
Article MathSciNet MATH Google Scholar
Hennig C, Liao TF (2013) How to find an appropriate clustering for mixed type variables with application to socio-economic stratification. J R Stat Soc Series C Appl Stat 62(3):1–25
MathSciNet Google Scholar
Henning G (1989) Meanings and implications of the principle of local independence. Lang Test 6(1):95–108
Article MathSciNet Google Scholar
Hunt LA, Basford KE (1999) Fitting a mixture model to three-mode three-way data with categorical and continuous variables. J Classif 16(2):283–296
Article MathSciNet MATH Google Scholar
Hunt LA, Jorgensen M (2011) Clustering mixed data. Wiley Interdiscip Rev Data Min Knowl Discov 1(4):352–361
Article Google Scholar
Hurvich CM, Tsai CL (1989) Regression and time series model selection in small samples. Biometrika 76(2):297–307
Article MathSciNet MATH Google Scholar
Ingrassia S, Minotti SC, Vittadini G (2012) Local statistical modeling via the cluster-weighted approach with elliptical distributions. J Classif 29(3):363–401
Article MathSciNet MATH Google Scholar
Ingrassia S, Minotti SC, Punzo A (2014) Model-based clustering via linear cluster-weighted models. Comput Stat Data Anal 71:159–182
Article MathSciNet Google Scholar
Ingrassia S, Punzo A, Vittadini G, Minotti SC (2015) The generalized linear mixed cluster-weighted model. J Classif 32(1):85–113
Article MathSciNet MATH Google Scholar
Joe H (2005) Asymptotic efficiency of the two-stage estimation method for copula-based models. J Multivar Anal 94(2):401–419
Article MathSciNet MATH Google Scholar
Jorgensen M, Hunt LA (1996) Mixture model clustering of data sets with categorical and continuous variables. In: Dowe DL, Korb KB, Oliver JJ (eds) Proceedings of the Conference: Information, Statistics and Induction in Science, Melbourne, Australia, 20–23 August, vol 96. River Edge, New Jersey, pp 375–384
Karlis D, Xekalaki E (2003) Choosing initial values for the EM algorithm for finite mixtures. Computational Statistics & Data Analysis 41(3–4):577–590
Article MathSciNet MATH Google Scholar
Kocherlakota S, Kocherlakota K (1992) Bivariate discrete distributions, volume 132 of statistics: a series of textbooks and monographs. Taylor & Francis, Cambridge
MATH Google Scholar
Leisch F (2004) FlexMix: a general framework for finite mixture models and latent class regression in \({\sf R}\). J Stat Softw 11(8):1–18
Article Google Scholar
Lichman M (2013) UCI Machine Learning Repository, University of California, School of Information and Computer Science. Irvine, CA. http://archive.ics.uci.edu/ml
Mazza A, Punzo A, Ingrassia S (2015) flexCWM: flexible cluster-weighted modeling. http://cran.r-project.org/web/packages/flexCWM/index.html
McCullagh P, Nelder J (1989) Generalized linear models, 2nd edn. Chapman & Hall, Boca Raton
Book MATH Google Scholar
McLachlan GJ, Peel D (2000) Finite mixture models. In: Applied probability and statistics: Wiley Series in Probability and Statistics. John Wiley & Sons, New York
McLachlan GJ, Basford KE (1988) Mixture models: inference and applications to clustering, volume 84 of statistics series. Marcel Dekker, New York
MATH Google Scholar
McQuarrie A, Shumway R, Tsai C-L (1997) The model selection criterion AICu. Stat Probab Lett 34(3):285–292
Article MathSciNet MATH Google Scholar
Nelsen RB (2007) An introduction to copulas. Springer Series in Statistics. Springer, New York
Google Scholar
Punzo A (2014) Flexible mixture modeling with the polynomial Gaussian cluster-weighted model. Stat Modelling 14(3):257–291
Article MathSciNet Google Scholar
Punzo A, Ingrassia S (2015) Parsimonious generalized linear Gaussian cluster-weighted models. In: Morlini I, Minerva T, Vichi M (eds) Advances in Statistical Models for Data Analysis, Studies in Classification, Data Analysis and Knowledge Organization, Switzerland. Springer International Publishing, Forthcoming
Punzo A, Ingrassia S (2013) On the use of the generalized linear exponential cluster-weighted model to asses local linear independence in bivariate data. QdS J Methodol Appl Stat 15:131–144
Google Scholar
Punzo A, McNicholas PD (2014) Robust clustering in regression analysis via the contaminated Gaussian cluster-weighted model. arXiv.org e-print arXiv.org e-print arXiv:1409.6019 available at: arXiv:1409.6019
R Core Team (2013) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Google Scholar
Schlattmann P (2009) Medical applications of finite mixture models. Statistics for biology and health. Springer, Berlin
MATH Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MathSciNet MATH Google Scholar
Sklar M (1959) Fonctions de répartition à n dimensions et leurs marges. Publications de l’Institut de Statistique de l’Université de Paris 8:229–231
MathSciNet MATH Google Scholar
Stephens M (2000) Dealing with label switching in mixture models. J R Stat Soc Series B Stat Methodol 62(4):795–809
Article MathSciNet MATH Google Scholar
Subedi S, Punzo A, Ingrassia S, McNicholas PD (2013) Clustering and classification via cluster-weighted factor analyzers. Adv Data Anal Classif 7(1):5–40
Article MathSciNet MATH Google Scholar
Subedi S, Punzo A, Ingrassia S, McNicholas PD (2015) Cluster-weighted \(t\)-factor analyzers for robust model-based clustering and dimension reduction. Stat Methods Appl 24 (in press)
Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. John Wiley & Sons, New York
MATH Google Scholar
Tsanas A, Xifara A (2012) Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy Build 49:560–567
Article Google Scholar
Vermunt JK, Magidson J (2002) Latent class cluster analysis. In: Hagenaars JA, McCutcheon AL (eds) Applied latent class analysis. Cambridge University Press, Cambridge, pp 89–106
Chapter Google Scholar
Wedel M, DeSarbo WS (1995) A mixture likelihood approach for generalized linear models. J Classif 12(1):21–55
Article MATH Google Scholar
Wedel M, Kamakura W (2000) Market segmentation: conceptual and methodological foundations, 2nd edn. Kluwer Academic Publishers, Boston
Book Google Scholar
Yao W (2012) Model based labeling for mixture models. Stat Comput 22(2):337–347
Article MathSciNet MATH Google Scholar
Yao W, Wei Y, Yu C (2014) Robust mixture regression using the \(t\)-distribution. Comput Stat Data Anal 71:116–127
Article MathSciNet Google Scholar

Download references

Acknowledgments

The authors acknowledge the financial support from the Grant “Finite mixture and latent variable models for causal inference and analysis of socio-economic data” (FIRB 2012-Futuro in ricerca) funded by the Italian Government (RBFR12SHVV).

Author information

Authors and Affiliations

Department of Economics and Business, University of Catania, Corso Italia 55, 95129, Catania, Italy
Antonio Punzo & Salvatore Ingrassia

Authors

Antonio Punzo
View author publications
You can also search for this author in PubMed Google Scholar
Salvatore Ingrassia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antonio Punzo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Punzo, A., Ingrassia, S. Clustering bivariate mixed-type data via the cluster-weighted model. Comput Stat 31, 989–1013 (2016). https://doi.org/10.1007/s00180-015-0600-z

Download citation

Received: 22 April 2013
Accepted: 20 June 2015
Published: 04 July 2015
Issue Date: September 2016
DOI: https://doi.org/10.1007/s00180-015-0600-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering bivariate mixed-type data via the cluster-weighted model

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Clustering bivariate mixed-type data via the cluster-weighted model

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation