Semi-supervised model-based clustering with positive and negative constraints

Melnykov, Volodymyr; Melnykov, Igor; Michael, Semhar

doi:10.1007/s11634-015-0200-3

Semi-supervised model-based clustering with positive and negative constraints

Regular Article
Published: 25 February 2015

Volume 10, pages 327–349, (2016)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Volodymyr Melnykov¹,
Igor Melnykov² &
Semhar Michael¹

868 Accesses
15 Citations
Explore all metrics

Abstract

Cluster analysis is a popular technique in statistics and computer science with the objective of grouping similar observations in relatively distinct groups generally known as clusters. Semi-supervised clustering assumes that some additional information about group memberships is available. Under the most frequently considered scenario, labels are known for some portion of data and unavailable for the rest of observations. In this paper, we discuss a general type of semi-supervised clustering defined by so called positive and negative constraints. Under positive constraints, some data points are required to belong to the same cluster. On the contrary, negative constraints specify that particular points must represent different data groups. We outline a general framework for semi-supervised clustering with constraints naturally incorporating the additional information into the EM algorithm traditionally used in mixture modeling and model-based clustering. The developed methodology is illustrated on synthetic and classification datasets. A dendrochronology application is considered and thoroughly discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recent Developments in Model-Based Clustering with Applications

Advances in Robust Constrained Model Based Clustering

A Novel Clustering Algorithm Based on a Non-parametric “Anti-Bayesian” Paradigm

References

Anderson E (1935) The Irises of the Gaspe Peninsula. Bull Am Iris Soc 59:2–5
Google Scholar
Basu S, Banerjee A, Mooney R (2002) Semi-supervised clustering by seeding. In: Proceedings of the 19th International Conference on Machine Learning, pp 19–26
Basu S, Bilenko M, Mooney RJ (2004) A Probabilistic Framework for Semi-Supervised Clustering. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 59–68
Basu S, Davidson I, Wagstaff K (2008) Constrained clustering: advances in algorithms, theory, and application. Chapman and Hall/CRC
Bouveyron C, Brunet C (2014) Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal 71:52–78
Article MathSciNet Google Scholar
Bridge M (2012) Locating the origins of wood resources: a review of dendroprovenancing. J Archaeol Sci 39:2828–2834
Article Google Scholar
Campbell NA, Mahon RJ (1974) A multivariate study of variation in two species of rock crab of genus Leptograsus. Aust J Zool 22:417–425
Article Google Scholar
Chen W-C, Maitra R (2011) Model-based clustering of regression time series data via APECM-An AECM Algorithm Sung to an even faster beat. Stat Anal Data Min 4:567–578
Article MathSciNet Google Scholar
Côme E, Oukhellou L, Denœux T, Aknin P (2009) Learning from partially supervised data using mixture models and belief functions. Pattern Recognit 42:334–348
Article MATH Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm (with discussion). J Royal Stat Soc, Ser B 39:1–38
MathSciNet MATH Google Scholar
Digalakis VV, Rtischev D, Neumeyer LG (1995) Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans Speech Audio Process 3:357–366
Article Google Scholar
Fisher RA (1936) The use of multiple measurements in taxonomic poblems. Ann Eugen 7:179–188
Article Google Scholar
Forgy E (1965) Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics 21:768–780
Fraley C, Raftery AE (1998) How many clusters? Which cluster method? Answers via model-based cluster analysis. Comput J 41:578–588
Article MATH Google Scholar
Fraley C, Raftery AE (2002) Model-based clustering and density estimation. J Am Stat Assoc 97:611–631
Article MathSciNet MATH Google Scholar
Fraley C, Raftery AE (2006) MCLUST Version 3 for R: normal mixture modeling and model-based clustering, Tech. Rep. 504, University of Washington, Department of Statistics, Seattle, WA
Gaffney SJ, Smyth P (1999) Trajectory clustering with mixture of regression model. Proceedings of Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San diego. CA. ACM, USA, pp 63–72
Chapter Google Scholar
Grissino-Mayeri HD, Fritts H (1997) The international tree-ring data bank: an enhanced global database serving the Global Scientific Community. Holocene 7:235–238
Article Google Scholar
Haneca K, Wazny T, Van Acker J, Beeckman H (2005) Provenancing Baltic timber from art historical objects: success and limitations. J Archaeol Sci 32:261–271
Article Google Scholar
Hennig C (2010) Methods for merging Gaussian mixture components. Adv Data Anal Classif 4:3–34
Article MathSciNet MATH Google Scholar
Huang J-T, Hasegawa-Johnson M (2009) On semi-supervised learning of Gaussian mixture models for phonetic classification. In: NAACL HLT workshop on semi-supervised learning
Hughes MK, Swetnam TW, Diaz HF (2009) Dendroclimatology: progress and prospects, vol 11. Princeton, Developments in Paleoenvirnmental ResearchSpringer
Google Scholar
Johnson S (1967) Hierarchical clustering schemes. Psychometrika 32(3):241–254
Article Google Scholar
Law MHC, Topchy A, Jain AK (2005) Model-based clustering with probabilistic constraints. In: 2005 SIAM International Conference on Data Mining, pp 641–645
Liu B, Shen X, Pan W (2013) Semi-supervised spectral clustering with application to detect population stratification. Front Genet 4:1–5
Google Scholar
Lu Z, Leen TK (2007) Penalized probabilistic clustering. Neural Comput 19:1528–1567
Article MathSciNet MATH Google Scholar
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. Proc Fifth Berkeley Symp 1:281–297
MathSciNet MATH Google Scholar
Maitra R, Melnykov V (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms. J Comput Graph Stat 19:354–376
Article MathSciNet Google Scholar
Martinez-Uso A, Pla F, Sotoca J (2010) A semi-supervised Gaussian mixture model for image segmentation. In: International Conference on Pattern Recognition, pp 2941–2944
McLachlan G, Peel D (2000) Finite Mixture Models. Wiley, New York
Book MATH Google Scholar
Melnykov V (2012) Efficient estimation in model-based clustering of Gaussian regression time series. Stat Anal Data Min 5:95–99
Article MathSciNet Google Scholar
Melnykov V (2013) On the distribution of posterior probabilities in finite mixture models with application in clustering. J Multivar Anal 122:175–189
Article MathSciNet MATH Google Scholar
Melnykov V, Chen W-C, Maitra R (2012) MixSim: R package for simulating datasets with pre-specified clustering complexity. J Stat Softw 51:1–25
Article Google Scholar
Melnykov V, Maitra R (2010) Finite mixture models and model-based clustering. Stat Surv 4:80–116
Article MathSciNet MATH Google Scholar
Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39:103–134
Article MATH Google Scholar
Pan W, Shen X, Jiang A, Hebbel R (2006) Semisupervised learning via penalized mixture model with application to microarray sample classification. Bioinformatics 22(19):2388–2395
Article Google Scholar
Schwarz G (1978) Estimating the dimensions of a model. Ann Stat 6:461–464
Article MATH Google Scholar
Shental N, Bar-Hillel A, Hertz T, Weinshall D (2003) Computing Gaussian mixture models with EM using equivalence constraints. In: Advances in NIPS, vol. 15
Sloane NJA (2014) The online encyclopedia of integer sequences: A001349 Number of connected graphs with n nodes
Wagstaff K, Cardie C, Rogers S, Schroedl S (2001) Constrained \(K\)-means Clustering with Background Knowledge. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp 577–584
Wang L, Zhu J, Zou H (2007) Hybrid Huberized Support Vector Machines for Microarray Classification. Proceedings of the 24th International Conference on Machine Learning, New York. NY. ACM, USA, pp 983–990
Google Scholar
Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58:236–244
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Systems, Statistics, and Management Science, University of Alabama, Tuscaloosa, AL, 35487, USA
Volodymyr Melnykov & Semhar Michael
Department of Mathematics and Physics, Colorado State University-Pueblo, Pueblo, CO, 81001, USA
Igor Melnykov

Authors

Volodymyr Melnykov
View author publications
You can also search for this author in PubMed Google Scholar
Igor Melnykov
View author publications
You can also search for this author in PubMed Google Scholar
Semhar Michael
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Volodymyr Melnykov.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Melnykov, V., Melnykov, I. & Michael, S. Semi-supervised model-based clustering with positive and negative constraints. Adv Data Anal Classif 10, 327–349 (2016). https://doi.org/10.1007/s11634-015-0200-3

Download citation

Received: 18 February 2014
Revised: 24 January 2015
Accepted: 09 February 2015
Published: 25 February 2015
Issue Date: September 2016
DOI: https://doi.org/10.1007/s11634-015-0200-3

Keywords

Mathematics Subject Classification

62H30

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semi-supervised model-based clustering with positive and negative constraints

Abstract

Access this article

Similar content being viewed by others

Recent Developments in Model-Based Clustering with Applications

Advances in Robust Constrained Model Based Clustering

A Novel Clustering Algorithm Based on a Non-parametric “Anti-Bayesian” Paradigm

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Semi-supervised model-based clustering with positive and negative constraints

Abstract

Access this article

Similar content being viewed by others

Recent Developments in Model-Based Clustering with Applications

Advances in Robust Constrained Model Based Clustering

A Novel Clustering Algorithm Based on a Non-parametric “Anti-Bayesian” Paradigm

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation