Abstract
Semi-supervised clustering aim to aid and bias the unsupervised clustering by employing a small amount of supervised information. The supervised information is generally given as pairwise constraints, which was used to either modify the objective function or to learn the distance measure. Many previous work have shown that the cluster algorithm based on distance metric is significantly better than the cluster algorithm based on probability distribution in the some data set, there are a totally opposite result in another data set, so how to balance the two methods become a key problem. In this paper, we proposed a semi-supervised hybrid clustering algorithm that provides a principled framework integrating distance metric into Gaussian mixture model, which consider not only the intrinsic geometry information but also the probability distribution information of the data. In comparison to only using the pairwise constraints, the labeled data was used to initialize Gaussian distribution parameter and to construct the weight matrix of regularizer, and then we adopt Kullback-Leibler Divergence as the “distance” measurement to regularize the objective function. Experiments on several UCI data sets and the real world data sets of Chinese Word Sense Induction demonstrate the effectiveness of our semi-supervised cluster algorithm.
Similar content being viewed by others
References
Basu, S., Banerjee, A., Mooney, R. (2002). Semi-supervised clustering by seeding[C]. In Proceedings of 19th international conference on machine learning (pp. 19–26).
Belkin, M., Niyogi, P., Sindhwani, V. (2006). Manifold regularization: a geometric framework for learning from labeled and unlabeled examples [J]. Journal of Machine Learning Research, 7, 2399–2434.
Bilenko, M., Basu, S., Mooney, R.J. (2004). Integrating constraints and metric learning in semi-supervised clustering [C]. In Proceedings of the 21th international conference on machine learning (pp. 81–88).
Bonifati, A., & Cuzzocrea, A. (2006). Storing and retrieving Xpath fragments in structured P2P networks [J]. Data & Knowledge Engineering, 59(2), 247–269.
Cai, D., He, X.F., Han, J.W. (2010). Locally consistent concept factorization for document clustering [J]. IEEE Transactions on Knowledge and Data Engineering, 23(6), 902–913.
Chandra, B., & Gupta, M. (2013). A novel approach for distance-based semi-supervised clustering using functional link neural network [J]. Soft Computing, 17(3), 369–379.
Chang, C.C., & Chen, H.Y. (2012). Semi-supervised clustering with discriminative random fields [J]. Pattern Recognition, 45(12), 4402–4413.
Cheung, Y.M, & Zeng, H. (2012). Semi-supervised maximum margin clustering with pairwise constraints [J]. IEEE Transactions on Knowledge and Data Engineering, 24(5), 926–939.
Cohn, D., Caruana, R., McCallum, A. (2003). Semi-supervised clustering with user feedback. Technical Report TR2003-1892, Cornell University.
Cuzzocrea, A., Furfaro, F., et al. (2004). A grid framework for approximate aggregate query answering on summarized sensor network readings [C]. In On the move to meaningful internet systems (pp. 144–153).
da Costa, A.F.B.F., Pimentel, B.A., de Souza R.M.C.R. (2013). Clustering interval data through kernel-induced feature space [J]. Journal of Intelligent Information Systems, 40(1), 109–140.
Demiriz, A., Bennett, K.P., Embrechts, M.J. (1999). Semi-supervised clustering using genetic algorithms [C]. In Proceedings of artificial neural networks in engineering (ANNIE-99) (pp. 809–814).
Dempster, A.P., Laird, N.M., Rubin, D.B. (1997). Maximum likelihood from incomplete data via the EM algorithm [J]. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.
Figueiredo, M.A., & Jain, A.K. (2002). Unsupervised learning of finite mixture models [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 381–396.
Grira, N., Crucianu, M., Boujemaa, N. (2005). Unsupervised and semi-supervised clustering: A brief survey. In A review of machine learning techniques for processing multimedia content. Report of the MUSCLE European Network of Excellence (6th Framework Programme).
He, X.F., Cai, D., Shao, Y.L., et al. (2011). Laplacian regularized Gaussian mixture model for data clustering [J]. IEEE Transactions on Knowledge and Data Engineering, 23(9), 1406–1418.
Jain, A.K., Murty, M.N., Flynn, P.J. (1999). Data clustering: a review [J]. ACM Computing Surveys, 31(3), 264–323.
Klein, D., Kamvar, S.D., Manning, C.D. (2002). From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering [C]. In Proceedings of the 19th international conference on machine learning (ICML-02) (pp. 307–314).
Kulis, B., Basu, S., Dhillon, I., et al. (2009). Semi-supervised graph clustering: a kernel approach [J]. Machine Learning, 74(1), 1–22.
Luxburg, U.V. (2007). A tutorial on spectral clustering [J]. Statistics and Computing, 17(4), 395–416.
Macqueen, J. (1965). Some methods for classification and analysis of multivariate observations [C]. In Proceedings of the 5th Berkeley symposium on mathematical statistics and probability (pp. 281–297).
Ng, A.Y., Jordan, M.I., Weiss, Y. (2001). On spectral clustering: analysis and an algorithm [J]. Advances in Neural Information Processing Systems, 14, 849–856.
Ruiz, C., Spiliopoulou, M., Menasalvas, E. (2010). Density-based semi-supervised clustering [J]. Data Mining and Knowledge Discovery, 21(3), 345–370.
Theobald, M. (2013). The program of the svmlight algorithm. http://www.mpi-inf.mpg.de/~mtb/svmlight/JNI_SVM-light-6.01.zip. Accessed 4 Mar 2013.
Tong, B., Shao, H., Chou B.H., et al. (2012). Linear semi-supervised projection clustering by transferred centroid regularization [J]. Journal of Intelligent Information Systems, 39(2), 461–490.
Wagstaff, K., & Cardie, C. (2000). Clustering with instance-level constraints [C]. In Proceedings of the 17th international conference on machine learning (pp. 1103–1110).
Wan, M., Li, L.X., Xiao, J.H., et al. (2012). Data clustering using bacterial foraging optimization [J]. Journal of Intelligent Information Systems, 38(2), 321–341.
Wang, X., Rostoker, C., Hamilton, H.J., et al. (2012). A density-based spatial clustering for physical constraints [J]. Journal of Intelligent Information Systems, 38(1), 269–297.
Witten, I.H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann. http://prdownloads.sourceforge.net/weka/datasets-UCI.jar.
Xing, E.P., Ng, A.Y., Jordan, M.I., et al. (2003). Distance metric learning with application to clustering with side-information [C]. In Proceedings of the conference on advances in neural information processing systems (NIPS) (pp. 505–512).
Xu, R., & Wunsch, D. II. (2005). Survey of clustering algorithms [J]. IEEE Transactions on Neural Networks, 16(3), 645–678.
Xu, L., Neufeld, J., Larson, B, et al. (2005). Maximum margin clustering [J]. Advances in Neural Information Processing Systems, 17, 1537–1544.
Yin, X.S., Chen, S.C., Hu, E.L., et al. (2010). Semi-supervised clustering with metric learning: an adaptive Kernel method [J]. Pattern Recognition, 43(4), 1320–1333.
Yin, X.S., Shu, T., Huang, Q. (2012). Semi-supervised fuzzy clustering with metric learning and entropy regularization [J]. Knowledge-Based Systems, 35, 304–311.
Zhao, Y., & Kapypis, G. (2005). Hierarchical clustering algorithms for document datasets [J]. Data Mining and Knowledge Discovery, 10(2), 141–168.
Acknowledgements
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to significantly improve the quality of this paper. The research reported in this paper has been partially supported by National Science Foundations of China under Grant Nos. 61075053 and 71102065, the Ph.D. Programs Foundation of Ministry of Education of China No. 20120191110028, and the Fundamental Research Funds for the Central Universities Project No. CDJZR10090001.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, Y., Wen, J., Wang, X. et al. Semi-supervised hybrid clustering by integrating Gaussian mixture model and distance metric learning. J Intell Inf Syst 45, 113–130 (2015). https://doi.org/10.1007/s10844-013-0264-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-013-0264-5