Skip to main content
Log in

Semi-supervised hybrid clustering by integrating Gaussian mixture model and distance metric learning

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Semi-supervised clustering aim to aid and bias the unsupervised clustering by employing a small amount of supervised information. The supervised information is generally given as pairwise constraints, which was used to either modify the objective function or to learn the distance measure. Many previous work have shown that the cluster algorithm based on distance metric is significantly better than the cluster algorithm based on probability distribution in the some data set, there are a totally opposite result in another data set, so how to balance the two methods become a key problem. In this paper, we proposed a semi-supervised hybrid clustering algorithm that provides a principled framework integrating distance metric into Gaussian mixture model, which consider not only the intrinsic geometry information but also the probability distribution information of the data. In comparison to only using the pairwise constraints, the labeled data was used to initialize Gaussian distribution parameter and to construct the weight matrix of regularizer, and then we adopt Kullback-Leibler Divergence as the “distance” measurement to regularize the objective function. Experiments on several UCI data sets and the real world data sets of Chinese Word Sense Induction demonstrate the effectiveness of our semi-supervised cluster algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Basu, S., Banerjee, A., Mooney, R. (2002). Semi-supervised clustering by seeding[C]. In Proceedings of 19th international conference on machine learning (pp. 19–26).

  • Belkin, M., Niyogi, P., Sindhwani, V. (2006). Manifold regularization: a geometric framework for learning from labeled and unlabeled examples [J]. Journal of Machine Learning Research, 7, 2399–2434.

    MathSciNet  MATH  Google Scholar 

  • Bilenko, M., Basu, S., Mooney, R.J. (2004). Integrating constraints and metric learning in semi-supervised clustering [C]. In Proceedings of the 21th international conference on machine learning (pp. 81–88).

  • Bonifati, A., & Cuzzocrea, A. (2006). Storing and retrieving Xpath fragments in structured P2P networks [J]. Data & Knowledge Engineering, 59(2), 247–269.

    Article  Google Scholar 

  • Cai, D., He, X.F., Han, J.W. (2010). Locally consistent concept factorization for document clustering [J]. IEEE Transactions on Knowledge and Data Engineering, 23(6), 902–913.

    Article  Google Scholar 

  • Chandra, B., & Gupta, M. (2013). A novel approach for distance-based semi-supervised clustering using functional link neural network [J]. Soft Computing, 17(3), 369–379.

    Article  Google Scholar 

  • Chang, C.C., & Chen, H.Y. (2012). Semi-supervised clustering with discriminative random fields [J]. Pattern Recognition, 45(12), 4402–4413.

    Article  MATH  Google Scholar 

  • Cheung, Y.M, & Zeng, H. (2012). Semi-supervised maximum margin clustering with pairwise constraints [J]. IEEE Transactions on Knowledge and Data Engineering, 24(5), 926–939.

    Article  Google Scholar 

  • Cohn, D., Caruana, R., McCallum, A. (2003). Semi-supervised clustering with user feedback. Technical Report TR2003-1892, Cornell University.

  • Cuzzocrea, A., Furfaro, F., et al. (2004). A grid framework for approximate aggregate query answering on summarized sensor network readings [C]. In On the move to meaningful internet systems (pp. 144–153).

  • da Costa, A.F.B.F., Pimentel, B.A., de Souza R.M.C.R. (2013). Clustering interval data through kernel-induced feature space [J]. Journal of Intelligent Information Systems, 40(1), 109–140.

    Article  Google Scholar 

  • Demiriz, A., Bennett, K.P., Embrechts, M.J. (1999). Semi-supervised clustering using genetic algorithms [C]. In Proceedings of artificial neural networks in engineering (ANNIE-99) (pp. 809–814).

  • Dempster, A.P., Laird, N.M., Rubin, D.B. (1997). Maximum likelihood from incomplete data via the EM algorithm [J]. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.

    MathSciNet  Google Scholar 

  • Figueiredo, M.A., & Jain, A.K. (2002). Unsupervised learning of finite mixture models [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 381–396.

    Article  MATH  Google Scholar 

  • Grira, N., Crucianu, M., Boujemaa, N. (2005). Unsupervised and semi-supervised clustering: A brief survey. In A review of machine learning techniques for processing multimedia content. Report of the MUSCLE European Network of Excellence (6th Framework Programme).

  • He, X.F., Cai, D., Shao, Y.L., et al. (2011). Laplacian regularized Gaussian mixture model for data clustering [J]. IEEE Transactions on Knowledge and Data Engineering, 23(9), 1406–1418.

    Article  Google Scholar 

  • Jain, A.K., Murty, M.N., Flynn, P.J. (1999). Data clustering: a review [J]. ACM Computing Surveys, 31(3), 264–323.

    Article  Google Scholar 

  • Klein, D., Kamvar, S.D., Manning, C.D. (2002). From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering [C]. In Proceedings of the 19th international conference on machine learning (ICML-02) (pp. 307–314).

  • Kulis, B., Basu, S., Dhillon, I., et al. (2009). Semi-supervised graph clustering: a kernel approach [J]. Machine Learning, 74(1), 1–22.

    Article  Google Scholar 

  • Luxburg, U.V. (2007). A tutorial on spectral clustering [J]. Statistics and Computing, 17(4), 395–416.

    Article  MathSciNet  Google Scholar 

  • Macqueen, J. (1965). Some methods for classification and analysis of multivariate observations [C]. In Proceedings of the 5th Berkeley symposium on mathematical statistics and probability (pp. 281–297).

  • Ng, A.Y., Jordan, M.I., Weiss, Y. (2001). On spectral clustering: analysis and an algorithm [J]. Advances in Neural Information Processing Systems, 14, 849–856.

    Google Scholar 

  • Ruiz, C., Spiliopoulou, M., Menasalvas, E. (2010). Density-based semi-supervised clustering [J]. Data Mining and Knowledge Discovery, 21(3), 345–370.

    Article  MathSciNet  Google Scholar 

  • Theobald, M. (2013). The program of the svmlight algorithm. http://www.mpi-inf.mpg.de/~mtb/svmlight/JNI_SVM-light-6.01.zip. Accessed 4 Mar 2013.

  • Tong, B., Shao, H., Chou B.H., et al. (2012). Linear semi-supervised projection clustering by transferred centroid regularization [J]. Journal of Intelligent Information Systems, 39(2), 461–490.

    Article  Google Scholar 

  • Wagstaff, K., & Cardie, C. (2000). Clustering with instance-level constraints [C]. In Proceedings of the 17th international conference on machine learning (pp. 1103–1110).

  • Wan, M., Li, L.X., Xiao, J.H., et al. (2012). Data clustering using bacterial foraging optimization [J]. Journal of Intelligent Information Systems, 38(2), 321–341.

    Article  Google Scholar 

  • Wang, X., Rostoker, C., Hamilton, H.J., et al. (2012). A density-based spatial clustering for physical constraints [J]. Journal of Intelligent Information Systems, 38(1), 269–297.

    Article  Google Scholar 

  • Witten, I.H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann. http://prdownloads.sourceforge.net/weka/datasets-UCI.jar.

    Google Scholar 

  • Xing, E.P., Ng, A.Y., Jordan, M.I., et al. (2003). Distance metric learning with application to clustering with side-information [C]. In Proceedings of the conference on advances in neural information processing systems (NIPS) (pp. 505–512).

  • Xu, R., & Wunsch, D. II. (2005). Survey of clustering algorithms [J]. IEEE Transactions on Neural Networks, 16(3), 645–678.

    Article  Google Scholar 

  • Xu, L., Neufeld, J., Larson, B, et al. (2005). Maximum margin clustering [J]. Advances in Neural Information Processing Systems, 17, 1537–1544.

    Google Scholar 

  • Yin, X.S., Chen, S.C., Hu, E.L., et al. (2010). Semi-supervised clustering with metric learning: an adaptive Kernel method [J]. Pattern Recognition, 43(4), 1320–1333.

    Article  Google Scholar 

  • Yin, X.S., Shu, T., Huang, Q. (2012). Semi-supervised fuzzy clustering with metric learning and entropy regularization [J]. Knowledge-Based Systems, 35, 304–311.

    Article  Google Scholar 

  • Zhao, Y., & Kapypis, G. (2005). Hierarchical clustering algorithms for document datasets [J]. Data Mining and Knowledge Discovery, 10(2), 141–168.

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to significantly improve the quality of this paper. The research reported in this paper has been partially supported by National Science Foundations of China under Grant Nos. 61075053 and 71102065, the Ph.D. Programs Foundation of Ministry of Education of China No. 20120191110028, and the Fundamental Research Funds for the Central Universities Project No. CDJZR10090001.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Junhao Wen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Y., Wen, J., Wang, X. et al. Semi-supervised hybrid clustering by integrating Gaussian mixture model and distance metric learning. J Intell Inf Syst 45, 113–130 (2015). https://doi.org/10.1007/s10844-013-0264-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-013-0264-5

Keywords

Navigation