Skip to main content
Log in

Multi-type clustering in heterogeneous information networks

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Heterogeneous information networks have drawn much attention in recent years due to their significant applications, such as text mining, e-commerce, social networks, and bioinformatics. Clustering different types of objects simultaneously based upon not only their relations of the same type, but also the relations between different types of objects can improve the clustering quality mutually. In this paper, we propose a general model, in which both the homogeneous and heterogeneous relations are considered simultaneously, to describe the structure of the heterogeneous information networks and devise a novel parametric free multi-type overlapped clustering approach. In this model, different types of relations between different types of objects are represented by a group of matrices. In this way, we transfer the multi-type clustering problem into the information compression problem. Subsequently, greedy search approaches, which aim at describing the group of relational matrices with least bits, are proposed. Moreover, by discovering the discriminative clusters among different types of objects, we devise effective parameter-free strategies to discover either overlapping or non-overlapping structure among different types of clusters. Extensive experiments on real-world and synthetic data sets demonstrate our methods are effective and efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. http://www.informatik.uni-trier.de/ley/db/.

  2. All logarithms are based on 2 in this paper.

  3. http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets.

  4. http://people.csail.mit.edu/jrennie/20Newsgroups/.

  5. http://www-users.cs.umn.edu/han/data/.

  6. http://mlg.ucd.ie/datasets.

  7. http://www.cs.uiuc.edu/homes/sun22/data/.

  8. http://arnetminer.org/.

References

  1. Ahn YY, Bagrow JP, Lehmann S (2010) Link communities reveal multiscale complexity in networks. Nature 466:761–764

    Article  Google Scholar 

  2. Banerjee A, Basu S, Merugu S (2007) Multi-way clustering on relation graphs. Proceedings of the 7th SIAM international conference on data mining. SIAM, Minneapolis, MN, USA, pp 145–156

    Google Scholar 

  3. Banerjee A, Dhillon I, Ghosh J, Merugu S, Modha DS (2007) A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. J Mach Learn Res 8:1919–1986

    MathSciNet  MATH  Google Scholar 

  4. Barron A, Rissanen J, Yu B (1998) The minimum description length principle in coding and modeling. IEEE Trans Inf Theory 44(6):2743–2760

    Article  MathSciNet  MATH  Google Scholar 

  5. Bekkerman R, Mccallum A (2005) Multi-way distributional clustering via pairwise interactions. Proceedings of the 22nd international conference on machine learning. ACM, Bonn, pp 41–48

    Google Scholar 

  6. Bekkerman R, Jeon J (2007) Multi-modal clustering for multimedia collections. Computer society conference on computer vision and pattern recognition. IEEE Computer Society, Minneapolis, MN, USA, pp 1–8

    Google Scholar 

  7. Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. Proceedings of the 10th international conference on knowledge discovery and data mining. ACM, Seattle, Washington, DC, USA, pp 79–88

    Google Scholar 

  8. Chen Y, Wang L, Dong M (2010) Non-negative matrix factorization for semisupervised heterogeneous data coclustering. IEEE Trans Knowl Data Eng 22(10):1459–1474

    Article  Google Scholar 

  9. Cheng YZ, Church GM (2000) Biclustering of expression data. International conference on intelligent systems for molecular biology 8:93–103

    Google Scholar 

  10. Cho H, Dhillon IS, Guan YQ, Sra S (2004) Minimum sum-squared residue co-clustering of gene expression data. Proceedings of the 4th international conference on data mining. SIAM, Lake Buena Vista, FL, USA, pp 114–125

    Google Scholar 

  11. Collins LM, Dent CM (1998) Omega: a general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivar Behav Res 23(2):231–242

    Article  Google Scholar 

  12. Cook DJ, Holder LB (1994) Substructure discovery using minimum description length and background knowledge. J Artif Intell Res 1:231–255

    Google Scholar 

  13. Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. Proceedings of the 7th international conference on knowledge discovery and data mining. ACM, San Francisco, CA, USA, pp 269–274

    Google Scholar 

  14. Dhillon IS, Guan YQ (2003) Information theoretic clustering of sparse co-occurrence data. Proceedings of the 9th international conference on knowledge discovery and data mining. IEEE Computer Society, Melbourne, FL, USA, pp 517–528

    Google Scholar 

  15. Dhillon IS, Mallela S, Modha DS (2003) Information theoretic co-clustering. Proceedings of the 9th international conference on knowledge discovery and data mining. ACM, Washington DC, pp 89–98

    Google Scholar 

  16. Gao B, Liu TY, Zheng X, Cheng QS, Ma WY (2005) Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering. Proceedings of the 11th international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 41–50

    Google Scholar 

  17. Gao B, Liu TY, Ma WY (2006) Star-structured high-order heterogeneous data co-clustering based on consistent information theory. 6th international conference on data mining. IEEE Computer Society, Hong Kong, pp 880–884

    Google Scholar 

  18. Gossen T, Kotzyba M, Nürnberger A (2014) Graph clusterings with overlaps: adapted quality indices and a generation model. Neurocomputing 123:13–22

    Article  Google Scholar 

  19. Gregory S (2009) Finding overlapping communities using disjoint community detection algorithms. In: Results of the 2009 international workshop on complex networks, Catania, pp 47–61

  20. Guimerá R, Amaral LAN (2005) Functional cartography of complex metabolic networks. Nature 433(7028):895–900

    Article  Google Scholar 

  21. Han EH, Karypis G (2000) Centroid-based document classification: analysis and experimental results. Proceedings of the 4th European conference on principles of data mining and knowledge discovery. Springer, Lyon, pp 424–431

    Chapter  Google Scholar 

  22. Havemann F, Heinz M, Struck A, Gläser J (2011) Identification of overlapping communities and their hierarchy by locally calculating community-changing resolution levels. J Stat Mech Theory Exp 01:P01023

    Google Scholar 

  23. He JR, Tong H, Papadimitriou S, Rad TE, Faloutsos C, Carbonell J (2009) Pack: scalable parameter-free clustering on k-partite graphs. In: SDM workshop on link analysis. SIAM, John Ascuagas Nugget

  24. Hubert L, Arabie P (1985) Comparing partitions. J Classif 1:193–218

    Article  Google Scholar 

  25. Ienco D, Robardet C, Pensa R, Meo R (2013) Parameter-less co-clustering for star-structured heterogeneous data. Data Min Knowl Discov 26(2):217–254

    Article  MathSciNet  MATH  Google Scholar 

  26. Koutra D, Kang U, Vreeken J, Faloutsos C (2014) VOG: summarizing and understanding large graphs. Proceedings of the 2014 international conference on data mining. SIAM, Philadelphia, PA, USA, pp 91–99

    Chapter  Google Scholar 

  27. Lancichinetti A, Fortunato S, Kertesz J (2009) Detecting the overlapping and hierarchical community structure in complex networks. New J Phys 11(3):033015

    Article  Google Scholar 

  28. Lazzeroni L, Owen A (2000) Plaid models for gene expression data. Stat Sin 12:61–86

    MathSciNet  MATH  Google Scholar 

  29. Lin WQ, Zhao YC, Yu PS, Deng B (2014) An effective approach on overlapping structures discovery for co-clustering. 16th Asia-Pacific web conference in web technologies and applications. Springer, Changsha, pp 56–67

    Google Scholar 

  30. Long B, Zhang ZF, Yu PS (2010) A general framework for relation graph clustering. Knowl Inf Syst 24:393–413

    Article  Google Scholar 

  31. Long B, Wu YX, Zhang ZF, Yu PS (2006) Unsupervised learning on k-partite graphs. Proceedings of the 12th international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 317–326

    Google Scholar 

  32. Long B, Zhang ZF, Wu XY, Yu PS (2006) Spectral clustering for multi-type relational data. Proceedings of the 23rd international conference on machine learning. ACM, Apia, pp 585–592

    Google Scholar 

  33. Long B, Zhang ZF, Yu PS (2005) Co-clustering by block value decomposition. Proceedings of the 11th international conference on knowledge discovery and data mining. IEEE Computer Society, Binghamton, pp 635–640

    Google Scholar 

  34. Meo PD, Ferrara E, Fiumara G, Provetti A (2014) Mixing local and global information for community detection in large networks. J Comput Syst Sci 80(1):72–87

    Article  MathSciNet  MATH  Google Scholar 

  35. Newman M, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69:026113

    Article  Google Scholar 

  36. Palla G, Derenyi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435:814

    Article  Google Scholar 

  37. Papadimitriou S, Gionis A, Tsaparas P, Vaisanen RA, Mannila H, Faloutsos C (2005) Parameter-free spatial data mining using MDL. Proceedings of the 5th international conference on data mining. IEEE Computer Society, Houston, TX, USA, pp 346–353

  38. Papadimitriou S, Sun J, Faloutsos C, Yu PS (2008) Hierarchical, parameter-free community discovery. European conference in machine learning and knowledge discovery in databases. Springer, Antwerp, Belgium, pp 170–187

    Chapter  Google Scholar 

  39. Rosvall M, Bergstrom CT (2007) An information-theoretic framework for resolving community structure in complex networks. Proc Natl Acad Sci USA 104:7327–7331

    Article  Google Scholar 

  40. Sales MP, Guimerà R, Moreira A, Amaral L (2007) Extracting the hierarchical organization of complex systems. Proc Natl Acad Sci 104(39):15224–15229

    Article  Google Scholar 

  41. Shiga M, Takigawa I, Mamitsuka H (2007) A spectral clustering approach to optimally combining numerical vectors with a modular network. Proceedings of the 13th international conference on knowledge discovery and data mining. ACM, San Jose, CA, USA, pp 647–656

    Google Scholar 

  42. Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905

    Article  Google Scholar 

  43. Sun YS, Yu YT, Han HW (2009) Ranking-based clustering of heterogeneous information networks with star network schema. Proceedings of the 15th international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 797–806

    Google Scholar 

  44. Tian Y, Hankins R, Patel J (2008) Efficient aggregation for graph summarization. Proceedings of the international conference on management of data (SIGMOD 2008). ACM, Vancouver, pp 567–580

    Google Scholar 

  45. Tsai C, Chiu C (2008) Developing a feature weight self-adjustment mechanism for a k-means clustering algorithm. Comput Stat Data Anal 52:4658–4672

    Article  MathSciNet  MATH  Google Scholar 

  46. Wakita K, Tsurumi T (2007) Finding community structure in mega-scale social networks. Proceedings of the 16th international conference on world wide web. ACM, Banff, AB, Canada, pp 1275–1276

    Chapter  Google Scholar 

  47. Wang JD, Zeng HJ, Chen Z, Lu HJ, Tao L, Ma WY (2003) Recom:reinforcement clustering of multi-type interrelated data objects. Proceedings of the 26th annual international conference on research and development in information retrieval. ACM, New York, NY, USA, pp 274–281

    Google Scholar 

  48. Wang XF, Tang L, Gao HJ, Liu H (2010) Discovering overlapping groups in social media. 10th international conference on data mining. IEEE Computer Society, Sydney, pp 569–578

    Google Scholar 

  49. Xu X, Yuruk N, Feng Z, Schweiger TAJ (2007) Scan: A structural clustering algorithm for networks. Proceedings of the 13th international conference on knowledge discovery and data mining. ACM, San Jose, CA, USA, pp 824–833

    Google Scholar 

Download references

Acknowledgments

Wangqun Lin and Bo Deng are supported by National Natural Science Foundation of China through Grant 61271252. Philip S. Yu and Yuchen Zhao are supported by NSF through Grant CNS-1115234, Google Research Award, and the Pinnacle Lab at Singapore Management University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wangqun Lin.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, W., Yu, P.S., Zhao, Y. et al. Multi-type clustering in heterogeneous information networks. Knowl Inf Syst 48, 143–178 (2016). https://doi.org/10.1007/s10115-015-0869-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-015-0869-9

Keywords

Navigation