Abstract
Clustering of objects in a heterogeneous information network, where different types of objects are linked to each other, is an important problem in heterogeneous information network analysis. Several existing clustering approaches deal with star-structured information networks with different central-attribute relations. In real applications, homogeneous links between central objects may also be available and useful for clustering. In this paper, we propose a new approach called CluEstar for clustering of network with an extended star-structure (E-Star), which extends the classic star-structure by further including central–central relation, i.e., links between objects of the central type. In CluEstar, all objects have a ranking with respect to each cluster to reflect their within-cluster representativeness and determine the clusters of objects that they linked to. A novel objective function is proposed for clustering of E-Star network by formulating both central-attribute and central–central links in an efficient way. Results of extensive experimental studies with benchmark data sets show that the proposed approach is more favorable than existing ones for clustering of E-Star networks with high quality and good efficiency.
Similar content being viewed by others
References
Abdelsadek Y, Chelghoum K, Herrmanna F, Kacem I, Otjacques B (2018) Community extraction and visualization in social networks applied to twitter. Inf Sci 424:204–223
Banerjee A, Dhillon I, Ghosh J, Meruguand S, Modha DS (2004) A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 509–514
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Chen J, Yuan B (2006) Detecting functional modules in the yeast protein–protein interaction network. Bioinformatics 22:2283–2290
Chen Y, Wang L, Dong M (2010) Non-negative matrix factorization for semi-supervised heterogeneous data coclustering. IEEE Trans Knowl Data Eng 22(10):1459–1474
Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 89–98
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 269–274
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42:143–175
Ding CHQ, He X, Zha H, Gu M, Simon HD (2001) A min–max cut algorithm for graph partitioning and data clustering. In: Proceedings of IEEE international conference on data mining, pp 107–114
Gao B, Liu T-Y, Zheng X, Cheng Q-S, Ma W-Y (2005) Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 41–50
Guo Z, Zhu S, Chi Y, Zhang Z, Gong Y (2009) A latent topic model for linked documents. In: Proceedings of international conference on research and development in information retrieval, pp 720–721
Gu Q, Zhou J (2009) Co-clustering on manifolds. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 359–368
Hofmann T (1999) Probabilistic latent semantic analysis. In: Conference on uncertainty in artificial intelligence, pp 289–296
Hou S, Ye Y, Song Y, Abdulhayoglu M (2017) Hindroid: an intelligent android malware detection system based on structured heterogeneous information network. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 1507–1515
Ienco D, Robardet C, Pensa RG, Meo R (2013) Parameter-less co-clustering for star-structured heterogeneous data. Data Min Knowl Discov 26(2):217–254
Ji M, Sun Y, Danilevsky M, Han J, Gao J (2010) Graph regularized transductive classification on heterogeneous information networks. In: Proceedings of European conference on machine learning and data mining, pp 570–586
Krishnamurthy B, Wang J (2000) On network-aware clustering of web clients. SIGCOMM Comput Commun Rev 30:97–110
Kummamuru K, Dhawale A, Krishnapuram R (2003) Fuzzy co-clustering of documents and keywords. In: Proceedings of the 12th IEEE international conference on fuzzy systems, pp 772–777
Lin W, Yu PS, Zhao Y, Deng B (2016) Multi-type clustering in heterogeneous information networks. Knowl Inf Syst 48(1):143–178
Long B, Zhang Z, Wu X, Yu PS (2006a) Spectral clustering for multi-type relational data. In: Proceedings of 23th international conference on machine learning, pp 585–592
Long B, Wu X, Zhang Z, Yu PS (2006b) Unsupervised learning on k-partite graphs. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 317–326
Long B, Zhang Z, Yu PS (2007) A probabilistic framework for relational clustering. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 470–479
Long B, Zhang Z, Yu PS (2010) A general framework for relation graph clustering. Knowl Inf Syst 24:393–413
McCallum A, Nigam K, Rennie J, Seymore K (2000) Automating the construction of internet portals with machine learning. Inf Retr 3(2):127–163
Mei J-P, Chen L (2010) Fuzzy clustering with weighted medoids for relational data. Pattern Recognit 43:1964–1974
Mei J-P, Chen L (2011) Fuzzy clustering approach for star-structured multi-type relational data. In: IEEE international conference on fuzzy systems, pp 2500–2506
Mei J-P, Chen L (2012) A fuzzy approach for multitype relational data clustering. IEEE Trans Fuzzy Syst 20:358–371
Mei Q, Cai D, Zhang D, Zhai CX (2008) Topic modeling with network regularization. In: Proceedings of international world wide web conference, pp 101–110
Mei J-P, Kwoh C-K, Yang P, Li X-L, Zheng J (2013) Drugtarget interaction prediction by learning from local information and neighbors. Bioinformatics 29(2):238–245
Miyamoto S, Umayahara K (1998) Fuzzy clustering by quadratic regularization. In: IEEE international conference on fuzzy systems, pp 1394–1399
Pio G, Serafino F, Malerba D, Ceci M (2018) Multi-type clustering and classification from heterogeneous networks. Inf Sci 425:107–126
Serafino F, Pio G, Ceci M (2018) Ensemble learning for multi-type classification in heterogeneous networks. IEEE Trans Knowl Data Eng, 1–1. https://doi.org/10.1109/TKDE.2018.2822307
Shafiei MM, Milios EE (2006) Latent Dirichlet co-clustering. In: Proceedings of IEEE international conference on data mining, pp 542–551
Shi C, Li Y, Zhang J, Sun Y, Philip SY (2017) A survey of heterogeneous information network analysis. IEEE Trans Knowl Data Eng 29:17–37
Shi Y, Zhu Q, Guo F, Zhang C, Han J (2018) Easing embedding learning by comprehensive transcription of heterogeneous information networks. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 2190–2199
Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
Sun Y, Han J, Gao J, Yu Y (2009a) itopicmodel: Information network-integrated topic modeling. In: Proceedings of IEEE international conference on data mining, pp 493–502
Sun Y, Yu Y, Han J (2009b) Ranking-based clustering of heterogeneous information networks with star network schema. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 797–806
Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of international conference on research and development in information retrieval, pp 267–273
Yamanishi Y, Araki M, Gutteridge A (2008) Prediction of drugtarget interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24:i232–i240
Zhang D, Wang F, Zhang C, Li T (2008) Multi-view local learning. In: Proceedings of AAAI conference on artificial intelligence, pp 752–757
Zhu S, Yu K, Chi Y, Gong Y (2007) Combining content and link for classification using matrix factorization. In: Proceedings of international conference on research and development in information retrieval, pp 487–494
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Grant Nos. 61502420 and 61772472), the Zhejiang Provincial Natural Science Foundation (Grant Nos. LY16F020032 and LY17F020020).
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Aristides Gionis.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Below we give the detailed derivation of updating equations of cluster assignment and ranking. With Lagrange multipliers \(\varvec{\gamma }_t\) and \(\{\varvec{\lambda }_h\}_{h\in {\mathcal {H}}}\), the Lagrangian is formed as
According to the constraints in Eq. (2), each central object \(x^t_i\) may be assigned a membership of 0 in some clusters, and in each cluster c, a number of objects of each \({\mathcal {X}}_h\) may have a ranking of 0. We assume that \({\mathcal {K}}^{t+}_{i}\) is the set of clusters where \(x^t_i\) has positive memberships and \({\mathcal {N}}_c^{h+}\) is the set of objects of type \(h \in {\mathcal {H}}\) that have positive rankings in cluster c, i.e.,
and \(|{\mathcal {K}}_i^{t+}|>1\), \(|{\mathcal {N}}_c^{h+}|>1\). Based on above definition, we write the summation constraints as
where
and
Now we derive \(\mathbf {u'}^t_{i}\), the membership vector of central object \(x^t_i\) in all the k clusters. According to the first order necessary condition
which gives
with
According to Eq. (27), \(\gamma _i^t\) can be calculated and substitute back into Eq. (31) to get
For attribute object \(x^p_i\), it is assigned to a cluster where the total ranking of its linked central objects is the highest, i.e.,
In a similarly way, the ranking values of all \(x^h_j \in {\mathcal {X}}_h\) in cluster c can be obtained with the following rule
with
The first term of Eq. (33) decides the membership distribution of each central object in the k clusters while the second term is a normalization to ensure the summation constraint to be satisfied. Similarly, the first term in Eq. (35) decides the distribution of ranking values among objects in \({\mathcal {X}}_h\) in cluster c, and the second term ensures that the sum of rankings of objects of the same type in a cluster is 1.
The last problem left is to decide \({\mathcal {K}}^{t+}_{i}\) and \({\mathcal {N}}^{h+}_{c}\). According to the discussions in Miyamoto and Umayahara (1998) and Mei and Chen (2010), it can be proved that if \(c \in K^{t+}_i\), then \(\{\forall f \in K^{t+}_i| f: g^t_{if}>g^t_{ic}\}\) and if \(j \in N^{h+}_c\), then \(\{\forall l \in N^{h+}_c| l: z^h_{lc}>z^h_{jc}\}\). Based on this, \({\mathcal {K}}^{t+}_{i}\) and \({\mathcal {N}}^{h+}_{c}\) can be obtained in an incremental way which is similar as Procedure-K and Procedure-N given in Mei and Chen (2010).
Rights and permissions
About this article
Cite this article
Mei, JP., Lv, H., Yang, L. et al. Clustering for heterogeneous information networks with extended star-structure. Data Min Knowl Disc 33, 1059–1087 (2019). https://doi.org/10.1007/s10618-019-00626-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-019-00626-2