Clustering for heterogeneous information networks with extended star-structure

Mei, Jian-Ping; Lv, Huajiang; Yang, Lianghuai; Li, Yanjun

doi:10.1007/s10618-019-00626-2

Clustering for heterogeneous information networks with extended star-structure

Published: 10 April 2019

Volume 33, pages 1059–1087, (2019)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Jian-Ping Mei ORCID: orcid.org/0000-0003-1678-6215¹,
Huajiang Lv¹,
Lianghuai Yang¹ &
…
Yanjun Li¹

643 Accesses
5 Citations
Explore all metrics

Abstract

Clustering of objects in a heterogeneous information network, where different types of objects are linked to each other, is an important problem in heterogeneous information network analysis. Several existing clustering approaches deal with star-structured information networks with different central-attribute relations. In real applications, homogeneous links between central objects may also be available and useful for clustering. In this paper, we propose a new approach called CluEstar for clustering of network with an extended star-structure (E-Star), which extends the classic star-structure by further including central–central relation, i.e., links between objects of the central type. In CluEstar, all objects have a ranking with respect to each cluster to reflect their within-cluster representativeness and determine the clusters of objects that they linked to. A novel objective function is proposed for clustering of E-Star network by formulating both central-attribute and central–central links in an efficient way. Results of extensive experimental studies with benchmark data sets show that the proposed approach is more favorable than existing ones for clustering of E-Star networks with high quality and good efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Novel Graph Clustering Algorithm Based on Structural Attribute Neighborhood Similarity (SANS)

Intra graph clustering using collaborative similarity measure

Article 20 January 2015

Multiple Star Node Discovery Algorithm in Social Network Based on Six Degrees of Separation and Greedy Strategy

Notes

References

Abdelsadek Y, Chelghoum K, Herrmanna F, Kacem I, Otjacques B (2018) Community extraction and visualization in social networks applied to twitter. Inf Sci 424:204–223
Article Google Scholar
Banerjee A, Dhillon I, Ghosh J, Meruguand S, Modha DS (2004) A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 509–514
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Chen J, Yuan B (2006) Detecting functional modules in the yeast protein–protein interaction network. Bioinformatics 22:2283–2290
Article Google Scholar
Chen Y, Wang L, Dong M (2010) Non-negative matrix factorization for semi-supervised heterogeneous data coclustering. IEEE Trans Knowl Data Eng 22(10):1459–1474
Article Google Scholar
Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 89–98
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 269–274
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42:143–175
Article MATH Google Scholar
Ding CHQ, He X, Zha H, Gu M, Simon HD (2001) A min–max cut algorithm for graph partitioning and data clustering. In: Proceedings of IEEE international conference on data mining, pp 107–114
Gao B, Liu T-Y, Zheng X, Cheng Q-S, Ma W-Y (2005) Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 41–50
Guo Z, Zhu S, Chi Y, Zhang Z, Gong Y (2009) A latent topic model for linked documents. In: Proceedings of international conference on research and development in information retrieval, pp 720–721
Gu Q, Zhou J (2009) Co-clustering on manifolds. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 359–368
Hofmann T (1999) Probabilistic latent semantic analysis. In: Conference on uncertainty in artificial intelligence, pp 289–296
Hou S, Ye Y, Song Y, Abdulhayoglu M (2017) Hindroid: an intelligent android malware detection system based on structured heterogeneous information network. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 1507–1515
Ienco D, Robardet C, Pensa RG, Meo R (2013) Parameter-less co-clustering for star-structured heterogeneous data. Data Min Knowl Discov 26(2):217–254
Article MathSciNet MATH Google Scholar
Ji M, Sun Y, Danilevsky M, Han J, Gao J (2010) Graph regularized transductive classification on heterogeneous information networks. In: Proceedings of European conference on machine learning and data mining, pp 570–586
Krishnamurthy B, Wang J (2000) On network-aware clustering of web clients. SIGCOMM Comput Commun Rev 30:97–110
Article Google Scholar
Kummamuru K, Dhawale A, Krishnapuram R (2003) Fuzzy co-clustering of documents and keywords. In: Proceedings of the 12th IEEE international conference on fuzzy systems, pp 772–777
Lin W, Yu PS, Zhao Y, Deng B (2016) Multi-type clustering in heterogeneous information networks. Knowl Inf Syst 48(1):143–178
Article Google Scholar
Long B, Zhang Z, Wu X, Yu PS (2006a) Spectral clustering for multi-type relational data. In: Proceedings of 23th international conference on machine learning, pp 585–592
Long B, Wu X, Zhang Z, Yu PS (2006b) Unsupervised learning on k-partite graphs. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 317–326
Long B, Zhang Z, Yu PS (2007) A probabilistic framework for relational clustering. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 470–479
Long B, Zhang Z, Yu PS (2010) A general framework for relation graph clustering. Knowl Inf Syst 24:393–413
Article Google Scholar
McCallum A, Nigam K, Rennie J, Seymore K (2000) Automating the construction of internet portals with machine learning. Inf Retr 3(2):127–163
Article Google Scholar
Mei J-P, Chen L (2010) Fuzzy clustering with weighted medoids for relational data. Pattern Recognit 43:1964–1974
Article MATH Google Scholar
Mei J-P, Chen L (2011) Fuzzy clustering approach for star-structured multi-type relational data. In: IEEE international conference on fuzzy systems, pp 2500–2506
Mei J-P, Chen L (2012) A fuzzy approach for multitype relational data clustering. IEEE Trans Fuzzy Syst 20:358–371
Article Google Scholar
Mei Q, Cai D, Zhang D, Zhai CX (2008) Topic modeling with network regularization. In: Proceedings of international world wide web conference, pp 101–110
Mei J-P, Kwoh C-K, Yang P, Li X-L, Zheng J (2013) Drugtarget interaction prediction by learning from local information and neighbors. Bioinformatics 29(2):238–245
Article Google Scholar
Miyamoto S, Umayahara K (1998) Fuzzy clustering by quadratic regularization. In: IEEE international conference on fuzzy systems, pp 1394–1399
Pio G, Serafino F, Malerba D, Ceci M (2018) Multi-type clustering and classification from heterogeneous networks. Inf Sci 425:107–126
Article MathSciNet Google Scholar
Serafino F, Pio G, Ceci M (2018) Ensemble learning for multi-type classification in heterogeneous networks. IEEE Trans Knowl Data Eng, 1–1. https://doi.org/10.1109/TKDE.2018.2822307
Shafiei MM, Milios EE (2006) Latent Dirichlet co-clustering. In: Proceedings of IEEE international conference on data mining, pp 542–551
Shi C, Li Y, Zhang J, Sun Y, Philip SY (2017) A survey of heterogeneous information network analysis. IEEE Trans Knowl Data Eng 29:17–37
Article Google Scholar
Shi Y, Zhu Q, Guo F, Zhang C, Han J (2018) Easing embedding learning by comprehensive transcription of heterogeneous information networks. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 2190–2199
Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
MathSciNet MATH Google Scholar
Sun Y, Han J, Gao J, Yu Y (2009a) itopicmodel: Information network-integrated topic modeling. In: Proceedings of IEEE international conference on data mining, pp 493–502
Sun Y, Yu Y, Han J (2009b) Ranking-based clustering of heterogeneous information networks with star network schema. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 797–806
Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of international conference on research and development in information retrieval, pp 267–273
Yamanishi Y, Araki M, Gutteridge A (2008) Prediction of drugtarget interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24:i232–i240
Article Google Scholar
Zhang D, Wang F, Zhang C, Li T (2008) Multi-view local learning. In: Proceedings of AAAI conference on artificial intelligence, pp 752–757
Zhu S, Yu K, Chi Y, Gong Y (2007) Combining content and link for classification using matrix factorization. In: Proceedings of international conference on research and development in information retrieval, pp 487–494

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61502420 and 61772472), the Zhejiang Provincial Natural Science Foundation (Grant Nos. LY16F020032 and LY17F020020).

Author information

Authors and Affiliations

College of Computer Science and Technology, Zhejiang University of Technology, Liuhe Road No. 288, Xihu District, Hangzhou, 310023, China
Jian-Ping Mei, Huajiang Lv, Lianghuai Yang & Yanjun Li

Authors

Jian-Ping Mei
View author publications
You can also search for this author in PubMed Google Scholar
Huajiang Lv
View author publications
You can also search for this author in PubMed Google Scholar
Lianghuai Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yanjun Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian-Ping Mei.

Additional information

Responsible editor: Aristides Gionis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Below we give the detailed derivation of updating equations of cluster assignment and ranking. With Lagrange multipliers $\varvec{\gamma }_t$ and $\{\varvec{\lambda }_h\}_{h\in {\mathcal {H}}}$, the Lagrangian is formed as

$$\begin{aligned} L=&J+\varvec{\gamma }_t^T({\mathbf {U}}_t{\mathbf {1}}-{\mathbf {1}}) +\sum _{h \in {\mathcal {H}}}\varvec{\lambda }_h^T({\mathbf {V}}^T_h{\mathbf {1}}-{\mathbf {1}}) \end{aligned}$$

(24)

According to the constraints in Eq. (2), each central object $x^t_i$ may be assigned a membership of 0 in some clusters, and in each cluster c, a number of objects of each ${\mathcal {X}}_h$ may have a ranking of 0. We assume that ${\mathcal {K}}^{t+}_{i}$ is the set of clusters where $x^t_i$ has positive memberships and ${\mathcal {N}}_c^{h+}$ is the set of objects of type $h \in {\mathcal {H}}$ that have positive rankings in cluster c, i.e.,

$$\begin{aligned} {\mathcal {K}}_i^{t+}&=\left\{ f : u^t_{if}>0\right\} \end{aligned}$$

(25)

$$\begin{aligned} {\mathcal {N}}_c^{h+}&=\left\{ l : v^h_{lc}>0\right\} \end{aligned}$$

(26)

and $|{\mathcal {K}}_i^{t+}|>1$, $|{\mathcal {N}}_c^{h+}|>1$. Based on above definition, we write the summation constraints as

$$\begin{aligned} {\mathbf {e}}^{tT}_i{\mathbf {u}}'^t_i=1 \ \ \forall \ \ i=[n_t] \ \ \text {and} \ \ {\mathbf {e}}'^{hT}_c{\mathbf {v}}^h_c=1 \ \ \forall \ \ h \in {\mathcal {H}}, c=[k] \end{aligned}$$

(27)

where

$$\begin{aligned} {\mathbf {e}}^{tT}_i=[e^t_{i1}, e^t_{i2}, \ldots , e^t_{ik} ], \quad e^t_{if}=\left\{ \begin{aligned}&1 \quad \text {for} \quad f \in {\mathcal {K}}_i^{t+}\\ {}&0 \quad \text {others}\end{aligned} \right. \end{aligned}$$

(28)

and

$$\begin{aligned} {\mathbf {e}}'^{hT}_c=[e'^h_{1c}, e'^h_{2c}, \ldots , e'^h_{n_t c} ], \quad e'^h_{jc}=\left\{ \begin{aligned}1 \quad&\text {for} \quad j \in {\mathcal {N}}_c^{h+}\\ 0 \quad&\text {others}\end{aligned} \right. \end{aligned}$$

(29)

Now we derive $\mathbf {u'}^t_{i}$, the membership vector of central object $x^t_i$ in all the k clusters. According to the first order necessary condition

$$\begin{aligned} \nabla _{\mathbf {u'}^t_{i}}L={\mathbf {g}}^t_i-\rho _h\mathbf {u'}^t_i+\gamma ^t_i{\mathbf {1}}={\mathbf {0}} \end{aligned}$$

(30)

which gives

$$\begin{aligned} \mathbf {u'}^t_{i}=\frac{1}{\rho _t}({\mathbf {g}}^t_i+\gamma ^t_i{\mathbf {1}}) \end{aligned}$$

(31)

with

$$\begin{aligned} {\mathbf {g}}_i^{t}=[g^t_{i1},g^t_{i2},\ldots , g^t_{ik}]^T=\sum _{h\in {\mathcal {H}}}\beta _{h}{\mathbf {V}}^T_h\mathbf {r'}^{h}_i \end{aligned}$$

(32)

According to Eq. (27), $\gamma _i^t$ can be calculated and substitute back into Eq. (31) to get

$$\begin{aligned} \mathbf {u'}^t_i=\frac{1}{\rho _t}{\mathbf {g}}_i^t +\frac{1}{|{\mathcal {K}}_i^{t+}|} \left( 1-\frac{1}{\rho _t}{\mathbf {e}}^T_i{\mathbf {g}}_i^t\right) {\mathbf {1}} \end{aligned}$$

(33)

For attribute object $x^p_i$, it is assigned to a cluster where the total ranking of its linked central objects is the highest, i.e.,

$$\begin{aligned} l^p_{i}=\arg \max _{c=[k]} {\mathbf {v}}^t_c\mathbf {r'}^p_i \end{aligned}$$

(34)

In a similarly way, the ranking values of all $x^h_j \in {\mathcal {X}}_h$ in cluster c can be obtained with the following rule

$$\begin{aligned} {\mathbf {v}}^h_c=\frac{1}{\eta _h}{\mathbf {z}}_c^h +\frac{1}{|{\mathcal {N}}_c^{h+}|} \left( 1-\frac{1}{\eta _h}{\mathbf {e}}'^T_c{\mathbf {z}}_c^h\right) {\mathbf {1}} \end{aligned}$$

(35)

with

$$\begin{aligned} {\mathbf {z}}_c^h=[z^h_{1c},z^h_{2c},\ldots ,z^h_{n_hc}]^T=\left\{ \begin{array}{l} \sum \limits _{\nu \in {\mathcal {H}}}\beta _{\nu }{\mathbf {R}}_{\nu }{\mathbf {u}}_c^\nu \ \ \text {for} \ \ h=t\\ \beta _{h}{\mathbf {R}}_{h}^T{\mathbf {u}}_c^t \ \ \text {for} \ \ h\in {\mathcal {A}} \end{array} \right. \end{aligned}$$

(36)

The first term of Eq. (33) decides the membership distribution of each central object in the k clusters while the second term is a normalization to ensure the summation constraint to be satisfied. Similarly, the first term in Eq. (35) decides the distribution of ranking values among objects in ${\mathcal {X}}_h$ in cluster c, and the second term ensures that the sum of rankings of objects of the same type in a cluster is 1.

The last problem left is to decide ${\mathcal {K}}^{t+}_{i}$ and ${\mathcal {N}}^{h+}_{c}$. According to the discussions in Miyamoto and Umayahara (1998) and Mei and Chen (2010), it can be proved that if $c \in K^{t+}_i$, then $\{\forall f \in K^{t+}_i| f: g^t_{if}>g^t_{ic}\}$ and if $j \in N^{h+}_c$, then $\{\forall l \in N^{h+}_c| l: z^h_{lc}>z^h_{jc}\}$. Based on this, ${\mathcal {K}}^{t+}_{i}$ and ${\mathcal {N}}^{h+}_{c}$ can be obtained in an incremental way which is similar as Procedure-K and Procedure-N given in Mei and Chen (2010).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mei, JP., Lv, H., Yang, L. et al. Clustering for heterogeneous information networks with extended star-structure. Data Min Knowl Disc 33, 1059–1087 (2019). https://doi.org/10.1007/s10618-019-00626-2

Download citation

Received: 11 April 2018
Accepted: 03 April 2019
Published: 10 April 2019
Issue Date: 01 July 2019
DOI: https://doi.org/10.1007/s10618-019-00626-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering for heterogeneous information networks with extended star-structure

Abstract

Access this article

Similar content being viewed by others

A Novel Graph Clustering Algorithm Based on Structural Attribute Neighborhood Similarity (SANS)

Intra graph clustering using collaborative similarity measure

Multiple Star Node Discovery Algorithm in Social Network Based on Six Degrees of Separation and Greedy Strategy

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Clustering for heterogeneous information networks with extended star-structure

Abstract

Access this article

Similar content being viewed by others

A Novel Graph Clustering Algorithm Based on Structural Attribute Neighborhood Similarity (SANS)

Intra graph clustering using collaborative similarity measure

Multiple Star Node Discovery Algorithm in Social Network Based on Six Degrees of Separation and Greedy Strategy

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation