Skip to main content
Log in

Clustering for heterogeneous information networks with extended star-structure

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Clustering of objects in a heterogeneous information network, where different types of objects are linked to each other, is an important problem in heterogeneous information network analysis. Several existing clustering approaches deal with star-structured information networks with different central-attribute relations. In real applications, homogeneous links between central objects may also be available and useful for clustering. In this paper, we propose a new approach called CluEstar for clustering of network with an extended star-structure (E-Star), which extends the classic star-structure by further including central–central relation, i.e., links between objects of the central type. In CluEstar, all objects have a ranking with respect to each cluster to reflect their within-cluster representativeness and determine the clusters of objects that they linked to. A novel objective function is proposed for clustering of E-Star network by formulating both central-attribute and central–central links in an efficient way. Results of extensive experimental studies with benchmark data sets show that the proposed approach is more favorable than existing ones for clustering of E-Star networks with high quality and good efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. http://dblp.org/.

  2. http://www.informatik.uni-trier.de/ley/db.

References

  • Abdelsadek Y, Chelghoum K, Herrmanna F, Kacem I, Otjacques B (2018) Community extraction and visualization in social networks applied to twitter. Inf Sci 424:204–223

    Article  Google Scholar 

  • Banerjee A, Dhillon I, Ghosh J, Meruguand S, Modha DS (2004) A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 509–514

  • Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Chen J, Yuan B (2006) Detecting functional modules in the yeast protein–protein interaction network. Bioinformatics 22:2283–2290

    Article  Google Scholar 

  • Chen Y, Wang L, Dong M (2010) Non-negative matrix factorization for semi-supervised heterogeneous data coclustering. IEEE Trans Knowl Data Eng 22(10):1459–1474

    Article  Google Scholar 

  • Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 89–98

  • Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 269–274

  • Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42:143–175

    Article  MATH  Google Scholar 

  • Ding CHQ, He X, Zha H, Gu M, Simon HD (2001) A min–max cut algorithm for graph partitioning and data clustering. In: Proceedings of IEEE international conference on data mining, pp 107–114

  • Gao B, Liu T-Y, Zheng X, Cheng Q-S, Ma W-Y (2005) Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 41–50

  • Guo Z, Zhu S, Chi Y, Zhang Z, Gong Y (2009) A latent topic model for linked documents. In: Proceedings of international conference on research and development in information retrieval, pp 720–721

  • Gu Q, Zhou J (2009) Co-clustering on manifolds. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 359–368

  • Hofmann T (1999) Probabilistic latent semantic analysis. In: Conference on uncertainty in artificial intelligence, pp 289–296

  • Hou S, Ye Y, Song Y, Abdulhayoglu M (2017) Hindroid: an intelligent android malware detection system based on structured heterogeneous information network. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 1507–1515

  • Ienco D, Robardet C, Pensa RG, Meo R (2013) Parameter-less co-clustering for star-structured heterogeneous data. Data Min Knowl Discov 26(2):217–254

    Article  MathSciNet  MATH  Google Scholar 

  • Ji M, Sun Y, Danilevsky M, Han J, Gao J (2010) Graph regularized transductive classification on heterogeneous information networks. In: Proceedings of European conference on machine learning and data mining, pp 570–586

  • Krishnamurthy B, Wang J (2000) On network-aware clustering of web clients. SIGCOMM Comput Commun Rev 30:97–110

    Article  Google Scholar 

  • Kummamuru K, Dhawale A, Krishnapuram R (2003) Fuzzy co-clustering of documents and keywords. In: Proceedings of the 12th IEEE international conference on fuzzy systems, pp 772–777

  • Lin W, Yu PS, Zhao Y, Deng B (2016) Multi-type clustering in heterogeneous information networks. Knowl Inf Syst 48(1):143–178

    Article  Google Scholar 

  • Long B, Zhang Z, Wu X, Yu PS (2006a) Spectral clustering for multi-type relational data. In: Proceedings of 23th international conference on machine learning, pp 585–592

  • Long B, Wu X, Zhang Z, Yu PS (2006b) Unsupervised learning on k-partite graphs. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 317–326

  • Long B, Zhang Z, Yu PS (2007) A probabilistic framework for relational clustering. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 470–479

  • Long B, Zhang Z, Yu PS (2010) A general framework for relation graph clustering. Knowl Inf Syst 24:393–413

    Article  Google Scholar 

  • McCallum A, Nigam K, Rennie J, Seymore K (2000) Automating the construction of internet portals with machine learning. Inf Retr 3(2):127–163

    Article  Google Scholar 

  • Mei J-P, Chen L (2010) Fuzzy clustering with weighted medoids for relational data. Pattern Recognit 43:1964–1974

    Article  MATH  Google Scholar 

  • Mei J-P, Chen L (2011) Fuzzy clustering approach for star-structured multi-type relational data. In: IEEE international conference on fuzzy systems, pp 2500–2506

  • Mei J-P, Chen L (2012) A fuzzy approach for multitype relational data clustering. IEEE Trans Fuzzy Syst 20:358–371

    Article  Google Scholar 

  • Mei Q, Cai D, Zhang D, Zhai CX (2008) Topic modeling with network regularization. In: Proceedings of international world wide web conference, pp 101–110

  • Mei J-P, Kwoh C-K, Yang P, Li X-L, Zheng J (2013) Drugtarget interaction prediction by learning from local information and neighbors. Bioinformatics 29(2):238–245

    Article  Google Scholar 

  • Miyamoto S, Umayahara K (1998) Fuzzy clustering by quadratic regularization. In: IEEE international conference on fuzzy systems, pp 1394–1399

  • Pio G, Serafino F, Malerba D, Ceci M (2018) Multi-type clustering and classification from heterogeneous networks. Inf Sci 425:107–126

    Article  MathSciNet  Google Scholar 

  • Serafino F, Pio G, Ceci M (2018) Ensemble learning for multi-type classification in heterogeneous networks. IEEE Trans Knowl Data Eng, 1–1. https://doi.org/10.1109/TKDE.2018.2822307

  • Shafiei MM, Milios EE (2006) Latent Dirichlet co-clustering. In: Proceedings of IEEE international conference on data mining, pp 542–551

  • Shi C, Li Y, Zhang J, Sun Y, Philip SY (2017) A survey of heterogeneous information network analysis. IEEE Trans Knowl Data Eng 29:17–37

    Article  Google Scholar 

  • Shi Y, Zhu Q, Guo F, Zhang C, Han J (2018) Easing embedding learning by comprehensive transcription of heterogeneous information networks. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 2190–2199

  • Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617

    MathSciNet  MATH  Google Scholar 

  • Sun Y, Han J, Gao J, Yu Y (2009a) itopicmodel: Information network-integrated topic modeling. In: Proceedings of IEEE international conference on data mining, pp 493–502

  • Sun Y, Yu Y, Han J (2009b) Ranking-based clustering of heterogeneous information networks with star network schema. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 797–806

  • Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of international conference on research and development in information retrieval, pp 267–273

  • Yamanishi Y, Araki M, Gutteridge A (2008) Prediction of drugtarget interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24:i232–i240

    Article  Google Scholar 

  • Zhang D, Wang F, Zhang C, Li T (2008) Multi-view local learning. In: Proceedings of AAAI conference on artificial intelligence, pp 752–757

  • Zhu S, Yu K, Chi Y, Gong Y (2007) Combining content and link for classification using matrix factorization. In: Proceedings of international conference on research and development in information retrieval, pp 487–494

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61502420 and 61772472), the Zhejiang Provincial Natural Science Foundation (Grant Nos. LY16F020032 and LY17F020020).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian-Ping Mei.

Additional information

Responsible editor: Aristides Gionis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Below we give the detailed derivation of updating equations of cluster assignment and ranking. With Lagrange multipliers \(\varvec{\gamma }_t\) and \(\{\varvec{\lambda }_h\}_{h\in {\mathcal {H}}}\), the Lagrangian is formed as

$$\begin{aligned} L=&J+\varvec{\gamma }_t^T({\mathbf {U}}_t{\mathbf {1}}-{\mathbf {1}}) +\sum _{h \in {\mathcal {H}}}\varvec{\lambda }_h^T({\mathbf {V}}^T_h{\mathbf {1}}-{\mathbf {1}}) \end{aligned}$$
(24)

According to the constraints in Eq. (2), each central object \(x^t_i\) may be assigned a membership of 0 in some clusters, and in each cluster c, a number of objects of each \({\mathcal {X}}_h\) may have a ranking of 0. We assume that \({\mathcal {K}}^{t+}_{i}\) is the set of clusters where \(x^t_i\) has positive memberships and \({\mathcal {N}}_c^{h+}\) is the set of objects of type \(h \in {\mathcal {H}}\) that have positive rankings in cluster c, i.e.,

$$\begin{aligned} {\mathcal {K}}_i^{t+}&=\left\{ f : u^t_{if}>0\right\} \end{aligned}$$
(25)
$$\begin{aligned} {\mathcal {N}}_c^{h+}&=\left\{ l : v^h_{lc}>0\right\} \end{aligned}$$
(26)

and \(|{\mathcal {K}}_i^{t+}|>1\), \(|{\mathcal {N}}_c^{h+}|>1\). Based on above definition, we write the summation constraints as

$$\begin{aligned} {\mathbf {e}}^{tT}_i{\mathbf {u}}'^t_i=1 \ \ \forall \ \ i=[n_t] \ \ \text {and} \ \ {\mathbf {e}}'^{hT}_c{\mathbf {v}}^h_c=1 \ \ \forall \ \ h \in {\mathcal {H}}, c=[k] \end{aligned}$$
(27)

where

$$\begin{aligned} {\mathbf {e}}^{tT}_i=[e^t_{i1}, e^t_{i2}, \ldots , e^t_{ik} ], \quad e^t_{if}=\left\{ \begin{aligned}&1 \quad \text {for} \quad f \in {\mathcal {K}}_i^{t+}\\ {}&0 \quad \text {others}\end{aligned} \right. \end{aligned}$$
(28)

and

$$\begin{aligned} {\mathbf {e}}'^{hT}_c=[e'^h_{1c}, e'^h_{2c}, \ldots , e'^h_{n_t c} ], \quad e'^h_{jc}=\left\{ \begin{aligned}1 \quad&\text {for} \quad j \in {\mathcal {N}}_c^{h+}\\ 0 \quad&\text {others}\end{aligned} \right. \end{aligned}$$
(29)

Now we derive \(\mathbf {u'}^t_{i}\), the membership vector of central object \(x^t_i\) in all the k clusters. According to the first order necessary condition

$$\begin{aligned} \nabla _{\mathbf {u'}^t_{i}}L={\mathbf {g}}^t_i-\rho _h\mathbf {u'}^t_i+\gamma ^t_i{\mathbf {1}}={\mathbf {0}} \end{aligned}$$
(30)

which gives

$$\begin{aligned} \mathbf {u'}^t_{i}=\frac{1}{\rho _t}({\mathbf {g}}^t_i+\gamma ^t_i{\mathbf {1}}) \end{aligned}$$
(31)

with

$$\begin{aligned} {\mathbf {g}}_i^{t}=[g^t_{i1},g^t_{i2},\ldots , g^t_{ik}]^T=\sum _{h\in {\mathcal {H}}}\beta _{h}{\mathbf {V}}^T_h\mathbf {r'}^{h}_i \end{aligned}$$
(32)

According to Eq. (27), \(\gamma _i^t\) can be calculated and substitute back into Eq. (31) to get

$$\begin{aligned} \mathbf {u'}^t_i=\frac{1}{\rho _t}{\mathbf {g}}_i^t +\frac{1}{|{\mathcal {K}}_i^{t+}|} \left( 1-\frac{1}{\rho _t}{\mathbf {e}}^T_i{\mathbf {g}}_i^t\right) {\mathbf {1}} \end{aligned}$$
(33)

For attribute object \(x^p_i\), it is assigned to a cluster where the total ranking of its linked central objects is the highest, i.e.,

$$\begin{aligned} l^p_{i}=\arg \max _{c=[k]} {\mathbf {v}}^t_c\mathbf {r'}^p_i \end{aligned}$$
(34)

In a similarly way, the ranking values of all \(x^h_j \in {\mathcal {X}}_h\) in cluster c can be obtained with the following rule

$$\begin{aligned} {\mathbf {v}}^h_c=\frac{1}{\eta _h}{\mathbf {z}}_c^h +\frac{1}{|{\mathcal {N}}_c^{h+}|} \left( 1-\frac{1}{\eta _h}{\mathbf {e}}'^T_c{\mathbf {z}}_c^h\right) {\mathbf {1}} \end{aligned}$$
(35)

with

$$\begin{aligned} {\mathbf {z}}_c^h=[z^h_{1c},z^h_{2c},\ldots ,z^h_{n_hc}]^T=\left\{ \begin{array}{l} \sum \limits _{\nu \in {\mathcal {H}}}\beta _{\nu }{\mathbf {R}}_{\nu }{\mathbf {u}}_c^\nu \ \ \text {for} \ \ h=t\\ \beta _{h}{\mathbf {R}}_{h}^T{\mathbf {u}}_c^t \ \ \text {for} \ \ h\in {\mathcal {A}} \end{array} \right. \end{aligned}$$
(36)

The first term of Eq. (33) decides the membership distribution of each central object in the k clusters while the second term is a normalization to ensure the summation constraint to be satisfied. Similarly, the first term in Eq. (35) decides the distribution of ranking values among objects in \({\mathcal {X}}_h\) in cluster c, and the second term ensures that the sum of rankings of objects of the same type in a cluster is 1.

The last problem left is to decide \({\mathcal {K}}^{t+}_{i}\) and \({\mathcal {N}}^{h+}_{c}\). According to the discussions in Miyamoto and Umayahara (1998) and Mei and Chen (2010), it can be proved that if \(c \in K^{t+}_i\), then \(\{\forall f \in K^{t+}_i| f: g^t_{if}>g^t_{ic}\}\) and if \(j \in N^{h+}_c\), then \(\{\forall l \in N^{h+}_c| l: z^h_{lc}>z^h_{jc}\}\). Based on this, \({\mathcal {K}}^{t+}_{i}\) and \({\mathcal {N}}^{h+}_{c}\) can be obtained in an incremental way which is similar as Procedure-K and Procedure-N given in Mei and Chen (2010).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mei, JP., Lv, H., Yang, L. et al. Clustering for heterogeneous information networks with extended star-structure. Data Min Knowl Disc 33, 1059–1087 (2019). https://doi.org/10.1007/s10618-019-00626-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-019-00626-2

Keywords

Navigation