Elsevier

Knowledge-Based Systems

Volume 194, 22 April 2020, 105458
Knowledge-Based Systems

Cross Multi-Type Objects Clustering in Attributed Heterogeneous Information Network

https://doi.org/10.1016/j.knosys.2019.105458Get rights and content

Abstract

Real-world networks usually consist of a large number of interacting, multi-typed components which are usually referred as heterogeneous information networks (HIN). HIN that associated with various attributes on nodes is defined as attributed HIN (or AHIN). Clustering is a fundamental task for HIN and AHIN. However, most of the current existing methods focus on single type nodes and there is very limited existing work that groups objects of different types into the same cluster. This is largely due to the reasons that object similarities can either be attribute-based or link-based between same type of nodes and it is challenging to incorporate both in a unified framework. To bridge this gap, in this paper, we propose a framework, namely Cross Multi-Type Objects Clustering in Attributed Heterogeneous Information Network, or CMOC-AHIN, to integrate both the attribute information and multi-type node clustering in a principled way. We empirically show superior performances of CMOC-AHIN on three large scale challenging data sets and also summarize insights on the performances compared to other state-of-the-arts methodologies.

Introduction

In the past decades, homogeneous information network has been attracting much attention, and numerous data mining tasks such as ranking, clustering and classification have been explored. Most of contemporary information networks analyses have a basic assumption that the type of objects or links is unique [1], [2], [3]. However, real systems usually consist of a large number of interacting, multi-typed components, such as social interactions, biological networks, and communication networks, etc. Such interconnected networks are usually referred to as heterogeneous information networks (HIN) [1]. Compared to the widely studied homogeneous network, an HIN contains richer structure and semantic information that provides plenty of research opportunities as well as challenges [4], [5], [6], [7]. Further more, in some real-world HIN, objects are often associated with various attributes. For example, in bibliographic network, an author may be associated with attributes like country, organization, address etc. A conference may be associated with attributes like year, topic, place etc. An HIN with object attributes is called an attributed HIN or AHIN for short [6], [8].

Clustering is a fundamental task in data mining. It aims at partitioning a set of data objects (or observations) into a set of clusters, such that objects in the same cluster are similar to each other, yet dissimilar to objects in other clusters. Clustering in HIN attracts much attention recently since it gives insight of the structure of the network and may benefit other data mining tasks such as link prediction and ranking [9]. For example, in bibliographic heterogeneous information network such as DBLP [10], clustering authors shows the research field or latent co-author relationship among authors. In social network such as Facebook, clustering users reveals the social community or the latent interests of users. To facilitate clustering in large complex networks, it has been suggested that the user provide some supplementary information about the data (e.g. pairwise relationships between few data points), which when incorporated in the clustering process, could lead to a better data partition [11]. The side-information usually supplies by providing a constraint to the solution space [12], [13], [14] or learning a better distance metric in the network [15]. Such clustering are called semi-supervised clustering which has been widely studied in real-world data set.

However, most of the existing clustering methods in HIN or AHIN targets at one of the node-types, namely, the target type node and other node types are only used to help cluster the target node type, which means no clustering will be performed on these node types. Analyzing the outcomes of these clustering methods could only see the relationship among the target types of nodes while ignoring the whole picture of the HIN. In real-world HIN, different types of nodes may belong to one particular cluster, in other words, there may exist different types of nodes in one cluster. For example, in the bibliographic network, the authors, papers, conferences may belong to one cluster that represents one research topic. In social network, different types of social roles may belong to one user, such as jobs, bank accounts and social accounts from different social platforms. Finding such clusters could give us more insight of the relationships between different types of nodes and the latent representations of the clusters. An example of cross multi-type clustering in bibliographic attributed heterogeneous information network is illustrated in Fig. 1, where three types of nodes are contained: author, paper and conference. Traditional clustering methods focus on single type clustering (the middle subfigure) while our proposed method focus on multi type clustering(the right subfigure). Studying the relationship between all kinds of nodes could also iteratively improve the quality of clustering. For example, compared with clustering authors in DBLP data set, the cross multi-type clustering in DBLP data set could show how the conferences and papers are related to the research topic. This motivates the cross multi-type clustering in heterogeneous information network an interesting task in HIN or AHIN.

Although clustering different types of nodes are important, very few methods have been proposed for this purpose. The main challenges of cross multi-type clustering in HIN or AHIN network are as follows:

  • 1.

    All the node types in HIN or AHIN need to be studied together in the same framework and enhance each other in clustering so that the whole HIN can be partitioned into clusters with all types of nodes.

  • 2.

    The similarity measure for clustering should combine both the attribute information and network structure information. Since the cluster may contain different types of nodes, the measure should also be able to handle both same type and different types of nodes.

  • 3.

    Given the side-information by users, label constraints would be constructed and clustering result should agree to the label constraints.

To address these challenges, only a few methods have been proposed to overcome partial challenges. For example, Aggarwal and Sun at al. [8], [16], [17] proposed to integrate the attribute information into clustering analysis on HIN. Deng et al. Deng et al. [18] proposed a joint probabilistic topic model for simultaneously modeling the contents of multi-typed objects of a HIN. However, to the best of our knowledge, there is not much previous work that explicitly investigates both in a unified framework. To bridge this gap, we propose a generic inference framework to integrate both the attribute information and multi-type data clustering in a principled way.

The major contributions of this paper can be summarized as follows:

  • 1.

    We propose a novel framework to cluster different types of nodes into clusters in heterogeneous information network. Similarity based on node attributes and network topology between nodes are learned in a unified framework.

  • 2.

    An efficient EM-style updating algorithm is proposed to learn cluster assignment as well as parameters with respect to similarity. We provide time complexity analysis of the proposed method and existing methods.

  • 3.

    We conduct extensive experiments on three real-world datasets to evaluate the effectiveness of the proposed method. We also summarize insights on the performances compared to other state-of-the-arts methodologies.

The rest of the paper is organized as follows. In Section 2, we briefly review the related work of clustering in heterogeneous information networks. In Section 3, we introduce the problem definition and the proposed CMOC-AHIN framework. In Section 4, we conduct experiment on two bibliographic networks and a very challenging and sparse real user behavior data set provided by a world leading E-commerce company. Finally, we conclude the paper in Section 5.

Section snippets

Related work

Most real systems usually consist of a large number of interacting, multi-typed components [19], such as human social activities, communications and computer systems, and biological networks. In such systems, the interacting components constitute interconnected networks, or information networks. The information network analysis, especially clustering analysis, has gained extremely wide attentions from academia as well as industry.

Traditional clustering methods, such as K-Means [20], Kmeoids [21]

The clustering model

In this section, we first provide some formal definition of the multi-type objects clustering in heterogeneous information network. Then we introduce our proposed CMOC-AHIN model by combining the attributes and meta path based node similarity, to learn the parameters of similarity as well as clustering results, we further propose an efficient EM-style update algorithm.

Experimental evaluation

In this section, we empirically show superior performances of CMOC-AHIN on three challenging data sets compared with other state-of-the-arts methodologies. We also test CMOC-AHIN with two other variations: attribute-based alone and link-based alone similarities and empirically show that overall similarity as proposed in Eq. (5) works. To the best of our knowledge, most of the current existing methods focus on single type nodes and there is very limited existing work that groups objects of

Conclusions and future work

In this paper, we introduce a novel and practical model to study the problem of cross multi-type clustering in heterogeneous information network, namely CMOC-AHIN. Given the attributed network information and some semi-supervised constraints, CMOC-AHIN combines node attributes and meta-path information in a constrained way. With an iterative learning process, CMOC-AHIN learns the optimal parameters as well as clustering results. To empirically show the superiority of the mixed edge information

CRediT authorship contribution statement

Sheng Zhou: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Writing - original draft. Jiajun Bu: Investigation, Writing - review & editing, Supervision, Funding acquisition, Project administration. Zhen Zhang: Software, Validation. Can Wang: Investigation, Data curation, Writing - review & editing, Supervision, Project administration. Lingzhou Ma: Project administration, Funding acquisition. Jianfeng Zhang: Project administration, Funding acquisition.

Acknowledgment

This work is supported by Alibaba-Zhejiang University Joint Institute of Frontier Technologies, National Natural Science Foundation of China (Grant No: U1866602), National Key Research and Development Project (Grant No: 2018AAA0101503, 2019YFB1600700), the National Key R&D Program of China (No. 2018YFC2002603, 2018YFB1403202), Zejiang Provincial Natural Science Foundation of China (No. LZ13F020001), the National Natural Science Foundation of China (No. 61972349, 61173185, 61173186) and the

References (55)

  • SunY. et al.

    PathSim: Meta path-based top-k similarity search in heterogeneous information networks

    Proc. VLDB

    (2011)
  • LiX. et al.

    Semi-supervised clustering in attributed heterogeneous information networks

  • WanC. et al.

    Classification with active learning and meta-paths in heterogeneous information networks

  • SunY. et al.

    Relation strength-aware clustering of heterogeneous information networks with incomplete attributes

    Proc. VLDB Endow.

    (2012)
  • SunY. et al.

    Rankclus: integrating clustering with ranking for heterogeneous information network analysis

  • ZhouS. et al.

    Prre: Personalized relation ranking embedding for attributed networks

  • HennigC. et al.

    Handbook of Cluster Analysis

    (2015)
  • BasuS. et al.

    Active semi-supervision for pairwise constrained clustering

  • R. Bekkerman, M. Sahami, Semi-supervised clustering using combinatorial MRFs, in: ICML-06 Workshop on Learning in...
  • LangeT. et al.

    Learning with constrained and unlabelled data

  • BasuS. et al.

    Constrained Clustering: Advances in Algorithms, Theory, and Applications

    (2008)
  • AggarwalC. et al.

    Towards community detection in locally heterogeneous

  • QiG.-J. et al.

    On clustering heterogeneous social media objects with outlier links

  • DengH. et al.

    Collective topic modeling for heterogeneous networks

  • HanJ.

    Mining heterogeneous information networks by exploring the power of links

    Discov. Sci.

    (2009)
  • K. Wagstaff, C. Cardie, S. Rogers, S. Schrödl, et al. Constrained k-means clustering with background knowledge, in:...
  • NgA.Y. et al.

    On spectral clustering: Analysis and an algorithm

  • Cited by (17)

    • A coarse-to-fine collective entity linking method for heterogeneous information networks[Formula presented]

      2021, Knowledge-Based Systems
      Citation Excerpt :

      For example, in the YAGO network, multiple types of objects, such as person (P), location (L), and organization (O), and multiple types of relationships, such as “lives in”, “works at”, and “is married to”, are connected to form a heterogeneous information network. HINs are of great value in many aspects such as recommendations [3] and object clustering [4]. Despite the existence of a large number of HINs, the information contained in such networks is limited [5].

    • Extracting a core structure from heterogeneous information network using h-subnet and meta-path strength

      2021, Journal of Informetrics
      Citation Excerpt :

      Anil & Singh (2020) studied two bibliometric tasks (co-authorship prediction and author classification based on research area) by quantifying the class imbalance problem in HINs. Zhou et al. (2020) investigated the clustering of different types of nodes by combining attribute- and meta-path-based similarities in HINs. In this study, in contrast to studies that have used meta-paths to denote the relationship between nodes, we extend two forms of meta-paths to represent the relationship between the attribute edges in HINs.

    • A survey about community detection over On-line Social and Heterogeneous Information Networks

      2021, Knowledge-Based Systems
      Citation Excerpt :

      Another community detection method has been designed by Fang et al. [106] through the computation of community cohesiveness based on meta-path concept, that is defined as edges’ sequence between different types of vertices. Some approaches [107,108] have been developed for clustering objects in AHIN. The former, named CMOC-AHIN, aims to cluster multi-type objects by using attribute information and multi-type node clustering while the latter proposed a semi-supervised approach (SCHAIN-IRAM) based on objects’ similarity that considers object attributes and their structural connectedness.

    View all citing articles on Scopus

    One or more of the authors of this paper have disclosed potential or pertinent conflicts of interest, which may include receipt of payment, either direct or indirect, institutional support, or association with an entity in the biomedical field which may be perceived to have potential conflict of interest with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.2019.105458.

    View full text