Cross Multi-Type Objects Clustering in Attributed Heterogeneous Information Network

doi:10.1016/j.knosys.2019.105458

Knowledge-Based Systems

Volume 194, 22 April 2020, 105458

https://doi.org/10.1016/j.knosys.2019.105458 Get rights and content

Abstract

Real-world networks usually consist of a large number of interacting, multi-typed components which are usually referred as heterogeneous information networks (HIN). HIN that associated with various attributes on nodes is defined as attributed HIN (or AHIN). Clustering is a fundamental task for HIN and AHIN. However, most of the current existing methods focus on single type nodes and there is very limited existing work that groups objects of different types into the same cluster. This is largely due to the reasons that object similarities can either be attribute-based or link-based between same type of nodes and it is challenging to incorporate both in a unified framework. To bridge this gap, in this paper, we propose a framework, namely Cross Multi-Type Objects Clustering in Attributed Heterogeneous Information Network, or CMOC-AHIN, to integrate both the attribute information and multi-type node clustering in a principled way. We empirically show superior performances of CMOC-AHIN on three large scale challenging data sets and also summarize insights on the performances compared to other state-of-the-arts methodologies.

Introduction

In the past decades, homogeneous information network has been attracting much attention, and numerous data mining tasks such as ranking, clustering and classification have been explored. Most of contemporary information networks analyses have a basic assumption that the type of objects or links is unique [1], [2], [3]. However, real systems usually consist of a large number of interacting, multi-typed components, such as social interactions, biological networks, and communication networks, etc. Such interconnected networks are usually referred to as heterogeneous information networks (HIN) [1]. Compared to the widely studied homogeneous network, an HIN contains richer structure and semantic information that provides plenty of research opportunities as well as challenges [4], [5], [6], [7]. Further more, in some real-world HIN, objects are often associated with various attributes. For example, in bibliographic network, an author may be associated with attributes like country, organization, address etc. A conference may be associated with attributes like year, topic, place etc. An HIN with object attributes is called an attributed HIN or AHIN for short [6], [8].

Clustering is a fundamental task in data mining. It aims at partitioning a set of data objects (or observations) into a set of clusters, such that objects in the same cluster are similar to each other, yet dissimilar to objects in other clusters. Clustering in HIN attracts much attention recently since it gives insight of the structure of the network and may benefit other data mining tasks such as link prediction and ranking [9]. For example, in bibliographic heterogeneous information network such as DBLP [10], clustering authors shows the research field or latent co-author relationship among authors. In social network such as Facebook, clustering users reveals the social community or the latent interests of users. To facilitate clustering in large complex networks, it has been suggested that the user provide some supplementary information about the data (e.g. pairwise relationships between few data points), which when incorporated in the clustering process, could lead to a better data partition [11]. The side-information usually supplies by providing a constraint to the solution space [12], [13], [14] or learning a better distance metric in the network [15]. Such clustering are called semi-supervised clustering which has been widely studied in real-world data set.

However, most of the existing clustering methods in HIN or AHIN targets at one of the node-types, namely, the target type node and other node types are only used to help cluster the target node type, which means no clustering will be performed on these node types. Analyzing the outcomes of these clustering methods could only see the relationship among the target types of nodes while ignoring the whole picture of the HIN. In real-world HIN, different types of nodes may belong to one particular cluster, in other words, there may exist different types of nodes in one cluster. For example, in the bibliographic network, the authors, papers, conferences may belong to one cluster that represents one research topic. In social network, different types of social roles may belong to one user, such as jobs, bank accounts and social accounts from different social platforms. Finding such clusters could give us more insight of the relationships between different types of nodes and the latent representations of the clusters. An example of cross multi-type clustering in bibliographic attributed heterogeneous information network is illustrated in Fig. 1, where three types of nodes are contained: author, paper and conference. Traditional clustering methods focus on single type clustering (the middle subfigure) while our proposed method focus on multi type clustering(the right subfigure). Studying the relationship between all kinds of nodes could also iteratively improve the quality of clustering. For example, compared with clustering authors in DBLP data set, the cross multi-type clustering in DBLP data set could show how the conferences and papers are related to the research topic. This motivates the cross multi-type clustering in heterogeneous information network an interesting task in HIN or AHIN.

Although clustering different types of nodes are important, very few methods have been proposed for this purpose. The main challenges of cross multi-type clustering in HIN or AHIN network are as follows:

1.
All the node types in HIN or AHIN need to be studied together in the same framework and enhance each other in clustering so that the whole HIN can be partitioned into clusters with all types of nodes.
2.
The similarity measure for clustering should combine both the attribute information and network structure information. Since the cluster may contain different types of nodes, the measure should also be able to handle both same type and different types of nodes.
3.
Given the side-information by users, label constraints would be constructed and clustering result should agree to the label constraints.

To address these challenges, only a few methods have been proposed to overcome partial challenges. For example, Aggarwal and Sun at al. [8], [16], [17] proposed to integrate the attribute information into clustering analysis on HIN. Deng et al. Deng et al. [18] proposed a joint probabilistic topic model for simultaneously modeling the contents of multi-typed objects of a HIN. However, to the best of our knowledge, there is not much previous work that explicitly investigates both in a unified framework. To bridge this gap, we propose a generic inference framework to integrate both the attribute information and multi-type data clustering in a principled way.

The major contributions of this paper can be summarized as follows:

1.
We propose a novel framework to cluster different types of nodes into clusters in heterogeneous information network. Similarity based on node attributes and network topology between nodes are learned in a unified framework.
2.
An efficient EM-style updating algorithm is proposed to learn cluster assignment as well as parameters with respect to similarity. We provide time complexity analysis of the proposed method and existing methods.
3.
We conduct extensive experiments on three real-world datasets to evaluate the effectiveness of the proposed method. We also summarize insights on the performances compared to other state-of-the-arts methodologies.

The rest of the paper is organized as follows. In Section 2, we briefly review the related work of clustering in heterogeneous information networks. In Section 3, we introduce the problem definition and the proposed CMOC-AHIN framework. In Section 4, we conduct experiment on two bibliographic networks and a very challenging and sparse real user behavior data set provided by a world leading E-commerce company. Finally, we conclude the paper in Section 5.

Section snippets

Related work

Most real systems usually consist of a large number of interacting, multi-typed components [19], such as human social activities, communications and computer systems, and biological networks. In such systems, the interacting components constitute interconnected networks, or information networks. The information network analysis, especially clustering analysis, has gained extremely wide attentions from academia as well as industry.

Traditional clustering methods, such as K-Means [20], Kmeoids [21]

The clustering model

In this section, we first provide some formal definition of the multi-type objects clustering in heterogeneous information network. Then we introduce our proposed CMOC-AHIN model by combining the attributes and meta path based node similarity, to learn the parameters of similarity as well as clustering results, we further propose an efficient EM-style update algorithm.

Experimental evaluation

In this section, we empirically show superior performances of CMOC-AHIN on three challenging data sets compared with other state-of-the-arts methodologies. We also test CMOC-AHIN with two other variations: attribute-based alone and link-based alone similarities and empirically show that overall similarity as proposed in Eq. (5) works. To the best of our knowledge, most of the current existing methods focus on single type nodes and there is very limited existing work that groups objects of

Conclusions and future work

In this paper, we introduce a novel and practical model to study the problem of cross multi-type clustering in heterogeneous information network, namely CMOC-AHIN. Given the attributed network information and some semi-supervised constraints, CMOC-AHIN combines node attributes and meta-path information in a constrained way. With an iterative learning process, CMOC-AHIN learns the optimal parameters as well as clustering results. To empirically show the superiority of the mixed edge information

CRediT authorship contribution statement

Sheng Zhou: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Writing - original draft. Jiajun Bu: Investigation, Writing - review & editing, Supervision, Funding acquisition, Project administration. Zhen Zhang: Software, Validation. Can Wang: Investigation, Data curation, Writing - review & editing, Supervision, Project administration. Lingzhou Ma: Project administration, Funding acquisition. Jianfeng Zhang: Project administration, Funding acquisition.

Acknowledgment

This work is supported by Alibaba-Zhejiang University Joint Institute of Frontier Technologies, National Natural Science Foundation of China (Grant No: U1866602), National Key Research and Development Project (Grant No: 2018AAA0101503, 2019YFB1600700), the National Key R&D Program of China (No. 2018YFC2002603, 2018YFB1403202), Zejiang Provincial Natural Science Foundation of China (No. LZ13F020001), the National Natural Science Foundation of China (No. 61972349, 61173185, 61173186) and the

References (55)

SunY. et al.
Mining heterogeneous information networks: A structural analysis approach
SIGKDD Explor. Newsl.
(2013)
ParkH.-S. et al.
A simple and fast algorithm for K-medoids clustering
Expert Syst. Appl.
(2009)
KeikhaM.M. et al.
Community aware random walk for network embedding
Knowl.-Based Syst.
(2018)
WangH. et al.
A study of graph-based system for multi-view clustering
Knowl.-Based Syst.
(2019)
GuiL. et al.
Learning representations from heterogeneous network for sentiment classification of product reviews
Knowl.-Based Syst.
(2017)
ZhouY. et al.
A semantic-rich similarity measure in heterogeneous information networks
Knowl.-Based Syst.
(2018)
GoyalP. et al.
Graph embedding techniques, applications, and performance: A survey
Knowl.-Based Syst.
(2018)
LichtenwalterR.N. et al.
New perspectives and methods in link prediction
LeroyV. et al.
Cold start link prediction
ShiC. et al.
Relevance search in heterogeneous networks

SunY. et al.

PathSim: Meta path-based top-k similarity search in heterogeneous information networks

Proc. VLDB

(2011)

LiX. et al.

Semi-supervised clustering in attributed heterogeneous information networks

WanC. et al.

Classification with active learning and meta-paths in heterogeneous information networks

SunY. et al.

Relation strength-aware clustering of heterogeneous information networks with incomplete attributes

Proc. VLDB Endow.

(2012)

SunY. et al.

Rankclus: integrating clustering with ranking for heterogeneous information network analysis

ZhouS. et al.

Prre: Personalized relation ranking embedding for attributed networks

HennigC. et al.

Handbook of Cluster Analysis

(2015)

BasuS. et al.

Active semi-supervision for pairwise constrained clustering

R. Bekkerman, M. Sahami, Semi-supervised clustering using combinatorial MRFs, in: ICML-06 Workshop on Learning in...

LangeT. et al.

Learning with constrained and unlabelled data

BasuS. et al.

Constrained Clustering: Advances in Algorithms, Theory, and Applications

(2008)

AggarwalC. et al.

Towards community detection in locally heterogeneous

QiG.-J. et al.

On clustering heterogeneous social media objects with outlier links

DengH. et al.

Collective topic modeling for heterogeneous networks

HanJ.

Mining heterogeneous information networks by exploring the power of links

Discov. Sci.

(2009)

K. Wagstaff, C. Cardie, S. Rogers, S. Schrödl, et al. Constrained k-means clustering with background knowledge, in:...

NgA.Y. et al.

On spectral clustering: Analysis and an algorithm

Cited by (17)

A new method for recommendation based on embedding spectral clustering in heterogeneous networks (RESCHet)
2023, Expert Systems with Applications
The advancement in internet technology has enabled the use of increasingly sophisticated data by recommendation systems to enhance their effectiveness. This data is comprised of Heterogeneous Information Networks (HINs) which are composed of multiple nodes and link types. A significant challenge is effectively extracting and incorporating valuable information from HINs. Clustering has been proposed as one of the main methods in recommender systems, but in Heterogeneous Information Networks for recommender systems has received less attention. In this paper, we intend to present a new method for Recommendation Based on Embedding Spectral Clustering in Heterogeneous Networks (RESCHet), which uses the embedding spectral clustering method, whose similarity matrix is generated by a heterogeneous embedding approach. Subsequently, we employed the concepts of submeta-paths and atomic meta-paths to uncover the relationships between users and items that are pertinent to each cluster. Finally, we generated recommendations for users by computing the Hadamard product between the relevant vectors. Experiments carried out on three open benchmark datasets have demonstrated that RESCHet outperforms current leading methods in a significant manner.
METHODS: A meta-path-based method for heterogeneous community detection in the open source software ecosystem
2023, Information and Software Technology
Detecting communities in the open source software (OSS) ecosystem can help understand the collaborations in the open source software ecosystem and promote an understanding of the dynamics of the ecosystem. However, most existing community detection methods are designed for homogeneous networks, whereas the OSS ecosystem is a heterogeneous network. Therefore, we propose a meta-path-based method for heterogeneous community detection in the OSS ecosystem (METHODS). METHODS comprises four steps. Firstly, a heterogeneous information network is constructed based on meta-paths. Secondly, the Canopy algorithm is used to obtain the number of initial communities. Thirdly, the skip-gram model is used to identify seed nodes for community detection. Finally, METHODS detects heterogeneous communities around the seed nodes. By defining a series of evaluation metrics and verifying these on GitHub datasets, METHODS achieves the best performance of all the other methods. Moreover, the case studies on GitHub also shows METHODS can discover latent communities whose members are potentially relevant.
Motif discovery based traffic pattern mining in attributed road networks
2022, Knowledge-Based Systems
With the development of intelligent transportation systems, clustering methods are now being adopted for traffic pattern recognition to discover the time-varying laws in road networks; this had attracted significant attention from the industry and academia over the past decades. Existing methods mainly focus on the mobility pattern and spatiotemporal dimension, ignoring the complex relationships among these segments in road networks. The main issues can be divided into two categories: deep integration of the structural and attribute information; global spatial dependencies for clustering structural properties. To address these issues, a clustering method for motif-based attributed road networks is proposed. A higher-order connectivity model based on motif discovery is designed, and a weighted matrix of adjacent segments is defined in the road networks. Moreover, a clustering model for motif-based attributed road networks is constructed, considering the joint relationship between node structure and features. In this study, a set of experiments were conducted on two real-world datasets. The results indicated that the performance of the proposed method is superior to that of the state-of-the-art methods.
A coarse-to-fine collective entity linking method for heterogeneous information networks[Formula presented]
2021, Knowledge-Based Systems
Citation Excerpt :
For example, in the YAGO network, multiple types of objects, such as person (P), location (L), and organization (O), and multiple types of relationships, such as “lives in”, “works at”, and “is married to”, are connected to form a heterogeneous information network. HINs are of great value in many aspects such as recommendations [3] and object clustering [4]. Despite the existence of a large number of HINs, the information contained in such networks is limited [5].
Linking ambiguous entity mentions in a text with their true mapping entities in a heterogeneous information network (HIN) is important. Most of existing entity linking methods with HINs assume that the entities in a text are independent while ignoring the relationships between the entities in context. Recent studies have shown that collective entity linking methods are more effective than traditional independent entity linking methods because they consider the relationships between different entities in the same text. However, few studies focus on collective entity linking for HINs. Most of collective entity linking methods rely largely on special features in Wikipedia, and may not be suitable for the HINs that are not mapped to Wikipedia. Moreover, existing collective entity linking methods may have high time complexity. Therefore, a Coarse-to-Fine collective Entity Linking algorithm (called CFEL) is proposed for the case the Wikipedia cannot be used. CFEL is composed of a coarse-grained model and a fine-grained model. In the coarse-grained model, a pruning strategy motivated by the human cognition mechanism, is adopted to reduce the number of candidates for each entity mention in texts. The candidates in HINs that are inconsistent with the type of entity mentions can be deleted. In the fine-grained model, we present a probabilistic method that combines the semantic information in a text with the structural information in HINs. The experimental results on four real-world datasets verify the effectiveness of our algorithm compared to the baselines.
Extracting a core structure from heterogeneous information network using h-subnet and meta-path strength
2021, Journal of Informetrics
Citation Excerpt :
Anil & Singh (2020) studied two bibliometric tasks (co-authorship prediction and author classification based on research area) by quantifying the class imbalance problem in HINs. Zhou et al. (2020) investigated the clustering of different types of nodes by combining attribute- and meta-path-based similarities in HINs. In this study, in contrast to studies that have used meta-paths to denote the relationship between nodes, we extend two forms of meta-paths to represent the relationship between the attribute edges in HINs.
Based on the analytical methodology of homogeneous networks, we present a novel method to extract a core structure from a heterogeneous network. By extending two forms of meta-paths to represent the relationships between attribute edges, we propose the meta-path strength as a measure of the link strength of attribute edges in a heterogeneous information network. Inspired by the h-subnet method for weighted complex networks, we identify important attribute edges based on the h-cutoff of meta-path strengths. Additionally, important base edges can be filtered according to the base nodes on the retained attribute edges. Therefore, a heterogeneous h-subnet can be obtained by combining important attribute edges and base edges. Two bibliographic information networks are used to evaluate the proposed method empirically, and the results indicate that the extracted heterogeneous h-subnets contain less than 1% of the nodes and edges of the original networks and can cover different features of at least one of several other core structures.
A survey about community detection over On-line Social and Heterogeneous Information Networks
2021, Knowledge-Based Systems
Citation Excerpt :
Another community detection method has been designed by Fang et al. [106] through the computation of community cohesiveness based on meta-path concept, that is defined as edges’ sequence between different types of vertices. Some approaches [107,108] have been developed for clustering objects in AHIN. The former, named CMOC-AHIN, aims to cluster multi-type objects by using attribute information and multi-type node clustering while the latter proposed a semi-supervised approach (SCHAIN-IRAM) based on objects’ similarity that considers object attributes and their structural connectedness.
In modern Online Social Networks (OSNs), the need to detect users’ communities based on their interests and social connections has became a more and more important challenge in literature. Community Detection supports and make more effective and efficient several Social Network Analysis (SNA) applications: the diffusion of a new idea or technologies can be maximized by identifying of people group interested about a given topic, the recommendation suggestion can be improved taking in account also how the social ties can be influenced the user chooses and the behaviors of people in the same communities, expert finding tasks could be more accurate if users are earlier subdivided into thematic groups, and so on. This paper presents a survey that provides a comprehensive and comparative study of all the different community detection techniques applicable to the various models proposed for OSNs. In particular, the most diffused approaches based on game theory, artificial intelligence and fuzzy strategies are detailed and compared, highlighting the related pros and cons. In addition, the problem of their applicability on the different OSN models is discussed, focusing on complex networks. Finally, the main open issues and challenges for the community detection problem are reported to address the futures work concerning this topic.

View all citing articles on Scopus

^☆: One or more of the authors of this paper have disclosed potential or pertinent conflicts of interest, which may include receipt of payment, either direct or indirect, institutional support, or association with an entity in the biomedical field which may be perceived to have potential conflict of interest with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.2019.105458.

View full text

Cross Multi-Type Objects Clustering in Attributed Heterogeneous Information Network☆

Abstract

Introduction

Section snippets

Related work

The clustering model

Experimental evaluation

Conclusions and future work

CRediT authorship contribution statement

Acknowledgment

SIGKDD Explor. Newsl.

Expert Syst. Appl.

Knowl.-Based Syst.

Knowl.-Based Syst.

Knowl.-Based Syst.

Knowl.-Based Syst.

Knowl.-Based Syst.

New perspectives and methods in link prediction

Cold start link prediction

Relevance search in heterogeneous networks

PathSim: Meta path-based top-k similarity search in heterogeneous information networks

Proc. VLDB

Semi-supervised clustering in attributed heterogeneous information networks

Classification with active learning and meta-paths in heterogeneous information networks

Relation strength-aware clustering of heterogeneous information networks with incomplete attributes

Proc. VLDB Endow.

Rankclus: integrating clustering with ranking for heterogeneous information network analysis

Prre: Personalized relation ranking embedding for attributed networks

Handbook of Cluster Analysis

Active semi-supervision for pairwise constrained clustering

Learning with constrained and unlabelled data

Constrained Clustering: Advances in Algorithms, Theory, and Applications

Towards community detection in locally heterogeneous

On clustering heterogeneous social media objects with outlier links

Collective topic modeling for heterogeneous networks

Mining heterogeneous information networks by exploring the power of links

Discov. Sci.

On spectral clustering: Analysis and an algorithm