W-MetaPath2Vec: The topic-driven meta-path-based model for large-scaled content-based heterogeneous information network representation learning
Introduction
Most of the real-world information networks are heterogeneous, where the nodes and relations are of different types. In recent years, heterogeneous information network (HIN) analysis and mining have been thoroughly studied and applied in multiple disciplines. The common HINs, such as: “World Wide Web” (WWW), social networks (Facebook, Twitter, etc.) are naturally complex and very large in size (billions of nodes and links) (Sun & Han, 2012; Sun & Han, 2013; Shi et al., 2017). Similarity searching is one of the most important task in information network mining. It supports to explore the set of relevant nodes from networks. Measuring the similarity between nodes is also considered as the basis of many other data mining tasks, such as: clustering, classification, recommendation, etc. Meta-path is an important concept of most HIN mining techniques (Sun & Han, 2012). It is defined as a sequence of relations between node types which supports to distinguish the semantics among paths connecting two nodes in a network. There are several meta-path-based approaches for solving primitive tasks of HIN mining such as: similarity search (PathSim (Sun et al., 2011), HeteSim (Shi et al., 2014), etc.), ranking and clustering (RankClus (Sun et al., 2009a, Sun, Yu and Han, 2009b), NetClus (Sun, Yu & Han, 2009), etc.). These approaches have gained notable attentions in HIN mining. Recently, researchers have intensively focused on studies related to nodes and relationships representation learning for information networks. Many algorithms have been proposed, such as: Node2Vec (Grover & Leskovec, 2016), Metapath2Vec (Dong, Chawla, & Swami, 2017), etc. Information network embedding approach can be widely applied to resolve multiple HIN mining tasks, such as: node similarity search (Sun et al., 2011, Zhang et al., 2015), clustering, classification (Gupta, Kumar, & Bhasker, 2017), link prediction, etc. In short, the network embedding techniques support to transform the network nodes and edges into low-dimensional space of feature vectors. From these generated feature vectors of nodes and edges, we can easily process the similarity measure related tasks by using out-of-the-shelf distance measure algorithms (Euclid distance, cosine similarity, etc.) Moreover, network embedding approach is capable for working effectively on large-scaled heterogeneous networks with millions of nodes. Because it take only one time for constructing the learning model. The embedding network model also can be applied reinforcement learning for the future data changes which takes less time than re-learning the overall network.
However, there are several challenges of both embedded and non-embedded network similarity measurement such as thorough evaluations on the topic similarity of text-based nodes such as “paper” in bibliographic networks (DBLP, DBIS, etc.) or “comments”, “posts”, etc. in social networks (Facebook, Twitter, etc.). Discovering topics of nodes in content-based heterogeneous information network (content-based HINs) is considered as an important task. Topic evaluation over network's nodes is widely applied in multiple systems like as building news or friend recommendation systems based on users’ interactions on social networks. There are typical works of discovering users’ topics of interests via analyzing their associated nodes like as: “comments”, “viewed tweets”, etc. (Michelson and Macskassy, 2010, Xu et al., 2011). The fact is that thoroughly evaluating the topic of nodes in the network can leverage the output accuracy of the similarity search task. For example, in DBLP network, it is much more accurate for clustering “Jiawei Hans” with the other authors who work on “data mining”. Or recommending possible new co-authorships for “Christopher Manning” with authors who are interesting on “natural language processing” and “information retrieval”, etc. are much more meaningful than other authors who not mainly focus on these two fields. As illustrated in Fig. 1-A, a common case study in DBLP such as finding top-k similar authors with meta-path A-P-V-P-A. The meta-path A-P-V-P-A indicates the relationships of two authors who usually submit their works at the same set of venues/conferences. The assumption is that, a specific venue always has multiple tracks and each track covers different topic/sub-topic. So, it is unfair to rank “Author 1”, who mainly works on “data mining” field, has the same similarity score with “Author 3” and “Author 4”, who mainly work on “text processing/NLP” field. The “Author 1” should more similar to “Author 2”, who has the same interest with “Author 1” in “data mining”, than the left two authors.
Combining the topic attributes with nodes’ relationships in similarity measure can help to improve the quality of the outputs. Additionally, by combining with topic similarity in networked data mining, we can also tackle the problem of less-linked nodes problem. Most of both homogeneous-based and heterogeneous-based similarity measure models are all considered as link-based approaches. These link-based approaches are mostly relied on the links between nodes for analyzing the similarity. Or we can say that the more nodes are connected is the more they are relevant to each other. Mostly depending on relationships between nodes leads to the drawbacks of the failure in examining less-linked nodes but in fact they are very similarity in the other aspects which do not clearly present as the network relationships. For example, two scientists who both are interesting on the “database/data mining” but they are rarely submit their works at the same venues/journals, as illustrated in Fig. 1-B.
For most of previous network representation learning models like as: Node2Vec or Metapath2vec, work on the unweighted network which means all existed paths between two nodes are binary relations (1 for existing relation and 0 for otherwise). The walker needs to travel all the paths between two nodes in order to calculate the transitional probability (π). These computed transitional probabilities are used to rank the similarity level of destination nodes with the given source node. In some case, traveling through all existed paths between two nodes is not considered as efficiency way, especially with very large networks (Vahedian, Burke, & Mobasher, 2017). Between two nodes, there some relations are considered as important whereas the others are not. For example, such as in DBLP (as illustrated in Fig. 2), “Author 1” is known as the most active researcher who mostly contributes his works on “data mining” (3 papers) field, but sometime he also focuses on “big data” field (1 paper). It is obvious that the paths which connect “Author 1” to “Author 2” and “Author 3” are more important than the paths which connect to “Author 4” and “Author 5”. Even these paths are all weighted as 1 (binary relations) but in the semantic aspect, the relationships between “Author 1”, “Author 2” and “Author 3” are more stronger than the others.
Identifying which paths are more important than the others between two nodes is critical for reducing the efforts of transitional probabilities calculation. In order to evaluate the importance of paths, we need a mechanism to assign the weight for each path. Then, only paths satisfy the weight value threshold (σ) are selected for analyzing. In network representation learning, with fixed given walk length (l), for each node we only need to examine around |l| amount of most important paths to generate its set of neighborhoods. Examining all possible paths which connected two nodes in this case is not really necessary. Only important paths should be taken in consideration. By limiting the number of paths which are needed for evaluation, we can leverage the performance of overall node random walk processes.
Last but not least, most of the existed real-world information networks are very large in size with number of nodes can be up to billions, such as: Facebook, WWW, etc. Most of the traditional approaches of network analysis are designed to work on the standalone-based environment. It is definitely hard or impossible handle the big networked data resources like as: Facebook, Twitter, etc. with a single computer. The massive sizes of these networks beyond the capabilities of current heterogeneous network mining approaches. Therefore, we need to find a new solution for dealing with the challenge of large-scaled networks. One of the most common approach for big networked data processing is the distributed computing framework, like as: Apache Hadoop, Spark. Apache Spark is considered as a best choice for massive data handling, due to its capabilities of graph-paralleled processing, such as: GraphX, GraphFrames, etc. The GraphFrames framework can effectively support for handling common graph analysis task such as: path finding, node traversal (BFS, DFS), etc. in the manner of large-scaled networks.
Our overall works in this paper are mainly focused on studies of heterogeneous network representation learning problems as well as introducing the novel approach of W-MethPath2Vec model. The W-MetaPath2Vec is a topic-driven model which aims to capture distinctive features of nodes in heterogeneous network following the predefined meta-path(s). The topic similarities are obtained by evaluating the text-based nodes which are associated with investigated nodes following defined meta-paths. For example, like as “paper” nodes between “author” nodes with meta-path(s): A-P-A (author-paper-author), A-P-V-P-A (author-paper-venue-paper-author), etc. or “comment” nodes between “user” nodes with meta-path(s): U-C-P-U (user-comment-post-comment-user), etc. This topic-driven meta-path similarity measure has been introduced in our previous works, called W-PathSim model (Pham et al., 2018). Fig. 3 shows the relationship of our previous studies (the W-PathSim model) with the current proposed model. Through experiments in real-world DBLP bibliographic networks, we have proved that our proposed W-PathSim outperforms the traditional PathSim model. The W-PathSim model leverage the meta-path-based similarity measurement by combining with the topic similarity between nodes in content-based HINs. From previous achievements, in this paper, we introduce the W-MetaPath2Vec model which is an extension our previous works for topic-driven heterogeneous network representation learning.
This extended topic-driven skip-gram model supports to guide the process of extracting nodes’ features. Then, these extracted features are used to train the learning model. Next, the W-MetaPath2Vec is implemented in Apache Spark-based GraphFrames distributed environment which enables to handle large-scaled heterogeneous networks. The ultimate goal of W-MethPath2Vec is to maximize process of node embedding via both link-based and topic-based evaluation in content-based heterogeneous networks. Our contributions in this paper can be summarized as five-folds, include:
- •
First of all, we introduce the application of LDA topic model in discovering the topic distributions of content-based nodes over content-based network. Then, these topic distributions are used for the processes of evaluating topic similarity between nodes following defined meta-paths.
- •
Secondly, we propose the topic-driven meta-path-based random walk mechanism which is used for generating neighborhoods of a specific node. These neighborhood nodes are used to train the network learning model. In our proposed random walk mechanism, the walker is restricted to travel to the other neighbor nodes through not only the defined meta-path(s) but also the level of topic similarity of their associated nodes. Evaluating the topic similarity while conducting the node walk makes W-MetaPath2Vec different from the previous approaches.
- •
Thirdly, in our proposed W-MetaPath2Vec model the walker is guided to travel within the most important paths only. These paths are selected base on their weights of topic similarity. Only paths which their weight scores satisfy the (σ) threshold are chosen for calculating the transitional probability (π). By limiting the number of paths are needed for examining by defined (σ) threshold, the W-MetaPath2Vec model can help to leverage the time-consuming performance of node embedding processes but do not influence the output accuracy.
- •
Next, we implement the W-MetaPath2Vec under the Apache Spark-based GraphFrames distributed graph computing framework in order to leverage the performance of proposed model in the context of large-scaled networks.
- •
Finally, we demonstrate the experimental studies on our proposed W-MethPath2Vec model with other state-of-the-art algorithms, include: DeepWalk, Node2Vec, LINE and MetaPath2Vec on the real-world DBLP/DBIS datasets. The experimental results show that the W-MethPath2Vec model is efficient for improving the quality of heterogeneous network representation learning as well as scalable for large-scaled networks with millions of nodes.
The overall processes of our proposed W-Metapath2Vec model are illustrated in Fig. 4. From the given content-based HINs, the LDA topic model is applied to extract topic distributions from the text-based nodes such as: papers in DBLP networks. After that, the topic distributions between text-based nodes are used to support the process of calculating the transitional probability (π) between nodes following defined meta-path. This is called topic-driven meta-path random walk mechanism. Finally, we applied the heterogeneous skip-gram architecture of Metapath2Vec to train the model. The network's nodes which are embedded as n-dimensional vectors can be used to solve multiple network analysis tasks, such as: node similarity search, clustering, classification, link prediction, etc. The rest of our paper is organized in four main sections. In the second section, we discuss about the previous works and preliminaries. In the third section, we formally describe about the background concepts, methodology and implementation of our proposed W-MethPath2Vec model. In forth section, we demonstrate the experimental studies on W-MethPath2Vec model. In this section, we present detailed information about datasets usage, testing scenarios, methods and evaluation metrics. We also give discussions about the output results in this section. The final section contains our conclusions about the W-MethPath2Vec approach and our future improvements. (Fig. 5, Fig. 6, Fig. 7)
Section snippets
Heterogeneous information network analysis
The natural principle of data is interconnected which called information networks. Interactions between data node are critical paradigm of modern information infrastructure and mining (Sun & Han, 2012). Heterogeneous information networks are becoming prevalent and widely applied in several real-world applications.
From the past most of the information network mining techniques are considered homogeneous-based approach. In homogeneous network all nodes and links are considered as a same type.
Methodology and implementation
In this section, we introduce three main approaches of our W-MetaPath2Vec model which includes:
- •
The approach of applying LDA model in discovering topic distributions from the given content-based heterogeneous information networks.
- •
Next, we present the mechanism of topic-driven meta-path-based random walk which is used to extract neighborhood nodes from a given source node. These extracted neighborhoods play as learning features which are used to feed the network learning model.
- •
Finally, we
Experiment and discussions
In this section, we conduct thorough empirical studies in order to demonstrate the effectiveness of W-MethPath2Vec model. The section is divided into two main parts, include:
- •
In the first part, we evaluate the accuracy of W-MetaPath2Vec model with previous network embedding models by solving network analysis tasks include: node similarity searching, clustering and classification.
- •
In the second part, we perform the experiment on the scalability of W-MetaPath2Vec with Metapath2Vec model in the
Conclusion and future works
In this paper, we formally present studies related to problems of heterogeneous information network representation learning. There are remained challenges which are related to thorough evaluations on topic of text-based nodes in content-based HIN. Moreover, we are in the era of big data, it is necessary to develop network analysis model which is capable for handling large-scaled networks. To address these challenges, our works in this papers are focused on developing the W-metapath2Vec model.
Acknowledgement
This research is funded by Vietnam National University Ho Chi Minh City (VNU-HCMC) under the grant number B2017-26-02.
References (35)
- et al.
HeteClass: A Meta-path based framework for transductive classification of objects in heterogeneous information networks
Expert Systems with Applications
(2017) - et al.
Mining heterogeneous information networks: A structural analysis approach
- et al.
Top-k similarity search in heterogeneous information networks with x-star network schema
Expert Systems with Applications
(2015) - et al.
Representation learning: A review and new perspectives
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2013) Probabilistic topic models
Communications of the ACM
(2012)- et al.
Latent dirichlet allocation
Journal of Machine Learning Research
(2003) - et al.
PME: Projected metric embedding on heterogeneous networks for link prediction
- et al.
InfoGan: Interpretable representation learning by information maximizing generative adversarial nets
Advances In Neural Information Processing Systems
(2016)
Indexing by latent semantic analysis
Journal of the American Society for Information Science
metapath2vec: Scalable representation learning for heterogeneous networks
node2vec: Scalable feature learning for networks
Probabilistic latent semantic analysis
Cumulated gain-based evaluation of IR techniques
ACM Transactions on Information Systems (TOIS)
SimRank: A measure of structural-context similarity
Cited by (21)
Representation learning using Attention Network and CNN for Heterogeneous networks
2021, Expert Systems with ApplicationsCitation Excerpt :Therefore we need to propose the network embedding methods for HINs which would preserve as much semantic and structural information as possible during representation learning for HINs. Most of the existing works (Dong, Chawla, & Swami, 2017; Fu, Lee, & Lei, 2017; Li & Tang, 2019; Pham & Do, 2019; Shang et al., 2016; Wang et al., 2019; Wang, Zhang, & Shi, 2019b; Zhang, Swami, & Chawla, 2019) for HIN embedding preserve the semantic information in HIN with the help of meta-paths (Sun, Han, Yan, Yu, & Wu, 2011). Since each meta-path captures the proximity among the nodes from a particular semantic perspective, so the network embedding methods based on meta-paths retains the semantic information (Yang, Xiao, Zhang, Sun, & Han, 2020).
Anomaly detection method of packet loss node location in heterogeneous hash networks
2021, Computer CommunicationsCitation Excerpt :Node location is more important in heterogeneous hash networks. It is not only the premise to realize various functions of heterogeneous hash networks, but also the basis to provide target location information and detect events [1]. Applications such as geographic routing, target tracking and environment detection are all implemented on the basis of sensor location information [2].
Dynamic network embedding via structural attention
2021, Expert Systems with ApplicationsCitation Excerpt :Therefore, it is of great significance to find discriminating and manageable network embedding methods. Traditionally, each node in a network is described as a one-hot vector, meanwhile networks can be represented as adjacency matrices that are both high-dimensional and sparse, which can not facilitate to the mining and analysis of large-scale networks (Pham & Do, 2019). In recent years, network representation learning has emerged as an efficient way to tackle this challenging problem, which has achieved great success in social media, knowledge base and other fields.
Structural representation learning for network alignment with self-supervised anchor links
2021, Expert Systems with ApplicationsCitation Excerpt :Moreover, the fail to deal with large-scale networks, as the matrix decomposition on the whole network is often polynomial (cubic) (Bayati et al., 2009). To make the solution scalable, new supervised alignment techniques (Man, Shen, Liu, Jin, & Cheng, 2016) leverage existing network embeddings (Perozzi, Al-Rfou, & Skiena, 2014; Grover & Leskovec, 2016; Hamilton, Ying, & Leskovec, 2017; Pham & Do, 2019) to compute the alignment function directly from the latent node features. However, they often rely on a large amount of labelled data for training the latent features, which requires heavy manual labor and nevertheless domain-specific only (Zhou et al., 2018).
HIN_DRL: A random walk based dynamic network representation learning method for heterogeneous information networks
2020, Expert Systems with ApplicationsCitation Excerpt :Dong et al. (Dong, Chawla & Swami, 2017) first proposed a meta-path based random walk method to generate node sequences, and designed a heterogeneous Skip-Gram model and a new heterogeneous negative sampling method for the node sequences with multiple node types, thus extending the original Skip-Gram model to heterogeneous information networks. Pham et al. (Phu Pham & Phuc Do, 2019) recently proposed a topic-driven meta-path based model, W-MetaPath2Vec, which enhances the representation learning of heterogeneous information networks by combining the topic similarity between nodes with semantic correlations. Besides, Chen et al. proposed WTL+IBL (Chen et al., 2017) to conduct NRL for e-commerce networks, where WTL generates node sequences via weighted random walk method, and IBL distinguishes different types of nodes by considering that different types of nodes carry different attributes.