Efficient m-closest entity matching over heterogeneous information networks

doi:10.1016/j.knosys.2023.110299

Knowledge-Based Systems

Volume 263, 5 March 2023, 110299

https://doi.org/10.1016/j.knosys.2023.110299 Get rights and content

Highlights

•
A novel m-closest entity matching problem over HINs is devised.
•
Pruning abilities at both the non-spatial and spatial layers are explored.
•
Our approach can achieve much better performance than the baseline methods.

Abstract

This work investigates a novel $m$ -closest entity ( $m$ CE) matching problem over geographic heterogeneous information networks (Geo-HINs). That is, given a Geo-HIN $G$ and $m$ query graphs ${q_{1}, q_{2}, \dots, q_{m}}$ , $m$ CE matching aims to find a group of geographic entities (geo-entities) whose patterns match the query graphs ${q_{1}, q_{2}, \dots, q_{m}}$ correspondingly, for which the maximum distance between any geo-entity pair (i.e., the diameter) in the group is minimized. As a fundamental problem, the $m$ CE matching can be applied for many scenarios, e.g., travel itinerary recommendation and city planning. The existing works have not simultaneously considered the characteristics of patterns matching and spatial search so that they cannot solve our problem, which is computationally expensive. To solve this problem efficiently, we propose a unified framework named $F u z z y - E x a c t$ framework to process entity matching and spatial search comprehensively, in which pruning abilities at non-spatial and spatial layers are cooperatively explored. Two mutually adaptive auxiliary data structures named $A r c - T r e e$ and $A r c - F o r e s t$ are devised to maintain the intermediate search results which are exploited to enhance the search process between non-spatial and spatial layers. Experimental results demonstrate that our algorithm can outperform the baseline methods by 2 orders of magnitude on runtime.

Introduction

Heterogeneous information networks (HINs) [1] are formed by multiple typed entities and multiple typed links to model various data, such as social networks, biology, and knowledge graphs. With the prevalence of location based service, plenty of entities in HINs are generated with location related links to identify their geographic info. For example, restaurants or hotels (entities) in Yelp network usually associate with location attributes to store their geographic coordinates. In social networks, such as Twitter, user entities usually contain check-in info and the micro-blogs are labeled with geo-tags. A Geo-HIN is a HIN, where at least one type of entity have geographic location attribute (named geo-entity). Fig. 1 shows a Geo-HIN example of Yelp network, which consists of eight geo-entities (i.e., $e_{1}, \dots, e_{8}$ ). Each geo-entity $e_{i}$ links to node $l_{i}$ denoting the location of $e_{i}$ . Here a geo-entity $e_{i}$ can be restaurant, hotel, park, or any object with location info. Besides, there are several types of non-geographic entities (non-geo-entities) in Fig. 1, such as user, feature, grade and so on. In this paper, we study the $m$ -closest entity ( $m$ CE) matching problem over Geo-HINs. Below is a motivating example.

Example 1

Fig. 1 illustrates a Geo-HIN of Yelp network. $u_{1}$ is a customer of Yelp who requests a travel itinerary planning for an unfamiliar city. Fig. 2 shows three query graphs ${q_{1}, q_{2}, q_{3}}$ to express the requirements of $u_{1}$ . In specific, the query graphs contain following contents “find a kids-friendly ( $t_{1}$ ) restaurant $e_{x}$ which is rated with five stars ( $g_{1}$ ) and marked by seafood ( $f_{1}$ )”, “find a kids-friendly ( $t_{1}$ ) hotel $e_{y}$ which has been booked by $u_{1}$ 's friend $u_{2}$ and tagged with swimming pool ( $f_{2}$ )”, and “find a kids-friendly ( $t_{1}$ ) and indoor ( $f_{3}$ ) scenic spot $e_{z}$ that has been visited by $u_{1}$ 's friend $u_{2}$ and is rated with five stars”. Through graph matching, we have $e_{x} = {e_{2}, e_{5}}$ , $e_{y} = {e_{3}}$ , and $e_{z} = {e_{4}}$ . Among the combination results of $e_{x}$ , $e_{y}$ , and $e_{z}$ (i.e., ${e_{2}, e_{3}, e_{4}}$ , ${e_{5}, e_{3}, e_{4}}$ ), $m$ CE may recommend ${e_{5}, e_{3}, e_{4}}$ to $u_{1}$ as the geo-diameter is minimized.

The $m$ -closest entity ( $m$ CE) matching problem over Geo-HINs can be further formulate as: given a Geo-HIN $G$ and $m$ query graphs in ${q_{1}, q_{2}, \dots, q_{m}}$ where each query graph describes the pattern of a geo-entity, the $m$ CE matching aims to find a group of geo-entities matching with the patterns of query graphs ${q_{1}, q_{2}, \dots, q_{m}}$ correspondingly such that the diameter of the geo-entity group is minimized. Here the diameter of a group (aka geo-diameter) is defined as the largest geo-distance between any pair of geo-entities in the group.

Applications. The $m$ CE matching can be used in many applications. As discussed in Example 1, it provides considerate travel itinerary planning to meet sophisticated requirements from users. This function is popular in tourist oriented APPs such as Wanderlog and Ctrip. Besides, for city planning, the locations of public facilities should be carefully considered. Given the specified public facilities, $m$ CE matching can be used to evaluate the setting according to the POI group within a smallest range. Similar to travel planning, $m$ CE matching can also be used to evaluate Internet of Things (IoT) system and location-based service system for recommending proper objects or users. Moreover, it is observed that when each query graph is as simplified as possible, e.g., just including geo-entity, $m$ CE matching problem is equal to $m$ CK query problem [2]. Thus, $m$ CE matching may extend the applications of the $m$ CK query.

Existing Studies. To the best of our knowledge, this is the first work to study the $m$ -closest entities ( $m$ CE) matching over Geo-HINs. The most related work to $m$ CE matching is $m$ -closest keywords ( $m$ CK) query [2], an important type of the spatial keyword query studied in several existing works (e.g.,[2], [3], [4]). However, $m$ CK query is located at the spatial database, which aims to find a group of objects that cover all keywords and guarantee the geo-diameter of the group is minimized. Our $m$ CE matching is a more complex problem than the existing $m$ CK query. The reason is that the $m$ CK query solves the $m$ -closest problem involving keywords and locations, and $m$ CE matching need to address the complex relations between subgraphs matching and locations. For instance, in Example 1, $m$ CE will return a POI that contains a kid-friendly swimming pool while $m$ CK can only ensure the POI contains a swimming pool and is a kid-friendly place. Note that the relationship between the swimming pool and kid-friendly keywords cannot be represented by $m$ CK, which will lead to the resulting swimming pool may be for adults only and it is not expected by the users. Moreover, all keywords are independent and the relationship between them cannot be explicitly specified in the $m$ CK query, which leads to it cannot precisely represent the user demand and the query answer may deviate far from user expectation. In regard to this limitation, we will empirically discuss in Section 5.2. In contrast to the $m$ CK, $m$ CE matching can provide more semantic information and make query requirements more general. Thus, it is necessary to pursuit ingenious method to integrate the process of subgraph matching and spatial search for $m$ CE matching.

Challenges and Our Solution. The challenges lie in two folds. The first challenge is that the pattern matching of entities and $m$ -closest searching among candidate entities are two relatively independent tasks. To improve query efficiency, a novel cooperated framework should be devised to exploit the intermediate search results of these two search processes. However, the related existing studies mainly focus on the queries that have both textual and spatial constraints [5], [6], [7], [8], [9], [10]. To the best of our knowledge, this is the first work to study the query that considers both entity pattern matching and spatial constraints. Compared with keywords, entity pattern matching can describe richer semantic information, while it also brings higher query complexity. Thus, it is important but hard to efficiently answer the queries that consider both entity pattern matching and spatial constraints.

The second challenge is that on account of subgraph matching and $m$ -closest search which are both NP-hard [11] [2]. The search space will increase exponentially with the increasing number of vertices in query graphs, and no algorithm that can find the exact answer in polynomial time. In existing studies, the general solution is to return an approximate answer to avoid the large time cost [2], [12], [13], [14], [15]. However, for approximate solutions, we may sacrifice some properties on the result such as a relaxed spatial constraint or partially matched entities. Moreover, the result of early termination of the exact algorithm in the hard case can be considered as an approximate solution. As the $m$ CE problem is studied for the first time, the investigation of exact algorithm is significant.

To address this new and practical problem, we propose an efficient $F u z z y - E x a c t$ framework to process entity matching and spatial search comprehensively. The cooperation of pruning strategies at both the non-spatial and spatial layers are explored to improve the efficiency of $m$ CE matching. Specifically, we design a fuzzy mechanism to avoid exhaustive checking whether each geo-entity is actually eligible. Moreover, two auxiliary data structures named $A r c - T r e e$ and $A r c - F o r e s t$ are introduced to maintain the intermediate search results and then facilitate the search process.

Contributions. The principal contributions in this paper are summarized as follows.

•
To our best knowledge, this is the first attempt to investigate the $m$ -closest entity matching problem over geographic heterogeneous information networks. Inspired by real-world applications, $m$ CE matching considers both the pattern matching and spatial search, which makes the problem more general than $m$ CK.
•
A novel $F u z z y - E x a c t$ framework is proposed, where the pruning abilities at non-spatial and spatial layers are cooperatively explored by integrating a fuzzy mechanism. Two mutually adaptive data structures named $A r c - T r e e$ and $A r c - F o r e s t$ are introduced to enhance the search process and make the framework more scalable.
•
Extensive experimental results demonstrate the efficiency and scalability of the proposed framework on five real-world datasets with up to ten million vertices. Our techniques can achieve 2 orders of magnitude improvements on runtime compared with baselines.

Section snippets

Preliminaries

In this section, we first formally present some concepts in Section 2.1. Section 2.2 formally defines the problem of $m$ CE matching, and Section 2.3 reviews related work.

Table 1 summarizes the notations frequently used in this paper.

Basic solution

To make the logic of the algorithms clearer, we can consider the Geo-HIN $G$ from two layers: (1) non-spatial layer, a vertex induced subgraph (Definition 6) from all entities except spatial descriptors, i.e., $G [V_{e} \cup V_{a} / V_{s}]$ . (2) spatial layer, a vertex induced subgraph from geo-entities and spatial descriptors, i.e., $G [V_{e} \cup V_{s}]$ . Fig. 3 shows the two layers of the Geo-HIN in Fig. 1. After dividing Geo-HIN into two layers, for a high-level idea, we can match entities at the non-spatial layer for

Fuzzy-exact framework

To address those problems posed by the basic solution, we propose a $F u z z y - E x a c t$ framework. It aims to avoid the exhaustive exact matching for each geo-entity and effectively exploit the pruning ability at the spatial layer based on the intermediate results generated by the non-spatial layer. For the overview of $F u z z y - E x a c t$ framework, our main idea is that we initially obtain the geo-entities that are probably exact matched geo-entities at a fraction of the cost, then pruning them at the spatial

Experimental evaluation

In this section, extensive experiments are conducted to verify both the effectiveness and the efficiency of the algorithms. All source code and datasets used in the experiment are available in “https://github.com/Morgan279/mCE”.

Conclusion

In this paper, we proposed a novel $F u z z y - E x a c t$ framework to handle a new $m$ -closest entity matching problem over geographic heterogeneous information networks, which is more practical and a challenge. Here geo-entities and non-geo-entities are processed by a unified framework, where exhaustive exact matching for each geo-entity is avoided through the fuzzy mechanism. At the same time, two adaptive data structures named $A r c - T r e e$ and $A r c - F o r e s t$ are introduced to maintain the intermediate search

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by the National Key R&D Program of China under grant 2018AAA0102502.

References (43)

SunS. et al.
Subgraph matching with effective matching order and indexing
IEEE Trans. Knowl. Data Eng.
(2020)
SunY. et al.
PathSim: Meta path-based top-K similarity search in heterogeneous information networks
Proc. VLDB Endow.
(2011)
T. Guo, X. Cao, G. Cong, Efficient algorithms for answering the m-closest keywords query, in: Proceedings of the 2015...
ZhangD. et al.
Keyword search in spatial databases: Towards searching by document
ZhangD. et al.
Locating mapped resources in web 2.0
CaiZ. et al.
Diversified spatial keyword search on RDF data
VLDB J.
(2020)
ZhangD. et al.
Augmented keyword search on spatial entity databases
VLDB J.
(2018)
G. Kalamatianos, G.J. Fakas, N. Mamoulis, Proportionality in spatial keyword search, in: Proceedings of the 2021...
A. Mahmood, W.G. Aref, Query processing techniques for big spatial-keyword data, in: Proceedings of the 2017 ACM...
G. Cong, C.S. Jensen, Querying geo-textual data: Spatial keyword queries and beyond, in: Proceedings of the 2016...

J. Lu, Y. Lu, G. Cong, Reverse spatial and textual k nearest neighbor search, in: Proceedings of the 2011 ACM SIGMOD...

GareyM.R. et al.

Computers and Intractability: A Guide to the Theory of NP-Completeness

(1979)

HeH. et al.

Closure-tree: An index structure for graph queries

TianY. et al.

Tale: A tool for approximate large graph matching

LiuL. et al.

G-finder: Approximate attributed subgraph matching

ZhangS. et al.

SAPPER: Subgraph indexing and approximate matching in large graphs

Proc. VLDB Endow.

(2010)

ChoiD. et al.

Finding the minimum spatial keyword cover

Y. Fang, K. Wang, X. Lin, W. Zhang, Cohesive subgraph search over big heterogeneous information networks: Applications,...

D. Seyler, P. Chandar, M. Davis, An information retrieval framework for contextual suggestion based on heterogeneous...

S. Fan, C. Shi, X. Wang, Abnormal event detection via heterogeneous information network embedding, in: Proceedings of...

H. Hong, Y. Lin, X. Yang, Z. Li, K. Fu, Z. Wang, X. Qie, J. Ye, Heteta: Heterogeneous information network embedding for...

Cited by (1)

Portable graph-based rumour detection against multi-modal heterophily
2024, Knowledge-Based Systems
The propagation of rumours on social media poses an important threat to societies, so that various techniques for graph-based rumour detection have been proposed recently. Existing works, however, are based on homophilic graphs: entities that are connected to each other often have the same label. However, recent studies found that heterophily is more common in real-world social networks, i.e., entities with different labels are also often linked to each other due to ‘innocent’ retweets or camouflage behaviours by malicious users. Especially, the heterophily problem is even more challenging in multi-modal social graphs, in which neighbouring entities might differ in terms of both labels and modalities. To cope with multi-modal homophily in graph-based rumour detection, we propose a Portable Graph Transformer-based Rumour Detection model (PHAROS) with novel multi-modal homophily measures. It integrates label information in the learning process, which enables us to generate discriminative neighbourhoods of entities. Our model can handle multiple modalities (a natural characteristic of social graphs) and is portable to be combined with existing graph-based models. Extensive experiments on real and synthetic data show the superiority, efficiency, robustness, and portability of PHAROS and its heterophily resilience.

¹: Wancheng Long and Xiaowen Li are the joint first authors.

View full text

Efficient m-closest entity matching over heterogeneous information networks

Highlights

Abstract

Introduction

Section snippets

Preliminaries

Basic solution

Fuzzy-exact framework

Experimental evaluation

Conclusion

Declaration of Competing Interest

Acknowledgments

IEEE Trans. Knowl. Data Eng.

PathSim: Meta path-based top-K similarity search in heterogeneous information networks

Proc. VLDB Endow.

Keyword search in spatial databases: Towards searching by document

Locating mapped resources in web 2.0

Diversified spatial keyword search on RDF data

VLDB J.

Augmented keyword search on spatial entity databases

VLDB J.

Computers and Intractability: A Guide to the Theory of NP-Completeness

Closure-tree: An index structure for graph queries

Tale: A tool for approximate large graph matching

G-finder: Approximate attributed subgraph matching

SAPPER: Subgraph indexing and approximate matching in large graphs

Proc. VLDB Endow.

Finding the minimum spatial keyword cover

Efficient $m$ -closest entity matching over heterogeneous information networks