Efficient -closest entity matching over heterogeneous information networks
Introduction
Heterogeneous information networks (HINs) [1] are formed by multiple typed entities and multiple typed links to model various data, such as social networks, biology, and knowledge graphs. With the prevalence of location based service, plenty of entities in HINs are generated with location related links to identify their geographic info. For example, restaurants or hotels (entities) in Yelp network usually associate with location attributes to store their geographic coordinates. In social networks, such as Twitter, user entities usually contain check-in info and the micro-blogs are labeled with geo-tags. A Geo-HIN is a HIN, where at least one type of entity have geographic location attribute (named geo-entity). Fig. 1 shows a Geo-HIN example of Yelp network, which consists of eight geo-entities (i.e., ). Each geo-entity links to node denoting the location of . Here a geo-entity can be restaurant, hotel, park, or any object with location info. Besides, there are several types of non-geographic entities (non-geo-entities) in Fig. 1, such as user, feature, grade and so on. In this paper, we study the -closest entity (CE) matching problem over Geo-HINs. Below is a motivating example.
Example 1 Fig. 1 illustrates a Geo-HIN of Yelp network. is a customer of Yelp who requests a travel itinerary planning for an unfamiliar city. Fig. 2 shows three query graphs to express the requirements of . In specific, the query graphs contain following contents “find a kids-friendly () restaurant which is rated with five stars () and marked by seafood ()”, “find a kids-friendly () hotel which has been booked by 's friend and tagged with swimming pool ()”, and “find a kids-friendly () and indoor () scenic spot that has been visited by 's friend and is rated with five stars”. Through graph matching, we have , , and . Among the combination results of , , and (i.e., , ), CE may recommend to as the geo-diameter is minimized.
The -closest entity (CE) matching problem over Geo-HINs can be further formulate as: given a Geo-HIN and query graphs in where each query graph describes the pattern of a geo-entity, the CE matching aims to find a group of geo-entities matching with the patterns of query graphs correspondingly such that the diameter of the geo-entity group is minimized. Here the diameter of a group (aka geo-diameter) is defined as the largest geo-distance between any pair of geo-entities in the group.
Applications. The CE matching can be used in many applications. As discussed in Example 1, it provides considerate travel itinerary planning to meet sophisticated requirements from users. This function is popular in tourist oriented APPs such as Wanderlog and Ctrip. Besides, for city planning, the locations of public facilities should be carefully considered. Given the specified public facilities, CE matching can be used to evaluate the setting according to the POI group within a smallest range. Similar to travel planning, CE matching can also be used to evaluate Internet of Things (IoT) system and location-based service system for recommending proper objects or users. Moreover, it is observed that when each query graph is as simplified as possible, e.g., just including geo-entity, CE matching problem is equal to CK query problem [2]. Thus, CE matching may extend the applications of the CK query.
Existing Studies. To the best of our knowledge, this is the first work to study the -closest entities (CE) matching over Geo-HINs. The most related work to CE matching is -closest keywords (CK) query [2], an important type of the spatial keyword query studied in several existing works (e.g.,[2], [3], [4]). However, CK query is located at the spatial database, which aims to find a group of objects that cover all keywords and guarantee the geo-diameter of the group is minimized. Our CE matching is a more complex problem than the existing CK query. The reason is that the CK query solves the -closest problem involving keywords and locations, and CE matching need to address the complex relations between subgraphs matching and locations. For instance, in Example 1, CE will return a POI that contains a kid-friendly swimming pool while CK can only ensure the POI contains a swimming pool and is a kid-friendly place. Note that the relationship between the swimming pool and kid-friendly keywords cannot be represented by CK, which will lead to the resulting swimming pool may be for adults only and it is not expected by the users. Moreover, all keywords are independent and the relationship between them cannot be explicitly specified in the CK query, which leads to it cannot precisely represent the user demand and the query answer may deviate far from user expectation. In regard to this limitation, we will empirically discuss in Section 5.2. In contrast to the CK, CE matching can provide more semantic information and make query requirements more general. Thus, it is necessary to pursuit ingenious method to integrate the process of subgraph matching and spatial search for CE matching.
Challenges and Our Solution. The challenges lie in two folds. The first challenge is that the pattern matching of entities and -closest searching among candidate entities are two relatively independent tasks. To improve query efficiency, a novel cooperated framework should be devised to exploit the intermediate search results of these two search processes. However, the related existing studies mainly focus on the queries that have both textual and spatial constraints [5], [6], [7], [8], [9], [10]. To the best of our knowledge, this is the first work to study the query that considers both entity pattern matching and spatial constraints. Compared with keywords, entity pattern matching can describe richer semantic information, while it also brings higher query complexity. Thus, it is important but hard to efficiently answer the queries that consider both entity pattern matching and spatial constraints.
The second challenge is that on account of subgraph matching and -closest search which are both NP-hard [11] [2]. The search space will increase exponentially with the increasing number of vertices in query graphs, and no algorithm that can find the exact answer in polynomial time. In existing studies, the general solution is to return an approximate answer to avoid the large time cost [2], [12], [13], [14], [15]. However, for approximate solutions, we may sacrifice some properties on the result such as a relaxed spatial constraint or partially matched entities. Moreover, the result of early termination of the exact algorithm in the hard case can be considered as an approximate solution. As the CE problem is studied for the first time, the investigation of exact algorithm is significant.
To address this new and practical problem, we propose an efficient framework to process entity matching and spatial search comprehensively. The cooperation of pruning strategies at both the non-spatial and spatial layers are explored to improve the efficiency of CE matching. Specifically, we design a fuzzy mechanism to avoid exhaustive checking whether each geo-entity is actually eligible. Moreover, two auxiliary data structures named and are introduced to maintain the intermediate search results and then facilitate the search process.
Contributions. The principal contributions in this paper are summarized as follows.
- •
To our best knowledge, this is the first attempt to investigate the -closest entity matching problem over geographic heterogeneous information networks. Inspired by real-world applications, CE matching considers both the pattern matching and spatial search, which makes the problem more general than CK.
- •
A novel framework is proposed, where the pruning abilities at non-spatial and spatial layers are cooperatively explored by integrating a fuzzy mechanism. Two mutually adaptive data structures named and are introduced to enhance the search process and make the framework more scalable.
- •
Extensive experimental results demonstrate the efficiency and scalability of the proposed framework on five real-world datasets with up to ten million vertices. Our techniques can achieve 2 orders of magnitude improvements on runtime compared with baselines.
Section snippets
Preliminaries
In this section, we first formally present some concepts in Section 2.1. Section 2.2 formally defines the problem of CE matching, and Section 2.3 reviews related work.
Table 1 summarizes the notations frequently used in this paper.
Basic solution
To make the logic of the algorithms clearer, we can consider the Geo-HIN from two layers: (1) non-spatial layer, a vertex induced subgraph (Definition 6) from all entities except spatial descriptors, i.e., . (2) spatial layer, a vertex induced subgraph from geo-entities and spatial descriptors, i.e., . Fig. 3 shows the two layers of the Geo-HIN in Fig. 1. After dividing Geo-HIN into two layers, for a high-level idea, we can match entities at the non-spatial layer for
Fuzzy-exact framework
To address those problems posed by the basic solution, we propose a framework. It aims to avoid the exhaustive exact matching for each geo-entity and effectively exploit the pruning ability at the spatial layer based on the intermediate results generated by the non-spatial layer. For the overview of framework, our main idea is that we initially obtain the geo-entities that are probably exact matched geo-entities at a fraction of the cost, then pruning them at the spatial
Experimental evaluation
In this section, extensive experiments are conducted to verify both the effectiveness and the efficiency of the algorithms. All source code and datasets used in the experiment are available in “https://github.com/Morgan279/mCE”.
Conclusion
In this paper, we proposed a novel framework to handle a new -closest entity matching problem over geographic heterogeneous information networks, which is more practical and a challenge. Here geo-entities and non-geo-entities are processed by a unified framework, where exhaustive exact matching for each geo-entity is avoided through the fuzzy mechanism. At the same time, two adaptive data structures named and are introduced to maintain the intermediate search
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported by the National Key R&D Program of China under grant 2018AAA0102502.
References (43)
- et al.
Subgraph matching with effective matching order and indexing
IEEE Trans. Knowl. Data Eng.
(2020) - et al.
PathSim: Meta path-based top-K similarity search in heterogeneous information networks
Proc. VLDB Endow.
(2011) - T. Guo, X. Cao, G. Cong, Efficient algorithms for answering the m-closest keywords query, in: Proceedings of the 2015...
- et al.
Keyword search in spatial databases: Towards searching by document
- et al.
Locating mapped resources in web 2.0
- et al.
Diversified spatial keyword search on RDF data
VLDB J.
(2020) - et al.
Augmented keyword search on spatial entity databases
VLDB J.
(2018) - G. Kalamatianos, G.J. Fakas, N. Mamoulis, Proportionality in spatial keyword search, in: Proceedings of the 2021...
- A. Mahmood, W.G. Aref, Query processing techniques for big spatial-keyword data, in: Proceedings of the 2017 ACM...
- G. Cong, C.S. Jensen, Querying geo-textual data: Spatial keyword queries and beyond, in: Proceedings of the 2016...
Computers and Intractability: A Guide to the Theory of NP-Completeness
Closure-tree: An index structure for graph queries
Tale: A tool for approximate large graph matching
G-finder: Approximate attributed subgraph matching
SAPPER: Subgraph indexing and approximate matching in large graphs
Proc. VLDB Endow.
Finding the minimum spatial keyword cover
Cited by (1)
Portable graph-based rumour detection against multi-modal heterophily
2024, Knowledge-Based Systems
- 1
Wancheng Long and Xiaowen Li are the joint first authors.