Elsevier

Knowledge-Based Systems

Volume 263, 5 March 2023, 110299
Knowledge-Based Systems

Efficient m-closest entity matching over heterogeneous information networks

https://doi.org/10.1016/j.knosys.2023.110299Get rights and content

Highlights

  • A novel m-closest entity matching problem over HINs is devised.

  • Pruning abilities at both the non-spatial and spatial layers are explored.

  • Our approach can achieve much better performance than the baseline methods.

Abstract

This work investigates a novel m-closest entity (mCE) matching problem over geographic heterogeneous information networks (Geo-HINs). That is, given a Geo-HIN G and m query graphs {q1,q2,,qm}, mCE matching aims to find a group of geographic entities (geo-entities) whose patterns match the query graphs {q1,q2,,qm} correspondingly, for which the maximum distance between any geo-entity pair (i.e., the diameter) in the group is minimized. As a fundamental problem, the mCE matching can be applied for many scenarios, e.g., travel itinerary recommendation and city planning. The existing works have not simultaneously considered the characteristics of patterns matching and spatial search so that they cannot solve our problem, which is computationally expensive. To solve this problem efficiently, we propose a unified framework named FuzzyExact framework to process entity matching and spatial search comprehensively, in which pruning abilities at non-spatial and spatial layers are cooperatively explored. Two mutually adaptive auxiliary data structures named ArcTree and ArcForest are devised to maintain the intermediate search results which are exploited to enhance the search process between non-spatial and spatial layers. Experimental results demonstrate that our algorithm can outperform the baseline methods by 2 orders of magnitude on runtime.

Introduction

Heterogeneous information networks (HINs) [1] are formed by multiple typed entities and multiple typed links to model various data, such as social networks, biology, and knowledge graphs. With the prevalence of location based service, plenty of entities in HINs are generated with location related links to identify their geographic info. For example, restaurants or hotels (entities) in Yelp network usually associate with location attributes to store their geographic coordinates. In social networks, such as Twitter, user entities usually contain check-in info and the micro-blogs are labeled with geo-tags. A Geo-HIN is a HIN, where at least one type of entity have geographic location attribute (named geo-entity). Fig. 1 shows a Geo-HIN example of Yelp network, which consists of eight geo-entities (i.e., e1,,e8). Each geo-entity ei links to node li denoting the location of ei. Here a geo-entity ei can be restaurant, hotel, park, or any object with location info. Besides, there are several types of non-geographic entities (non-geo-entities) in Fig. 1, such as user, feature, grade and so on. In this paper, we study the m-closest entity (mCE) matching problem over Geo-HINs. Below is a motivating example.

Example 1

Fig. 1 illustrates a Geo-HIN of Yelp network. u1 is a customer of Yelp who requests a travel itinerary planning for an unfamiliar city. Fig. 2 shows three query graphs {q1,q2,q3} to express the requirements of u1. In specific, the query graphs contain following contents “find a kids-friendly (t1) restaurant ex which is rated with five stars (g1) and marked by seafood (f1)”, “find a kids-friendly (t1) hotel ey which has been booked by u1's friend u2 and tagged with swimming pool (f2)”, and “find a kids-friendly (t1) and indoor (f3) scenic spot ez that has been visited by u1's friend u2 and is rated with five stars”. Through graph matching, we have ex={e2,e5}, ey={e3}, and ez={e4}. Among the combination results of ex, ey, and ez (i.e., {e2,e3,e4}, {e5,e3,e4}), mCE may recommend {e5,e3,e4} to u1 as the geo-diameter is minimized.

The m-closest entity (mCE) matching problem over Geo-HINs can be further formulate as: given a Geo-HIN G and m query graphs in {q1,q2,,qm} where each query graph describes the pattern of a geo-entity, the mCE matching aims to find a group of geo-entities matching with the patterns of query graphs {q1,q2,,qm} correspondingly such that the diameter of the geo-entity group is minimized. Here the diameter of a group (aka geo-diameter) is defined as the largest geo-distance between any pair of geo-entities in the group.

Applications. The mCE matching can be used in many applications. As discussed in Example 1, it provides considerate travel itinerary planning to meet sophisticated requirements from users. This function is popular in tourist oriented APPs such as Wanderlog and Ctrip. Besides, for city planning, the locations of public facilities should be carefully considered. Given the specified public facilities, mCE matching can be used to evaluate the setting according to the POI group within a smallest range. Similar to travel planning, mCE matching can also be used to evaluate Internet of Things (IoT) system and location-based service system for recommending proper objects or users. Moreover, it is observed that when each query graph is as simplified as possible, e.g., just including geo-entity, mCE matching problem is equal to mCK query problem [2]. Thus, mCE matching may extend the applications of the mCK query.

Existing Studies. To the best of our knowledge, this is the first work to study the m-closest entities (mCE) matching over Geo-HINs. The most related work to mCE matching is m-closest keywords (mCK) query [2], an important type of the spatial keyword query studied in several existing works (e.g.,[2], [3], [4]). However, mCK query is located at the spatial database, which aims to find a group of objects that cover all keywords and guarantee the geo-diameter of the group is minimized. Our mCE matching is a more complex problem than the existing mCK query. The reason is that the mCK query solves the m-closest problem involving keywords and locations, and mCE matching need to address the complex relations between subgraphs matching and locations. For instance, in Example 1, mCE will return a POI that contains a kid-friendly swimming pool while mCK can only ensure the POI contains a swimming pool and is a kid-friendly place. Note that the relationship between the swimming pool and kid-friendly keywords cannot be represented by mCK, which will lead to the resulting swimming pool may be for adults only and it is not expected by the users. Moreover, all keywords are independent and the relationship between them cannot be explicitly specified in the mCK query, which leads to it cannot precisely represent the user demand and the query answer may deviate far from user expectation. In regard to this limitation, we will empirically discuss in Section 5.2. In contrast to the mCK, mCE matching can provide more semantic information and make query requirements more general. Thus, it is necessary to pursuit ingenious method to integrate the process of subgraph matching and spatial search for mCE matching.

Challenges and Our Solution. The challenges lie in two folds. The first challenge is that the pattern matching of entities and m-closest searching among candidate entities are two relatively independent tasks. To improve query efficiency, a novel cooperated framework should be devised to exploit the intermediate search results of these two search processes. However, the related existing studies mainly focus on the queries that have both textual and spatial constraints [5], [6], [7], [8], [9], [10]. To the best of our knowledge, this is the first work to study the query that considers both entity pattern matching and spatial constraints. Compared with keywords, entity pattern matching can describe richer semantic information, while it also brings higher query complexity. Thus, it is important but hard to efficiently answer the queries that consider both entity pattern matching and spatial constraints.

The second challenge is that on account of subgraph matching and m-closest search which are both NP-hard [11] [2]. The search space will increase exponentially with the increasing number of vertices in query graphs, and no algorithm that can find the exact answer in polynomial time. In existing studies, the general solution is to return an approximate answer to avoid the large time cost [2], [12], [13], [14], [15]. However, for approximate solutions, we may sacrifice some properties on the result such as a relaxed spatial constraint or partially matched entities. Moreover, the result of early termination of the exact algorithm in the hard case can be considered as an approximate solution. As the mCE problem is studied for the first time, the investigation of exact algorithm is significant.

To address this new and practical problem, we propose an efficient FuzzyExact framework to process entity matching and spatial search comprehensively. The cooperation of pruning strategies at both the non-spatial and spatial layers are explored to improve the efficiency of mCE matching. Specifically, we design a fuzzy mechanism to avoid exhaustive checking whether each geo-entity is actually eligible. Moreover, two auxiliary data structures named ArcTree and ArcForest are introduced to maintain the intermediate search results and then facilitate the search process.

Contributions. The principal contributions in this paper are summarized as follows.

  • To our best knowledge, this is the first attempt to investigate the m-closest entity matching problem over geographic heterogeneous information networks. Inspired by real-world applications, mCE matching considers both the pattern matching and spatial search, which makes the problem more general than mCK.

  • A novel FuzzyExact framework is proposed, where the pruning abilities at non-spatial and spatial layers are cooperatively explored by integrating a fuzzy mechanism. Two mutually adaptive data structures named ArcTree and ArcForest are introduced to enhance the search process and make the framework more scalable.

  • Extensive experimental results demonstrate the efficiency and scalability of the proposed framework on five real-world datasets with up to ten million vertices. Our techniques can achieve 2 orders of magnitude improvements on runtime compared with baselines.

Section snippets

Preliminaries

In this section, we first formally present some concepts in Section 2.1. Section 2.2 formally defines the problem of mCE matching, and Section 2.3 reviews related work.

Table 1 summarizes the notations frequently used in this paper.

Basic solution

To make the logic of the algorithms clearer, we can consider the Geo-HIN G from two layers: (1) non-spatial layer, a vertex induced subgraph (Definition 6) from all entities except spatial descriptors, i.e., G[VeVa/Vs]. (2) spatial layer, a vertex induced subgraph from geo-entities and spatial descriptors, i.e., G[VeVs]. Fig. 3 shows the two layers of the Geo-HIN in Fig. 1. After dividing Geo-HIN into two layers, for a high-level idea, we can match entities at the non-spatial layer for

Fuzzy-exact framework

To address those problems posed by the basic solution, we propose a FuzzyExact framework. It aims to avoid the exhaustive exact matching for each geo-entity and effectively exploit the pruning ability at the spatial layer based on the intermediate results generated by the non-spatial layer. For the overview of FuzzyExact framework, our main idea is that we initially obtain the geo-entities that are probably exact matched geo-entities at a fraction of the cost, then pruning them at the spatial

Experimental evaluation

In this section, extensive experiments are conducted to verify both the effectiveness and the efficiency of the algorithms. All source code and datasets used in the experiment are available in “https://github.com/Morgan279/mCE”.

Conclusion

In this paper, we proposed a novel FuzzyExact framework to handle a new m-closest entity matching problem over geographic heterogeneous information networks, which is more practical and a challenge. Here geo-entities and non-geo-entities are processed by a unified framework, where exhaustive exact matching for each geo-entity is avoided through the fuzzy mechanism. At the same time, two adaptive data structures named ArcTree and ArcForest are introduced to maintain the intermediate search

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by the National Key R&D Program of China under grant 2018AAA0102502.

References (43)

  • SunS. et al.

    Subgraph matching with effective matching order and indexing

    IEEE Trans. Knowl. Data Eng.

    (2020)
  • SunY. et al.

    PathSim: Meta path-based top-K similarity search in heterogeneous information networks

    Proc. VLDB Endow.

    (2011)
  • T. Guo, X. Cao, G. Cong, Efficient algorithms for answering the m-closest keywords query, in: Proceedings of the 2015...
  • ZhangD. et al.

    Keyword search in spatial databases: Towards searching by document

  • ZhangD. et al.

    Locating mapped resources in web 2.0

  • CaiZ. et al.

    Diversified spatial keyword search on RDF data

    VLDB J.

    (2020)
  • ZhangD. et al.

    Augmented keyword search on spatial entity databases

    VLDB J.

    (2018)
  • G. Kalamatianos, G.J. Fakas, N. Mamoulis, Proportionality in spatial keyword search, in: Proceedings of the 2021...
  • A. Mahmood, W.G. Aref, Query processing techniques for big spatial-keyword data, in: Proceedings of the 2017 ACM...
  • G. Cong, C.S. Jensen, Querying geo-textual data: Spatial keyword queries and beyond, in: Proceedings of the 2016...
  • J. Lu, Y. Lu, G. Cong, Reverse spatial and textual k nearest neighbor search, in: Proceedings of the 2011 ACM SIGMOD...
  • GareyM.R. et al.

    Computers and Intractability: A Guide to the Theory of NP-Completeness

    (1979)
  • HeH. et al.

    Closure-tree: An index structure for graph queries

  • TianY. et al.

    Tale: A tool for approximate large graph matching

  • LiuL. et al.

    G-finder: Approximate attributed subgraph matching

  • ZhangS. et al.

    SAPPER: Subgraph indexing and approximate matching in large graphs

    Proc. VLDB Endow.

    (2010)
  • ChoiD. et al.

    Finding the minimum spatial keyword cover

  • Y. Fang, K. Wang, X. Lin, W. Zhang, Cohesive subgraph search over big heterogeneous information networks: Applications,...
  • D. Seyler, P. Chandar, M. Davis, An information retrieval framework for contextual suggestion based on heterogeneous...
  • S. Fan, C. Shi, X. Wang, Abnormal event detection via heterogeneous information network embedding, in: Proceedings of...
  • H. Hong, Y. Lin, X. Yang, Z. Li, K. Fu, Z. Wang, X. Qie, J. Ye, Heteta: Heterogeneous information network embedding for...
  • Cited by (1)

    1

    Wancheng Long and Xiaowen Li are the joint first authors.

    View full text