Elsevier

Knowledge-Based Systems

Volume 228, 27 September 2021, 107286
Knowledge-Based Systems

A coarse-to-fine collective entity linking method for heterogeneous information networks

https://doi.org/10.1016/j.knosys.2021.107286Get rights and content

Abstract

Linking ambiguous entity mentions in a text with their true mapping entities in a heterogeneous information network (HIN) is important. Most of existing entity linking methods with HINs assume that the entities in a text are independent while ignoring the relationships between the entities in context. Recent studies have shown that collective entity linking methods are more effective than traditional independent entity linking methods because they consider the relationships between different entities in the same text. However, few studies focus on collective entity linking for HINs. Most of collective entity linking methods rely largely on special features in Wikipedia, and may not be suitable for the HINs that are not mapped to Wikipedia. Moreover, existing collective entity linking methods may have high time complexity. Therefore, a Coarse-to-Fine collective Entity Linking algorithm (called CFEL) is proposed for the case the Wikipedia cannot be used. CFEL is composed of a coarse-grained model and a fine-grained model. In the coarse-grained model, a pruning strategy motivated by the human cognition mechanism, is adopted to reduce the number of candidates for each entity mention in texts. The candidates in HINs that are inconsistent with the type of entity mentions can be deleted. In the fine-grained model, we present a probabilistic method that combines the semantic information in a text with the structural information in HINs. The experimental results on four real-world datasets verify the effectiveness of our algorithm compared to the baselines.

Introduction

A heterogeneous information network (HIN) consists of multitype objects and relations [1]. Many large databases can be modeled as HINs, such as YAGO [2]. For example, in the YAGO network, multiple types of objects, such as person (P), location (L), and organization (O), and multiple types of relationships, such as “lives in”, “works at”, and “is married to”, are connected to form a heterogeneous information network. HINs are of great value in many aspects such as recommendations [3] and object clustering [4].

Despite the existence of a large number of HINs, the information contained in such networks is limited [5]. In addition, with the development of the world, new facts have emerged. Therefore, populating existing HINs with new facts has become increasingly important. However, the entities contained in the new facts extracted from the text are ambiguous [6]. The same name may express different meanings and refer to different entities. As shown in Fig. 1, the name “England” in the text may be linked to 45 different entities in YAGO, including the country “England”, the city “London”, the team “England_cricket_team”, etc. Consequently, it is essential to link the entity mentions in the text with their true mapping entities in HINs.

To the best of our knowledge, few works have been proposed to address EL with HINs. For instance, Shen et al. [5], [7] first explored the problem of linking entities with HINs and proposed a general unsupervised framework called SHINE. Wang et al. [8] proposed an exploratory entity linking framework to link ambiguous author names to the entities in HINs and discover new authors not in the HIN. These works assume that the entities in a text are independent while ignoring the relationships between the entities in the context. Recent research results show that the relationship between entities is beneficial to improve the accuracy of entity linking algorithms. Take the sentence in Fig. 1 as an example: if we link the entity mention “Alan Shearer” to the football player “Alan Shearer”, the entity mention “England” can be linked to the entity “England_national_football_team”. The reason is that “Alan Shearer” is the captain of “England_national_football_team”. There is a relationship between the entity “England_national_football_team” and the football player “Alan Shearer” in a HIN. Therefore, we adopt collective approaches to utilize the relationships between the entities in the same text to solve the entity linking problem for HINs.

Most existing collective entity linking methods rely on various information in Wikipedia, such as Wikipedia articles [9], [10], [11] and links in Wikipedia [10], [11], [12]. For example, the description pages of entities in Wikipedia have been utilized to estimate the context similarity between an entity mention and its candidates [9], [10], [11], [12]. Although an increasing number of HINs have mapping relationships to Wikipedia, there are still some HINs that are unrelated to Wikipedia. The reason is that some domain-specific information contained in HINs does not exist in Wikipedia. For example, there are mainly scientific research achievements of authors in the computer field in DBLP network, such as authors, papers, and venues. But it is difficult to find the above information in Wikipedia. Therefore, for the HINs that cannot use information in Wikipedia, these collective approaches cannot be used.

In addition, collective entity linking methods usually have high time complexity. To optimize the algorithm, we adopt a pruning strategy to delete the useless candidate entities from an entity graph. More specifically, when humans encounter an ambiguous entity, we usually determine the predicted type of this ambiguous entity mention from context. Then, we can obtain the most probable mapping entity from all candidates whose type matches the predicted type by combining the contextual information and the relationships between the entities. By way of illustration, in the text shown in Fig. 1, after reading this paragraph, we conclude that the entity mention “England” has the semantic type “organization”. Next, we only need to find the most probable mapping entity from the candidates whose type is “organization”. Therefore, inspired by this idea, we introduce the human hierarchical cognitive mechanism [13]. We obtain the type of entity mentioned in the text using a “global precedence” and apply a preprocessing phase to prune the entities in the entity graph whose type is inconsistent with the predicted type of the entity mention.

In this paper, we propose a Coarse-to-Fine collective Entity Linking algorithm (CFEL) for HINs that combines collective entity linking and the human cognition mechanism of “global precedence” [13] to reduce the time complexity of our algorithm. Our proposed CFEL algorithm is made up of two steps: the coarse-grained step and the fine-grained step. In the coarse-grained step, we first determine the type of entity mention from context. By way of illustration, in the text snippet as shown in Fig. 1, after we show that the entity mentioning “England” has the semantic type “organization”, we can restrict the candidate entities to the type “organization”. In the fine-grained step, we propose a probabilistic model with the semantic information and relationships between the entities in a text. As shown in Fig. 1, we can infer that the entity mention “England” refers to the “England_national_football_team” in the HIN. Our major contributions are as follows. In addition, the relationship between our motivations and contributions is summarized in Fig. 2.

  • We propose a Coarse-to-Fine Entity Linking algorithm (CFEL) with a heterogeneous information network. Our algorithm combines collective entity linking and the human cognition mechanism of “global precedence” to model the types of and relationships between entities. In addition, we obtain the type of entity mention in a text and adopt the pruning strategy to delete the candidate entities that do not match the type in the entity graph, thus reducing the size of the entity graph and achieving the goal of optimizing the algorithm.

  • We propose a probability-based entity linking model that combines the semantic information of the entity mentions in the text with the relationships between entities in HINs. In addition, we propose an entity relatedness measurement method based on constrained paths to estimate the probability of entities appearing together in the same context.

  • We present an experimental study on four real datasets. The experimental results prove that the entity linking results of our algorithm outperform the results of the baseline comparison methods.

The organization of this article is as follows. Section 2 summarizes the related work of this paper. Section 3 explains our proposed CFEL algorithm. Section 4 presents the experimental results and analysis. Section 5 concludes this article.

Section snippets

Related work

Entity linking methods include independent entity linking, collaborative entity linking, and collective entity linking [6]. Independent entity linking methods consider that the entities in Web text are independent. In addition, these methods find the most probable entities by measuring the contextual similarity between the entity mention and its candidates. For instance, Mendes et al. [14] developed an open source entity linking system Spotlight, which allows users to configure the system

Our CFEL algorithm

In this section, we present some of the definitions mentioned in this article. Then, we clarify our proposed algorithm in detail.

Experiment

To verify the performance of our CFEL algorithm, we conducted a thorough experiment. We illustrate the experimental setting in Section 4.1. Then, we discuss the performance of CFEL in Section 4.2.

Conclusion

This study considered the task of collective entity linking in HINs. For the first time, we combined the human brain’s “global precedence” cognitive mechanism with entity linking and proposed a coarse-to-fine grained collective entity linking algorithm called CFEL. To optimize the algorithm, we first determined the entity types in a text using a coarse-grained approach, reducing the number of candidates for each entity mention. Then, we proposed a probability method combining the semantic

CRediT authorship contribution statement

Jiao Li: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing - original draft, Writing - review & editing. Chenyang Bu: Conceptualization, Methodology, Writing - original draft, Writing - review & editing, Resources, Funding acquisition. Peipei Li: Resources, Writing - original draft, Writing - review & editing. Xindong Wu: Resources, Writing - review & editing, Supervision, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Key Research and Development Program of China (2016YFB1000901), the National Natural Science Foundation of China (61806065 and 91746209), and the Fundamental Research Funds for the Central Universities, China (JZ2020HGQA0186).

References (33)

  • WangC. et al.

    HEEL: exploratory entity linking for heterogeneous information networks

    Knowl. Inf. Syst.

    (2020)
  • LiuM. et al.

    A multi-view-based collective entity linking method

    ACM Trans. Inf. Syst.

    (2019)
  • ZhouX. et al.

    A recurrent model for collective entity linking with adaptive features

  • WangG. et al.

    Granular computing: from granularity optimization to multi-granularity joint problem solving

    Granul. Comput.

    (2017)
  • DaiberJ. et al.

    Improving efficiency and accuracy in multilingual entity extraction

  • HoffartJ. et al.

    KORE: Keyphrase overlap relatedness for entity disambiguation

  • Cited by (7)

    • Combining embedding-based and symbol-based methods for entity alignment

      2022, Pattern Recognition
      Citation Excerpt :

      In future work, we plan to study collective symbol-based methods [43] for entity alignment to better utilize the advantages of various symbol-based models.

    • Which Companies are Likely to Invest: Knowledge-graph-based Recommendation for Investment Promotion

      2022, Proceedings - IEEE International Conference on Data Mining, ICDM
    • A Weak Supervision Approach with Adversarial Training for Named Entity Recognition

      2021, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus

    The code (and data) in this article has been certified as Reproducible by Code Ocean: (https://codeocean.com/). More information on the Reproducibility Badge Initiative is available at https://www.elsevier.com/physical-sciences-and-engineering/computer-science/journals.

    View full text