Elsevier

Pattern Recognition

Volume 124, April 2022, 108433
Pattern Recognition

Combining embedding-based and symbol-based methods for entity alignment

https://doi.org/10.1016/j.patcog.2021.108433Get rights and content

Highlights

  • We propose a two-stage framework for entity alignment from the perspective of combining the advantages of both symbol-based and embedding-based methods.

  • A series of symbol-based methods are adopted to align the relation pairs in stage I.

  • Symbol-based methods and a hybrid embedding model are combined to match the entity pairs in stage II.

  • Experimental results from real-word datasets demonstrate that our proposed method is effective.

  • Ablation studies illustrate our proposed strategies are versatile and can also be applied to other embedding models.

Abstract

The objective of entity alignment is to judge whether entities refer to the same object in the real world. Methods for entity alignment can be grossly divided into two groups: conventional symbol-based entity alignment methods and embedding-based entity alignment methods. Both groups of methods have advantages and disadvantages (which are detailed in Section 1). Therefore, combining the advantages of both methods might be a promising strategy. However, to the best of our knowledge, only the RTEA algorithm that was proposed in our previous conference paper (Proceeding of Pacific Rim International Conference on Artificial Intelligence, pp. 162–175, 2019) utilizes this strategy for entity alignment. This manuscript is an extended version of that conference paper, in which an improved algorithm, namely, ESEA (combining embedding-based and symbol-based methods for entity alignment), is proposed based on the following steps. First, a novel method for combining embedding models with symbol-based models is proposed. Entities with high vector similarities are obtained through a hybrid embedding model, and the final aligned entity pairs are calculated via symbol-based methods. Second, a series of symbol-based methods, instead of only the edit distance method in the original version, are combined with embedding-based methods for relation alignment. Third, we combine symbol-based and embedding-based methods in a more complicated framework with the objective of better exploiting the advantages of both methods. The experimental results on real-world datasets demonstrate that the proposed method outperformed several state-of-the-art embedding-based entity alignment approaches and outperformed our previous RTEA method.

Graphical abstract

An example of fusing multiple financial knowledge graphs (KGs) from different sources. Through aligning and integrating multi-source heterogeneous data, we could observe that Jack Ma (also called Mr. Ma) is not only the principal founder of the Alibaba Group, but also a director of Softbank. Therefore, a more complete knowledge graph with rich information can be obtained, which is essential for applications such as financial search and financial question answering

Image, graphical abstract
  1. Download : Download high-res image (152KB)
  2. Download : Download full-size image
.

Introduction

Entity alignment refers to the recognition of whether an entity pair from different sources represent the same object in the real world [1]. It can facilitate the description, identification, and classification of the essential characteristics of objects, which is important for pattern recognition [2]. For example, in a financial knowledge graph (which is a directed graph of entities, e.g., Microsoft Corporation, and relations, e.g., leader of), financial data can be obtained in various ways, such as extraction from systems, purchase from a third party, and crawling from the Internet. Aligning and integrating these sources of information is essential for applications such as financial event prediction [3] and financial question answering [4].

Conventional entity alignment models are mostly based on string similarity calculations [5, 6] or propagation [7, 8]. The similarity or equality between the attributes, character strings, or neighboring nodes can be used to judge whether two entities are equal or not. Methods of this type have the advantages of high accuracy and no required training data. However, the similarities between source entities and target entities should be calculated when performing alignment, and the time complexity is high. In addition, the reliance on symbols leads to a potential inability of symbol-based entity alignment to handle cross-language or literal heterogeneity scenarios. For example, the Chinese name “马云” and English name “Jack Ma”, as illustrated in Fig. 1(a), refer to the same person in real life (entrepreneur Jack Ma), who can also be called “Mr. Ma” in various circumstances, as illustrated in Fig. 1(b). Thus, calculation of the string similarity for entity alignment cannot address these two scenarios directly.

To avoid considering the literary form of entities, embedding-based approaches for entity alignment have attracted increasing attention [9, 10]. Embedding-based methods encode entities and relations into low-dimensional continuous vector spaces and learn the vector embeddings of entities and relations, which is an effective approach for graph data processing [11, 12]. Then, vector distances (e.g., cosine distance) of embeddings are utilized to measure similarities between entities. As a result, embedding-based methods have higher computing efficiency and can address cross-language and literal heterogeneity scenarios. Moreover, an embedding model might contain semantic information, such as background information [13] or structural information [14], which might be helpful for the task of entity alignment. However, a method of this type typically requires many aligned entity pairs for model training, while labeling data manually is expensive. In addition, compared to representative symbol-based methods, embedding methods have the disadvantages of lower accuracy and no interpretability.

Since both symbol-based and embedding-based approaches have their own advantages and disadvantages according to the above analysis, an intuitive strategy is to combine the advantages of both approaches. However, to the best of our knowledge, only the RTEA algorithm [15], which was proposed in our conference paper, adopts this strategy for entity alignment.

This manuscript is an extended version of our previous paper [15]. The differences between this study and the previous study are as follows: 1) In this study, we propose a novel strategy for combining embedding-based methods with symbol-based methods (detailed in Section 4.2). In the previous study, we aligned entities in stage II using only embedding-based methods, whereas hybrid embedding and symbol string similarity are combined to match entities in this study. A hybrid embedding model that considers both facts and rules is used to learn the vector embeddings of entities. Entities (called candidate entities) that are closest to the target entity in the embedding space can be identified. Since embedding models have high computing efficiency, the candidate entities of cross-language and literal heterogeneity scenarios can all be selected. Then, conventional symbol-based methods are adopted to capture the true equal entity in the candidate entity set. The number of candidate entities is far less than that of original entities before filtering. Thus, the time complexity of the similarity calculation can be reduced effectively. Moreover, the results that are generated through symbol-based methods retain satisfactory interpretability. Significantly, candidate entities can also be generated through other embedding models, thereby proving that the framework is extensible. 2) In the original version, the edit-distance method was the only symbol-based method. In this study, however, a series of other symbol-based methods, which include machine translation, preprocessing, common substring comparison, and rule extraction by both AMIE+ and ontology knowledge bases, are combined with embedding-based methods. The experimental results demonstrate that the utilization of these symbol-based methods can substantially improve the performance. 3) To better realize the benefits of both methods, symbol-based and embedding-based methods are combined in a more complicated framework. For example, in the original version, only the symbol-based method was adopted in the stage of relation alignment, which could have a disadvantage in addressing the scenario of literal heterogeneity in relation pairs. In contrast, in the improved framework, both embedding-based and symbol-based methods can be adopted for relation alignment. The proposed method is evaluated on several real-world datasets, and the experimental results demonstrate that our approach outperforms most state-of-the-art methods.

The remainder of this paper is organized as follows: The background is described in Section 2. The previous study is introduced in Section 3, and the proposed model is detailed in Section 4. The experimental results and various analyses are discussed in Section 5. Finally, the results are summarized and potential future directions of investigation are discussed in the final section.

Section snippets

Background

In this section, important definitions are introduced. Then, typical traditional methods (Section 2.2) and embedding-based methods (Section 2.3) for entity alignment are described.

RTEA algorithm

This section focuses on our previous study in reference [15]. Two components, namely, similarity-based relation alignment with KG embedding (Section 3.1) and hybrid embedding with both fact triples and logical rules (Section 3.2), are detailed.

ESEA algorithm

This section introduces the proposed algorithm. The overall framework is described in Section 4.1. Then, three improved strategies for combining symbol-based methods and embedding-based methods are detailed in Section 4.2, Section 4.3 and Section 4.4.

Experiments

In this section, we evaluate the performance of the proposed method on various datasets. The experiments were conducted on a PC with an Intel Xeon E5 2.40 GHz CPU with 128 GB of RAM.

Conclusions and future work

In this study, we addressed the entity alignment problem by combining the benefits of symbol-based and embedding-based methods. A two-stage framework is introduced to combine the advantages of the above two methods. In Stage I, the task of relation alignment is solved by a series of symbol-based methods. An advantage of this unsupervised symbol-based process is that high-accuracy aligned relation pairs can be discovered, and these relation pairs are beneficial to entity alignment because a

Declaration of Competing Interest

No conflict of interest exits in the submission of this manuscript, and the manuscript is approved by all authors for publication. This manuscript is an extended version of our previous conference work (T. Jiang, C. Bu, Y. Zhu, and X. Wu, "Two-Stage Entity Alignment: Combining Hybrid Knowledge Graph Embedding with Similarity-Based Relation Alignment," in: Proceeding of Pacific Rim International Conference on Artificial Intelligence, pp. 162–175, 2019). The work described was original research,

Acknowledgments

This work was partly supported by the National Key Research and Development Program of China (No. 2016YFB1000901), the National Natural Science Foundation of China (No. 61806065 and No. 91746209), the funds for International Cooperation and Exchange of the National Natural Science Foundation of China (No. 62120106008), and the Fundamental Research Funds for the Central Universities (No. JZ2020HGQA0186).

Tingting Jiang is currently a PhD student in the Hefei University of Technology, Hefei, China. Her research interests include knowledge graph, knowledge graph embedding, and entity alignment.

References (43)

  • M. Pershina et al.

    Holistic entity matching across knowledge graphs

  • M. Chen et al.

    Multilingual knowledge graph embeddings for cross-lingual knowledge alignment

  • Z. Sun et al.

    Bootstrapping entity alignment with knowledge graph embedding

  • H. Jin et al.

    Incorporating Chinese characters of words for lexical sememe prediction

  • H. Zhu et al.

    Iterative entity alignment via joint knowledge Embeddings

  • T. Jiang et al.

    Two-stage entity alignment: combining hybrid knowledge graph embedding with similarity-based relation alignment

  • Q. Wang et al.

    Knowledge graph embedding: a survey of approaches and applications

    IEEE T Knowl. Data EN

    (2017)
  • S. Ji, S. Pan, E. Cambria, P. Marttinen, and P.S. Yu, "A survey on knowledge graphs: representation, acquisition and...
  • A. Traylor et al.

    Learning string alignments for entity aliases

  • R. Klabunde

    Daniel jurafsky/james h. martin, speech and language processing

    Zeitschrift für Sprachwissenschaft

    (2002)
  • D. Papachristou and S.D. Baker, "Longest-common-subsequence detection for common synonyms," ed: Google Patents,...
  • Cited by (0)

    Tingting Jiang is currently a PhD student in the Hefei University of Technology, Hefei, China. Her research interests include knowledge graph, knowledge graph embedding, and entity alignment.

    Chenyang Bu received the PhD degree from University of Science and Technology of China (USTC) in 2017. He is currently an assistant professor with Hefei University of Technology, Hefei, China. He serves as a reviewer for several international journals including IEEE TEVC, IEEE TNNLS, IEEE TCYB, IEEE TII, and IEEE TETCI. His-research interests include knowledge graph embedding for dynamic data, and evolutionary dynamic optimization and applications.

    Yi Zhu is currently an assistant professor in the School of information Engineering, Yangzhou University, China. He received the BS degree from Anhui University, the MS degree from University of Science and Technology of China, and the PhD degree from Hefei University of Technology. His-research interests are in data mining and knowledge engineering. His-research interests include data mining, knowledge engineering, and recommendation systems.

    Xindong Wu is a Professor of Computer Science with the School of Computing and Informatics, the University of Louisiana at Lafayette, Lafayette, LA, USA, and a Yangtze River Scholar with the School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China. He received his Bachelor's and Master's degrees in Computer Science from the Hefei University of Technology, China, and his Ph.D. degree in Artifificial Intelligence from the University of Edinburgh, United Kingdom. His-research interests include data mining, knowledge engineering, and Web information exploration. Dr. Wu is a Fellow of the IEEE and the AAAS. He is the Steering Committee Chair of the IEEE International Conference on Data Mining (ICDM), the Editor in-Chief of Knowledge and Information Systems (KAIS, by Springer), and an Editor-in-Chief of the Springer Book Series on Advanced Information and Knowledge Processing (AI & KP). He was the Editor-in-Chief of the IEEE Transactions on Knowledge and Data Engineering (TKDE) between 2005 and 2008. He served as a program committee chair/co-chair for ICDM 2003 (the 3rd IEEE International Conference on Data Mining), KDD 2007 (the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining), CIKM 2010 (the 19th ACM Conference on Information and Knowledge Management, and ICBK 2017 (the 8th IEEE International Conference on Big Knowledge).

    View full text