Cross-language article linking with deep neural network based paragraph encoding

https://doi.org/10.1016/j.csl.2021.101279Get rights and content

Highlights

  • Cross-language article linking helps create a multilingual unified knowledge base.

  • Using attention-based neural network that learns to attend to the vital part of articles.

  • The novel method that does not rely on feature engineering and is scalable to large data.

Abstract

Cross-language article linking (CLAL), the task of generating links between articles in different languages from different encyclopedias, is critical for facilitating sharing among online knowledge bases. Some previous CLAL research has been done on creating links among Wikipedia wikis, but much of this work depends heavily on simple language patterns and encyclopedia format or metadata. In this paper, we propose a new CLAL method based on deep learning paragraph embeddings to link English Wikipedia articles with articles in Baidu Baike, the most popular online encyclopedia in mainland China. To measure article similarity for link prediction, we employ several neural networks with attention mechanisms, such as CNN and LSTM, to train paragraph encoders that create vector representations of the articles’ semantics based only on article text, rather than link structure, as input data. Using our “Deep CLAL” method, we compile a data set consisting of Baidu Baike entries and corresponding English Wikipedia entries. Our approach does not rely on linguistic or structural features and can be easily applied to other language pairs by using pre-trained word embeddings, regardless of whether the two languages are on the same encyclopedia platform.

Introduction

To help bridge the gap between languages online, more data are required to feed the machine learning systems that power automatic translation software. Online encyclopedias are an excellent source of multilingual data due to their diversity of topics and ease of access. To find meaningful correlations between foreign languages in such literature, it is desirable to first link articles with corresponding content, a process referred to as cross-language article linking (CLAL). A large amount of previous CLAL research has focused on engineering features for Wikipedia (Sorg and Cimiano, 2008, Oh et al., 2008). Most of them rely on the structural information of Wikipedia, such as inter-language links among different language versions of Wikipedia, common categories, and information boxes. Although such approaches can yield acceptable performance, they often sacrifice system scalability. One issue is that other online encyclopedias, e.g. Baidu Baike, may not contain the same information as Wikipedia. For example, structural information, such as inter-language links, inforboxes and categories may be different or lacking from different encyclopedias. In addition, the amount of textual content in different online encyclopedias may have huge differences such as one is very informative but the other just has only few sentences. Therefore, a more generalizable approach is required in order to link articles from diverse encyclopedia platforms.

CLAL now becomes an important research topic because CLAL can help create a multilingual unified knowledge base that can be used in many fields such as knowledge inference, entity linking, and machine translation. Most of the past approaches to CLAL are based on feature-engineering, which requires specific knowledge of the encyclopedias of interest and time in designing and processing the features. Usually, it takes multiple features to complete the task, thus increase the complexity of the system and time needed to process the data. We can broadly categorize previous systems into two classes: link-based and text-based systems. Link-based systems take advantage of the link structure within the encyclopedias and design features based on link statistics of the paired articles (Sorg and Cimiano, 2008, Itakura and Clarke, 2007, Jenkinson et al., 2008, Tang et al., 2011, Wang et al., 2012). The central idea of these approaches is that any equivalent article pair will at least share one common link. Because of the need of link statistics, these approaches are often employed in encyclopedias where cross-language links are easy to obtain, e.g. Wikipedia.

Text-based systems predict the article pairs by comparing the most informative text of the articles. Past approaches (Zeng and Bloniarz, 2004, Zhang and Kamps, 2008, Miao et al., 2013) will first translate the text to be compared, they then either compute cosine similarity or edit distance between the texts. More recent approaches (Tsai and Roth, 2016, Sil et al., 2018) learn the embedding for each article using the text (context) they are mentioned from. They then create features based on the learnt embedding and feed them into a classifier. Other works combine textual and structural information, such as topics, categories, and infoboxes, as features to build a ranker to generate links between two encyclopedias Dopichaj et al., 2008, Granitzer et al., 2008, Milne and Witten, 2008 and Wang et al. (2014). However, these methods heavily depend on the detailed structural information of encyclopedia articles and they also required manual feature engineering techniques to build a supervised classifier.

The above mentioned systems typically learnt a classifier to integrate multiple features of which each one of them only consider a small Wiki-specific context of the article instead of a whole paragraph. In doing so, they have restricted themselves to the information that those features will ever provide. We argue that an intelligent system shall be able to predict the result with only raw inputs, excluding feature engineering entirely. In this paper, we propose to achieve the goal by using attention-based neural network that learns to attend to the vital part of inputs. Our proposed CLAL method predicts the similarity only based on the first paragraph of articles to link related articles between the English Wikipedia and Baidu Baike online encyclopedias. In our CLAL approach, the first paragraph of articles is transformed to a vector and then build a similarity matrix to measure the similarity of the two vectors between different encyclopedias. During training, it learns to assign higher attention weights to those terms that are instrumental in predicting equivalent article pairs. We show that the neural network based approach is better at predicting equivalent article pairs than methods that use multiple features.

Our contributions are three-fold:

  • 1.

    A novel approach to the CLAL problem that does not rely on feature engineering.

  • 2.

    Our approach can be easily scaled to large datasets because the paragraph encoding of each article can be processed in advance to make the processing time during testing linear to the model parameters.

  • 3.

    It can be generalized to any encyclopedia as long as we can extract the first paragraph instead of encyclopedia-specific features.

Section snippets

Related work

CLAL, linking articles in different languages from different online encyclopedias, is a relatively new research task that emerged out of work on linking articles in different language versions of Wikipedia. In addition, the vector representations (namely “embeddings”) of a word or a document with real values become a widely used techniques to embed semantics to measure the similarity between two words or documents. These approaches are also useful to cross-language article linking. In the

Method

Cross-language article linking between different knowledge bases can be formulated as follows: For each knowledge base K, a collection of human-written articles, can be defined as K=aii=1n, where ai is an article in K and n is the size of K. Article linking can then be defined as follows: Given two knowledge bases K1 and K2, cross-language article linking is the task of finding the corresponding equivalent article aj from knowledge base K2 for each article ai from knowledge base K1. Equivalent

Experiments

To create our dataset, we obtained 4,134,839 Wikipedia articles from the English Wikipedia dump.1 We also crawled 1,323,269 Baidu articles. We then discarded articles in Baidu that are either dictionary entries or classical Chinese poems, as these articles are rarely found in Wikipedia.

To evaluate the performance of article linking, we created 66,632 pairs of Wikipedia and Baidu articles. To avoid extensive manual labeling, we rely on the inter-language links

Error case analysis with the BERT-based baseline

In our experiments, the baseline based on BERT does not achieve comparable results as our proposed method. To deeply analyze the performance of BERT-based baseline, we observe the attention weights of each word in the paragraphs to examine how the BERT-based baseline links the English and Chinese articles. For each word, the multi-head attention weights at the [CLS] position are summed up with linear interpolation to see the contribution of the word.

In several cases, the BERT-based baseline can

Conclusion

Cross-language article linking (CLAL) is the task of generating links between articles in different languages from different encyclopedias. In this work, we propose a novel method based on deep learning models to link articles in English Wikipedia with their counterparts in Chinese Baidu Baike. Our approach is composed of two steps: candidate selection and link prediction. Candidate selection is formulated as a English-Chinese cross-language information retrieval task. To measure article

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

The authors would like to thank the Ministry of Science and Technology, Taiwan for financially supporting this research. (MOST 109-2221-E-008-058-MY3)

References (40)

  • BahdanauD. et al.

    Neural machine translation by jointly learning to align and translate

    (2014)
  • BengioY. et al.

    Learning long-term dependencies with gradient descent is difficult

    IEEE Trans. Neural Netw.

    (1994)
  • BojanowskiP. et al.

    Enriching word vectors with subword information

    Trans. Assoc. Comput. Linguist.

    (2017)
  • Brown, P.F., Lai, J.C., Mercer, R.L., 1991. Aligning sentences in parallel corpora. In: Proceedings of the 29th Annual...
  • ChangM.-W. et al.

    Language model pre-training for hierarchical document representations

    (2019)
  • DevlinJ. et al.

    Bert: Pre-training of deep bidirectional transformers for language understanding

    (2018)
  • DopichajP. et al.

    Stealing anchors to link the wiki

  • Gale, W.A., Church, K.W., 1991. A program for aligning sentences in bilingual corpora. In: Proceedings of Association...
  • GranitzerM. et al.

    Context based wikipedia linking

  • HarrisZ.S.

    Distributional structure

    Word

    (1954)
  • Hasan, M.M., Matsumoto, Y., 2001. Multilingual Document Alignment-A Study with Chinese and Japanese. In: Proceedings of...
  • HochreiterS. et al.

    Gradient flow in recurrent nets: the difficulty of learning long-term dependencies

  • HochreiterS. et al.

    Long short-term memory

    Neural Comput.

    (1997)
  • Itakura, K.Y., Clarke, C.L., 2007. University of waterloo at INEX2007: Adhoc and link-the-wiki tracks. In:...
  • Jenkinson, D., Leung, K.-C., Trotman, A., 2008. Wikisearching and wikilinking. In: International Workshop of the...
  • KimY.

    Convolutional neural networks for sentence classification

    (2014)
  • KirosR. et al.

    Skip-thought vectors

  • Miao, Q., Fang, R., Meng, Y., Zhang, S., 2013. FRDC’s Cross-lingual Entity Linking System at TAC 2013. In: Proceedings...
  • MikolovT. et al.

    Efficient estimation of word representations in vector space

    (2013)
  • MilneD. et al.

    Learning to link with wikipedia

  • View full text