Keywords

1 Introduction

Knowledge graphs (KGs) contain knowledge about the world and provide a structured representation of this knowledge. Current KGs contain only a small subset of what is true in the world [7]. There are different types of information that could be incomplete, for example, incomplete set of entities (not all movies are mentioned in the movies KG), incomplete set of predicates, or missing links between existing entities. Different types of incompleteness are usually addressed with different methods. For example, Named Entity Recognition is successfully applied to find new entities of given classes [10]. In this work we consider the latter problem of link prediction (LP), i.e. finding triples (subject s, predicate p, objects o), where p is defined in the schema of the KG and s and o are known instance contained in the existing KG.

LP approaches can roughly be subdivided into two classes:

  • Leveraging the knowledge from the existing KG

    • The rule induction methods [3, 6] learn rules over KG that capture patterns in data. In a generic domain one can learn, for example, that a person has a home address or that a consumer good has a price. These rules help to identify potential gaps in an incomplete KG. In order to fill in the gaps one needs to verify the veracity of the potential new triple.

    • Embeddings project symbolic entities and relations into continuous vector space. The vector arithmetic is used to predict new links [8, 12].

  • Extracting the knowledge from other sources.

    • Transformation of the information from some structured sourceFootnote 1.

    • Relation Extraction (RE) methods [11] employ trained models to recognize triples in the text and add those triples to the existing KG.

In this work we consider combining both approaches. Given a KG we use rules and heuristics to identify potential gaps in the KG. Then we employ a Question Answering (QA) framework to find and/or verify new triples. Therefore, for each new triple we can demonstrate supporting piece of text. It is interesting to compare with the paper [1] describing a ranking method to provide descriptive explanations for relationships in the KG. In our paper we look for relationships expressed in the text that are not yet in the KG.

The difference of the introduced methodology from the existing approaches:

  • LP typically considers only the KG itself and not the textual data.

  • RE lacks in finding new relations not presented in the training set.

2 Approach

After applying the learned rules or heuristics we obtain a set of pairs (sp) and triples (spo). In case of having pairs, the task is to find objects O such that \(\{(s,p,o)\ |\ o \in O\}\) is a set of valid triples. In case of having triples, the task is to verify the provided triples. Our approach consists of the following steps:

  1. 1.

    Question formulation,

  2. 2.

    Retrieving documents potentially containing answers from the corpus,

  3. 3.

    Employing QA over documents to get candidate answers and their scores,

  4. 4.

    Choosing correct answers.

Question Formulation. The goal is to go from a pair (sp) to such a question that a correct answer o defines a valid triple (spo). In order to articulate this natural language question q we employ lexicalization techniques [5].

Retrieving Documents. Given a question q we use query expansion techniques [2] and formulate a search query to retrieve relevant document from the corpus.

Question Answering. Our end-to-end QA system accepts a natural language question and a set of documents as input and outputs a set of pairs (answer, confidence score). The score indicates the confidence of the system that the answer is correct. The QA systems are usually computationally demanding, therefore it is not feasible to send the whole corpus as an input to the QA system.

We use an implementation of BERT [4] that we trained on a large set of question-answer pairs of SQuAD dataset [9]. BERT is a deep learning based system that outperforms previous approaches and reaches close to human performance in QA tasks.

Verification and Integration. Given the answers produced by the QA system we try to match each answer to an instances in the KG. If an answer could not be matched then it is discarded. For the undiscarded answers we check if the class of the matched instances complies with the range restrictions of the predicate definition in the schema. If the range restriction is satisfied we add the triple (subject, predicate, matched instances) to the KG.

In the special case when we know that at most one triple is allowed for a pair (subject, predicate) we choose the answer yielding the highest score of the QA system if it satisfies the range restriction.

3 Preliminary Results and Conclusion

The first used corpus is a collection of over 3300 biomedical paper abstracts from “PubMed”Footnote 2 containing term “BRCA1”. Questions and first 3 answers are provided in the Table 1. For the first question: “Which diseases does niraparib treat?” it is possible to check the following triple (niraparib, treatsDisease, BRCA-mutant ovarian cancer). If this triple fails the verification stage, we continue to verify other candidates for their existence. In this case the answer satisfies the verification as “BRCA-mutant ovarian cancer” is indeed a disease.

In all three examples we asked what kind of disease is treated by such drugs as niraparib, rucaparib, and olaparib. All these drugs are anti-cancer agents, the system can successfully match these drugs with treating cancer. Moreover, the system is able to identify the specific types of cancer and additional details.

The second dataset is a collection of paper abstracts containing the term “rs1045642”, the identifier of a single-nucleotide polymorphism (SNP) in the human genome. We intend to observe data about mutations, the KG of which we expect not to be complete as the database of mutations is not up-to-date. To add such frequently updated data one should inspect corresponding literature. That being said, it is clear that an automated system could come as a benefit.

Conclusion. We considered an important practically relevant task of LP in KGs. In our approach we combine existing techniques to identify potential gaps with QA system to extract concealed knowledge from a text corpus and to formulate new triples. The first experiments show promising results even for domain specific datasets.

Table 1. Questions and answers.