Elsevier

Journal of Web Semantics

Volume 60, January 2020, 100546
Journal of Web Semantics

Large-scale relation extraction from web documents and knowledge graphs with human-in-the-loop

https://doi.org/10.1016/j.websem.2019.100546Get rights and content

Abstract

The Semantic Web movement has produced a wealth of curated collections of entities and facts, often referred as Knowledge Graphs. Creating and maintaining such Knowledge Graphs is far from being a solved problem: it is crucial to constantly extract new information from the vast amount of heterogeneous sources of data on the Web. In this work we address the task of Knowledge Graph population. Specifically, given any target relation between two entities, we propose an approach to extract positive instances of the relation from various Web sources. Our relation extraction approach introduces a human-in-the-loop component in the extraction pipeline, which delivers significant advantage with respect to other solely automatic approaches. We test our solution on the ISWC 2018 Semantic Web Challenge, with the objective to identify supply-chain relations among organizations in the Thomson Reuters Knowledge Graph. Our human-in-the-loop extraction pipeline achieves top performance among all competing systems.

Introduction

The vision of the Semantic Web (SW) is to make information on the Web machine understandable. To do so the Semantic Web promotes machine-readable and machine-understandable graph-based representation of data, which results in large Knowledge Graphs on the Web. Knowledge Graphs are the backbone of many information systems that require access to structured knowledge and they have been recognized as a valuable source for background information in many data mining, information retrieval, natural language processing, and knowledge extraction tasks [1], [2].

Following the same idea, many organizations build their own Knowledge Graphs (KG) [3], i.e. curated collections of interlinked descriptions of entities and factual information in their business domain of interest. Maintenance of such graphs is crucial: whenever new data is available it needs to be added in the KG. Knowledge graph population relies on extracting new entities – or new relations between entities – from different sources, which can be unstructured text or other existing knowledge graphs.

To address the challenge of knowledge graph population in this paper we design a large-scale information extraction system to extract relations between entities from Web documents and Knowledge Graphs.

Given a set of KG entities the system is able to extract user-defined target relations between the entities. In the first step the system generates a large focused crawl of Web documents, which is used for relation extraction. In the second step, to improve the performance of the system we integrate a human-in-the-loop component. This is able to (i) generate meaningful training data for the relation extraction system and (ii) effectively reduce the amount of data on which to perform the relation extraction. In the final step, the system uses state-of-the-art deep neural network for relation mining from text. Furthermore, the system makes use of external knowledge graphs for relation mining, i.e. identifying relations between entities in the knowledge graph that are not explicitly present in the graph. To do so, we combine state-of-the-art knowledge graph mining approaches and deep neural networks. The strength of our solution is that while the external KG is very useful in the case of popular entities, our human-in-the-loop component fills in the gaps for the long-tail entities.

The major contribution of this work is the novel incorporation of human-in-the-loop strategies within a large scale Information Extraction system. While full automation is often desirable, we show that with strategic and minimal human involvement we can tackle large-scale facts extraction for both head and tail entities of a KG. Furthermore, the system can be easily adopted for any type of KG population task, with the only requirement of having human expert interaction. We indeed also used the method for the population of KG in the pharmaceutical domain [4].

We evaluate our system on the Semantic Web Challenge 20181 with the objective to augment the Thomson Reuters permid.org open knowledge graph with facts extracted from Internet sources. The Thomson Reuters permid.org open dataset contains organizations, people and financial entities. The target relation that we populate is the supply chain relation. Given two entities of type organization in the graph, the goal is to assess if a supplier/customer relation exists between them. One of the major challenges in the given task is that the majority of target organizations are long tail entities, i.e. they are under-represented in publicly available data. The official evaluation system, provided by the challenge organizers, showed that our approach is able to successfully tackle the challenge, achieving the best performance among the participating solutions.

The rest of this paper is structured as follows. In Section 2, we give an overview of related work. In Section 3, we present our relation extraction approach, followed by an evaluation in Section 4. We conclude with a summary and an outlook on future work.

Section snippets

Relation extraction

The task of Relation Extraction has been very well addressed in literature. There is no one-size-fits-all model to solve the task as much depends on the specific relation to extract and the data at hand. State of the art systems range from early solutions based on SVMs and tree kernels [5], [6], [7], [8], [9] to most recent ones exploiting neural architectures [10], [11], [12]. Regardless of the model, one of the key hurdles – as in many machine learning tasks – is obtaining sufficient relevant

Approach

We develop two approaches for relation extraction, one which extracts relations from Web documents, which we will refer as W-Rex, and one from knowledge graphs, which we will refer as G-Rex.

Evaluation

The evaluation is performed in the contest of the Semantic Web Challenge 201814 with the objective to augment the Thomson Reuters permid.org open dataset with facts extracted from Internet sources. The target for the evaluation is the supplier/customer relation between two organizations.

Conclusion and future work

Extracting knowledge from heterogeneous data sources remains a fundamentally hard problem, especially when high accuracy is a requirement. Many of the state-of-the-art Information Extraction systems can be either imprecise or suffer from low recall, especially on very specific domains.

In this work we show how adding a human-in-the-loop component in the extraction system nearly doubled the precision as well as increasing the recall by nearly 50%. Interestingly, adding a Knowledge Graph Mining

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (41)

  • RistoskiP. et al.

    Semantic web in data mining and knowledge discovery: A comprehensive survey

    Web Semant.: Sci. Serv. Agents World Wide Web

    (2016)
  • RistoskiP. et al.

    Mining the web of linked data with rapidminer

    Web Semant.: Sci. Serv. Agents World Wide Web

    (2015)
  • LehmbergO. et al.

    The mannheim search join engine

    Web Semant.

    (2015)
  • PaulheimH.

    Knowledge graph refinement: A survey of approaches and evaluation methods

    Semant. web

    (2017)
  • AuerS. et al.

    Towards a knowledge graph for science

  • GentileA.L. et al.

    Personalized knowledge graphs for the pharmaceutical domain

  • BunescuR.C. et al.

    A shortest path dependency kernel for relation extraction

  • CulottaA. et al.

    Dependency tree kernels for relation extraction

  • MooneyR.J. et al.

    Subsequence kernels for relation extraction

  • ZelenkoD. et al.

    Kernel methods for relation extraction

    J. Mach. Learn. Res.

    (2003)
  • ZhaoS. et al.

    Extracting relations with integrated information using kernel methods

  • NguyenT.H. et al.

    Relation extraction: Perspective from convolutional neural networks

  • ZengD. et al.

    Relation classification via convolutional deep neural network

  • VuN.T. et al.

    Combining recurrent and convolutional neural networks for relation classification

  • AugensteinI. et al.

    Distantly supervised web relation extraction for knowledge base population

    Semant. Web

    (2016)
  • GentileA.L. et al.

    Unsupervised wrapper induction using linked data

  • JiG. et al.

    Distant supervision for relation extraction with sentence-level attention and entity descriptions

  • RatnerA.J. et al.

    Data programming: Creating large training sets, quickly

  • RothB. et al.

    A survey of noise reduction methods for distant supervision

  • AngeliG. et al.

    Combining distant and partial supervision for relation extraction

  • Cited by (22)

    • Maintenance planning recommendation of complex industrial equipment based on knowledge graph and graph neural network

      2023, Reliability Engineering and System Safety
      Citation Excerpt :

      Apart from grammatical refinement, entity disambiguation distinguishes entities with different meanings but the same name, separating them from their referent entities [43]. Entity resolution, on the contrary, pertains to entities with various names but the same meaning, and those entities should be mapped to the appropriate one [44]. To better illustrate the MKG, a vivid example is offered in Fig. 2.

    • An NLP-guided ontology development and refinement approach to represent and query visual information

      2023, Expert Systems with Applications
      Citation Excerpt :

      Furthermore, they have not constructed ontology; thus efficacy of the approach is not fully validated. Ristoski et al. (2020) presented an approach for relationship extraction to represent the information by crawling a web of documents to convert into knowledge graphs. Wang, Valipour, et al. (2017) proposed an ontology generation approach by using a supervised machine learning approach.

    • A survey of human-in-the-loop for machine learning

      2022, Future Generation Computer Systems
      Citation Excerpt :

      Many researchers add humans to NLP tasks(such as entity analysis, knowledge graphs, etc.) by using crowdsourcing [50,52,54]. Ristoski et al. [61] introduced a method of extracting instances from various web resources, which dramatically improves the performance of the system by introducing human-recycling components. Besides, this method can integrate the human experience and knowledge to empower machines’ accurate intelligence.

    • Topic analysis and development in knowledge graph research: A bibliometric review on three decades

      2021, Neurocomputing
      Citation Excerpt :

      To facilitate knowledge graph population, Ristoski et al. [35] developed a method for extracting positive instances of relationships from different sources using human-in-the-loop elements. Several suggestions for future research on relation extraction for knowledge graph construction include: (1) extracting more precise and varied relations and joint extraction of entities and relationships [34], (2) developing different relationship extraction approaches for various knowledge existence forms [28], (3) incorporating comprehensive hyperplanes to detect relation-specific information [31], (4) facilitating graph feature extraction [5], and (5) exploring ways to incorporate iterative feedback and simultaneously conduct multiple extraction tasks [35]. There are a number of suggestions concerning personalized customization to meet various requirements, including (1) integrating knowledge obtained using questionnaires and eye-tracking experiments concerning users’ needs into the construction of richer knowledge graphs to realize personalized customization and meet diverse requirements in diverse areas [25,44], (2) providing users with interesting/important information according to their interests and excluding unimportant/uninteresting information to facilitate users’ knowledge acquisition, and (3) integrating diverse data such as behavior and characteristic data for knowledge graph construction.

    • Multi-scale evolution mechanism and knowledge construction of a digital twin mimic model

      2021, Robotics and Computer-Integrated Manufacturing
      Citation Excerpt :

      In terms of knowledge modeling, Nguyen et al. [34] provided a broad, complete, and systematic overview of the definitions and challenges of the knowledge graph fusion. Ristoski et al. [35] proposed an approach to extract positive instances of the relation from various Web sources to constantly extract new information from the vast amount of heterogeneous sources of data on the Web. Chen et al. [36] proposed an automatic document knowledge graph and inference network modeling framework based on ontology and natural language processing (NLP) to facilitate effective knowledge exploration from document abstracts.

    View all citing articles on Scopus
    View full text