Large-scale relation extraction from web documents and knowledge graphs with human-in-the-loop
Introduction
The vision of the Semantic Web (SW) is to make information on the Web machine understandable. To do so the Semantic Web promotes machine-readable and machine-understandable graph-based representation of data, which results in large Knowledge Graphs on the Web. Knowledge Graphs are the backbone of many information systems that require access to structured knowledge and they have been recognized as a valuable source for background information in many data mining, information retrieval, natural language processing, and knowledge extraction tasks [1], [2].
Following the same idea, many organizations build their own Knowledge Graphs (KG) [3], i.e. curated collections of interlinked descriptions of entities and factual information in their business domain of interest. Maintenance of such graphs is crucial: whenever new data is available it needs to be added in the KG. Knowledge graph population relies on extracting new entities – or new relations between entities – from different sources, which can be unstructured text or other existing knowledge graphs.
To address the challenge of knowledge graph population in this paper we design a large-scale information extraction system to extract relations between entities from Web documents and Knowledge Graphs.
Given a set of KG entities the system is able to extract user-defined target relations between the entities. In the first step the system generates a large focused crawl of Web documents, which is used for relation extraction. In the second step, to improve the performance of the system we integrate a human-in-the-loop component. This is able to (i) generate meaningful training data for the relation extraction system and (ii) effectively reduce the amount of data on which to perform the relation extraction. In the final step, the system uses state-of-the-art deep neural network for relation mining from text. Furthermore, the system makes use of external knowledge graphs for relation mining, i.e. identifying relations between entities in the knowledge graph that are not explicitly present in the graph. To do so, we combine state-of-the-art knowledge graph mining approaches and deep neural networks. The strength of our solution is that while the external KG is very useful in the case of popular entities, our human-in-the-loop component fills in the gaps for the long-tail entities.
The major contribution of this work is the novel incorporation of human-in-the-loop strategies within a large scale Information Extraction system. While full automation is often desirable, we show that with strategic and minimal human involvement we can tackle large-scale facts extraction for both head and tail entities of a KG. Furthermore, the system can be easily adopted for any type of KG population task, with the only requirement of having human expert interaction. We indeed also used the method for the population of KG in the pharmaceutical domain [4].
We evaluate our system on the Semantic Web Challenge 20181 with the objective to augment the Thomson Reuters permid.org open knowledge graph with facts extracted from Internet sources. The Thomson Reuters permid.org open dataset contains organizations, people and financial entities. The target relation that we populate is the supply chain relation. Given two entities of type organization in the graph, the goal is to assess if a supplier/customer relation exists between them. One of the major challenges in the given task is that the majority of target organizations are long tail entities, i.e. they are under-represented in publicly available data. The official evaluation system, provided by the challenge organizers, showed that our approach is able to successfully tackle the challenge, achieving the best performance among the participating solutions.
The rest of this paper is structured as follows. In Section 2, we give an overview of related work. In Section 3, we present our relation extraction approach, followed by an evaluation in Section 4. We conclude with a summary and an outlook on future work.
Section snippets
Relation extraction
The task of Relation Extraction has been very well addressed in literature. There is no one-size-fits-all model to solve the task as much depends on the specific relation to extract and the data at hand. State of the art systems range from early solutions based on SVMs and tree kernels [5], [6], [7], [8], [9] to most recent ones exploiting neural architectures [10], [11], [12]. Regardless of the model, one of the key hurdles – as in many machine learning tasks – is obtaining sufficient relevant
Approach
We develop two approaches for relation extraction, one which extracts relations from Web documents, which we will refer as W-Rex, and one from knowledge graphs, which we will refer as G-Rex.
Evaluation
The evaluation is performed in the contest of the Semantic Web Challenge 201814 with the objective to augment the Thomson Reuters permid.org open dataset with facts extracted from Internet sources. The target for the evaluation is the supplier/customer relation between two organizations.
Conclusion and future work
Extracting knowledge from heterogeneous data sources remains a fundamentally hard problem, especially when high accuracy is a requirement. Many of the state-of-the-art Information Extraction systems can be either imprecise or suffer from low recall, especially on very specific domains.
In this work we show how adding a human-in-the-loop component in the extraction system nearly doubled the precision as well as increasing the recall by nearly 50%. Interestingly, adding a Knowledge Graph Mining
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (41)
- et al.
Semantic web in data mining and knowledge discovery: A comprehensive survey
Web Semant.: Sci. Serv. Agents World Wide Web
(2016) - et al.
Mining the web of linked data with rapidminer
Web Semant.: Sci. Serv. Agents World Wide Web
(2015) - et al.
The mannheim search join engine
Web Semant.
(2015) Knowledge graph refinement: A survey of approaches and evaluation methods
Semant. web
(2017)- et al.
Towards a knowledge graph for science
- et al.
Personalized knowledge graphs for the pharmaceutical domain
- et al.
A shortest path dependency kernel for relation extraction
- et al.
Dependency tree kernels for relation extraction
- et al.
Subsequence kernels for relation extraction
- et al.
Kernel methods for relation extraction
J. Mach. Learn. Res.
(2003)
Extracting relations with integrated information using kernel methods
Relation extraction: Perspective from convolutional neural networks
Relation classification via convolutional deep neural network
Combining recurrent and convolutional neural networks for relation classification
Distantly supervised web relation extraction for knowledge base population
Semant. Web
Unsupervised wrapper induction using linked data
Distant supervision for relation extraction with sentence-level attention and entity descriptions
Data programming: Creating large training sets, quickly
A survey of noise reduction methods for distant supervision
Combining distant and partial supervision for relation extraction
Cited by (22)
Maintenance planning recommendation of complex industrial equipment based on knowledge graph and graph neural network
2023, Reliability Engineering and System SafetyCitation Excerpt :Apart from grammatical refinement, entity disambiguation distinguishes entities with different meanings but the same name, separating them from their referent entities [43]. Entity resolution, on the contrary, pertains to entities with various names but the same meaning, and those entities should be mapped to the appropriate one [44]. To better illustrate the MKG, a vivid example is offered in Fig. 2.
An NLP-guided ontology development and refinement approach to represent and query visual information
2023, Expert Systems with ApplicationsCitation Excerpt :Furthermore, they have not constructed ontology; thus efficacy of the approach is not fully validated. Ristoski et al. (2020) presented an approach for relationship extraction to represent the information by crawling a web of documents to convert into knowledge graphs. Wang, Valipour, et al. (2017) proposed an ontology generation approach by using a supervised machine learning approach.
A survey of human-in-the-loop for machine learning
2022, Future Generation Computer SystemsCitation Excerpt :Many researchers add humans to NLP tasks(such as entity analysis, knowledge graphs, etc.) by using crowdsourcing [50,52,54]. Ristoski et al. [61] introduced a method of extracting instances from various web resources, which dramatically improves the performance of the system by introducing human-recycling components. Besides, this method can integrate the human experience and knowledge to empower machines’ accurate intelligence.
Topic analysis and development in knowledge graph research: A bibliometric review on three decades
2021, NeurocomputingCitation Excerpt :To facilitate knowledge graph population, Ristoski et al. [35] developed a method for extracting positive instances of relationships from different sources using human-in-the-loop elements. Several suggestions for future research on relation extraction for knowledge graph construction include: (1) extracting more precise and varied relations and joint extraction of entities and relationships [34], (2) developing different relationship extraction approaches for various knowledge existence forms [28], (3) incorporating comprehensive hyperplanes to detect relation-specific information [31], (4) facilitating graph feature extraction [5], and (5) exploring ways to incorporate iterative feedback and simultaneously conduct multiple extraction tasks [35]. There are a number of suggestions concerning personalized customization to meet various requirements, including (1) integrating knowledge obtained using questionnaires and eye-tracking experiments concerning users’ needs into the construction of richer knowledge graphs to realize personalized customization and meet diverse requirements in diverse areas [25,44], (2) providing users with interesting/important information according to their interests and excluding unimportant/uninteresting information to facilitate users’ knowledge acquisition, and (3) integrating diverse data such as behavior and characteristic data for knowledge graph construction.
Multi-scale evolution mechanism and knowledge construction of a digital twin mimic model
2021, Robotics and Computer-Integrated ManufacturingCitation Excerpt :In terms of knowledge modeling, Nguyen et al. [34] provided a broad, complete, and systematic overview of the definitions and challenges of the knowledge graph fusion. Ristoski et al. [35] proposed an approach to extract positive instances of the relation from various Web sources to constantly extract new information from the vast amount of heterogeneous sources of data on the Web. Chen et al. [36] proposed an automatic document knowledge graph and inference network modeling framework based on ontology and natural language processing (NLP) to facilitate effective knowledge exploration from document abstracts.
Information Integration of Regulation Texts and Tables for Automated Construction Safety Knowledge Mapping
2024, Journal of Construction Engineering and Management