Sar-graphs: A language resource connecting linguistic knowledge with semantic relations from knowledge graphs
Introduction
Knowledge graphs are vast networks which store entities and their semantic types, properties and relations. In recent years considerable effort has been invested into constructing these large knowledge bases in academic research, community-driven projects and industrial development. Prominent examples include Freebase [1], Yago [2], [3], DBpedia [4], NELL [5], [6], WikiData [7], PROSPERA [8], Google’s Knowledge Graph [9] and also the Google Knowledge Vault [10]. A parallel and in part independent development is the emergence of several large-scale knowledge resources with a more language-centered focus, such as UWN [11], BabelNet [12], ConceptNet [13], and UBY [14]. These resources are important contributions to the linked data movement, where repositories of world-knowledge and linguistic knowledge complement each other. In this article, we present a method that aims to bridge these two types of resources by automatically building an intermediate resource.
In comparison to (world-)knowledge graphs, the underlying representation and semantic models of linguistic knowledge resources exhibit a greater degree of diversity. ConceptNet makes use of natural-language representations for modeling common-sense information. BabelNet integrates entity information from Wikipedia with word senses from WordNet, as well as with many other resources such as Wikidata and Wiktionary [15]. UWN automatically builds a multilingual WordNet from various resources, similar to UBY, which integrates multiple resources via linking on the word-sense level. Few to none of the existing linguistic resources, however, provide a feasible approach to explicitly linking semantic relations from knowledge graphs with their linguistic representations. We aim to fill this gap with the resource whose structure we define in Section 2 and whose construction method we detail in Section 3. Instances of this resource are graphs of semantically-associated relations, which we refer to by the name sar-graphs. Our definition is a formalization of the idea sketched in [16]. We believe that sar-graphs are examples for a new type of knowledge repository, language graphs, as they represent the linguistic patterns for relations in a knowledge graph. A language graph can be thought of as a bridge between the language and knowledge encoded in a knowledge graph, a bridge that characterizes the ways in which a language can express instances of one or several relations, and thus a mapping between strings and things.
The construction strategies of the described (world-)knowledge resources range from (1) integrating existing structured or semi-structured knowledge (e.g., Wikipedia infoboxes) via (2) crowd-sourcing to (3) automatic extraction from semi- and unstructured resources, where often (4) combinations of these are implemented. At the same time the existence of knowledge graphs enabled the development of new technologies for knowledge engineering, e.g., distantly supervised machine-learning methods [8], [17], [18], [19], [20]. Relation extraction is one of the central technologies contributing to the automatic creation of fact databases [10], on the other hand it benefits from the growing number of available factual resources by using them for automatic training and improvement of extraction systems. In Section 3, we describe how our own existing methods [18], which exploit factual knowledge bases for the automatic gathering of linguistic constructions, can be employed for the purpose of sar-graphs. Then in turn, one of many potential applications of sar-graphs is relation extraction, which we illustrate in Section 7.
An important aspect of the construction of sar-graphs is the disambiguation of their content words with respect to lexical semantics knowledge repositories, thereby generalizing content words with word senses. In addition to making sar-graphs more adjustable to the varying granularity needs of possible applications, this positions sar-graphs as a link hub between a number of formerly independent resources (see Fig. 1). Sar-graphs represent linguistic constructions for semantic relations from factual knowledge bases and incorporate linguistic structures extracted from mentions of knowledge-graph facts in free texts, while at the same time anchoring this information in lexical semantic resources. We go into further detail on this matter in Section 6.
The distantly supervised nature of the proposed construction methodology requires means for automatic and manual confidence estimation for the extracted linguistic structures, presented in Section 4. This is of particular importance when unstructured web texts are exploited for finding linguistic patterns which express semantic relations. Our contribution is the combination of battle-tested confidence-estimation strategies [18], [21] with a large manual verification effort for linguistic structures. In our experiments (Section 5), we continue from our earlier work [18], [22], i.e., we employ Freebase as our source of semantic relations and the lexical knowledge base BabelNet for linking word senses. We create sar-graphs for 25 relations, which exemplifies the feasibility of the proposed method, also we make the resource publicly available for this core set of relations.
We demonstrate the usefulness of sar-graphs by applying them to the task of relation extraction, where we identify and compose mentions of argument entities and projections of -ary semantic relations. We believe that sar-graphs will prove to be a valuable resource for numerous other applications, such as adaptation of parsers to special recognition tasks, text summarization, language generation, query analysis and even interpretation of telegraphic style in highly elliptical texts as found in SMS, Twitter, headlines or brief spoken queries. We therefore make this resource freely available to the community, and hope that other parties will find it of interest (Section 8).
Section snippets
Sar-graphs: a linguistic knowledge resource
Sar-graphs [16] extend the current range of knowledge graphs, which represent factual, relational and common-sense information for one or more languages, with linguistic knowledge, namely, linguistic variants of how semantic relations between abstract concepts and real-world entities are expressed in natural language text.
Sar-graph construction
In this section, we describe a general method for constructing sar-graphs. Our method is language- and relation-independent, and relies solely on the availability of a set of seed relation instances from an existing knowledge base. Fig. 4 outlines this process. Given a target relation , a set of seed instances of this relation, and a language , we can create a sar-graph with the following procedure.
- 1.
Acquire a set of textual mentions of instances for all from a text corpus.
- 2.
Quality control
As discussed in the previous section, our approach to sar-graph construction uses distant supervision for collecting textual mentions of a given target relation. In this section, we present several approaches to automatically compute confidence metrics for candidate dependency structures, and to learn validation thresholds. We also describe an annotation process for manual, expert-driven quality control of extracted dependency structures, and briefly describe the linguistic annotation tool and
Implementation
So far, we have described our methodology for creating the proposed resource of combined lexical, syntactic, and lexical semantic information. In this section, we outline the concrete experiments carried out to compile sar-graphs for 25 semantic relations.
Related work
In the previous sections, we have motivated the construction of sar-graphs and outlined a method of building them from an alignment of web text with known facts. Taking into account the implemented construction methodology, it may seem that sar-graphs can be regarded as a side-product of pattern discovery for relation extraction. However, sar-graphs are a further development of this, i.e., a novel linguistic knowledge resource on top of the results of pattern discovery.
In comparison to
Applications and experiments
We believe that sar-graphs, in addition to their role as an anchor in the linked data world and as a repository of relation phrases, are also a very useful resource for a variety of natural-language processing tasks and real-world applications. In particular, sar-graphs are well-suited for (a) the generation of phrases and sentences from database facts and (b) the detection of fact mentions in text.
The first aspect makes sar-graphs a good candidate for the employment in, e.g., business
Public availability, release
We release our sar-graph data to the public, hoping to support research in the areas of relation extraction, question answering, paraphrase generation, and others. The data is available for download at http://sargraph.dfki.de.
Properties of the release. The released dataset contains English sar-graphs for 25 target relations (Table 2) about corporations, award topics, as well as biographic information, many of them linking more than just two arguments. For now, we limit the released data to
Conclusion
In this article, we present a new linguistic resource called sar-graph, which aggregates knowledge about the means a language provides for expressing a given semantic target relation. We describe a general approach for automatically accumulating such linguistic knowledge, and for merging it into a single connected graph. Furthermore, we discuss different ways of assessing the relevancy of expressions and phrases with respect to the target relation, and outline several graph merging strategies.
Acknowledgments
This research was supported by the German Federal Ministry of Education and Research (BMBF) through the projects Deependance (contract 01IW11003), ALL SIDES (contract 01IW14002) and BBDC (contract 01IS14013E), as well as by the ERC Starting Grant MultiJEDI No. 259234, and a Google Focused Research Award granted in July 2013.
References (66)
- et al.
YAGO: A large ontology from Wikipedia and WordNet
Web Semant. Sci. Serv. Agents World Wide Web
(2008) - et al.
BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network
Artificial Intelligence
(2012) - et al.
ATOLL—a framework for the automatic induction of ontology lexica
Data Knowl. Eng.
(2014) - et al.
Freebase: A collaboratively created graph database for structuring human knowledge
- et al.
Yago: A core of semantic knowledge
- et al.
DBpedia—a large-scale, multilingual knowledge base extracted from Wikipedia
Semant. Web J.
(2015) - A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E.R. Hruschka, T.M. Mitchell, Toward an architecture for never-ending...
- T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J....
- et al.
Wikidata: A free collaborative knowledgebase
Commun. ACM
(2014) - et al.
Scalable knowledge harvesting with high precision and high recall
Knowledge vault: A web-scale approach to probabilistic knowledge fusion
Towards a universal wordnet by learning from combined evidence
ConceptNet 5: A large semantic network for relational knowledge
Uby: A large-scale unified lexical-semantic resource based on LMF
From strings to things SAR-graphs: A new type of resource for connecting knowledge and language
Distant supervision for relation extraction without labeled data
Large-scale learning of relation-extraction rules with distant supervision from the web
Using distant supervision for extracting relations on a large scale
Distant supervision for relation extraction with an incomplete knowledge base
Semantic rule filtering for web-scale relation extraction
Improvement of n-ary relation extraction by adding lexical semantics to distant-supervision rule learning
A seed-driven bottom-up machine learning framework for extracting relations of various complexity
Bootstrapping relation extraction from semantic seeds
Fact distribution in information extraction
Lang. Resour. Eval.
Inter-sentential relations in information extraction corpora
Information extraction: Capabilities and challenges, Tech. rep
Word sense disambiguation: A survey
ACM Comput. Surv.
Incorporating non-local information into information extraction systems by gibbs sampling
Cited by (21)
Construction of a fluvial facies knowledge graph and its application in sedimentary facies identification
2023, Geoscience FrontiersCitation Excerpt :Lithofacies analysis is the description and classification of sediments followed by the interpretation of sedimentary processes and depositional environments that usually is in terms of a facies model (Anderton, 1985; Kontakiotis et al., 2020).
Denoising distant supervision for ontology lexicalization using semantic similarity measures
2021, Expert Systems with ApplicationsCitation Excerpt :Thus, the sentences whose similarity to the labels of the corresponding triples’ predicate is less than a predefined threshold value α are detected as wrong mappings and removed. The remaining sentences and their corresponding triples are considered as true mappings and can be used to generated ontology lexicon in any ontology lexicalization frameworks which work based upon the distant supervision assumption, such as M−ATOLL (Walter et al., 2014), BOA (Gerber & Ngomo, 2012), the framework proposed by Marginean and Eniko (Marginean & Eniko, 2016), and the framework proposed by Krause et al. (Krause et al., 2016). The proposed solution needs to calculate the semantic similarity between a sentence and a label.
Knowledge Graph: A Survey
2023, TechRxivIntelligent Stuck Pipe Type Recognition Using Digital Twins and Knowledge Graph Model
2023, Applied Sciences (Switzerland)The progress and perspective of digital intelligence in comprehensive paleogeographic reconstruction
2023, Dizhi Xuebao/Acta Geologica SinicaEvolution of Big Data Models from Hierarchical Models to Knowledge Graphs
2023, Proceedings - International Computer Software and Applications Conference