Elsevier

Journal of Web Semantics

Volumes 37–38, March 2016, Pages 112-131
Journal of Web Semantics

Sar-graphs: A language resource connecting linguistic knowledge with semantic relations from knowledge graphs

https://doi.org/10.1016/j.websem.2016.03.004Get rights and content

Abstract

Recent years have seen a significant growth and increased usage of large-scale knowledge resources in both academic research and industry. We can distinguish two main types of knowledge resources: those that store factual information about entities in the form of semantic relations (e.g., Freebase), namely so-called knowledge graphs, and those that represent general linguistic knowledge (e.g., WordNet or UWN). In this article, we present a third type of knowledge resource which completes the picture by connecting the two first types. Instances of this resource are graphs of semantically-associated relations (sar-graphs), whose purpose is to link semantic relations from factual knowledge graphs with their linguistic representations in human language.

We present a general method for constructing sar-graphs using a language- and relation-independent, distantly supervised approach which, apart from generic language processing tools, relies solely on the availability of a lexical semantic resource, providing sense information for words, as well as a knowledge base containing seed relation instances. Using these seeds, our method extracts, validates and merges relation-specific linguistic patterns from text to create sar-graphs. To cope with the noisily labeled data arising in a distantly supervised setting, we propose several automatic pattern confidence estimation strategies, and also show how manual supervision can be used to improve the quality of sar-graph instances. We demonstrate the applicability of our method by constructing sar-graphs for 25 semantic relations, of which we make a subset publicly available at http://sargraph.dfki.de.

We believe sar-graphs will prove to be useful linguistic resources for a wide variety of natural language processing tasks, and in particular for information extraction and knowledge base population. We illustrate their usefulness with experiments in relation extraction and in computer assisted language learning.

Introduction

Knowledge graphs are vast networks which store entities and their semantic types, properties and relations. In recent years considerable effort has been invested into constructing these large knowledge bases in academic research, community-driven projects and industrial development. Prominent examples include Freebase  [1], Yago  [2], [3], DBpedia  [4], NELL  [5], [6], WikiData  [7], PROSPERA  [8], Google’s Knowledge Graph  [9] and also the Google Knowledge Vault  [10]. A parallel and in part independent development is the emergence of several large-scale knowledge resources with a more language-centered focus, such as UWN  [11], BabelNet  [12], ConceptNet  [13], and UBY  [14]. These resources are important contributions to the linked data movement, where repositories of world-knowledge and linguistic knowledge complement each other. In this article, we present a method that aims to bridge these two types of resources by automatically building an intermediate resource.

In comparison to (world-)knowledge graphs, the underlying representation and semantic models of linguistic knowledge resources exhibit a greater degree of diversity. ConceptNet makes use of natural-language representations for modeling common-sense information. BabelNet integrates entity information from Wikipedia with word senses from WordNet, as well as with many other resources such as Wikidata and Wiktionary  [15]. UWN automatically builds a multilingual WordNet from various resources, similar to UBY, which integrates multiple resources via linking on the word-sense level. Few to none of the existing linguistic resources, however, provide a feasible approach to explicitly linking semantic relations from knowledge graphs with their linguistic representations. We aim to fill this gap with the resource whose structure we define in Section  2 and whose construction method we detail in Section  3. Instances of this resource are graphs of semantically-associated relations, which we refer to by the name sar-graphs. Our definition is a formalization of the idea sketched in  [16]. We believe that sar-graphs are examples for a new type of knowledge repository, language graphs, as they represent the linguistic patterns for relations in a knowledge graph. A language graph can be thought of as a bridge between the language and knowledge encoded in a knowledge graph, a bridge that characterizes the ways in which a language can express instances of one or several relations, and thus a mapping between strings and things.

The construction strategies of the described (world-)knowledge resources range from (1) integrating existing structured or semi-structured knowledge (e.g., Wikipedia infoboxes) via (2) crowd-sourcing to (3) automatic extraction from semi- and unstructured resources, where often (4) combinations of these are implemented. At the same time the existence of knowledge graphs enabled the development of new technologies for knowledge engineering, e.g., distantly supervised machine-learning methods  [8], [17], [18], [19], [20]. Relation extraction is one of the central technologies contributing to the automatic creation of fact databases  [10], on the other hand it benefits from the growing number of available factual resources by using them for automatic training and improvement of extraction systems. In Section  3, we describe how our own existing methods  [18], which exploit factual knowledge bases for the automatic gathering of linguistic constructions, can be employed for the purpose of sar-graphs. Then in turn, one of many potential applications of sar-graphs is relation extraction, which we illustrate in Section  7.

An important aspect of the construction of sar-graphs is the disambiguation of their content words with respect to lexical semantics knowledge repositories, thereby generalizing content words with word senses. In addition to making sar-graphs more adjustable to the varying granularity needs of possible applications, this positions sar-graphs as a link hub between a number of formerly independent resources (see Fig. 1). Sar-graphs represent linguistic constructions for semantic relations from factual knowledge bases and incorporate linguistic structures extracted from mentions of knowledge-graph facts in free texts, while at the same time anchoring this information in lexical semantic resources. We go into further detail on this matter in Section  6.

The distantly supervised nature of the proposed construction methodology requires means for automatic and manual confidence estimation for the extracted linguistic structures, presented in Section  4. This is of particular importance when unstructured web texts are exploited for finding linguistic patterns which express semantic relations. Our contribution is the combination of battle-tested confidence-estimation strategies  [18], [21] with a large manual verification effort for linguistic structures. In our experiments (Section  5), we continue from our earlier work  [18], [22], i.e., we employ Freebase as our source of semantic relations and the lexical knowledge base BabelNet for linking word senses. We create sar-graphs for 25 relations, which exemplifies the feasibility of the proposed method, also we make the resource publicly available for this core set of relations.

We demonstrate the usefulness of sar-graphs by applying them to the task of relation extraction, where we identify and compose mentions of argument entities and projections of n-ary semantic relations. We believe that sar-graphs will prove to be a valuable resource for numerous other applications, such as adaptation of parsers to special recognition tasks, text summarization, language generation, query analysis and even interpretation of telegraphic style in highly elliptical texts as found in SMS, Twitter, headlines or brief spoken queries. We therefore make this resource freely available to the community, and hope that other parties will find it of interest (Section  8).

Section snippets

Sar-graphs: a linguistic knowledge resource

Sar-graphs  [16] extend the current range of knowledge graphs, which represent factual, relational and common-sense information for one or more languages, with linguistic knowledge, namely, linguistic variants of how semantic relations between abstract concepts and real-world entities are expressed in natural language text.

Sar-graph construction

In this section, we describe a general method for constructing sar-graphs. Our method is language- and relation-independent, and relies solely on the availability of a set of seed relation instances from an existing knowledge base. Fig. 4 outlines this process. Given a target relation r, a set of seed instances Ir of this relation, and a language l, we can create a sar-graph Gr,l with the following procedure.

  • 1.

    Acquire a set of textual mentions Mr,l of instances i for all iIr from a text corpus.

  • 2.

Quality control

As discussed in the previous section, our approach to sar-graph construction uses distant supervision for collecting textual mentions of a given target relation. In this section, we present several approaches to automatically compute confidence metrics for candidate dependency structures, and to learn validation thresholds. We also describe an annotation process for manual, expert-driven quality control of extracted dependency structures, and briefly describe the linguistic annotation tool and

Implementation

So far, we have described our methodology for creating the proposed resource of combined lexical, syntactic, and lexical semantic information. In this section, we outline the concrete experiments carried out to compile sar-graphs for 25 semantic relations.

Related work

In the previous sections, we have motivated the construction of sar-graphs and outlined a method of building them from an alignment of web text with known facts. Taking into account the implemented construction methodology, it may seem that sar-graphs can be regarded as a side-product of pattern discovery for relation extraction. However, sar-graphs are a further development of this, i.e., a novel linguistic knowledge resource on top of the results of pattern discovery.

In comparison to

Applications and experiments

We believe that sar-graphs, in addition to their role as an anchor in the linked data world and as a repository of relation phrases, are also a very useful resource for a variety of natural-language processing tasks and real-world applications. In particular, sar-graphs are well-suited for (a) the generation of phrases and sentences from database facts and (b) the detection of fact mentions in text.

The first aspect makes sar-graphs a good candidate for the employment in, e.g., business

Public availability, release

We release our sar-graph data to the public, hoping to support research in the areas of relation extraction, question answering, paraphrase generation, and others. The data is available for download at http://sargraph.dfki.de.

Properties of the release. The released dataset contains English sar-graphs for 25 target relations (Table 2) about corporations, award topics, as well as biographic information, many of them linking more than just two arguments. For now, we limit the released data to

Conclusion

In this article, we present a new linguistic resource called sar-graph, which aggregates knowledge about the means a language provides for expressing a given semantic target relation. We describe a general approach for automatically accumulating such linguistic knowledge, and for merging it into a single connected graph. Furthermore, we discuss different ways of assessing the relevancy of expressions and phrases with respect to the target relation, and outline several graph merging strategies.

Acknowledgments

This research was supported by the German Federal Ministry of Education and Research (BMBF) through the projects Deependance (contract 01IW11003), ALL SIDES (contract 01IW14002) and BBDC (contract 01IS14013E), as well as by the ERC Starting Grant MultiJEDI No. 259234, and a Google Focused Research Award granted in July 2013.

References (66)

  • A. Singhal, Introducing the Knowledge Graph: things, not strings, Google Official Blog, May 2012....
  • X. Dong et al.

    Knowledge vault: A web-scale approach to probabilistic knowledge fusion

  • G. de Melo et al.

    Towards a universal wordnet by learning from combined evidence

  • R. Speer et al.

    ConceptNet 5: A large semantic network for relational knowledge

  • I. Gurevych et al.

    Uby: A large-scale unified lexical-semantic resource based on LMF

  • Wikimedia Foundation, Wiktionary. URL...
  • H. Uszkoreit et al.

    From strings to things SAR-graphs: A new type of resource for connecting knowledge and language

  • M. Mintz et al.

    Distant supervision for relation extraction without labeled data

  • S. Krause et al.

    Large-scale learning of relation-extraction rules with distant supervision from the web

  • L. Jean-Louis et al.

    Using distant supervision for extracting relations on a large scale

  • B. Min et al.

    Distant supervision for relation extraction with an incomplete knowledge base

  • A. Moro et al.

    Semantic rule filtering for web-scale relation extraction

  • H. Li et al.

    Improvement of n-ary relation extraction by adding lexical semantics to distant-supervision rule learning

  • F. Xu et al.

    A seed-driven bottom-up machine learning framework for extracting relations of various complexity

  • F. Xu

    Bootstrapping relation extraction from semantic seeds

    (2007)
  • M.-C. de Marneffe, C.D. Manning, Stanford dependencies manual, 2008. URL...
  • M. Stevenson

    Fact distribution in information extraction

    Lang. Resour. Eval.

    (2006)
  • K. Swampillai et al.

    Inter-sentential relations in information extraction corpora

  • R. Grishman

    Information extraction: Capabilities and challenges, Tech. rep

    (2012)
  • R. Navigli

    Word sense disambiguation: A survey

    ACM Comput. Surv.

    (2009)
  • G.R. Doddington, A. Mitchell, M.A. Przybocki, L.A. Ramshaw, S. Strassel, R.M. Weischedel, The automatic content...
  • J.R. Finkel et al.

    Incorporating non-local information into information extraction systems by gibbs sampling

  • Cited by (21)

    • Construction of a fluvial facies knowledge graph and its application in sedimentary facies identification

      2023, Geoscience Frontiers
      Citation Excerpt :

      Lithofacies analysis is the description and classification of sediments followed by the interpretation of sedimentary processes and depositional environments that usually is in terms of a facies model (Anderton, 1985; Kontakiotis et al., 2020).

    • Denoising distant supervision for ontology lexicalization using semantic similarity measures

      2021, Expert Systems with Applications
      Citation Excerpt :

      Thus, the sentences whose similarity to the labels of the corresponding triples’ predicate is less than a predefined threshold value α are detected as wrong mappings and removed. The remaining sentences and their corresponding triples are considered as true mappings and can be used to generated ontology lexicon in any ontology lexicalization frameworks which work based upon the distant supervision assumption, such as M−ATOLL (Walter et al., 2014), BOA (Gerber & Ngomo, 2012), the framework proposed by Marginean and Eniko (Marginean & Eniko, 2016), and the framework proposed by Krause et al. (Krause et al., 2016). The proposed solution needs to calculate the semantic similarity between a sentence and a label.

    • Evolution of Big Data Models from Hierarchical Models to Knowledge Graphs

      2023, Proceedings - International Computer Software and Applications Conference
    View all citing articles on Scopus
    View full text