An Extensible Schema for Building Large Weakly-Labeled Semantic Corpora

https://doi.org/10.1016/j.procs.2018.03.009Get rights and content
Under a Creative Commons license
open access

Abstract

In NLP data drives research, as evidenced by the frequency with which seminal works of database engineering such as the Penn Treebank have been employed as a basis for experimentation. Traditionally large-scale expertly annotated corpora are expensive and time consuming to produce.

This paradigm drove researchers to adopt automated methods for generating labeled data with available tools such as Freebase, DBpedia, and the “infoboxes” found on Wikipedia pages. These knowledge bases have been, or are in the process of being, subsumed by Wikidata, an initiative to concentrate such disparate data repositories in an organized machine readable format. This resource is an important research tool. In this paper, we review our experience using Wikidata in constructing a large annotated corpus under distant supervision, moreover we make the materials, the code used to generate our annotations, freely available to all interested parties.

Keywords

Distant-supervision
information extraction
corpus annotation

Cited by (0)