An Open Linguistic Infrastructure for Annotated Corpora

Ide, Nancy

doi:10.1007/978-3-642-35085-6_10

An Open Linguistic Infrastructure for Annotated Corpora

Nancy Ide³

Chapter
First Online: 01 January 2013

1493 Accesses
1 Citations

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

Abstract

One means to offset the high cost of corpus creation is to distribute effort among members of the research community, and thereby distribute the cost as well. To this end, the American National Corpus (ANC) project undertook to provide data and linguistic annotations to serve as the base for a collaborative, community-wide resource development effort (the ANC Open Linguistic Infrastructure, ANC-OLI). The fundamental premises of the effort are, first, that all data and annotations must be freely available to all members of the community, without restriction on use or redistribution, and second, that once a base of data and annotation was established, the resources would grow as community members contributed their enhancements and derived data. To ensure maximum flexibility and usability, the project has also developed an infrastructure for representing linguistically annotated resources intended to solve some of the usability problems for annotations produced at different sites by harmonizing their representation formats. We describe here the resources and infrastructure developed to support this collaborative community development and the efforts to ensure full community engagement.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
www.anc.org
2.
The NSF workshop, held October 29–30, 2006, included the following participants: Collin Baker, Hans Boas, Branimir Bogureav, Nicoletta Calzolari, Christopher Cieri, Christiane Fellbaum, Charles Fillmore, Sanda Harabagiu, Rebecca Hwa, Nancy Ide, Judith Klavans, Adam Meyers, Martha Palmer, Rebecca Passonneau, James Pustejovsky, Janyce Wiebe, and funding organization representatives Tatiana Korelsky (NSF) and Joseph Olive (DARPA). A report summarizing the consensus of the workshop participants is available at http://anc.org/nsf-workshop-2006.
3.
creativecommons.org/licenses/by/3.0/
4.
www.gnu.org/licenses/gpl.html
5.
creativecommons.org/licenses/by-sa/2.5/
6.
Based on entries in the LRE Map, http://www.resourcebook.eu/LreMap/faces/views/resourceMap.xhtml
7.
www.icsi.berkeley.edu/~framenet
8.
nlp.cs.nyu.edu/nomlex/index.html
9.
The consortium members who contributed texts to the ANC are Oxford University Press, Cambridge University Press, Langenscheidt Publishers, and the Microsoft Corporation.
10.
The contents of the ANC First Release are described at http://www.anc.org/FirstRelease/
11.
www.anc.org/OANC/index.html
12.
However, since 2005 the ANC project had no funding for production of additional data.
13.
NSF CRI 0708952
14.
MASC contains about 4 K words of the 10 K LU corpus, eliminating non-English and translated LU texts as well as texts that are not free of usage and redistribution restrictions.
15.
The MASC project commissioned the remainder of the annotation from the Penn Treebank project.
16.
Lack of funding for processing the data currently prevents its publication.
17.
www.biomedcentral.com
18.
www.plos.org
19.
www.anc.org/contribute.html
20.
www.ldc.upenn.edu
21.
liberalarts.iupui.edu/icic/research/corpus_of_philanthropic_fundraising_discourse
22.
newsouthvoices.uncc.edu/
23.
http://quod.lib.umich.edu/m/micase/
24.
Allowing annotations to reference other annotations differentiates GrAF from other representation formats, such as Annotation Graphs [2]
25.
For more details, see Chiarcos, et al., in this volume.
26.
linguistics.okfn.org/llod
27.
http://sourceforge.net/projects/iso-graf/
28.
General Architecture for Text Engineering; http://gate.ac.uk
29.
Taken from Field of Dreams; see http://en.wikipedia.org/wiki/Field_of_Dreams
30.
The Charniak and Johnson (2005) parser, MaltParser, and LHT dependency converter.
31.
http://aclweb.org/aclwiki/index.php?title=SemEval_Portal
32.
http://opennlp.apache.org
33.
Such repositories were set up to answer the call for resource reusability which, no doubt in large part because information added to these resources was until recently unlikely to be usable by others, always referred to the consumer-only model.
34.
http://www.languagelibrary.eu
35.
http://anawiki.essex.ac.uk/phrasedetectives/

References

Basile V, Bos J, Evang K, Venhuizen N (2012) Developing a large semantically annotated corpus. In: Proceedings of the eighth international conference on language resources and evaluation (LREC 2012), Istanbul, Turkey, pp 3196–3200
Google Scholar
Bird S, Liberman M (2001) A formal framework for linguistic annotation. Speech Commu 33(1–2):23–60
Article Google Scholar
Chiarcos C (2008) An ontology of linguistic annotations. LDV Forum 23(1):1–16
Google Scholar
Chiarcos C (2012) Ontologies of linguistic annotation: survey and perspectives. In: Proceedings of the eighth international conference on language resources and evaluation (LREC), Istanbul, Turkey
Google Scholar
Chiarcos C, Ritz J, Stede M (2012) By all these lovely tokens…merging conflicting tokenizations. Lang Resour Eval 46(1):53–74
Article Google Scholar
Dipper S (2005) XML-based stand-off representation and exploitation of multi-level linguistic annotation. In: Eckstein R, Tolksdorf R (eds) Berliner XML Tage, pp 39–50
Google Scholar
Farrar S, Langendoen DT (2010) An OWL-DL implementation of GOLD: an ontology for the semantic web. In: Witt A, Metzing D (eds) Linguistic modeling of information and markup languages. Springer, Dordrecht
Google Scholar
Ferrucci D, Lally A (2004) UIMA: an architectural approach to unstructured information processing in the corporate research environment. J Nat Lang Eng 10(3–4):327–348
Article Google Scholar
Fillmore CJ, Jurafsky D, Ide N, Macleod C (1998) An American national corpus: a proposal. In: Proceedings of the first annual conference on language resources and evaluation. European Language Resources Association, Paris, pp 965–969
Google Scholar
Ide N (2012) MultiMASC: An open linguistic infrastructure for language research. In: Proceedings of the fifth workshop on building and using comparable corpora, Istanbul, Turkey
Google Scholar
Ide N, Romary L (2004) International standard for a linguistic annotation framework. J Nat Lang Eng 10(3–4):211–225
Article Google Scholar
Ide N, Suderman K (2006) An open linguistic infrastructure for American English. In: Proceedings of the fifth language resources and evaluation conference (LREC). European Language Resources Association, Paris, Genoa, Italy
Google Scholar
Ide N, Suderman K (2007) GrAF: a graph-based format for linguistic annotations. In: Proceedings of the first linguistic annotation workshop, Prague, Czech Republic, pp 1–8
Google Scholar
Ide N, Suderman K (Submitted) The linguistic annotation framework: a standard for annotation interchange and merging. Lang Resour Eval, in press
Google Scholar
ISO 24612 (2012) Language resource management – linguistic annotation framework. International Standard ISO 24612
Google Scholar
Janin A, Baron D, Edwards J, Ellis D, Gelbart D, Morgan N, Peskin B, Pfau T, Shriberg E, Stolcke A, Wooters C (2003) The ICSI meeting corpus. In: Proceedings of ICASSP-03, Hong Kong, pp 364–367
Google Scholar
Kemps-Snijders M, Windhouwer M, Wittenburg P, Wright SE (2008) ISOcat: corralling data categories in the wild. In: Proceedings of the sixth international conference on language resources and evaluation (LREC’08), Marrakech, Morocco. European Language Resources Association (ELRA)
Google Scholar
Klyne G, Carroll JJ (2004) Resource description framework (RDF): concepts and abstract syntax. World Wide Web Consortium, Recommendation REC-RDF-Concepts-20040210
Google Scholar
Marcus MP, Marcinkiewicz MA, Santorini B (1993) Building a large annotated corpus of English: the Penn Treebank. Comput Linguist 19(2):313–330
Google Scholar
Miller S, Guinness J, Zamanian A (2004) Name tagging with word clusters and discriminative training. In: Susan Dumais DM, Roukos S (eds) HLT-NAACL 2004: main proceedings, association for computational linguistics, Boston, MA, USA, pp 337–342
Google Scholar
Miller S, Guinness J, Zamanian A (2004) Name tagging with word clusters and discriminative training. In: Proceedings of human language technologies, Boston, MA, USA, pp 337–342
Google Scholar
Nowak S, Rüger S (2010) How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In: Proceedings of the international conference on multimedia information retrieval. ACM, New York. MIR ’10, pp 557–566. doi:10.1145/1743384.1743478, http://doi.acm.org/10.1145/1743384.1743478
Passonneau RJ, Baker CF, Fellbaum C, Ide N (2012) The MASC word sense corpus. In: Proceedings of the eighth international conference on language resources and evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA)
Google Scholar
Pradhan SS, Hovy E, Marcus M, Palmer M, Ramshaw L, Weischedel R (2007) OntoNotes: a unified relational semantic representation. In: ICSC ’07: Proceedings of the international conference on semantic computing. IEEE Computer Society, Washington, DC, pp 517–526
Google Scholar
Prud’hommeaux E, Seaborne A (2007) SPARQL query language for rdf (working draft). Technical report, W3C. http://www.w3.org/TR/2007/WD-rdf-sparql-query-20070326/

Download references

Acknowledgements

This work was supported in part by National Science Foundation grant CRI-0708952.

Author information

Authors and Affiliations

Vassar College, Poughkeepsie, NY, USA
Nancy Ide

Authors

Nancy Ide
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nancy Ide .

Editor information

Editors and Affiliations

Department of Computer Science Ubiquitous Knowledge Processing (UKP) Lab, Technische Universität Darmstadt, Darmstadt, Germany
Iryna Gurevych & Jungi Kim &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ide, N. (2013). An Open Linguistic Infrastructure for Annotated Corpora. In: Gurevych, I., Kim, J. (eds) The People’s Web Meets NLP. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35085-6_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-35085-6_10
Published: 21 February 2013
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35084-9
Online ISBN: 978-3-642-35085-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics