Skip to main content

An Open Linguistic Infrastructure for Annotated Corpora

  • Chapter
  • First Online:

Abstract

One means to offset the high cost of corpus creation is to distribute effort among members of the research community, and thereby distribute the cost as well. To this end, the American National Corpus (ANC) project undertook to provide data and linguistic annotations to serve as the base for a collaborative, community-wide resource development effort (the ANC Open Linguistic Infrastructure, ANC-OLI). The fundamental premises of the effort are, first, that all data and annotations must be freely available to all members of the community, without restriction on use or redistribution, and second, that once a base of data and annotation was established, the resources would grow as community members contributed their enhancements and derived data. To ensure maximum flexibility and usability, the project has also developed an infrastructure for representing linguistically annotated resources intended to solve some of the usability problems for annotations produced at different sites by harmonizing their representation formats. We describe here the resources and infrastructure developed to support this collaborative community development and the efforts to ensure full community engagement.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    www.anc.org

  2. 2.

    The NSF workshop, held October 29–30, 2006, included the following participants: Collin Baker, Hans Boas, Branimir Bogureav, Nicoletta Calzolari, Christopher Cieri, Christiane Fellbaum, Charles Fillmore, Sanda Harabagiu, Rebecca Hwa, Nancy Ide, Judith Klavans, Adam Meyers, Martha Palmer, Rebecca Passonneau, James Pustejovsky, Janyce Wiebe, and funding organization representatives Tatiana Korelsky (NSF) and Joseph Olive (DARPA). A report summarizing the consensus of the workshop participants is available at http://anc.org/nsf-workshop-2006.

  3. 3.

    creativecommons.org/licenses/by/3.0/

  4. 4.

    www.gnu.org/licenses/gpl.html

  5. 5.

    creativecommons.org/licenses/by-sa/2.5/

  6. 6.

    Based on entries in the LRE Map, http://www.resourcebook.eu/LreMap/faces/views/resourceMap.xhtml

  7. 7.

    www.icsi.berkeley.edu/~framenet

  8. 8.

    nlp.cs.nyu.edu/nomlex/index.html

  9. 9.

    The consortium members who contributed texts to the ANC are Oxford University Press, Cambridge University Press, Langenscheidt Publishers, and the Microsoft Corporation.

  10. 10.

    The contents of the ANC First Release are described at http://www.anc.org/FirstRelease/

  11. 11.

    www.anc.org/OANC/index.html

  12. 12.

    However, since 2005 the ANC project had no funding for production of additional data.

  13. 13.

    NSF CRI 0708952

  14. 14.

    MASC contains about 4 K words of the 10 K LU corpus, eliminating non-English and translated LU texts as well as texts that are not free of usage and redistribution restrictions.

  15. 15.

    The MASC project commissioned the remainder of the annotation from the Penn Treebank project.

  16. 16.

    Lack of funding for processing the data currently prevents its publication.

  17. 17.

    www.biomedcentral.com

  18. 18.

    www.plos.org

  19. 19.

    www.anc.org/contribute.html

  20. 20.

    www.ldc.upenn.edu

  21. 21.

    liberalarts.iupui.edu/icic/research/corpus_of_philanthropic_fundraising_discourse

  22. 22.

    newsouthvoices.uncc.edu/

  23. 23.

    http://quod.lib.umich.edu/m/micase/

  24. 24.

    Allowing annotations to reference other annotations differentiates GrAF from other representation formats, such as Annotation Graphs [2]

  25. 25.

    For more details, see Chiarcos, et al., in this volume.

  26. 26.

    linguistics.okfn.org/llod

  27. 27.

    http://sourceforge.net/projects/iso-graf/

  28. 28.

    General Architecture for Text Engineering; http://gate.ac.uk

  29. 29.

    Taken from Field of Dreams; see http://en.wikipedia.org/wiki/Field_of_Dreams

  30. 30.

    The Charniak and Johnson (2005) parser, MaltParser, and LHT dependency converter.

  31. 31.

    http://aclweb.org/aclwiki/index.php?title=SemEval_Portal

  32. 32.

    http://opennlp.apache.org

  33. 33.

    Such repositories were set up to answer the call for resource reusability which, no doubt in large part because information added to these resources was until recently unlikely to be usable by others, always referred to the consumer-only model.

  34. 34.

    http://www.languagelibrary.eu

  35. 35.

    http://anawiki.essex.ac.uk/phrasedetectives/

References

  1. Basile V, Bos J, Evang K, Venhuizen N (2012) Developing a large semantically annotated corpus. In: Proceedings of the eighth international conference on language resources and evaluation (LREC 2012), Istanbul, Turkey, pp 3196–3200

    Google Scholar 

  2. Bird S, Liberman M (2001) A formal framework for linguistic annotation. Speech Commu 33(1–2):23–60

    Article  Google Scholar 

  3. Chiarcos C (2008) An ontology of linguistic annotations. LDV Forum 23(1):1–16

    Google Scholar 

  4. Chiarcos C (2012) Ontologies of linguistic annotation: survey and perspectives. In: Proceedings of the eighth international conference on language resources and evaluation (LREC), Istanbul, Turkey

    Google Scholar 

  5. Chiarcos C, Ritz J, Stede M (2012) By all these lovely tokens…merging conflicting tokenizations. Lang Resour Eval 46(1):53–74

    Article  Google Scholar 

  6. Dipper S (2005) XML-based stand-off representation and exploitation of multi-level linguistic annotation. In: Eckstein R, Tolksdorf R (eds) Berliner XML Tage, pp 39–50

    Google Scholar 

  7. Farrar S, Langendoen DT (2010) An OWL-DL implementation of GOLD: an ontology for the semantic web. In: Witt A, Metzing D (eds) Linguistic modeling of information and markup languages. Springer, Dordrecht

    Google Scholar 

  8. Ferrucci D, Lally A (2004) UIMA: an architectural approach to unstructured information processing in the corporate research environment. J Nat Lang Eng 10(3–4):327–348

    Article  Google Scholar 

  9. Fillmore CJ, Jurafsky D, Ide N, Macleod C (1998) An American national corpus: a proposal. In: Proceedings of the first annual conference on language resources and evaluation. European Language Resources Association, Paris, pp 965–969

    Google Scholar 

  10. Ide N (2012) MultiMASC: An open linguistic infrastructure for language research. In: Proceedings of the fifth workshop on building and using comparable corpora, Istanbul, Turkey

    Google Scholar 

  11. Ide N, Romary L (2004) International standard for a linguistic annotation framework. J Nat Lang Eng 10(3–4):211–225

    Article  Google Scholar 

  12. Ide N, Suderman K (2006) An open linguistic infrastructure for American English. In: Proceedings of the fifth language resources and evaluation conference (LREC). European Language Resources Association, Paris, Genoa, Italy

    Google Scholar 

  13. Ide N, Suderman K (2007) GrAF: a graph-based format for linguistic annotations. In: Proceedings of the first linguistic annotation workshop, Prague, Czech Republic, pp 1–8

    Google Scholar 

  14. Ide N, Suderman K (Submitted) The linguistic annotation framework: a standard for annotation interchange and merging. Lang Resour Eval, in press

    Google Scholar 

  15. ISO 24612 (2012) Language resource management – linguistic annotation framework. International Standard ISO 24612

    Google Scholar 

  16. Janin A, Baron D, Edwards J, Ellis D, Gelbart D, Morgan N, Peskin B, Pfau T, Shriberg E, Stolcke A, Wooters C (2003) The ICSI meeting corpus. In: Proceedings of ICASSP-03, Hong Kong, pp 364–367

    Google Scholar 

  17. Kemps-Snijders M, Windhouwer M, Wittenburg P, Wright SE (2008) ISOcat: corralling data categories in the wild. In: Proceedings of the sixth international conference on language resources and evaluation (LREC’08), Marrakech, Morocco. European Language Resources Association (ELRA)

    Google Scholar 

  18. Klyne G, Carroll JJ (2004) Resource description framework (RDF): concepts and abstract syntax. World Wide Web Consortium, Recommendation REC-RDF-Concepts-20040210

    Google Scholar 

  19. Marcus MP, Marcinkiewicz MA, Santorini B (1993) Building a large annotated corpus of English: the Penn Treebank. Comput Linguist 19(2):313–330

    Google Scholar 

  20. Miller S, Guinness J, Zamanian A (2004) Name tagging with word clusters and discriminative training. In: Susan Dumais DM, Roukos S (eds) HLT-NAACL 2004: main proceedings, association for computational linguistics, Boston, MA, USA, pp 337–342

    Google Scholar 

  21. Miller S, Guinness J, Zamanian A (2004) Name tagging with word clusters and discriminative training. In: Proceedings of human language technologies, Boston, MA, USA, pp 337–342

    Google Scholar 

  22. Nowak S, Rüger S (2010) How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In: Proceedings of the international conference on multimedia information retrieval. ACM, New York. MIR ’10, pp 557–566. doi:10.1145/1743384.1743478, http://doi.acm.org/10.1145/1743384.1743478

  23. Passonneau RJ, Baker CF, Fellbaum C, Ide N (2012) The MASC word sense corpus. In: Proceedings of the eighth international conference on language resources and evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA)

    Google Scholar 

  24. Pradhan SS, Hovy E, Marcus M, Palmer M, Ramshaw L, Weischedel R (2007) OntoNotes: a unified relational semantic representation. In: ICSC ’07: Proceedings of the international conference on semantic computing. IEEE Computer Society, Washington, DC, pp 517–526

    Google Scholar 

  25. Prud’hommeaux E, Seaborne A (2007) SPARQL query language for rdf (working draft). Technical report, W3C. http://www.w3.org/TR/2007/WD-rdf-sparql-query-20070326/

Download references

Acknowledgements

This work was supported in part by National Science Foundation grant CRI-0708952.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nancy Ide .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Ide, N. (2013). An Open Linguistic Infrastructure for Annotated Corpora. In: Gurevych, I., Kim, J. (eds) The People’s Web Meets NLP. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35085-6_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35085-6_10

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35084-9

  • Online ISBN: 978-3-642-35085-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics