Abstract
One means to offset the high cost of corpus creation is to distribute effort among members of the research community, and thereby distribute the cost as well. To this end, the American National Corpus (ANC) project undertook to provide data and linguistic annotations to serve as the base for a collaborative, community-wide resource development effort (the ANC Open Linguistic Infrastructure, ANC-OLI). The fundamental premises of the effort are, first, that all data and annotations must be freely available to all members of the community, without restriction on use or redistribution, and second, that once a base of data and annotation was established, the resources would grow as community members contributed their enhancements and derived data. To ensure maximum flexibility and usability, the project has also developed an infrastructure for representing linguistically annotated resources intended to solve some of the usability problems for annotations produced at different sites by harmonizing their representation formats. We describe here the resources and infrastructure developed to support this collaborative community development and the efforts to ensure full community engagement.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
The NSF workshop, held October 29–30, 2006, included the following participants: Collin Baker, Hans Boas, Branimir Bogureav, Nicoletta Calzolari, Christopher Cieri, Christiane Fellbaum, Charles Fillmore, Sanda Harabagiu, Rebecca Hwa, Nancy Ide, Judith Klavans, Adam Meyers, Martha Palmer, Rebecca Passonneau, James Pustejovsky, Janyce Wiebe, and funding organization representatives Tatiana Korelsky (NSF) and Joseph Olive (DARPA). A report summarizing the consensus of the workshop participants is available at http://anc.org/nsf-workshop-2006.
- 3.
creativecommons.org/licenses/by/3.0/
- 4.
- 5.
creativecommons.org/licenses/by-sa/2.5/
- 6.
Based on entries in the LRE Map, http://www.resourcebook.eu/LreMap/faces/views/resourceMap.xhtml
- 7.
- 8.
- 9.
The consortium members who contributed texts to the ANC are Oxford University Press, Cambridge University Press, Langenscheidt Publishers, and the Microsoft Corporation.
- 10.
The contents of the ANC First Release are described at http://www.anc.org/FirstRelease/
- 11.
- 12.
However, since 2005 the ANC project had no funding for production of additional data.
- 13.
NSF CRI 0708952
- 14.
MASC contains about 4 K words of the 10 K LU corpus, eliminating non-English and translated LU texts as well as texts that are not free of usage and redistribution restrictions.
- 15.
The MASC project commissioned the remainder of the annotation from the Penn Treebank project.
- 16.
Lack of funding for processing the data currently prevents its publication.
- 17.
- 18.
- 19.
- 20.
- 21.
liberalarts.iupui.edu/icic/research/corpus_of_philanthropic_fundraising_discourse
- 22.
newsouthvoices.uncc.edu/
- 23.
- 24.
Allowing annotations to reference other annotations differentiates GrAF from other representation formats, such as Annotation Graphs [2]
- 25.
For more details, see Chiarcos, et al., in this volume.
- 26.
linguistics.okfn.org/llod
- 27.
- 28.
General Architecture for Text Engineering; http://gate.ac.uk
- 29.
Taken from Field of Dreams; see http://en.wikipedia.org/wiki/Field_of_Dreams
- 30.
The Charniak and Johnson (2005) parser, MaltParser, and LHT dependency converter.
- 31.
- 32.
- 33.
Such repositories were set up to answer the call for resource reusability which, no doubt in large part because information added to these resources was until recently unlikely to be usable by others, always referred to the consumer-only model.
- 34.
- 35.
References
Basile V, Bos J, Evang K, Venhuizen N (2012) Developing a large semantically annotated corpus. In: Proceedings of the eighth international conference on language resources and evaluation (LREC 2012), Istanbul, Turkey, pp 3196–3200
Bird S, Liberman M (2001) A formal framework for linguistic annotation. Speech Commu 33(1–2):23–60
Chiarcos C (2008) An ontology of linguistic annotations. LDV Forum 23(1):1–16
Chiarcos C (2012) Ontologies of linguistic annotation: survey and perspectives. In: Proceedings of the eighth international conference on language resources and evaluation (LREC), Istanbul, Turkey
Chiarcos C, Ritz J, Stede M (2012) By all these lovely tokens…merging conflicting tokenizations. Lang Resour Eval 46(1):53–74
Dipper S (2005) XML-based stand-off representation and exploitation of multi-level linguistic annotation. In: Eckstein R, Tolksdorf R (eds) Berliner XML Tage, pp 39–50
Farrar S, Langendoen DT (2010) An OWL-DL implementation of GOLD: an ontology for the semantic web. In: Witt A, Metzing D (eds) Linguistic modeling of information and markup languages. Springer, Dordrecht
Ferrucci D, Lally A (2004) UIMA: an architectural approach to unstructured information processing in the corporate research environment. J Nat Lang Eng 10(3–4):327–348
Fillmore CJ, Jurafsky D, Ide N, Macleod C (1998) An American national corpus: a proposal. In: Proceedings of the first annual conference on language resources and evaluation. European Language Resources Association, Paris, pp 965–969
Ide N (2012) MultiMASC: An open linguistic infrastructure for language research. In: Proceedings of the fifth workshop on building and using comparable corpora, Istanbul, Turkey
Ide N, Romary L (2004) International standard for a linguistic annotation framework. J Nat Lang Eng 10(3–4):211–225
Ide N, Suderman K (2006) An open linguistic infrastructure for American English. In: Proceedings of the fifth language resources and evaluation conference (LREC). European Language Resources Association, Paris, Genoa, Italy
Ide N, Suderman K (2007) GrAF: a graph-based format for linguistic annotations. In: Proceedings of the first linguistic annotation workshop, Prague, Czech Republic, pp 1–8
Ide N, Suderman K (Submitted) The linguistic annotation framework: a standard for annotation interchange and merging. Lang Resour Eval, in press
ISO 24612 (2012) Language resource management – linguistic annotation framework. International Standard ISO 24612
Janin A, Baron D, Edwards J, Ellis D, Gelbart D, Morgan N, Peskin B, Pfau T, Shriberg E, Stolcke A, Wooters C (2003) The ICSI meeting corpus. In: Proceedings of ICASSP-03, Hong Kong, pp 364–367
Kemps-Snijders M, Windhouwer M, Wittenburg P, Wright SE (2008) ISOcat: corralling data categories in the wild. In: Proceedings of the sixth international conference on language resources and evaluation (LREC’08), Marrakech, Morocco. European Language Resources Association (ELRA)
Klyne G, Carroll JJ (2004) Resource description framework (RDF): concepts and abstract syntax. World Wide Web Consortium, Recommendation REC-RDF-Concepts-20040210
Marcus MP, Marcinkiewicz MA, Santorini B (1993) Building a large annotated corpus of English: the Penn Treebank. Comput Linguist 19(2):313–330
Miller S, Guinness J, Zamanian A (2004) Name tagging with word clusters and discriminative training. In: Susan Dumais DM, Roukos S (eds) HLT-NAACL 2004: main proceedings, association for computational linguistics, Boston, MA, USA, pp 337–342
Miller S, Guinness J, Zamanian A (2004) Name tagging with word clusters and discriminative training. In: Proceedings of human language technologies, Boston, MA, USA, pp 337–342
Nowak S, Rüger S (2010) How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In: Proceedings of the international conference on multimedia information retrieval. ACM, New York. MIR ’10, pp 557–566. doi:10.1145/1743384.1743478, http://doi.acm.org/10.1145/1743384.1743478
Passonneau RJ, Baker CF, Fellbaum C, Ide N (2012) The MASC word sense corpus. In: Proceedings of the eighth international conference on language resources and evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA)
Pradhan SS, Hovy E, Marcus M, Palmer M, Ramshaw L, Weischedel R (2007) OntoNotes: a unified relational semantic representation. In: ICSC ’07: Proceedings of the international conference on semantic computing. IEEE Computer Society, Washington, DC, pp 517–526
Prud’hommeaux E, Seaborne A (2007) SPARQL query language for rdf (working draft). Technical report, W3C. http://www.w3.org/TR/2007/WD-rdf-sparql-query-20070326/
Acknowledgements
This work was supported in part by National Science Foundation grant CRI-0708952.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Ide, N. (2013). An Open Linguistic Infrastructure for Annotated Corpora. In: Gurevych, I., Kim, J. (eds) The People’s Web Meets NLP. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35085-6_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-35085-6_10
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35084-9
Online ISBN: 978-3-642-35085-6
eBook Packages: Computer ScienceComputer Science (R0)