Abstract
In this chapter we review the currently available corpora to study anaphoric interpretation, and the tools that can be used to create new ones. A comprehensive survey of annotated corpora will be given, which ranges from the corpora and guidelines developed for the Message Understanding Conferences MUC-6 (1996) and MUC-7 (1998), which have been seminal to the field, to the resources that have been recently made available as part of the 2010 SemEval evaluation campaign. All fundamental design decisions regarding annotation formats and standards are described, and the relevant properties of the corpora are presented in a uniform and well-structured way. Moreover, three useful, widely used and freely available annotation tools (CALLISTO, MMAX2, and Palinka) will be described. They can be employed if own annotation work turns out to be indispensable.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
As discussed in chapter “Linguistic and Cognitive Evidence About Anaphora”, many types of expressions in language are anaphoric to a degree, but the type of anaphoric reference most studied in computational linguistics, by far, is anaphoric reference via noun phrases, so in this chapter, as in the rest of the book, we will focus on coding schemes and corpora for np anaphoric reference.
- 3.
- 4.
- 5.
These are persons, organizations, locations, temporal expressions, and numerical expressions—see, e.g., [22].
- 6.
- 7.
npaper 9801.139
- 8.
- 9.
- 10.
- 11.
- 12.
The anaphorically annotated versions of ldc corpora such as the rst Discourse Treebank and the trains-93 corpus require previous purchase of the original corpora.
- 13.
- 14.
- 15.
The portion of AnCora annotated with coreference information (AnCora-CO) amounts to a total of 400,000 words for each language.
- 16.
Elliptical subjects were manually inserted as part of the treebank.
- 17.
- 18.
- 19.
The morphosyntactic and semantic tag sets differ between languages.
- 20.
In the case of OntoNotes, the singletons were heuristically added.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
A notable exception is the OntoNotes corpus, where all semantic levels are stored in a unified format in a database [64].
- 31.
- 32.
In muc and other projects, the muc scoring metric was used (see chapter “Evaluation Campaigns”). The muc-6 annotators reached an agreement level of F1 = 0. 83 [30], comparable with later efforts such as the German TüBa-D/Z corpus (F1 = 0. 83, [78]), or the Dutch corea corpus (F1 = 0. 76, [27]), which relied on more refined annotation guidelines.
- 33.
Researchers working on languages for which not even chunkers exist need to be aware that the corpora they create will probably only be usable for linguistic studies.
References
ACE: Annotation guidelines for entity detection and tracking (EDT) (2004). Version 4.2.6
Aone, C., Bennett, S.: Evaluating automated and manual acquisition of anaphora resolution strategies. In: Proceedings of ACL, Cambridge (1995)
Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Comput. Linguist. 34 (4), 555–596 (2008). An early version of this paper has been circulating since 2005 as “Kappa3 = Alpha (or Beta)”. This version is still available from the ARRAU website
Bard, E.G., Anderson, A.H., Sotillo, C., Aylett, M., Doherty-Sneddon, G., Newlands, A.: Controlling the intelligibility of referring expressions. J. Mem. Lang. 42, 1–22 (2000)
Bird, S., Day, D., Garofolo, J., Henderson, J., Laprun, C., Liberman, M.: Atlas: a flexible and extensible architecture for linguistic annotation. http://arxiv.org/abs/cs/0007022 (2000)
Botley, S.P.: Indirect anaphora: testing the limits of corpus-based linguistics. Int. J. Corpus Linguist. 11 (1), 73–112 (2006)
Bretonnel Cohen, K., Verspoor, K., Bada, M., Funk, C., Hunter, L.: The Colorado richly annotated full text (CRAFT) corpus: multi-model annotation in the biomedical domain. In: Ide, N., Pustejovsky, J. (eds.) Handbook of Linguistic Annotation. Springer, Berlin (forthcoming)
Bruneseaux, F., Romary, L.: Codage des références et coréférences dans le dialogues homme-machine. In: Proceedings of ACH-ALLC, Kingston (1997)
Byron, D.: Resolving pronominal references to abstract entities. In: Proceedings of the ACL, Philadelphia, pp. 80–87 (2002)
Carletta, J.: Assessing agreement on classification tasks: the kappa statistic. Comput. Linguist. 22 (2), 249–254 (1996)
Carlson, L., Marcu, D., Okurowski, M.E.: Building a discourse-tagged corpus in the framework of rhetorical structure theory. In: van Kuppevelt, J., Smith, R. (eds.) Current Directions in Discourse and Dialogue, pp. 85–112. Kluwer Academic, Dordrecht/Boston (2003)
Chafe, W.L.: The Pear Stories: Cognitive, Cultural and Linguistic Aspects of Narrative Production. Ablex, Norwood (1980)
Cheng, H.: Modelling aggregation motivated interactions in descriptive text generation. Ph.D. thesis, Division of Informatics, the University of Edinburgh, Edinburgh (2001)
Chinchor, N.A.: Overview of MUC-7/MET-2. In: Proceedings of the Seventh Message Understanding Conference (MUC-7), Fairfax (1998)
Collovini, S., Carbonel, T., Thielsen Fuchs, J., Coelho, J.C., Rino, L., Vieira, R.: Summit: um corpus anotado com informa cões discursivas visando à sumariza cão automática. In: 52nd Workshop em Tecnologia da Informa cão e da Linguagem Humana (TIL’2007), Rio de Janeiro (2007)
van Deemter, K., Kibble, R.: On coreferring: coreference in MUC and related annotation schemes. Comput. Linguist. 26 (4), 629–637 (2000). Squib
Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassell, S., Weischedel, R.: The automatic content extraction (ACE) program–tasks, data, and evaluation. In: Proceedings of LREC, Athens (2000)
Gardent, C., Manuélian, H.: Création d’un corpus annoté pour le traitement des déscriptions d éfinies. Traitement Automatique des Langues 46 (1), 115–140 (2005). http://www.loria.fr/~gardent/publis/tal2005.pdf
Gasperin, C., Karamanis, N., Seal, R.: Annotation of anaphoric relations in biomedical full-text articles using a domain-relevant scheme. In: Proceedings of DAARC 2007, Lagos, pp. 19–24 (2007)
Ge, N., Hale, J., Charniak, E.: A statistical approach to anaphora resolution. In: Proceedings of Sixth Workshop on Very Large Corpora (WVLC/EMNLP) (1998)
Grishman, R.: Coreference task definition. Technical report, NYU (1995). http://www.cs.nyu.edu/cs/faculty/grishman/COtask21.book_1.html
Grishman, R.: Named entity task definition. Technical report, NYU (1995). http://www.cs.nyu.edu/cs/faculty/grishman/NEtask20.book_1.html
Grishman, R., Sundheim, B.: Design of the MUC-6 evalutation. In: Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia (1995)
Grishman, R., Sundheim, B.: Message understanding conference-6: a brief history. In: Proceedings of the 16th COLING, COLING ’96, pp. 466–471. Association for Computational Linguistics, Stroudsburg (1996). doi:http://dx.doi.org/10.3115/992628.992709
Hajič, J., Böhmová, A., Hajičová, E., Vidová-Hladká, B.: The Prague dependency treebank: a three-level annotation scenario. In: Abeillé, A. (ed.) Treebanks: Building and Using Parsed Corpora, pp. 103–127. Kluwer Academic, Amsterdam (2000)
Hasler, L., Orasan, C., Naumann, K.: NPs for events: experiments in coreference annotation. In: Proceedings of LREC, Genoa (2006)
Hendrickx, I., Bouma, G., Coppens, F., Daelemans, W., Hoste, V., Kloosterman, G., Mineur, A.M., Van Der Vloet, J., Verschelde, J.L.: A coreference corpus and resolution system for Dutch. In: Proceedings of LREC, Marrakech (2008)
Hinrichs, E., Kübler, S., Naumann, K.: A unified representation for morphological, syntactic, semantic and referential annotations. In: ACL Workshop on Frontiers in Corpus Annotation II: Pie in the Sky, Ann Arbor (2005)
Hirschman, L.: MUC-7 coreference task definition, version 3.0. In: Chinchor, N. (ed.) Proceedings of the 7th Message Understanding Conference (1998). Available at http://www.muc.saic.com/proceedings/muc_7_toc.html
Hirschman, L., Robinson, P., Burger, J., Vilain, M.: Automating coreference: the role of automated training data. In: Proceedings of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing (1997). http://arxiv.org/pdf/cmp-lg/9803001
Ide, N.: Corpus encoding standard: SGML guidelines for encoding linguistic corpora. In: Proceedings of LREC, Granada (1998)
Ide, N., Pustejovsky, J. (eds.): Handbook of Linguistic Annotation. Springer, Berlin (forthcoming)
Iida, R., Komachi, M., Inui, K., Matsumoto, Y.: Annotating a Japanese text corpus with predicate-argument and coreference relations. In: Proceeding of the ACL Linguistic Annotation Workshop (LAW), Prague, pp. 132–139 (2007)
Iida, R., Poesio, M.: A cross-lingual ILP solution to zero anaphora resolution. In: Proceedings of ACL. ACL, Boulder (2011)
Isard, A.: An XML architecture for the HCRC map task corpus. In: Kühnlein, P., Rieser, H., Zeevat, H. (eds.) Proceedings of BI-DIALOG (2001)
Kabadjov, M.A.: Task-oriented evaluation of anaphora resolution. Ph.D. thesis, Department of Computing and Electronic Systems, University of Essex, Colchester (2007)
Karamanis, N.: Entity coherence for descriptive text structuring. Ph.D. thesis, University of Edinburgh, Informatics (2003)
Klein, M., Bernsen, N.O., Davies, S., Dybkjaer, L., Garrido, J., Kasch, H., Mengel, A., Pirelli, V., Poesio, M., Quazza, S., Soria, C.: Supported coding schemes. Deliverable 1.1, The MATE Consortium. mate.nis.sdu.dk/about/deliverables.html (1998)
Krasavina, O., Chiarcos, C.: The potsdam coreference scheme. In: Proceedings of the 1st Linguistic Annotation Workshop, pp. 156–163 (2007)
Kuc̆ová, L., Hajic̆ová, E.: Coreferential relations in the prague dependency treebank. In: Proceedings of DAARC, pp. 94–102 (2004)
LDC: ACE (Automatic Content Extraction) English annotation guidelines for entities, version 5.6.1 (2004)
Magnini, B., Pianta, E., Girardi, C., Negri, M., Romano, L., Speranza, M., Lenzi, V.B., Sprugnoli, R.: I-cab: the italian content annotation bank. In: Proceedings of LREC, Genoa (2006)
McCarthy, J.F., Lehnert, W.G.: Using decision trees for coreference resolution. In: Proceedings of IJCAI, Monréal (1995)
McKelvie, D., Isard, A., Mengel, A., Moeller, M.B., Grosse, M., Klein, M.: The MATE workbench – an annotation tool for XML corpora. Speech Commun. 33 (1–2), 97–112 (2001)
Moser, M., Moore, J.D.: Toward a synthesis of two accounts of discourse structure. Comput. Linguist. 22 (3), 409–419 (1996)
Müller, M.C.: Fully automatic resolution of it, this and that in unrestricted multy-party dialog. Ph.D. thesis, Universität Tübingen (2008)
Müller, C., Strube, M.: Multi-level annotation of linguistic data with mmax2. In: Braun, S., Kohn, K., Mukherjee, J. (eds.) Corpus Technology and Language Pedagogy. New Resources, New Tools, New Methods. English Corpus Linguistics, vol. 3, pp. 197–214. Peter Lang, New York (2006)
Navaretta, C.: Pronominal types and abstract reference in the Danish and Italian DAD Corpora. In: Proceedings of the Second Workshop on Anaphora Resolution (WAR II), Bergen. NEALT Proceedings Series, vol. 2, pp. 63–71 (2008)
Nguyen, N.L.T., Kim, J.D., Tsujii, J.: Challenges in pronoun resolution system for biomedical text. In: Proceedings of LREC, Marrakech (2008)
Orasan, C.: Palinka: a highly customizable tool for discourse annotation. In: Proceedings of the 4th SIGdial Workshop on Discourse and Dialogue, Sapporo (2003)
Poesio, M.: Annotating a corpus to develop and evaluate discourse entity realization algorithms: issues and preliminary results. In: Proceedings of the 2nd LREC, Athens, pp. 211–218 (2000)
Poesio, M.: The GNOME Annotation Scheme Manual. University of Edinburgh, HCRC and Informatics, Scotland, fourth version edn. (2000). Available from http://cswww.essex.ac.uk/Research/nle/corpora/GNOME/anno_manual_4.htm
Poesio, M.: The MATE/GNOME scheme for anaphoric annotation, revisited. In: Proceedings of SIGDIAL, Boston (2004)
Poesio, M., Artstein, R.: The reliability of anaphoric annotation, reconsidered: taking ambiguity into account. In: Meyers, A. (ed.) Proceedings of ACL Workshop on Frontiers in Corpus Annotation, Ann Arbor, pp. 76–83 (2005)
Poesio, M., Artstein, R.: Anaphoric annotation in the arrau corpus. In: Proceedings of LREC, Marrakesh (2008)
Poesio, M., Bruneseaux, F., Romary, L.: The MATE meta-scheme for coreference in dialogues in multiple languages. In: Walker, M. (ed.) Proceedings of the ACL Workshop on Standards and Tools for Discourse Tagging, College Park, pp. 65–74 (1999)
Poesio, M., Delmonte, R., Bristot, A., Chiran, L., Tonelli, S.: The VENEX corpus of anaphoric information in spoken and written Italian (2004, in preparation). Available online at http://cswww.essex.ac.uk/staff/poesio/publications/VENEX04.pdf
Poesio, M., Patel, A., Di Eugenio, B.: Discourse structure and anaphora in tutorial dialogues: an empirical analysis of two theories of the global focus. Res. Lang. Comput. 4, 229–257 (2006). Special Issue on Generation and Dialogue
Poesio, M., Stevenson, R., Di Eugenio, B., Hitzeman, J.M.: Centering: a parametric theory and its instantiations. Comput. Linguist. 30 (3), 309–363 (2004)
Poesio, M., Sturt, P., Arstein, R., Filik, R.: Underspecification and anaphora: theoretical issues and preliminary evidence. Discourse Process. 42 (2), 157–175 (2006)
Poesio, M., Vieira, R.: A corpus-based investigation of definite description use. Comput. Linguist. 24 (2), 183–216 (1998). Also available as Research Paper CCS-RP-71, Centre for Cognitive Science, University of Edinburgh
Postal, P.M.: Anaphoric islands. In: Binnick, R.I., et al. (ed.) Papers from the Fifth Regional Meeting of the Chicago Linguistic Society, pp. 205–235. University of Chicago, Chicago (1969)
Pradhan, S.S., Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R.: Ontonotes: a unified relational semantic representation. Int. J. Semant. Comput. 1 (4), 405–419 (2007)
Pradhan, S., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R., Xue, N.: CoNLL-2011 shared task: modeling unrestricted coreference in ontonotes. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning (CoNLL 2011), Portland (2011)
Pradhan, S., Moschitti, A., Xue, N., Uryupina, O., Zhang, Y.: Conll-2012 shared task: modeling multilingual unrestricted coreference in ontonotes. In: Joint Conference on EMNLP and CoNLL – Shared Task, pp. 1–40. Association for Computational Linguistics, Jeju Island (2012). http://www.aclweb.org/anthology/W12-4501
Pradhan, S., Ramshaw, L., Weischedel, R., MacBride, J., Micciulla, L.: Unrestricted coreference: indentifying entities and events in OntoNotes. In: Proceedings of the IEEE International Conference on Semantic Computing (ICSC), Irvine (2007)
Recasens, M., Màrquez, L., Sapena, E., Martí, M.A., Taulé, M., Hoste, V., Poesio, M., Versley, Y.: Semeval-2010 task 1: coreference resolution in multiple languages. In: Proceedings of SEMEVAL, Uppsala (2010)
Recasens, M., Martí, M.A.: Ancora-co: coreferentially annotated corpora for Spanish and Catalan. Lang. Resour. Eval. 44, 315–345 (2010)
Rodriguez, K.J., Delogu, F., Versley, Y., Stemle, E., Poesio, M.: Anaphoric annotation of wikipedia and blogs in the live memories corpus. In: Proceedings of LREC (poster), Malta (2010)
Sgall, P., Hajicova, E., Panevova, J. (eds.): The Meaning of the Sentence in Its Semantic and Pragmatic Aspects. D. Reidel, Dordrecht/Boston (1986)
Sobha, L., Bandyopadhyay, S., Vijay Sundar Ram, R., Akilandeswari, A.: NLP tool contest @ICON2011 on anaphora resolution in Indian languages. In: Proceedings of ICON, Singapore (2011)
Sperberg-McQueen, C.M., Burnard, L. (eds.): Guidelines for Electronic Text Encoding and Interchange (TEI P3). Text Encoding Initiative, Oxford (1994)
Stede, M.: The Potsdam Commentary Corpus. In: ACL’04 Workshop on Discourse Annotation, Barcelona (2004)
Su, J., Yang, X., Hong, H., Tateisi, Y., Tsujii, J.: Coreference resolution in biomedical texts: a machine learning approach. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Seminar Proceedings. 08131 – Ontologies and Text Mining for Life Sciences: Current Status and Future Perspectives (2008)
Taulé, M., Martí, M.A., Recasens, M.: AnCora: multilevel annotated corpora for Catalan and Spanish. In: Proceedings of LREC, Marrakech, pp. 96–101 (2008)
Tutin, A., Trouilleux, F., Clouzot, C., Gaussier, E., Zaenen, A., Rayot, S., Antoniadis, G.: Annotating a large corpus with anaphoric links. In: Proceedings of DAARC, Lancaster (2000)
Versley, Y.: Vagueness and referential ambiguity in a large-scale annotated corpus. Res. Lang. Comput. 6 (3–4), 333–353 (2008)
Wagner, A., Zeisler, B.: A syntactically annotated corpus of Tibetan. In: Proceedings of LREC, Lisbon (2004)
Walker, C., Strassel, S., Medero, J., Maeda, K.: ACE 2005 Multilingual Training Corpus. LDC2006T06. Linguistic Data Consortium, Philadelphia (2006)
Ward, G., Sproat, R., McKoon, G.: A pragmatic analysis of so-called anaphoric islands. Language 67, 439–474 (1991)
Weischedel, R., Hovy, E., Marcus, M., Palmer, M., Belvin, R., Pradhan, S., Ramshaw, L., Xue, N.: OntoNotes: a large training corpus for enhanced processing. In: Olive, J., Christianson, C., McCary, J. (eds.) Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation. Springer, New York (2011)
Weischedel, R., Pradhan, S., Ramshaw, L., Palmer, M., Xue, N., Marcus, M., Taylor, A., Greenberg, C., Hovy, E., Belvin, R., Houston, A.: Ontonotes Release 2.0. LDC2008T04. Linguistic Data Consortium, Philadelphia (2008)
Acknowledgements
This work was supported in part by a PhD studentship offered by Cogito/Expert Systems (Kepa Rodriguez), in part by the LiveMemories project (Poesio), and in part by the sensei project (Poesio).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Poesio, M., Pradhan, S., Recasens, M., Rodriguez, K., Versley, Y. (2016). Annotated Corpora and Annotation Tools. In: Poesio, M., Stuckardt, R., Versley, Y. (eds) Anaphora Resolution. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-47909-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-662-47909-4_4
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-47908-7
Online ISBN: 978-3-662-47909-4
eBook Packages: Computer ScienceComputer Science (R0)