Abstract
Coreference is a complex phenomenon involving a variety of linguistic factors: from surface similarity to morphological agreement, specific syntactic constraints, semantics, salience and encyclopedic knowledge. It is therefore essential for any coreference resolution system to rely on a rich linguistic representation of a document to be analyzed. This chapter focuses on the preprocessing technology, taking into consideration a variety of external tools needed to create such representations, and shows how to combine them in a Preprocessing Pipeline, in order to extract mentions of entities in a given document, describing their linguistic properties.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
In all the examples in this chapter we use square brackets to indicate correct (gold) mention boundaries.
- 2.
The MUC guidelines require annotation of the SLUG, DATE, NWORDS, PREAMBLE and TEXT parts of a document.
- 3.
- 4.
An exception is the TK-EMD module of BART [48] that uses tree kernels to identify relevant parse nodes and classify them as ±mentions.
- 5.
- 6.
- 7.
References
Baroni, M., Chantree, F., Kilgarriff, A., Sharoff, S.: CleanEval: a competition for cleaning web pages. In: Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), Marrakech (2008)
Benajiba, Y., Diab, M., Rosso, P.: Arabic named entity recognition: a feature-driven study. IEEE Trans. Audio Speech Lang. Process. 15 (5), 926–934 (2009)
Biggio, S.M.B., Speranza, M., Zanoli, R.: Entity mention detection using a combination of redundancy-driven classifiers. In: Proceedings of the Seventh conference on International Language Resources and Evaluation, Valletta (2010)
Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: A high-performance learning namefinder. In: Proceedings of ANLP-97, Washington, DC, pp. 194–201 (1997)
Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: An algorithm that learns what’s in a name. Mach. Learn. 34 (1), 211–231 (1999)
Borthwick, A.: A maximum entropy approach to named entity recognition. Ph.D. thesis, New York University (1999)
Borthwick, A., Sterling, J., Agichtein, E., Grishman, R.: Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In: Proceedings of the Sixth ACL Workshop on Very Large Corpora, Montreal (1998)
Brants, T.: TnT – a statistical part-of-speech tagger. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, Washington, DC (2000)
Brill, E.: Transformation-based error-driven parsing. In: Proceedings of the Third International Workshop on Parsing Technologies, Tilburg/Durbuy (1993)
Brill, E., Moore, R.C.: An improved error model for noisy channel spelling correction. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Hong Kong, pp. 286–293 (2000)
Briscoe, E.J., Caroll, J.: Generalised probabilistic LR parsing of natural language (corpora) with unification-based grammars. Comput. Linguist. 19 (1), 25–59 (1993)
Bryl, V., Giuliano, C., Serafini, L., Tymoshenko, K.: Supporting natural language processing with background knowledge: coreference resolution case. In: Proceedings of the 9th International Semantic Web Conference, Shanghai (2010)
Charniak, E.: A maximum-entropy-inspired parser. In: Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, pp. 132–139 (2000)
Charniak, E., Johnson, M.: Coarse-to-fine n-best parsing and maxent discriminative reranking. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor (2005)
Chieu, H.L., Ng, H.T.: Named entity recognition with a maximum entropy approach. In: Proceedings of CoNLL-2003, Edmonton, pp. 160–163 (2003)
Collins, M.: Head-driven statistical models for natural language parsing. Ph.D. thesis, University of Pennsylvania (1999)
Cutting, D., Kupiec, J., Pederson, J., Sibun, P.: A practical part-of-speech tagger. In: Proceedings of the Third Conference on Applied Natural Language Processing, Trento, pp. 133–140 (1997)
Daume III, H., Marcu, D.: A large-scale exploration of effective global featuresn for a joint entity detection and tracking model. In: Proceedings of the 2005 Conference on Empirical Methods in Natural Language Processing, Vancouver (2005)
Denis, P., Baldridge, J.: Global joint models for coreference resolution and named entity classification. In: Procesamiento del Lenguaje Natural 42. SEPLN, Barcelona (2009)
Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, pp. 363–370 (2005)
Florian, R., Ittycheriah, A., Jing, H., Zhang, T.: Named entity recognition through classifier combination. In: Proceedings of CoNLL-2003, Edmonton, pp. 168–171 (2003)
Giesbrecht, E., Evert, S.: Is part-of-speech tagging a solved task? an evaluation of POS taggers for the German Web as corpus. In: Proceedings of the 5th Web as Corpus Workshop, San Sebastian (2009)
Greene, B., Rubin, G.: Grammatical tagging of English. Technical report, Department of Linguistics, Brown University, Providence (1971)
Haghighi, A., Klein, D.: Unsupervised coreference resolution in a nonparametric Bayesian model. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague (2007)
Harabagiu, S., Maiorano, S.: Multilingual coreference resolution. In: Proceedings of the Language Technology Joint Conference on Applied Natural Language Processing and the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL2000), Seattle, pp. 142–149 (2000)
Kennedy, C., Boguraev, B.: Anaphora for everyone: pronominal anaphora resolution without a parser. In: Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen, pp. 113–118 (1996)
Klein, D., Manning, C.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, pp. 423–430 (2003)
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, Williamstown, pp. 282–289 (2001)
Lappin, S., Leass, H.: An algorithm for pronominal anaphora resolution. Comput. Linguist. 20 (4), 535–561 (1994)
Lita, L.V., Ittycheriah, A., Roukos, S., Kambhatla, N.: tRuEcasIng. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo (2003)
Luo, X., Florian, R., Ward, T.: Improving coreference resolution by using conversational metadata. In: Proceedings of The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, Boulder (2009)
Magnini, B., Negri, M., Prevete, R., Tanev, H.: A Wordnet-based approach to named-entities recognition. In: Proceedings of the COLING02 Workshop on SEMANET, Taipei (2002)
McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of Conference on Computational Natural Language Learning, Edmonton, pp. 188–191 (2003)
Miller, G.: Wordnet: an on-line lexical database. Int. J. Lexicogr. 3 (4), 235–312 (Winter 1990)
Ng, V.: Unsupervised models for coreference resolution. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Honolulu, pp. 640–649 (2008)
Norvig, P.: How to write a spelling corrector (2007). http://norvig.com/spell-correct.html
Petrov, S., Barett, L., Thibaux, R., Klein, D.: Learning accurate, compact, and interpretable tree annotation. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, 17–21 July 2006
Ponzetto, S.P., Strube, M.: Exploiting semantic role labeling, WordNet and Wikipedia for coreference resolution. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, Morristown, pp. 192–199 (2006)
Pradhan, S., Luo, X., Recasens, M., Hovy, E.H., Ng, V., Strube, M.: Scoring coreference partitions of predicted mentions: a reference implementation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), Baltimore, 22–27 June 2014, vol. 2, Short Papers, pp. 30–35 (2014)
Preiss, J.: Choosing a parser for anaphora resolution. In: Proceedings of the 4th Discourse Anaphora and Anaphor Resolution Colloquium, Lisbon, pp. 175–180 (2002)
Ramshaw, L.A., Marcus, M.P.: Text chunking using transformation-based learning. In: Proceedings of the Third ACL Workshop on Very Large Corpora, Cambridge, MA (1995)
Reynar, J.C., Ratnaparkhi, A.: A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, DC (1997)
Sang, E., Tjong, E.F., Sang, K., Veenstra, J.: Representing text chunks. In: Proceedings of EACL’99, Bergen (1999)
Sang, E.F.T.K., Meulder, F.D.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of CoNLL-2003, Edmonton, pp. 142–147 (2003)
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the International Conference on New Methods in Language Processing, Manchester, pp. 44–49 (1994)
Uryupina, O.: Knowledge acquisition for coreference resolution. Ph.D. thesis, Saarland University (2007)
Uryupina, O.: Corry: a system for coreference resolution. In: Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval’10), Uppsala (2010)
Uryupina, O., Moschitti, A.: Multilingual mention detection for coreference resolution. In: Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP’13), Nagoya (2013)
Uryupina, O., Poesio, M., Giuliano, C., Tymoshenko, K.: Disambiguation and filtering methods in using Web knowledge for coreference resolution. In: Proceedings of The 24th Florida Artificial Intelligence Research Society Conference (FLAIRS-24), Palm Beach (2011)
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Versley, Y., Ponzetto, S.P., Poesio, M., Eidelman, V., Jern, A., Smith, J., Yang, X., Moschitti, A.: BART: a modular toolkit for coreference resolution. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, Prague, pp. 9–12 (2008)
Yamada, H., Kudoh, T., Matsumoto, Y.: Japanese named entity extraction using support vector machines. Information Processing Society of Japan, SIG Notes NL 142-17 (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Uryupina, O., Zanoli, R. (2016). Preprocessing Technology. In: Poesio, M., Stuckardt, R., Versley, Y. (eds) Anaphora Resolution. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-47909-4_7
Download citation
DOI: https://doi.org/10.1007/978-3-662-47909-4_7
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-47908-7
Online ISBN: 978-3-662-47909-4
eBook Packages: Computer ScienceComputer Science (R0)