Preprocessing Technology

Uryupina, Olga; Zanoli, Roberto

doi:10.1007/978-3-662-47909-4_7

Olga Uryupina⁷ &
Roberto Zanoli⁸

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

1166 Accesses

Abstract

Coreference is a complex phenomenon involving a variety of linguistic factors: from surface similarity to morphological agreement, specific syntactic constraints, semantics, salience and encyclopedic knowledge. It is therefore essential for any coreference resolution system to rely on a rich linguistic representation of a document to be analyzed. This chapter focuses on the preprocessing technology, taking into consideration a variety of external tools needed to create such representations, and shows how to combine them in a Preprocessing Pipeline, in order to extract mentions of entities in a given document, describing their linguistic properties.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Different German and English Coreference Resolution Models for Multi-domain Content Curation Scenarios

Coreference Resolution in the Assamese Language: A Pioneering Attempt

Nominal Coreference Resolution Using Semantic Knowledge

Notes

1.
In all the examples in this chapter we use square brackets to indicate correct (gold) mention boundaries.
2.
The MUC guidelines require annotation of the SLUG, DATE, NWORDS, PREAMBLE and TEXT parts of a document.
3.
http://projects.ldc.upenn.edu/ace/
4.
An exception is the TK-EMD module of BART [48] that uses tree kernels to identify relevant parse nodes and classify them as ±mentions.
5.
http://www.evalita.it/2009/tasks/entity
6.
http://chasen.org/~taku/software/yamcha/
7.
http://www.livememories.org/

References

Baroni, M., Chantree, F., Kilgarriff, A., Sharoff, S.: CleanEval: a competition for cleaning web pages. In: Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), Marrakech (2008)
Google Scholar
Benajiba, Y., Diab, M., Rosso, P.: Arabic named entity recognition: a feature-driven study. IEEE Trans. Audio Speech Lang. Process. 15 (5), 926–934 (2009)
Article Google Scholar
Biggio, S.M.B., Speranza, M., Zanoli, R.: Entity mention detection using a combination of redundancy-driven classifiers. In: Proceedings of the Seventh conference on International Language Resources and Evaluation, Valletta (2010)
Google Scholar
Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: A high-performance learning namefinder. In: Proceedings of ANLP-97, Washington, DC, pp. 194–201 (1997)
Google Scholar
Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: An algorithm that learns what’s in a name. Mach. Learn. 34 (1), 211–231 (1999)
Article MATH Google Scholar
Borthwick, A.: A maximum entropy approach to named entity recognition. Ph.D. thesis, New York University (1999)
Google Scholar
Borthwick, A., Sterling, J., Agichtein, E., Grishman, R.: Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In: Proceedings of the Sixth ACL Workshop on Very Large Corpora, Montreal (1998)
Google Scholar
Brants, T.: TnT – a statistical part-of-speech tagger. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, Washington, DC (2000)
Google Scholar
Brill, E.: Transformation-based error-driven parsing. In: Proceedings of the Third International Workshop on Parsing Technologies, Tilburg/Durbuy (1993)
Google Scholar
Brill, E., Moore, R.C.: An improved error model for noisy channel spelling correction. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Hong Kong, pp. 286–293 (2000)
Google Scholar
Briscoe, E.J., Caroll, J.: Generalised probabilistic LR parsing of natural language (corpora) with unification-based grammars. Comput. Linguist. 19 (1), 25–59 (1993)
Google Scholar
Bryl, V., Giuliano, C., Serafini, L., Tymoshenko, K.: Supporting natural language processing with background knowledge: coreference resolution case. In: Proceedings of the 9th International Semantic Web Conference, Shanghai (2010)
Google Scholar
Charniak, E.: A maximum-entropy-inspired parser. In: Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, pp. 132–139 (2000)
Google Scholar
Charniak, E., Johnson, M.: Coarse-to-fine n-best parsing and maxent discriminative reranking. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor (2005)
Google Scholar
Chieu, H.L., Ng, H.T.: Named entity recognition with a maximum entropy approach. In: Proceedings of CoNLL-2003, Edmonton, pp. 160–163 (2003)
Google Scholar
Collins, M.: Head-driven statistical models for natural language parsing. Ph.D. thesis, University of Pennsylvania (1999)
Google Scholar
Cutting, D., Kupiec, J., Pederson, J., Sibun, P.: A practical part-of-speech tagger. In: Proceedings of the Third Conference on Applied Natural Language Processing, Trento, pp. 133–140 (1997)
Google Scholar
Daume III, H., Marcu, D.: A large-scale exploration of effective global featuresn for a joint entity detection and tracking model. In: Proceedings of the 2005 Conference on Empirical Methods in Natural Language Processing, Vancouver (2005)
Google Scholar
Denis, P., Baldridge, J.: Global joint models for coreference resolution and named entity classification. In: Procesamiento del Lenguaje Natural 42. SEPLN, Barcelona (2009)
Google Scholar
Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, pp. 363–370 (2005)
Google Scholar
Florian, R., Ittycheriah, A., Jing, H., Zhang, T.: Named entity recognition through classifier combination. In: Proceedings of CoNLL-2003, Edmonton, pp. 168–171 (2003)
Google Scholar
Giesbrecht, E., Evert, S.: Is part-of-speech tagging a solved task? an evaluation of POS taggers for the German Web as corpus. In: Proceedings of the 5th Web as Corpus Workshop, San Sebastian (2009)
Google Scholar
Greene, B., Rubin, G.: Grammatical tagging of English. Technical report, Department of Linguistics, Brown University, Providence (1971)
Google Scholar
Haghighi, A., Klein, D.: Unsupervised coreference resolution in a nonparametric Bayesian model. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague (2007)
Google Scholar
Harabagiu, S., Maiorano, S.: Multilingual coreference resolution. In: Proceedings of the Language Technology Joint Conference on Applied Natural Language Processing and the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL2000), Seattle, pp. 142–149 (2000)
Google Scholar
Kennedy, C., Boguraev, B.: Anaphora for everyone: pronominal anaphora resolution without a parser. In: Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen, pp. 113–118 (1996)
Google Scholar
Klein, D., Manning, C.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, pp. 423–430 (2003)
Google Scholar
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, Williamstown, pp. 282–289 (2001)
Google Scholar
Lappin, S., Leass, H.: An algorithm for pronominal anaphora resolution. Comput. Linguist. 20 (4), 535–561 (1994)
Google Scholar
Lita, L.V., Ittycheriah, A., Roukos, S., Kambhatla, N.: tRuEcasIng. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo (2003)
Google Scholar
Luo, X., Florian, R., Ward, T.: Improving coreference resolution by using conversational metadata. In: Proceedings of The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, Boulder (2009)
Google Scholar
Magnini, B., Negri, M., Prevete, R., Tanev, H.: A Wordnet-based approach to named-entities recognition. In: Proceedings of the COLING02 Workshop on SEMANET, Taipei (2002)
Google Scholar
McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of Conference on Computational Natural Language Learning, Edmonton, pp. 188–191 (2003)
Google Scholar
Miller, G.: Wordnet: an on-line lexical database. Int. J. Lexicogr. 3 (4), 235–312 (Winter 1990)
Article Google Scholar
Ng, V.: Unsupervised models for coreference resolution. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Honolulu, pp. 640–649 (2008)
Google Scholar
Norvig, P.: How to write a spelling corrector (2007). http://norvig.com/spell-correct.html
Google Scholar
Petrov, S., Barett, L., Thibaux, R., Klein, D.: Learning accurate, compact, and interpretable tree annotation. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, 17–21 July 2006
Google Scholar
Ponzetto, S.P., Strube, M.: Exploiting semantic role labeling, WordNet and Wikipedia for coreference resolution. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, Morristown, pp. 192–199 (2006)
Google Scholar
Pradhan, S., Luo, X., Recasens, M., Hovy, E.H., Ng, V., Strube, M.: Scoring coreference partitions of predicted mentions: a reference implementation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), Baltimore, 22–27 June 2014, vol. 2, Short Papers, pp. 30–35 (2014)
Google Scholar
Preiss, J.: Choosing a parser for anaphora resolution. In: Proceedings of the 4th Discourse Anaphora and Anaphor Resolution Colloquium, Lisbon, pp. 175–180 (2002)
Google Scholar
Ramshaw, L.A., Marcus, M.P.: Text chunking using transformation-based learning. In: Proceedings of the Third ACL Workshop on Very Large Corpora, Cambridge, MA (1995)
Google Scholar
Reynar, J.C., Ratnaparkhi, A.: A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, DC (1997)
Book Google Scholar
Sang, E., Tjong, E.F., Sang, K., Veenstra, J.: Representing text chunks. In: Proceedings of EACL’99, Bergen (1999)
Google Scholar
Sang, E.F.T.K., Meulder, F.D.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of CoNLL-2003, Edmonton, pp. 142–147 (2003)
Google Scholar
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the International Conference on New Methods in Language Processing, Manchester, pp. 44–49 (1994)
Google Scholar
Uryupina, O.: Knowledge acquisition for coreference resolution. Ph.D. thesis, Saarland University (2007)
Google Scholar
Uryupina, O.: Corry: a system for coreference resolution. In: Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval’10), Uppsala (2010)
Google Scholar
Uryupina, O., Moschitti, A.: Multilingual mention detection for coreference resolution. In: Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP’13), Nagoya (2013)
Google Scholar
Uryupina, O., Poesio, M., Giuliano, C., Tymoshenko, K.: Disambiguation and filtering methods in using Web knowledge for coreference resolution. In: Proceedings of The 24th Florida Artificial Intelligence Research Society Conference (FLAIRS-24), Palm Beach (2011)
Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Book MATH Google Scholar
Versley, Y., Ponzetto, S.P., Poesio, M., Eidelman, V., Jern, A., Smith, J., Yang, X., Moschitti, A.: BART: a modular toolkit for coreference resolution. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, Prague, pp. 9–12 (2008)
Google Scholar
Yamada, H., Kudoh, T., Matsumoto, Y.: Japanese named entity extraction using support vector machines. Information Processing Society of Japan, SIG Notes NL 142-17 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

DISI University of Trento, Trento, Italy
Olga Uryupina
Fondazione Bruno Kessler, Trento, Italy
Roberto Zanoli

Authors

Olga Uryupina
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Zanoli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Olga Uryupina .

Editor information

Editors and Affiliations

Trento, Italy
Massimo Poesio
Frankfurt am Main, Germany
Roland Stuckardt
Heidelberg, Germany
Yannick Versley

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Uryupina, O., Zanoli, R. (2016). Preprocessing Technology. In: Poesio, M., Stuckardt, R., Versley, Y. (eds) Anaphora Resolution. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-47909-4_7

Download citation

DOI: https://doi.org/10.1007/978-3-662-47909-4_7
Published: 05 August 2016
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-47908-7
Online ISBN: 978-3-662-47909-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Preprocessing Technology

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Different German and English Coreference Resolution Models for Multi-domain Content Curation Scenarios

Coreference Resolution in the Assamese Language: A Pioneering Attempt

Nominal Coreference Resolution Using Semantic Knowledge

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Preprocessing Technology

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Different German and English Coreference Resolution Models for Multi-domain Content Curation Scenarios

Coreference Resolution in the Assamese Language: A Pioneering Attempt

Nominal Coreference Resolution Using Semantic Knowledge

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation