Abstract
Ambiguous strings are strings of non-whitespace characters, typically coinciding with orthographic contractions of word forms, that depending on the specific occurrence, are to be considered as consisting of one or more than one token. This sort of strings is shown to raise the problem of undesired circularity between tokenization and tagging. This paper presents a strategy to resolve ambiguous strings and dissolve such circularity.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Mikheev, Andrei: Periods, Capitalized Words, etc. Computational Linguistics 28(3). (2002) 289–318.
Mitchell, Marcus, Mary Marcinkiewicz, and Beatrice Santorini: Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19(2) (1993) 313–330.
Ratnaparkhi, Adwait: A Maximum Entropy Model for Part-of-Speech Tagging, In Eric Brill and Kenneth Church (eds.), Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL (1996) 133–142.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Branco, A.H., Silva, J.R. (2003). Contractions: Breaking the Tokenization-Tagging Circularity. In: Mamede, N.J., Trancoso, I., Baptista, J., das Graças Volpe Nunes, M. (eds) Computational Processing of the Portuguese Language. PROPOR 2003. Lecture Notes in Computer Science(), vol 2721. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45011-4_24
Download citation
DOI: https://doi.org/10.1007/3-540-45011-4_24
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40436-1
Online ISBN: 978-3-540-45011-5
eBook Packages: Springer Book Archive