Formal Methods of Tokenization for Part-of-Speech Tagging

Graña, Jorge; Barcala, Fco. Mario; Vilares, Jesús

doi:10.1007/3-540-45715-1_22

Jorge Graña⁵,
Fco. Mario Barcala⁵ &
Jesús Vilares⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2276))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1522 Accesses
8 Citations

Abstract

One of the most important prior tasks for robust part-ofspeech tagging is the correct tokenization or segmentation of the texts. This task can involve processes which are much more complex than the simple identification of the different sentences in the text and each of their individual components, but it is often obviated in many current applications.

Nevertheless, this preprocessing step is an indispensable task in practice, and it is particularly difficult to tackle it with scientific precision without falling repeatedly in the analysis of the specific casuistry of every phenomenon detected.

In this work, we have developed a scheme of preprocessing oriented towards the disambiguation and robust tagging of Galician. Nevertheless, it is a proposal of a general architecture that can be applied to other languages, such as Spanish, with very slight modifications.

This work has been partially supported by the European Union (under FEDER project 1FD97-0047-C04-02), by the Spanish Government (under project TIC2000- 0370-C02-01), and by the Galician Government (under project PGIDT99XI10502B).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brants, T. (2000). TNT-A statistical part-of-speech tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference (ANLP’2000), Seattle.
Google Scholar
Chanod, J.-P.; Tapanainen, P. (1996). A Non-deterministic Tokeniser for Finite-State Parsing. In Proceedings of the Workshop on Extended finite state models of language (ECAI’96), Budapest.
Google Scholar
Grefenstette, G.; Tapanainen, P. (1994). What is a word, What is a sentence? Problems of Tokenization. In Proceedings of 3rd Conference on Computational Lexicongraphy and Text Research (COMPLEX’94), July 7–10.
Google Scholar
Mikev, A. (1999). A knowledge-free Method for Capitalized Word Disambiguation. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, June 20–26, Maryland.
Google Scholar
Mikev, A. (2000). Document Centered Approach to Text Normalization. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’2000), July 24–28, Athens, pp. 136–143.
Google Scholar
Mikev, A. (2000). Tagging Sentence Boundaries. In Proceedings of the First Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL’2000), Seatle, pp. 264–271
Google Scholar

Download references

Author information

Authors and Affiliations

Departamento de Computación, Facultad de Informática, Universidad de La Coruña, Campus de Elviña s/n, 15071, La Coruña, Spain
Jorge Graña, Fco. Mario Barcala & Jesús Vilares

Authors

Jorge Graña
View author publications
You can also search for this author in PubMed Google Scholar
Fco. Mario Barcala
View author publications
You can also search for this author in PubMed Google Scholar
Jesús Vilares
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

CIC Centro de Investigacion en Computacion, IPN Instituto Politecnico Nacional, Col Zacateno, CP 07738, Mexico DF, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Graña, J., Barcala, F.M., Vilares, J. (2002). Formal Methods of Tokenization for Part-of-Speech Tagging. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2002. Lecture Notes in Computer Science, vol 2276. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45715-1_22

Download citation

DOI: https://doi.org/10.1007/3-540-45715-1_22
Published: 05 February 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43219-7
Online ISBN: 978-3-540-45715-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics