TENOR: A Lexical Normalisation Tool for Spanish Web 2.0 Texts

Mosquera, Alejandro; Moreda, Paloma

doi:10.1007/978-3-642-32790-2_65

TENOR: A Lexical Normalisation Tool for Spanish Web 2.0 Texts

Alejandro Mosquera²¹ &
Paloma Moreda²¹

Conference paper

1680 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7499))

Abstract

The lexical richness and its ease of access to large volumes of information converts the Web 2.0 into an important resource for Natural Language Processing. Nevertheless, the frequent presence of non-normative linguistic phenomena that can make any automatic processing challenging. We therefore propose in this study the normalisation of non-normative lexical variants in Spanish Web 2.0 texts. We evaluate our system by restoring the canonical version of Twitter texts, increasing the F1 measure of a state-of-the-art approach for English texts by a 10%.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aw, A., Zhang, M., Xiao, J., Su, J.: A phrase-based statistical model for SMS text normalization. In: Proceedings of the COLING/ACL, pp. 33–40 (2006)
Google Scholar
López, V., San-Segundo, R., Martín, R., Echeverry, J.D., Lutfi, S.: Sistema de traducción de lenguaje SMS a castellano. In: XX Jornadas Telecom I+D, Valladolid, Spain (2010)
Google Scholar
Hoang, H., Birch, A., Callison-burch, C., Zens, R., Aachen, R., Constantin, A., Federico, M., Bertoldi, N., Dyer, C., Cowan, B., Shen, W., Moran, C., Bojar, O.: Moses: Open source toolkit for statistical machine translation, pp. 177–180 (2007)
Google Scholar
Kaufmann, J.: Syntactic Normalization of Twitter Messages. REU Site for Artificial Intelligence Natural Language Processing and Information Retrieval Research Project 2 (2010)
Google Scholar
Shannon, C.E.: A mathematical theory of communication. The Bell Systems Technical Journal 27, 379–423 (1948)
MathSciNet MATH Google Scholar
Choudhury, M., Saraf, R., Jain, V., Sarkar, S., Basu, A.: Investigation and modeling of the structure of texting language. In: Proceedings of the IJCAI-Workshop on Analytics for Noisy Unstructured Text Data, pp. 63–70 (2007)
Google Scholar
Gouws, S., Metzler, D., Cai, C., Hovy, E.: Contextual Bearing on Linguistic Variation in Social Media. In: ACL Workshop on Language in Social Media (LSM) (2011)
Google Scholar
Han, B., Baldwin, T.: Lexical normalisation of short text messages: Makn sens a #twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 368–378. Association for Computational Linguistics, Portland (2011)
Google Scholar
Garcia, R.G., Dimitriadis, Y., Merino Pastor, F., Coronado, J.L.: Error detection in character recognition using pseudosyllable analysis. In: International Conference on Document Analysis and Recognition, vol. 1, p. 446 (1995)
Google Scholar
Martí, M.A., Taulé, M.: Cess-ece: corpus anotados del español y catalán. Arena Romanistica. A New Nordic Journal of Romance Studies 1 (2007)
Google Scholar
Philips, L.: The double metaphone search algorithm. C/C++ Users Journal 18, 38–43 (2000)
Google Scholar
Ratcliff, J.W., Metzener, D.E.: Pattern matching: The gestalt approach. Dr. Dobb’s Journal 13, 46–72 (1988)
Google Scholar
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL 1996), pp. 310–318 (1996)
Google Scholar
Bird, S.: Nltk: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive presentation sessions, COLING-ACL 2006, pp. 69–72. Association for Computational Linguistics, Stroudsburg (2006)
Chapter Google Scholar
Tang, J., Li, H., Cao, Y., Tang, Z.: Email data cleaning. In: KDD 2005: Proceeding of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 489–498. ACM Press, New York (2005)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

DLSI, Universidad de Alicante, Alicante, Spain
Alejandro Mosquera & Paloma Moreda

Authors

Alejandro Mosquera
View author publications
You can also search for this author in PubMed Google Scholar
Paloma Moreda
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics, Department of Computer Graphics and Design, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Department of Information Technologies, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Aleš Horák , Ivan Kopeček & Karel Pala , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mosquera, A., Moreda, P. (2012). TENOR: A Lexical Normalisation Tool for Spanish Web 2.0 Texts. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2012. Lecture Notes in Computer Science(), vol 7499. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32790-2_65

Download citation

DOI: https://doi.org/10.1007/978-3-642-32790-2_65
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32789-6
Online ISBN: 978-3-642-32790-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics