abstract

Lexical Normalization of Spanish Tweets

Authors:

Jhon Adrián Cerón-Guzmán,

Elizabeth León-GuzmánAuthors Info & Claims

WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide Web

Pages 605 - 610

https://doi.org/10.1145/2872518.2890558

Published: 11 April 2016 Publication History

Abstract

Twitter data have brought new opportunities to know what happens in the world in real-time, and conduct studies on the human subjectivity on a diversity of issues and topics at large scale, which would not be feasible using traditional methods. However, as well as these data represent a valuable source, a vast amount of noise can be found in them. Because of the brevity of texts and the widespread use of mobile devices, non-standard word forms abound in tweets, which degrade the performance of Natural Language Processing tools. In this paper, a lexical normalization system of tweets written in Spanish is presented. The system suggests normalization candidates for out-of-vocabulary (OOV) words based on similarity of graphemes or phonemes. Using contextual information, the best correction candidate for a word is selected. Experimental results show that the system correctly detects OOV words and the most of cases suggests the proper corrections. Together with this, results indicate a room for improvement in the correction candidate selection. Compared with other methods, the overall performance of the system is above-average and competitive to different approaches in the literature.

References

[1]

A. Ageno, P. R. Comas, L. Padró, and J. Turmo. The talp-upc approach to tweet-norm 2013. In Proceedings of the Tweet Normalization Workshop at SEPLN 2013, September 2013.

[2]

I. Alegria, N. Aranberri, P. R. Comas, V. Fresno, P. Gamallo, L. Padró, I. S. Vicente, J. Turmo, and A. Zubiaga. Tweetnorm: a benchmark for lexical normalization of spanish tweets. Language Resources and Evaluation, 49(4):883--905, 2015.

Digital Library

[3]

K. R. Beesley and L. Karttunen. A gentle introduction. In Finite State Morphology. Center for the Study of Language and Information, April 2003.

[4]

F. Bravo-Marquez, M. Mendoza, and B. Poblete. Combining strengths, emotions and polarities for boosting twitter sentiment analysis. In Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining, WISDOM '13, 2013.

Digital Library

[5]

J. Cotelo, F. Cruz, J. Troyano, and F. Ortega. A modular approach for lexical normalization applied to Spanish tweets. Expert Systems with Applications, 42(10):4743--4754, 2015.

Digital Library

[6]

P. Gamallo, M. García, and J. R. Pichel. A method to lexical normalisation of tweets. In Proceedings of the Tweet Normalization Workshop at SEPLN 2013, September 2013.

[7]

B. Han and T. Baldwin. Lexical normalisation of short text messages: Makn sens a#twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT '11, pages 368--378, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.

Digital Library

[8]

B. Han, P. Cook, and T. Baldwin. Automatically constructing a normalisation dictionary for microblogs. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL '12, pages 421--432, 2012.

Digital Library

[9]

K. Heafield. Kenlm: Faster and smaller language model queries. In Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation, pages 187--197, Edinburgh, Scotland, United Kingdom, July 2011.

Digital Library

[10]

M. Hulden. Foma: a finite-state compiler and library. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pages 29--32. Association for Computational Linguistics, 2009.

Digital Library

[11]

R. Li, K. H. Lei, R. Khadiwala, and K. C.-C. Chang. Tedas: A twitter-based event detection and analysis system. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pages 1273--1276, April 2012.

Digital Library

[12]

B. Liu. Sentiment analysis: A multifaceted problem. IEEE Intelligent Systems, 25(3):76--80, 2010.

[13]

O. Owoputi, B. O'Connor, C. Dyer, K. Gimpel, N. Schneider, and N. A. Smith. Improved part-of-speech tagging for online conversational text with word clusters. In In Proceedings of NAACL 2013, 2013.

[14]

L. Padró and E. Stanilovsky. Freeling 3.0: Towards wider multilinguality. In Proceedings of the Language Resources and Evaluation Conference (LREC 2012), Istanbul, Turkey, May 2012. ELRA.

[15]

J. Porta and J. L. Sancho. Word normalization in Twitter using finite-state transducers. In Proceedings of the Tweet Normalization Workshop at SEPLN 2013, September 2013.

[16]

RAE. Exclusión detextitch ytextitll del abecedario. http://www.rae.es/consultas/exclusion-de-ch-y-ll-del-abecedario. (accessed: October 16, 2015).

[17]

RAE. Mayúculas. http://buscon.rae.es/dpd/srv/search?id=BapzSnotjD6n0vZiTp. (accessed: October 15, 2015).

[18]

RAE. Seseo. http://lema.rae.es/dpd/srv/search?id=IIUwJDU07D6XC2xEky. (accessed: November 9, 2015).

[19]

RAE. Voseo. http://lema.rae.es/dpd/srv/search?id=iOTUSehtID6mVONyGX. (accessed: October 24, 2015).

[20]

RAE. Yeísmo. http://lema.rae.es/dpd/srv/search?id=HK5DEyboyD6iOqnxZu. (accessed: October 23, 2015).

[21]

X. Saralegi and I. S. Vicente. Elhuyar at tweetnorm 2013. In Proceedings of the Tweet Normalization Workshop at SEPLN 2013, September 2013.

[22]

H. Schoen, D. Gayo-Avello, P. T. Metaxas, E. Mustafaraj, M. Strohmaier, and P. Gloor. The power of prediction with social media. Internet Research, 23(5):528--543, 2013.

[23]

A. Seshagiri. The languages of twitter users. http://bits.blogs.nytimes.com/2014/03/09/the-languages-of-twitter-users/. (accessed: December 4, 2015).

[24]

J. Stecyk. Study: Twitter users love mobile apps. https://blog.twitter.com/2015/study-twitter-users-love-mobile-apps. (accessed: November 10, 2015).

[25]

R. Zacarías. Formación de diminutivos con el sufijo ít. una propuesta desde la morfología natural. Anuario de Letras: Lingüística y Filología, 44:77--103, 2006.

Cited By

Jiang NLuo CLakshman VDattatreya YXue Y(2022)Massive Text Normalization via an Efficient Randomized AlgorithmProceedings of the ACM Web Conference 202210.1145/3485447.3512015(2946-2956)Online publication date: 25-Apr-2022
https://doi.org/10.1145/3485447.3512015
Poolsukkho SKongkachandra R(2018)Text Normalization on Thai Twitter Messages using IPA Similarity Algorithm2018 International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)10.1109/iSAI-NLP.2018.8692908(1-5)Online publication date: Nov-2018
https://doi.org/10.1109/iSAI-NLP.2018.8692908

Index Terms

Lexical Normalization of Spanish Tweets
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Lexical Normalization of Japanese Tweets Using Related Images
iiWAS2021: The 23rd International Conference on Information Integration and Web Intelligence

Twitter is noisy and contains many nonstandard words. Furthermore, in Japanese tweets, many words have multiple variant notations. Therefore, the use of such noisy data may interfere with tasks such as identifying potential communities. In this paper, ...
Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological ...
Lexical normalization for social media text
Special section on twitter and microblogging services, social recommender systems, and CAMRa2010: Movie recommendation in context

Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this article, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalizing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide Web

April 2016

1094 pages

ISBN:9781450341448

General Chairs:
Jacqueline Bourdeau
Tele-university (TELUQ), Montreal, QC, Canada
,
Jim A. Hendler
Rensselaer Polytechnic Institute, Troy, NY, USA
,
Roger Nkambou Nkambou
Université du Québec à Montréal, Montreal, QC, Canada
,
Program Chairs:
Ian Horrocks
University of Oxford, UK
,
Ben Y. Zhao
University of California at Santa Barbara, CA, USA

Copyright © 2016 Copyright is held by the International World Wide Web Conference Committee (IW3C2).

Sponsors

IW3C2: International World Wide Web Conference Committee

In-Cooperation

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

International World Wide Web Conferences Steering Committee

Republic and Canton of Geneva, Switzerland

Publication History

Published: 11 April 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Abstract

Conference

WWW '16

Sponsor:

IW3C2

WWW '16: 25th International World Wide Web Conference

April 11 - 15, 2016

Québec, Montréal, Canada

Acceptance Rates

WWW '16 Companion Paper Acceptance Rate 115 of 727 submissions, 16%;

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
97
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jiang NLuo CLakshman VDattatreya YXue Y(2022)Massive Text Normalization via an Efficient Randomized AlgorithmProceedings of the ACM Web Conference 202210.1145/3485447.3512015(2946-2956)Online publication date: 25-Apr-2022
https://doi.org/10.1145/3485447.3512015
Poolsukkho SKongkachandra R(2018)Text Normalization on Thai Twitter Messages using IPA Similarity Algorithm2018 International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)10.1109/iSAI-NLP.2018.8692908(1-5)Online publication date: Nov-2018
https://doi.org/10.1109/iSAI-NLP.2018.8692908

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten