skip to main content
10.1145/2872518.2890558acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
abstract

Lexical Normalization of Spanish Tweets

Published: 11 April 2016 Publication History

Abstract

Twitter data have brought new opportunities to know what happens in the world in real-time, and conduct studies on the human subjectivity on a diversity of issues and topics at large scale, which would not be feasible using traditional methods. However, as well as these data represent a valuable source, a vast amount of noise can be found in them. Because of the brevity of texts and the widespread use of mobile devices, non-standard word forms abound in tweets, which degrade the performance of Natural Language Processing tools. In this paper, a lexical normalization system of tweets written in Spanish is presented. The system suggests normalization candidates for out-of-vocabulary (OOV) words based on similarity of graphemes or phonemes. Using contextual information, the best correction candidate for a word is selected. Experimental results show that the system correctly detects OOV words and the most of cases suggests the proper corrections. Together with this, results indicate a room for improvement in the correction candidate selection. Compared with other methods, the overall performance of the system is above-average and competitive to different approaches in the literature.

References

[1]
A. Ageno, P. R. Comas, L. Padró, and J. Turmo. The talp-upc approach to tweet-norm 2013. In Proceedings of the Tweet Normalization Workshop at SEPLN 2013, September 2013.
[2]
I. Alegria, N. Aranberri, P. R. Comas, V. Fresno, P. Gamallo, L. Padró, I. S. Vicente, J. Turmo, and A. Zubiaga. Tweetnorm: a benchmark for lexical normalization of spanish tweets. Language Resources and Evaluation, 49(4):883--905, 2015.
[3]
K. R. Beesley and L. Karttunen. A gentle introduction. In Finite State Morphology. Center for the Study of Language and Information, April 2003.
[4]
F. Bravo-Marquez, M. Mendoza, and B. Poblete. Combining strengths, emotions and polarities for boosting twitter sentiment analysis. In Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining, WISDOM '13, 2013.
[5]
J. Cotelo, F. Cruz, J. Troyano, and F. Ortega. A modular approach for lexical normalization applied to Spanish tweets. Expert Systems with Applications, 42(10):4743--4754, 2015.
[6]
P. Gamallo, M. García, and J. R. Pichel. A method to lexical normalisation of tweets. In Proceedings of the Tweet Normalization Workshop at SEPLN 2013, September 2013.
[7]
B. Han and T. Baldwin. Lexical normalisation of short text messages: Makn sens a#twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT '11, pages 368--378, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.
[8]
B. Han, P. Cook, and T. Baldwin. Automatically constructing a normalisation dictionary for microblogs. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL '12, pages 421--432, 2012.
[9]
K. Heafield. Kenlm: Faster and smaller language model queries. In Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation, pages 187--197, Edinburgh, Scotland, United Kingdom, July 2011.
[10]
M. Hulden. Foma: a finite-state compiler and library. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pages 29--32. Association for Computational Linguistics, 2009.
[11]
R. Li, K. H. Lei, R. Khadiwala, and K. C.-C. Chang. Tedas: A twitter-based event detection and analysis system. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pages 1273--1276, April 2012.
[12]
B. Liu. Sentiment analysis: A multifaceted problem. IEEE Intelligent Systems, 25(3):76--80, 2010.
[13]
O. Owoputi, B. O'Connor, C. Dyer, K. Gimpel, N. Schneider, and N. A. Smith. Improved part-of-speech tagging for online conversational text with word clusters. In In Proceedings of NAACL 2013, 2013.
[14]
L. Padró and E. Stanilovsky. Freeling 3.0: Towards wider multilinguality. In Proceedings of the Language Resources and Evaluation Conference (LREC 2012), Istanbul, Turkey, May 2012. ELRA.
[15]
J. Porta and J. L. Sancho. Word normalization in Twitter using finite-state transducers. In Proceedings of the Tweet Normalization Workshop at SEPLN 2013, September 2013.
[16]
RAE. Exclusión detextitch ytextitll del abecedario. http://www.rae.es/consultas/exclusion-de-ch-y-ll-del-abecedario. (accessed: October 16, 2015).
[17]
RAE. Mayúculas. http://buscon.rae.es/dpd/srv/search?id=BapzSnotjD6n0vZiTp. (accessed: October 15, 2015).
[18]
RAE. Seseo. http://lema.rae.es/dpd/srv/search?id=IIUwJDU07D6XC2xEky. (accessed: November 9, 2015).
[19]
RAE. Voseo. http://lema.rae.es/dpd/srv/search?id=iOTUSehtID6mVONyGX. (accessed: October 24, 2015).
[20]
RAE. Yeísmo. http://lema.rae.es/dpd/srv/search?id=HK5DEyboyD6iOqnxZu. (accessed: October 23, 2015).
[21]
X. Saralegi and I. S. Vicente. Elhuyar at tweetnorm 2013. In Proceedings of the Tweet Normalization Workshop at SEPLN 2013, September 2013.
[22]
H. Schoen, D. Gayo-Avello, P. T. Metaxas, E. Mustafaraj, M. Strohmaier, and P. Gloor. The power of prediction with social media. Internet Research, 23(5):528--543, 2013.
[23]
A. Seshagiri. The languages of twitter users. http://bits.blogs.nytimes.com/2014/03/09/the-languages-of-twitter-users/. (accessed: December 4, 2015).
[24]
J. Stecyk. Study: Twitter users love mobile apps. https://blog.twitter.com/2015/study-twitter-users-love-mobile-apps. (accessed: November 10, 2015).
[25]
R. Zacarías. Formación de diminutivos con el sufijo ít. una propuesta desde la morfología natural. Anuario de Letras: Lingüística y Filología, 44:77--103, 2006.

Cited By

View all
  • (2022)Massive Text Normalization via an Efficient Randomized AlgorithmProceedings of the ACM Web Conference 202210.1145/3485447.3512015(2946-2956)Online publication date: 25-Apr-2022
  • (2018)Text Normalization on Thai Twitter Messages using IPA Similarity Algorithm2018 International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)10.1109/iSAI-NLP.2018.8692908(1-5)Online publication date: Nov-2018

Index Terms

  1. Lexical Normalization of Spanish Tweets

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide Web
    April 2016
    1094 pages
    ISBN:9781450341448

    Sponsors

    • IW3C2: International World Wide Web Conference Committee

    In-Cooperation

    Publisher

    International World Wide Web Conferences Steering Committee

    Republic and Canton of Geneva, Switzerland

    Publication History

    Published: 11 April 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. finite-state transducers
    2. language modeling
    3. lexical normalization
    4. out-of-vocabulary words
    5. spanish tweets
    6. twitter

    Qualifiers

    • Abstract

    Conference

    WWW '16
    Sponsor:
    • IW3C2
    WWW '16: 25th International World Wide Web Conference
    April 11 - 15, 2016
    Québec, Montréal, Canada

    Acceptance Rates

    WWW '16 Companion Paper Acceptance Rate 115 of 727 submissions, 16%;
    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Massive Text Normalization via an Efficient Randomized AlgorithmProceedings of the ACM Web Conference 202210.1145/3485447.3512015(2946-2956)Online publication date: 25-Apr-2022
    • (2018)Text Normalization on Thai Twitter Messages using IPA Similarity Algorithm2018 International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)10.1109/iSAI-NLP.2018.8692908(1-5)Online publication date: Nov-2018

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media