Skip to main content
Log in

TweetNorm: a benchmark for lexical normalization of Spanish tweets

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

The language used in social media is often characterized by the abundance of informal and non-standard writing. The normalization of this non-standard language can be crucial to facilitate the subsequent textual processing and to consequently help boost the performance of natural language processing tools applied to social media text. In this paper we present a benchmark for lexical normalization of social media posts, specifically for tweets in Spanish language. We describe the tweet normalization challenge we organized recently, analyze the performance achieved by the different systems submitted to the challenge, and delve into the characteristics of systems to identify the features that were useful. The organization of this challenge has led to the production of a benchmark for lexical normalization of social media, including an evaluation framework, as well as an annotated corpus of Spanish tweets—TweetNorm_es—, which we make publicly available. The creation of this benchmark and the evaluation has brought to light the types of words that submitted systems did best with, and posits the main shortcomings to be addressed in future work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

Notes

  1. Details about the workshop can be found at http://komunitatea.elhuyar.org/tweet-norm/.

  2. http://nil.fdi.ucm.es/sepln2013/.

  3. The term “ill-formed” has also been used in the literature to refer to these non-standard word forms. We opted for the term “non-standard word form” because some of the words that fall into this category, such as abbreviations or acronyms, are not necessarily misspellings.

  4. http://creativecommons.org/licenses/by/3.0/legalcode.

  5. http://dev.twitter.com/docs/api.

  6. http://nlp.cs.upc.edu/freeling.

  7. RAE, or Real Academia Española, is the institution responsible for regulating the Spanish language.

  8. http://dev.twitter.com/terms/api-terms.

  9. http://komunitatea.elhuyar.org/tweet-norm/files/2013/06/download_tweets.py.

  10. http://komunitatea.elhuyar.org/tweet-norm/resources/#Downloads.

  11. http://www.efe.com/.

  12. Out of 20 initially registered participants, 13 groups sent results.

  13. http://es.wikipedia.org.

  14. http://aspell.net.

  15. http://hunspell.sourceforge.net.

  16. http://jazzy.sourceforge.net.

  17. http://code.google.com/p/foma/.

  18. http://code.google.com/p/phonetisaurus/.

  19. http://www.opengrm.org.

  20. http://www.speech.sri.com/projects/srilm/.

  21. http://komunitatea.elhuyar.org/tweet-norm/.

References

  • Ageno, A., Comas, P. R., Padró, L., & Turmo, J. (2013). The talp-upc approach to tweet-norm 2013. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).

  • Alegria, I., Etxeberria, I., & Labaka, G. (2013). Una cascada de transductores simples para normalizar tweets. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).

  • Beaufort, R., Roekhaut, S., Cougnon, L. A., & Fairon, C. (2010). A hybrid rule/model-based finite-state framework for normalizing SMS messages. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics (ACL) (pp. 770–779), Uppsala, Sweden.

  • Chakrabarti, D., & Punera, K. (2011). Event summarization using tweets. In Proceedings of the fifth International Conference on Weblogs and Social Media (ICWSM).

  • Costa-Jussà, M. R., & Banchs, R. E. (2013). Automatic normalization of short texts by combining statistical and rule-based techniques. Language Resources and Evaluation, 47(1), 179–193.

  • Cotelo-Moya, J. M., Cruz, F. L., & Troyano, J. A. (2013). Resource-based lexical approach to tweet-norm task. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).

  • Eisenstein, J. (2013). What to do about bad language on the internet. In Proceedings of the 2013 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (pp. 359–369).

  • Gamallo, P., Garcia, M., & Pichel, J. R. (2013) A method to lexical normalisation of tweets. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).

  • Go, A., Bhayani, R., & Huang, L. (2009). Twitter sentiment classification using distant supervision (pp. 1–12). CS224N Project Report, Stanford.

  • Han, B., & Baldwin, T. (2011). Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics (ACL) (pp. 368–378).

  • Han, B., Cook, P., & Baldwin, T. (2013). Lexical normalisation for social media text. ACM Transactions on Intelligent Systems and Technology, 43(1), 15–27.

    Google Scholar 

  • Han, B., Cook, P., & Baldwin, T. (2013). unimelb: Spanish text normalisation. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).

  • Hong, L., & Davison, B. D. (2010). Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics (pp. 80–88), ACM.

  • Hulden, M., & Francom, J. (2013). Weighted and unweighted transducers for tweet normalization. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).

  • Inouye, D., & Kalita, J.K. (2011). Comparing twitter summarization algorithms for multiple post summaries. In Proceedings of the IEEE third international conference on social computing (SocialCom) (pp. 298–306), IEEE.

  • Jiang, L., Yu, M., Zhou, M., Liu, X., & Zhao, T. (2011). Target-dependent twitter sentiment classification. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics (ACL) (pp. 151–160).

  • Kaufmann, J., & Kalita, J. (2010). Syntactic normalization of twitter messages. In Proceedings of the international conference on natural language processing, Kharagpur, India.

  • Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86.

    Article  Google Scholar 

  • Lin, J., Snow, R., & Morgan, W. (2011). Smoothing techniques for adaptive online language models: topic tracking in tweet streams. In Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) (pp. 422–429), ACM.

  • Ling, W., Dyer, C., Black, A. W., & Trancoso, I. (2013). Paraphrasing 4 microblog normalization. In Proceedings of the 2014 conference on empirical methods on natural language processing (EMNLP) (pp. 73–84).

  • Liu, F., Weng, F., & Jiang, X. (2012). A broad-coverage normalization system for social media language. In Proceedings of the 50th annual meeting of the association for computational linguistics: Long papers (vol. 1, pp. 1035–1044), Association for Computational Linguistics.

  • Liu, X., Wei, F., Zhang, S., & Zhou, M. (2013). Named entity recognition for tweets. ACM Transactions on Intelligent Systems and Technology, 4(1), 3.

    Google Scholar 

  • Montejo-Ráez, A., Díaz-Galiano, M., Martínez-Cámara, E., Martín-Valdivia, T., García-Cumbreras, M. A., & Ureña-López, A. (2013). Sinai at twitter-normalization 2013. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).

  • Mosquera-López, A., & Moreda, P. (2013). Dlsi en tweet-norm 2013: Normalización de tweets en español. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).

  • Muñoz-García, O., Suárez, S. V., & Bel, N. (2013). Exploiting web-based collective knowledge for micropost normalisation. In Proceedings of the tweet normalization workshop at the Conference of the Spanish Society for Natural Language Processing (SEPLN).

  • Oliva, J., Serrano, J. I., del Castillo, M. D., & Iglesias, Á. (2013). A SMS normalization system integrating multiple grammatical resources. Natural Language Engineering, 19(1), 121–141.

    Article  Google Scholar 

  • Padró, L., & Stanilovsky, E. (2012). Freeling 3.0: Towards wider multilinguality. In Proceedings of the 8th international conference on language resources and evaluation (LREC).

  • Porta, J., & Sancho, J. L. (2013). Word normalization in twitter using finite-state transducers. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).

  • Ruiz, P., Cuadros, M., & Etchegoyhen, T. (2013). Lexical normalization of spanish tweets with preprocessing rules, domain-specific edit distances, and language models. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).

  • Saralegi, X., & San-Vicente, I. (2013). Elhuyar at tweet-norm 2013. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).

  • Vilares, J., Alonso, M. A., & Vilares, D. (2013). Prototipado rápido de un sistema de normalización de tuits: Una aproximación léxica. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).

  • Villena Román, J., Lana Serrano, S., Martínez Cámara, E., & González Cristóbal, J. C. (2013). TASS-workshop on sentiment analysis at SEPLN. In Proceedings of the Spanish Society for Natural Language Processing (SEPLN).

  • Wang, A., Kan, M. Y., Andrade, D., Onishi, T., & Ishikawa, K. (2013). Chinese informal word normalization: An experimental study. Proceedings of the Sixth International Joint Conference on Natural Language Processing, 13, 127–135.

    Google Scholar 

  • Wei, Z., Zhou, L., Li, B., Wong, K. F., Gao, W., & Wong, K. F. (2011). Exploring tweets normalization and query time sensitivity for twitter search. In Proceedings of the text REtrieval conference (TREC).

Download references

Acknowledgments

We would like to thank all the members of the organizing committee. This work has been supported by the following projects: Spanish MICINN projects Tacardi (Grant No. TIN2012-38523-C02-01), Skater (Grant No. TIN2012-38584-C06-01), TextMESS2 (TIN2009-13391-C04-01), OntoPedia (Grant No. FFI2010-14986) and Holopedia (TIN2010-21128-C02-01); Xlike FP7 project (Grant No. FP7-ICT-2011.4.2-288342); UNED project (2012V/PUNED/0004); ENEUS-Marie Curie Actions (FP7/2012-2014 under REA Grant Agreement No. 302038); Celtic CDTI FEDER-INNTER-CONECTA project (Grant No. ITC-20113031); Research Network MA2VICMR (S-2009/TIC-1542); and HPCPLN (Grant No. EM13/041, Xunta de Galicia).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Iñaki San Vicente.

Appendix: List of unresolved OOV words

Appendix: List of unresolved OOV words

Table 6 contains the list of words from the corpus that none of the systems found the correct variation for. The list comprises the word as spelled originally in the corpus on the left column, and the correct variation annotated manually on the right column.

Table 6 List of OOV words for which none of the participants found the correct variation

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alegria, I., Aranberri, N., Comas, P.R. et al. TweetNorm: a benchmark for lexical normalization of Spanish tweets. Lang Resources & Evaluation 49, 883–905 (2015). https://doi.org/10.1007/s10579-015-9315-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-015-9315-6

Keywords

Navigation