Skip to main content

Standardizing Tweets with Character-Level Machine Translation

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8404))

Abstract

This paper presents the results of the standardization procedure of Slovene tweets that are full of colloquial, dialectal and foreign-language elements. With the aim of minimizing the human input required we produced a manually normalized lexicon of the most salient out-of-vocabulary (OOV) tokens and used it to train a character-level statistical machine translation system (CSMT). Best results were obtained by combining the manually constructed lexicon and CSMT as fallback with an overall improvement of 9.9% increase on all tokens and 31.3% on OOV tokens. Manual preparation of data in a lexicon manner has proven to be more efficient than normalizing running text for the task at hand. Finally we performed an extrinsic evaluation where we automatically lemmatized the test corpus taking as input either original or automatically standardized wordforms, and achieved 75.1% per-token accuracy with the former and 83.6% with the latter, thus demonstrating that standardization has significant benefits for upstream processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aw, A., Zhang, M., Xiao, J., Su, J.: A Phrase-based Statistical Model for SMS Text Normalization. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions, COLING-ACL 2006, pp. 33–40. Association for Computational Linguistics, Stroudsburg (2006)

    Chapter  Google Scholar 

  2. Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., Basu, A.: Investigation and Modeling of the Structure of Texting Language. Int. J. Doc. Anal. Recognit. 10(3), 157–174 (2007)

    Article  Google Scholar 

  3. Kaufmann, M., Kalita, J.: Syntactic Normalization of Twitter Messages. In: Proceedings of the 8th International Conference on Natural Language Processing, ICON 2010 (2010)

    Google Scholar 

  4. Han, B., Cook, P., Baldwin, T.: Lexical Normalization for Social Media Text. ACM Trans. Intell. Syst. Technol. 4(1), 1–5 (2013)

    Article  Google Scholar 

  5. Pennell, D., Liu, Y.: Toward text message normalization: Modeling abbreviation generation. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5364–5367 (2011)

    Google Scholar 

  6. Pennell, D., Liu, Y.: A Character-Level Machine Translation Approach for Normalization of SMS Abbreviations. In: Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 974–982. Asian Federation of Natural Language Processing, Chiang Mai (November 2011)

    Google Scholar 

  7. De Clercq, O.E., Desmet, B., Schulz, S., Lefever, E., Hoste, V.: Normalization of Dutch user-generated content. In: Proceedings of Recent Advances in Natural Language Processing, INCOMA, pp. 179–188 (2013)

    Google Scholar 

  8. Scherrer, Y., Erjavec, T.: Modernizing historical Slovene words with character-based SMT. In: BSNLP 2013 - 4th Biennial Workshop on Balto-Slavic Natural Language Processing, Sofia, Bulgarie, pp. 2013–2014 (July 2013)

    Google Scholar 

  9. Arhar, Š.: Učni korpus SSJ in leksikon besednih oblik za slovenino. Jezik in slovstvo 54(3-4), 43–56 (2009)

    Google Scholar 

  10. Rayson, P., Garside, R.: Comparing Corpora Using Frequency Profiling. In: Proceedings of the Workshop on Comparing Corpora, WCC 2000, vol. 9, pp. 1–6. Association for Computational Linguistics, Stroudsburg (2000)

    Chapter  Google Scholar 

  11. Erjavec, T.: Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene. In: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 33–38. Association for Computational Linguistics, Portland (June 2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ljubešić, N., Erjavec, T., Fišer, D. (2014). Standardizing Tweets with Character-Level Machine Translation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8404. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54903-8_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-54903-8_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-54902-1

  • Online ISBN: 978-3-642-54903-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics