Standardizing Tweets with Character-Level Machine Translation

Ljubešić, Nikola; Erjavec, Tomaž; Fišer, Darja

doi:10.1007/978-3-642-54903-8_14

Nikola Ljubešić¹⁷,
Tomaž Erjavec¹⁸ &
Darja Fišer¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8404))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1681 Accesses
5 Citations

Abstract

This paper presents the results of the standardization procedure of Slovene tweets that are full of colloquial, dialectal and foreign-language elements. With the aim of minimizing the human input required we produced a manually normalized lexicon of the most salient out-of-vocabulary (OOV) tokens and used it to train a character-level statistical machine translation system (CSMT). Best results were obtained by combining the manually constructed lexicon and CSMT as fallback with an overall improvement of 9.9% increase on all tokens and 31.3% on OOV tokens. Manual preparation of data in a lexicon manner has proven to be more efficient than normalizing running text for the task at hand. Finally we performed an extrinsic evaluation where we automatically lemmatized the test corpus taking as input either original or automatically standardized wordforms, and achieved 75.1% per-token accuracy with the former and 83.6% with the latter, thus demonstrating that standardization has significant benefits for upstream processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aw, A., Zhang, M., Xiao, J., Su, J.: A Phrase-based Statistical Model for SMS Text Normalization. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions, COLING-ACL 2006, pp. 33–40. Association for Computational Linguistics, Stroudsburg (2006)
Chapter Google Scholar
Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., Basu, A.: Investigation and Modeling of the Structure of Texting Language. Int. J. Doc. Anal. Recognit. 10(3), 157–174 (2007)
Article Google Scholar
Kaufmann, M., Kalita, J.: Syntactic Normalization of Twitter Messages. In: Proceedings of the 8th International Conference on Natural Language Processing, ICON 2010 (2010)
Google Scholar
Han, B., Cook, P., Baldwin, T.: Lexical Normalization for Social Media Text. ACM Trans. Intell. Syst. Technol. 4(1), 1–5 (2013)
Article Google Scholar
Pennell, D., Liu, Y.: Toward text message normalization: Modeling abbreviation generation. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5364–5367 (2011)
Google Scholar
Pennell, D., Liu, Y.: A Character-Level Machine Translation Approach for Normalization of SMS Abbreviations. In: Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 974–982. Asian Federation of Natural Language Processing, Chiang Mai (November 2011)
Google Scholar
De Clercq, O.E., Desmet, B., Schulz, S., Lefever, E., Hoste, V.: Normalization of Dutch user-generated content. In: Proceedings of Recent Advances in Natural Language Processing, INCOMA, pp. 179–188 (2013)
Google Scholar
Scherrer, Y., Erjavec, T.: Modernizing historical Slovene words with character-based SMT. In: BSNLP 2013 - 4th Biennial Workshop on Balto-Slavic Natural Language Processing, Sofia, Bulgarie, pp. 2013–2014 (July 2013)
Google Scholar
Arhar, Š.: Učni korpus SSJ in leksikon besednih oblik za slovenino. Jezik in slovstvo 54(3-4), 43–56 (2009)
Google Scholar
Rayson, P., Garside, R.: Comparing Corpora Using Frequency Profiling. In: Proceedings of the Workshop on Comparing Corpora, WCC 2000, vol. 9, pp. 1–6. Association for Computational Linguistics, Stroudsburg (2000)
Chapter Google Scholar
Erjavec, T.: Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene. In: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 33–38. Association for Computational Linguistics, Portland (June 2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Humanities and Social Sciences, University of Zagreb, Zagreb, Croatia
Nikola Ljubešić
Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia
Tomaž Erjavec
Faculty of Arts, University of Ljubljana, Ljubljana, Slovenia
Darja Fišer

Authors

Nikola Ljubešić
View author publications
You can also search for this author in PubMed Google Scholar
Tomaž Erjavec
View author publications
You can also search for this author in PubMed Google Scholar
Darja Fišer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, Av. Juan Dios Bátiz, Col. Nueva Industrial Vallejo, 07738, Mexico D.F, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ljubešić, N., Erjavec, T., Fišer, D. (2014). Standardizing Tweets with Character-Level Machine Translation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8404. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54903-8_14

Download citation

DOI: https://doi.org/10.1007/978-3-642-54903-8_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54902-1
Online ISBN: 978-3-642-54903-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics