Social media text normalization for Turkish

GÜLŞEN ERYİǦİT; DİLARA TORUNOǦLU-SELAMET

doi:10.1017/S1351324917000134

Social media text normalization for Turkish

Published online by Cambridge University Press: 02 June 2017

GÜLŞEN ERYİǦİT and

DİLARA TORUNOǦLU-SELAMET

Show author details

GÜLŞEN ERYİǦİT: Affiliation:
Department of Computer Engineering, Istanbul Technical University, Istanbul, Turkey e-mail: gulsen.cebiroglu@itu.edu.tr, torunoglud@itu.edu.tr
DİLARA TORUNOǦLU-SELAMET: Affiliation:
Department of Computer Engineering, Istanbul Technical University, Istanbul, Turkey e-mail: gulsen.cebiroglu@itu.edu.tr, torunoglud@itu.edu.tr

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Text normalization is an indispensable stage in processing noncanonical language from natural sources, such as speech, social media or short text messages. Research in this field is very recent and mostly on English. As is known from different areas of natural language processing, morphologically rich languages (MRLs) pose many different challenges when compared to English. Turkish is a strong representative of MRLs and has particular normalization problems that may not be easily solved by a single-stage pure statistical model. This article introduces the first work on the social media text normalization of an MRL and presents the first complete social media text normalization system for Turkish. The article conducts an in-depth analysis of the error types encountered in Web 2.0 Turkish texts, categorizes them into seven groups and provides solutions for each of them by dividing the candidate generation task into separate modules working in a cascaded architecture. For the first time in the literature, two manually normalized Web 2.0 datasets are introduced for Turkish normalization studies. The exact match scores of the overall system on the provided datasets are 70.40 per cent and 67.37 per cent (77.07 per cent with a case insensitive evaluation).

Type: Articles
Information: Natural Language Engineering , Volume 23 , Issue 6 , November 2017 , pp. 835 - 875

DOI: https://doi.org/10.1017/S1351324917000134 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2017

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Adalı, K., and Eryiğit, G. 2014. Vowel and diacritic restoration for social media texts (LASM) at EACL. In Proceedings of 5th Workshop on Language Analysis for Social Media, Gothenburg, Sweden, pp. 53–61.Google Scholar

Ageno, A., Comas, P. R., Padró, L., and Turmo, J. 2013. The TALP-UPC approach to Tweet-Norm 2013. In Proceedings of the Tweet Normalization Workshop (TWEET-NORM) at SEPLN, Madrid, Spain, p. 58.Google Scholar

Akhtar, Md S., Sikdar, U. K., and Ekbal, A. 2015. IITP: multiobjective differential evolution based Twitter named entity recognition. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 61–7.Google Scholar

Akın, A. A., and Akın, M. D. 2007. Zemberek, an open source nlp framework for Turkic languages.Google Scholar

Alegria, I., Aranberri, N., Comas, P. R., Fresno, V., Gamallo, P., Padró, L., San Vicente, I., Turmo, J., and Zubiaga, A., 2015. Tweetnorm: a benchmark for lexical normalization of Spanish tweets. Language Resources and Evaluation 49 (4): 883–905.Google Scholar

Alegria, I., Aranberri, N., Fresno, V., Gamallo, P., Padró, L., San Vicente, I., Turmo, J., and Zubiaga, A. 2013. Introducción a la tarea compartida tweet-norm 2013: normalización léxica de tuits en Español. In Proceedings of the Tweet Normalization Workshop (TWEET-NORM) at SEPLN, Madrid, Spain, pp. 1–9.Google Scholar

Alex, B., Dubey, A., and Keller, F. 2007. Using foreign inclusion detection to improve parsing performance. In Proceedings of EMNLP-CONLL, Prague, Czech, pp. 151–60.Google Scholar

Aw, A., Zhang, M., Xiao, J., and Su, J. 2006. A phrase-based statistical model for sms text normalization. In Proceedings of the COLING/ACL. Morristown, NJ, USA, pp. 33–40.Google Scholar

Baldwin, T., Kim, Y.-B., de Marneffe, M. C., Ritter, A., Han, B., and Xu, W. 2015. Shared tasks of the 2015 workshop on noisy user-generated text: twitter lexical normalization and named entity recognition. In Proceedings of ACL-IJCNLP 2015, Beijing, China, p. 126.Google Scholar

Baldwin, T., and Li, Y. 2015. An in-depth analysis of the effect of text normalization in social media. In Proceedings of NAACL, Denver, Colorado, pp. 420–9.Google Scholar

Beaufort, R., Roekhaut, S., Cougnon, L.-A., and Fairon, C. 2010. A hybrid rule/model-based finite-state framework for normalizing sms messages. In Proceedings of ACL ’10, Stroudsburg, PA, USA, pp. 770–9.Google Scholar

Beckley, R. 2015. Bekli: a simple approach to Twitter text normalization. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 82–6.Google Scholar

Berend, G., and Tasnádi, E. 2015. Uszeged: correction type-sensitive normalization of English tweets using efficiently indexed n-gram statistics. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 120–5.Google Scholar

Blevins, T., Kwiatkowski, R., Macbeth, J., McKeown, K., Patton, D., and Rambow, O. 2016. Automatically processing tweets from gang-involved youth: towards detecting loss and aggression. In Proceedings of COLING. Osaka, Japan, pp. 2196–206.Google Scholar

Clark, E., and Araki, K., 2011. Text normalization in social media: progress, problems and applications for a pre-processing system of casual English. Procedia-social and Behavioral Sciences 27 : 2–11.Google Scholar

Cook, P., and Stevenson, S. 2009. An unsupervised model for text message normalization. In Proceedings of the Workshop on Computational Approaches to Linguistic Creativity at NAACL-HLT, Stroudsburg, PA, USA, pp. 71–8.Google Scholar

Crystal, D. 2008. Txtng: The gr8 db8. OUP Oxford, New York.Google Scholar

Das, A., and Gambäck, B., 2013. Code-mixing in social media text: the last language identification frontier. Traitement Automatique des Langues (TAL): Special Issue on Social Networks and NLP 54 (3): 65–79.Google Scholar

De Clercq, O., Desmet, B., Schulz, S., Lefever, E., and Hoste, V. 2013. Normalization of Dutch user-generated content. In Proceedings of Recent Advances in Natural Language Processing, Hissar, Bulgaria, pp. 179–88.Google Scholar

Doval Mosquera, Y., Vilares, J., and Gómez-Rodríguez, C. 2015. Lysgroup: adapting a Spanish microtext normalization system to English. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 99–105.Google Scholar

Eger, S., et al. 2016. A comparison of four character-level string-to-string translation models for (OCR) spelling error correction. The Prague Bulletin of Mathematical Linguistics 105 (1): 77–99.Google Scholar

Egidio, Y. M. O. F. P., and Coupé, M. C. 2013. A quantitative and typological approach to correlating linguistic complexity. In Proceedings of the 5th Conference on Quantitative Investigations in Theoretical Linguistics, University of Leuven, pp. 71–5.Google Scholar

Eisenstein, J. 2013a. Phonological factors in social media writing. In Proceedings of the Workshop on Language Analysis in Social Media, Atlanta, Georgia: Association for Computational Linguistics, pp. 11–9.Google Scholar

Eisenstein, J. 2013b. What to do about bad language on the internet. In Proceedings of NAACL-HLT, Atlanta, Georgia, pp. 359–69.Google Scholar

Eryiğit, G. 2007. ITU treebank annotation tool. In Proceedings of Workshop on Linguistic Annotation (LAW) at ACL, Prague, Czech, pp. 117–20.Google Scholar

Eryiğit, G. 2014. ITU Turkish NLP web service. In Proceedings of the Demonstrations at EACL, Gothenburg, Sweden, pp. 1–8 Google Scholar

Eryiğit, G., and Adalı, E. 2004. An affix stripping morphological analyzer for Turkish. In Proceedings of the International Conference on Artificial Intelligence and Applications, Inssbruck, pp. 299–304.Google Scholar

Eryigit, G., Cetin, F. S., Yanık, M., Temel, T., and Ciçekli, I. 2013. Turksent: a sentiment annotation tool for social media. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse at ACL, Sofia, Bulgaria, pp. 131–4.Google Scholar

Eskander, R., Al-Badrashiny, M., Habash, N., and Rambow, O. 2014. Foreign words and the automatic processing of Arabic social media text written in roman script. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching at ACL, Doha, Qatar, pp. 1–12.Google Scholar

Gal, Y. 2002. An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of Workshop on Computational Approaches to Semitic Languages at ACL, Stroudsburg, PA, USA, pp. 1–7.Google Scholar

Hakkani-Tür, D. Z., Oflazer, K., and Tür, G. 2000. Statistical morphological disambiguation for agglutinative languages. In Proceedings of COLING Stroudsburg, PA, USA, pp. 285–91.Google Scholar

Han, B., and Baldwin, T. 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of ACL-HLT, Portland, Oregon, USA, pp. 368–78.Google Scholar

Han, B., Cook, P., and Baldwin, T. 2013. Lexical normalization for social media text. ACM Transactions on Intelligent Systems and Technology (TIST) 4 (1): 5:1–27.Google Scholar

Hassan, H., and Menezes, A. 2013. Social text normalization using contextual graph random walks. In Proceedings of ACL, Sofia, Bulgaria, pp. 1577–86.Google Scholar

Ingason, A. K., Jóhannsson, S. B., Rögnvaldsson, E., Loftsson, H., and Helgadóttir, S. 2009. Context-sensitive spelling correction and rich morphology. In Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA), Odense, Denmark, pp. 231–4.Google Scholar

Jahjah, V., Khoury, R., and Lamontagne, L. 2016. Word Normalization using Phonetic Signatures, pp. 180–5. Cham: Springer International Publishing.Google Scholar

Jhamtani, H., Bhogi, S. K., and Raychoudhury, V. 2014. Word-level language identification in bi-lingual code-switched texts. In Proceedings of the 28th Pacific Asia Conference on Language, Information, and Computation, Phuket, Thailand, pp. 348–57.Google Scholar

Jia, Y., Huang, D., Liu, W., Dong, Y., Yu, S., and Wang, H. 2008. Text normalization in Mandarin text-to-speech system. In Acoustics, Speech and Signal Processing (ICASSP), pp. 4693–6. IEEE, Las Vegas.Google Scholar

Jin, N. 2015. Ncsu-sas-ning: candidate generation and feature engineering for supervised lexical normalization. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 87–92.Google Scholar

Kaufmann, M., and Kalita, J. 2010. Syntactic normalization of Twitter messages. In Proceedings of the 8th International Conference on Natural Language Processing (ICON), Chennai, India, pp. 1–7 Google Scholar

Khan, O. A., and Karim, A. 2012. A rule-based model for normalization of sms text. In Proceedings of the International Conference on Tools with Artificial Intelligence (ICTAI), Athens, Greece, pp. 634–41.Google Scholar

Kobus, C., Yvon, F., and Damnati, G. 2008. Normalizing sms: are two metaphors better than one? Proceedings of COLING, Manchester, UK, pp. 441–8.Google Scholar

Kukich, K., 1992. Techniques for automatically correcting words in text. ACM Computing Surveys (CSUR) 24 (4): 377–439.Google Scholar

Labov, W. 1969. A Study of Non-Standard English, Educational resources information center. ERIC Clearinghouse for Linguistics, Washington. D.C. Google Scholar

Lacoste, V. 2012. Phonological Variation in Rural Jamaican Schools, Creole language library. John Benjamins Publishing Company, Amsterdam.Google Scholar

Lafferty, J., McCallum, A., and Pereira, F. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML), San Francisco, CA, USA, pp. 282–9.Google Scholar

Leeman-Munk, S., Lester, J., and Cox, J. 2015. Ncsu_sas_sam: deep encoding and reconstruction for normalization of noisy text. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 154–61.Google Scholar

Leeman-Munk, S. P. 2016. Morphosyntactic Neural Analysis for Generalized Lexical Normalization. Ph.D. thesis, North Carolina State University.Google Scholar

Li, C., and Liu, Y. 2014. Improving text normalization via unsupervised model and discriminative reranking. In Proceedings of the ACL Student Research Workshop, Baltimore, Maryland, USA, pp. 86–93.Google Scholar

Limsopatham, N., and Collier, N. 2015. Adapting phrase-based machine translation to normalise medical terms in social media messages. In Proceedings of EMNLP, Lisbon, Portugal, pp. 1675–80.Google Scholar

Liu, F., Weng, F., and Jiang, X. 2012. A broad-coverage normalization system for social media language. In Proceedings of ACL, Stroudsburg, PA, USA, pp. 1035–44.Google Scholar

Lui, M., Lau, J. H., and Baldwin, T., 2014. Automatic detection and language identification of multilingual documents. Transactions of the Association for Computational Linguistics 2 : 27–40.Google Scholar

Max, A., and Wisniewski, G. 2010. Mining naturally-occurring corrections and paraphrases from Wikipedia’s revision history. In Proceedings of LREC, Valletta, Malta, pp. 3143–8.Google Scholar

McCallum, A. K. 2002. Mallet: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu.Google Scholar

McKean, E. 2005. The New Oxford American Dictionary, vol. 2. New York: Oxford University Press.Google Scholar

Melero, M., Costa-Jussà, M. R., Lambert, P., and Quixal, M., 2016. Selection of correction candidates for the normalization of Spanish user-generated content. Natural Language Engineering 22 (1): 135–61.Google Scholar

Microsoft,. 2010. Microsoft Word, Version 10.0. Microsoft.Google Scholar

Min, W., and Mott, B. 2015. Ncsu_sas_wookhee: a deep contextual long-short term memory model for text normalization. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 111–9.Google Scholar

Muhammad, A., Wiratunga, N., and Lothian, R. 2015. Context-aware sentiment analysis of social media. In Advances in Social Media Analysis, Switzerland, pp. 87–104.Google Scholar

Nguyen, T.-T., Thi, P., Thanh, T., and Tran, D.-D. 2010. A method for Vietnamese text normalization to improve the quality of speech synthesis. In Proceedings of the 2010 Symposium on Information and Communication Technology, New York, NY, USA, pp. 78–85.Google Scholar

Och, F. J., and Ney, H., 2003. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics 29 (1): 19–51.Google Scholar

Oflazer, K., 1996. Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction. Computational Linguistics 22 (1): 73–89.Google Scholar

Pamay, T., Sulubacak, U., Torunoğlu-Selamet, D., and Eryiğit, G. 2015. The annotation process of the ITU web treebank. In Proceedings of LAW Workshop at NAACL, Denver, Colorado, pp. 95–101.Google Scholar

Panchapagesan, K., Talukdar, P. P., Krishna, N. S., Bali, K., and Ramakrishnan, A. G. 2004. Hindi text normalization. In Proceedings of the 5th International Conference on Knowledge Based Computer Systems, India, pp. 19–22.Google Scholar

Pennell, D., and Liu, Y. 2011. A character-level machine translation approach for normalization of sms abbreviations. In Proceedings of the International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, pp. 974–82.Google Scholar

Pirinen, T. A., and Lindén, K. 2010. Finite-state spell-checking with weighted language and error models. In Proceedings the Workshop on Creation and Use of Basic Lexical Resources for Less-Resourced Languages at LREC, Valetta, Malta, pp. 13–8.Google Scholar

Pirinen, T. A., and Lindén, K. 2014. State-of-the-art in weighted finite-state spell-checking. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing, Kathmandu, Nepal, pp. 519–32.Google Scholar

Porta, J., and Sancho, J.-L. 2013. Word normalization in Twitter using finite-state transducers. In Proceedings of the Tweet Normalization Workshop (TWEET-NORM) at SEPLN, Madrid, Spain, pp. 49–53.Google Scholar

Qian, T., Zhang, Y., Zhang, M., Ren, Y., and Ji, D. 2015. A transition-based model for joint segmentation, pos-tagging and normalization. In Proceedings of EMNLP, Lisbon, Portugal, pp. 1837–46.Google Scholar

Şahin, M., Sulubacak, U., and Eryiğit, G. 2013. Redefinition of Turkish morphology using flag diacritics. Proceedings of the 10th Symposium on Natural Language Processing (SNLP-2013), Pukhet, Thailand, pp. 1–8.Google Scholar

Sak, H., Güngör, T., and Saraçlar, M. 2011. Resources for Turkish morphological processing. Language Resources and Evaluation 45 (2): pp. 249–61.Google Scholar

Saloot, M. A., Idris, N., and Mahmud, R. 2014. An architecture for Malay tweet normalization. Information Processing & Management 50 (5): pp. 621–33.Google Scholar

Sanches Duran, M., Volpe Nunes, M. das Graças, and Avanço, L. 2015. A normalizer for UGC in Brazilian Portuguese. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 38–47.Google Scholar

Sarikaya, R., Kirchhoff, K., Schultz, T., and Hakkani-Tur, D., 2009. Introduction to the special issue on processing morphologically rich languages. IEEE Transactions on Audio, Speech, and Language Processing 17 (5): 861–2.Google Scholar

Say, B., Zeyrek, D., Oflazer, K., and Özge, U. 2002. Development of a corpus and a treebank for present-day written Turkish. In Proceedings of the 11th International Conference of Turkish Linguistics, Northern Cyprus.Google Scholar

Schulz, S., Pauw, G. De, Clercq, O. De, Desmet, B., Hoste, V., Daelemans, W., and Macken, L., 2016. Multimodular text normalization of Dutch user-generated content. ACM Transactions on Intelligent Systems and Technology 7 (4): 1–22.Google Scholar

Şeker, G. A., and Eryiğit, G. 2012. Initial explorations on using CRFs for Turkish named entity recognition. In Proceedings of COLING 2012, Bombay, India, pp. 2459–74.Google Scholar

Şeker, G., and Eryiğit, G., 2017. Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content. Semantic Web Journal 8 (5): 625–42.Google Scholar

Silfverberg, M., Kauppinen, P., and Lindén, K. 2016. Data-driven spelling correction using weighted finite-state methods. In Proceedings of the Workshop on Statistical NLP and Weighted Automa, Berlin, Germany, pp. 51–9.Google Scholar

Smith, A., Cohn, T., and Osborne, M. 2005. Logarithmic opinion pools for conditional random fields. In Proceedings of ACL, Ann Arbor, Michigan, USA, pp. 18–25.Google Scholar

Solorio, T., Blair, E., Maharjan, S., Bethard, S., Diab, M., Ghoneim, M., Hawwari, A., AlGhamdi, F., Hirschberg, J., Chang, A., and Fung, P. 2014. Overview for the first shared task on language identification in code-switched data. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching at ACL, Doha, Qatar, pp. 62–72.Google Scholar

Sridhar, R., and Kumar, V. 2015. Unsupervised text normalization using distributed representations of words and phrases. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing at ACL, Denver, Colorado, pp. 8–16.Google Scholar

Supranovich, D., and Patsepnia, V. 2015. Ihs_rd: lexical normalization for English tweets. Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 78–81.Google Scholar

Torunoǧlu, D., and Eryiğit, G. 2014. A cascaded approach for social media text normalization of Turkish. In Proceedings of the 5th Workshop on Language Analysis for Social Media at EACL, Gothenburg, Sweden, pp. 62–70.Google Scholar

Torunoğlu-Selamet, D., Bekar, E., Ilbay, T., and Eryiğit, G. 2016. Exploring spelling correction approaches for Turkish. In Proceedings of the 1st International Conference on Turkic Computational Linguistics at CICLING, Konya, pp. 7–11.Google Scholar

Tsarfaty, R., Seddah, D., Goldberg, Y., Kübler, S., Candito, M., Foster, J., Versley, Y., Rehbein, I., and Tounsi, L. 2010. Statistical parsing of morphologically rich languages (SPMRL): what, how and whither. In Proceedings of the 1st Workshop on Statistical Parsing of Morphologically Rich Languages at NAACL-HLT, Stroudsburg, PA, USA, pp. 1–12.Google Scholar

Tür, G. 2000. A Statistical Information Extraction System for Turkish. PhD Thesis, Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, Ankara.Google Scholar

Tür, G., Hakkani-Tür, D., and Oflazer, K., 2003. A statistical information extraction system for Turkish. Natural Language Engineering 9 (2): 181–210.Google Scholar

Vilares, J., Alonso, M., and Vilares, D. 2013. Prototipado rápido de un sistema de normalización de tuits: una aproximación léxica. In Proceedings of the Tweet Normalization Workshop (TWEET-NORM) at SEPLN, Madrid, Spain, pp. 39–43.Google Scholar

Wagner, J., and Foster, J. 2015. Dcu-adapt: learning edit operations for microblog normalisation with the generalised perceptron. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 93–8.Google Scholar

Wang, P., and Ng, H. T. 2013. A beam-search decoder for normalization of social media text with application to machine translation. In Proceedings of NAACL-HLT, Atlanta, Georgia, pp. 471–81.Google Scholar

Wang, Z., Xu, G., Li, H., and Zhang, M. 2011. A fast and accurate method for approximate string search. In Proceedings of ACL-HLT, Stroudsburg, PA, USA, pp. 52–61.Google Scholar

Xu, K., Xia, Y., and Lee, C.-H. 2015. Tweet normalization with syllables. In Proceedings of ACL-IJCNLP, Beijing, China, pp. 920–8.Google Scholar

Yang, Y., and Eisenstein, J. 2013. A log-linear model for unsupervised text normalization. In Proceedings of EMNLP, Seattle, Washington, USA, pp. 61–72.Google Scholar

Yüret, D., and De La Maza, M. 2006. The greedy prepend algorithm for decision list induction. In Proceedings of the 21st International Conference on Computer and Information Sciences, Berlin, Heidelberg, pp. 37–46.Google Scholar

Zhang, C., Baldwin, T., Ho, H., Kimelfeld, B., and Li, Y. 2013. Adaptive parser-centric text normalization. In Proceedings of ACL, Sofia, Bulgaria, pp. 1159–68.Google Scholar

Zhang, Q., Chen, H., and Huang, X. 2014. Chinese-English mixed text normalization. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining, New York, NY, USA, pp. 433–42.Google Scholar

Zitouni, I., Sorensen, J., and Sarikaya, R. 2006. Maximum entropy based restoration of Arabic diacritics. In Proceedings of COLING-ACL, Stroudsburg, PA, USA, pp. 577–84.Google Scholar

Article contents

Social media text normalization for Turkish

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests