Abstract
Short texts are typically composed of small number of words, most of which are abbreviations, typos and other kinds of noise. This makes the noise to signal ratio relatively high for this specific category of text. A high proportion of noise in the data is undesirable for analysis procedures as well as machine learning applications. Text normalization techniques are used to reduce the noise and improve the quality of text for processing and analysis purposes. In this work, we propose a combination of statistical and rule-based techniques to normalize short texts. More specifically, we focus our attention on SMS messages. We base our normalization approach on a statistical machine translation system which translates from noisy data to clean data. This system is trained on a small manually annotated set. Then, we study several automatic methods to extract more general rules from the normalizations generated with the statistical machine translation system. We illustrate the proposed methodology by conducting some experiments with a SMS Haitian-Créole data collection. In order to evaluate the performance of our methodology we use several Haitian-Créole dictionaries, the well-known perplexity criteria and the achieved reduction of vocabulary.
Similar content being viewed by others
References
Aw, A., Zhang, M., Xiao, J., & Su, J. (2006). A phrase-base statistical model for sms text normalization. In Proceedings of the COLING/ACL on main conference poster sessions, (pp. 33–40), Sydney, Australia.
Brown, P., Della Pietra, S., Della Pietra, V., & Mercer, R., (1993). The mathematics of statistical machine translation. Computational Linguistics, 19(2), 263–311.
Callison-Burch, C., Koehn, P., Monz, C., & Zaidan. O. (2011). Findings of the 2011 workshop on statistical machine translation. In Proceedings of the sixth workshop on statistical machine translation, (pp. 22–64), Edinburgh, Scotland, July.
Costa-jussà, M. R., & Fonollosa, J. A. R. (2009). State-of-the-art word reordering approaches in statistical machine translation. IEICE Transactions on Information and Systems, 92(11), 2179–2185, November.
Henriquez, C., & Hernández, A. (2009). A ngram-based statistical machine translation approach for text normalization on chat-speak style communications. In Proceedings of the CAW2 workshop, Madrid, June.
Koehn, P., & Knight, K. (2003). Feature-rich statistical translation of noun phrases. In Proceedings of the 41th annual meeting of the association for computational linguistics, (pp. 311–318).
Koehn, P., Amittai, A., Birch, A., Callison-Burch, C., Osborne, M., Talbot, D., et al (2005). Edinburgh system description for the 2005 iwslt speech translation evaluation. In Proceedings of international workshop on spoken languages translation, Pittsburgh, October.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., et al (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the association for computational linguistics, (pp. 177–180), Prague, Czech Republic.
Och, F. J., & Ney, H. (2000). A comparison of alignment models for statistical machine translation. In Proceedings of the 18th conference on computational linguistics, (pp. 1086–1090), Morristown, NJ, USA.
Och, F. J., & Ney, H. (2002). Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the 40th annual meeting of the association for computational linguistics (pp. 295–302), Philadelphia, USA, July.
Och, F. J. (2003). Minimum error rate training in statistical machine translation. In Proceedings of the 41th annual meeting of the association for computational linguistics (pp. 160–167), Sapporo, July.
Papineni, K., Roukos, S., Ward, T., & Zhu, W-J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the association for computational linguistics (pp. 311–318), Philadelphia, PA, July.
Stolcke, A. (2002). SRILM—An extensible language modeling toolkit. In Proceedings of the 7th international conference on spoken language processing, ICSLP’02, (pp. 901–904), Denver, USA, September.
Tillmann, C. (2004). A unigram orientation model for statistical machine translation. In Proceedings of the human language technology conference, HLT-NAACL’04, (pp. 101–104), Boston, May.
Acknowledgments
The authors want to thank the anonymous reviewers for their valuable comments and suggestions which helped improving this paper. The authors also want to thank Barcelona Media Innovation Center and Institute for Infocomm Research for their support and permission to publish this research. This work has been partially funded by the Spanish Ministry of Economy and Competitive through the Juan de la Cierva fellowship program and by the Seventh Framework Programme of the European Comission through the T4ME contract (grant agreement no.: 249119).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Costa-jussà, M.R., Banchs, R.E. Automatic normalization of short texts by combining statistical and rule-based techniques. Lang Resources & Evaluation 47, 179–193 (2013). https://doi.org/10.1007/s10579-012-9187-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-012-9187-y