Skip to main content
Log in

Automatic normalization of short texts by combining statistical and rule-based techniques

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Short texts are typically composed of small number of words, most of which are abbreviations, typos and other kinds of noise. This makes the noise to signal ratio relatively high for this specific category of text. A high proportion of noise in the data is undesirable for analysis procedures as well as machine learning applications. Text normalization techniques are used to reduce the noise and improve the quality of text for processing and analysis purposes. In this work, we propose a combination of statistical and rule-based techniques to normalize short texts. More specifically, we focus our attention on SMS messages. We base our normalization approach on a statistical machine translation system which translates from noisy data to clean data. This system is trained on a small manually annotated set. Then, we study several automatic methods to extract more general rules from the normalizations generated with the statistical machine translation system. We illustrate the proposed methodology by conducting some experiments with a SMS Haitian-Créole data collection. In order to evaluate the performance of our methodology we use several Haitian-Créole dictionaries, the well-known perplexity criteria and the achieved reduction of vocabulary.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. http://caw2.barcelonamedia.org/

  2. http://en.wikipedia.org/wiki/Haitian_Creole_language

  3. http://www.statmt.org/wmt11/featured-translation-task.html

References

  • Aw, A., Zhang, M., Xiao, J., & Su, J. (2006). A phrase-base statistical model for sms text normalization. In Proceedings of the COLING/ACL on main conference poster sessions, (pp. 33–40), Sydney, Australia.

  • Brown, P., Della Pietra, S., Della Pietra, V., & Mercer, R., (1993). The mathematics of statistical machine translation. Computational Linguistics, 19(2), 263–311.

    Google Scholar 

  • Callison-Burch, C., Koehn, P., Monz, C., & Zaidan. O. (2011). Findings of the 2011 workshop on statistical machine translation. In Proceedings of the sixth workshop on statistical machine translation, (pp. 22–64), Edinburgh, Scotland, July.

  • Costa-jussà, M. R., & Fonollosa, J. A. R. (2009). State-of-the-art word reordering approaches in statistical machine translation. IEICE Transactions on Information and Systems, 92(11), 2179–2185, November.

    Google Scholar 

  • Henriquez, C., & Hernández, A. (2009). A ngram-based statistical machine translation approach for text normalization on chat-speak style communications. In Proceedings of the CAW2 workshop, Madrid, June.

  • Koehn, P., & Knight, K. (2003). Feature-rich statistical translation of noun phrases. In Proceedings of the 41th annual meeting of the association for computational linguistics, (pp. 311–318).

  • Koehn, P., Amittai, A., Birch, A., Callison-Burch, C., Osborne, M., Talbot, D., et al (2005). Edinburgh system description for the 2005 iwslt speech translation evaluation. In Proceedings of international workshop on spoken languages translation, Pittsburgh, October.

  • Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., et al (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the association for computational linguistics, (pp. 177–180), Prague, Czech Republic.

  • Och, F. J., & Ney, H. (2000). A comparison of alignment models for statistical machine translation. In Proceedings of the 18th conference on computational linguistics, (pp. 1086–1090), Morristown, NJ, USA.

  • Och, F. J., & Ney, H. (2002). Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the 40th annual meeting of the association for computational linguistics (pp. 295–302), Philadelphia, USA, July.

  • Och, F. J. (2003). Minimum error rate training in statistical machine translation. In Proceedings of the 41th annual meeting of the association for computational linguistics (pp. 160–167), Sapporo, July.

  • Papineni, K., Roukos, S., Ward, T., & Zhu, W-J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the association for computational linguistics (pp. 311–318), Philadelphia, PA, July.

  • Stolcke, A. (2002). SRILM—An extensible language modeling toolkit. In Proceedings of the 7th international conference on spoken language processing, ICSLP’02, (pp. 901–904), Denver, USA, September.

  • Tillmann, C. (2004). A unigram orientation model for statistical machine translation. In Proceedings of the human language technology conference, HLT-NAACL’04, (pp. 101–104), Boston, May.

Download references

Acknowledgments

The authors want to thank the anonymous reviewers for their valuable comments and suggestions which helped improving this paper. The authors also want to thank Barcelona Media Innovation Center and Institute for Infocomm Research for their support and permission to publish this research. This work has been partially funded by the Spanish Ministry of Economy and Competitive through the Juan de la Cierva fellowship program and by the Seventh Framework Programme of the European Comission through the T4ME contract (grant agreement no.: 249119).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marta R. Costa-jussà.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Costa-jussà, M.R., Banchs, R.E. Automatic normalization of short texts by combining statistical and rule-based techniques. Lang Resources & Evaluation 47, 179–193 (2013). https://doi.org/10.1007/s10579-012-9187-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-012-9187-y

Keywords

Navigation