Automatic normalization of short texts by combining statistical and rule-based techniques

Costa-jussà, Marta R.; Banchs, Rafael E.

doi:10.1007/s10579-012-9187-y

Automatic normalization of short texts by combining statistical and rule-based techniques

Original Paper
Published: 24 May 2012

Volume 47, pages 179–193, (2013)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Marta R. Costa-jussà¹ &
Rafael E. Banchs²

353 Accesses
3 Citations
3 Altmetric
Explore all metrics

Abstract

Short texts are typically composed of small number of words, most of which are abbreviations, typos and other kinds of noise. This makes the noise to signal ratio relatively high for this specific category of text. A high proportion of noise in the data is undesirable for analysis procedures as well as machine learning applications. Text normalization techniques are used to reduce the noise and improve the quality of text for processing and analysis purposes. In this work, we propose a combination of statistical and rule-based techniques to normalize short texts. More specifically, we focus our attention on SMS messages. We base our normalization approach on a statistical machine translation system which translates from noisy data to clean data. This system is trained on a small manually annotated set. Then, we study several automatic methods to extract more general rules from the normalizations generated with the statistical machine translation system. We illustrate the proposed methodology by conducting some experiments with a SMS Haitian-Créole data collection. In order to evaluate the performance of our methodology we use several Haitian-Créole dictionaries, the well-known perplexity criteria and the achieved reduction of vocabulary.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Automated identification of media bias in news articles: an interdisciplinary literature review

Article Open access 16 November 2018

Early dementia detection with speech analysis and machine learning techniques

Article Open access 11 April 2024

Notes

References

Aw, A., Zhang, M., Xiao, J., & Su, J. (2006). A phrase-base statistical model for sms text normalization. In Proceedings of the COLING/ACL on main conference poster sessions, (pp. 33–40), Sydney, Australia.
Brown, P., Della Pietra, S., Della Pietra, V., & Mercer, R., (1993). The mathematics of statistical machine translation. Computational Linguistics, 19(2), 263–311.
Google Scholar
Callison-Burch, C., Koehn, P., Monz, C., & Zaidan. O. (2011). Findings of the 2011 workshop on statistical machine translation. In Proceedings of the sixth workshop on statistical machine translation, (pp. 22–64), Edinburgh, Scotland, July.
Costa-jussà, M. R., & Fonollosa, J. A. R. (2009). State-of-the-art word reordering approaches in statistical machine translation. IEICE Transactions on Information and Systems, 92(11), 2179–2185, November.
Google Scholar
Henriquez, C., & Hernández, A. (2009). A ngram-based statistical machine translation approach for text normalization on chat-speak style communications. In Proceedings of the CAW2 workshop, Madrid, June.
Koehn, P., & Knight, K. (2003). Feature-rich statistical translation of noun phrases. In Proceedings of the 41th annual meeting of the association for computational linguistics, (pp. 311–318).
Koehn, P., Amittai, A., Birch, A., Callison-Burch, C., Osborne, M., Talbot, D., et al (2005). Edinburgh system description for the 2005 iwslt speech translation evaluation. In Proceedings of international workshop on spoken languages translation, Pittsburgh, October.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., et al (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the association for computational linguistics, (pp. 177–180), Prague, Czech Republic.
Och, F. J., & Ney, H. (2000). A comparison of alignment models for statistical machine translation. In Proceedings of the 18th conference on computational linguistics, (pp. 1086–1090), Morristown, NJ, USA.
Och, F. J., & Ney, H. (2002). Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the 40th annual meeting of the association for computational linguistics (pp. 295–302), Philadelphia, USA, July.
Och, F. J. (2003). Minimum error rate training in statistical machine translation. In Proceedings of the 41th annual meeting of the association for computational linguistics (pp. 160–167), Sapporo, July.
Papineni, K., Roukos, S., Ward, T., & Zhu, W-J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the association for computational linguistics (pp. 311–318), Philadelphia, PA, July.
Stolcke, A. (2002). SRILM—An extensible language modeling toolkit. In Proceedings of the 7th international conference on spoken language processing, ICSLP’02, (pp. 901–904), Denver, USA, September.
Tillmann, C. (2004). A unigram orientation model for statistical machine translation. In Proceedings of the human language technology conference, HLT-NAACL’04, (pp. 101–104), Boston, May.

Download references

Acknowledgments

The authors want to thank the anonymous reviewers for their valuable comments and suggestions which helped improving this paper. The authors also want to thank Barcelona Media Innovation Center and Institute for Infocomm Research for their support and permission to publish this research. This work has been partially funded by the Spanish Ministry of Economy and Competitive through the Juan de la Cierva fellowship program and by the Seventh Framework Programme of the European Comission through the T4ME contract (grant agreement no.: 249119).

Author information

Authors and Affiliations

Barcelona Media Innovation Center, Av. Diagonal 177, 08018, Barcelona, Spain
Marta R. Costa-jussà
Institute for Infocomm Research, 1 Fusionopolis Way, Singapore, 138632, Singapore
Rafael E. Banchs

Authors

Marta R. Costa-jussà
View author publications
You can also search for this author in PubMed Google Scholar
Rafael E. Banchs
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marta R. Costa-jussà.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Costa-jussà, M.R., Banchs, R.E. Automatic normalization of short texts by combining statistical and rule-based techniques. Lang Resources & Evaluation 47, 179–193 (2013). https://doi.org/10.1007/s10579-012-9187-y

Download citation

Published: 24 May 2012
Issue Date: March 2013
DOI: https://doi.org/10.1007/s10579-012-9187-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic normalization of short texts by combining statistical and rule-based techniques

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Automated identification of media bias in news articles: an interdisciplinary literature review

Early dementia detection with speech analysis and machine learning techniques

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic normalization of short texts by combining statistical and rule-based techniques

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Automated identification of media bias in news articles: an interdisciplinary literature review

Early dementia detection with speech analysis and machine learning techniques

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation