Abstract
Recently, spell checking (or spelling correction systems) has regained attention due to the need of normalizing user-generated content (UGC) on the web. UGC presents new challenges to spellers, as its register is much more informal and contains much more variability than traditional spelling correction systems can handle. This paper proposes two new approaches to deal with spelling correction of UGC in Brazilian Portuguese (BP), both of which take into account phonetic errors. The first approach is based on three phonetic modules running in a pipeline. The second one is based on machine learning, with soft decision making, and considers context-sensitive misspellings. We compared our methods with others on a human annotated UGC corpus of reviews of products. The machine learning approach surpassed all other methods, with 78.0 % correction rate, very low false positive (0.7 %) and false negative rate (21.9 %).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
Currently, a BP version of the phonetic rules can be found at http://sourceforge.net/projects/metaphoneptbr/.
- 4.
- 5.
- 6.
The dictionary is available upon request.
- 7.
- 8.
References
Duan, H., Hsu, B.P.: Online spelling correction for query completion. In: Proceedings of the 20th International Conference on World Wide Web, WWW 2011, NY, USA, pp. 117–126. ACM (2011)
Fossati, D., Di Eugenio, B.: A mixed trigrams approach for context sensitive spell checking. In: Gelbukh, A. (ed.) CICLing 2007. LNCS, vol. 4394, pp. 623–633. Springer, Heidelberg (2007)
Fossati, D., Di Eugenio, B.: I saw TREE trees in the park: how to correct real-word spelling mistakes. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation LREC 2008 (2008)
Mays, E., Damerau, F.J., Mercer, R.L.: Context based spelling correction. Inf. Process. Manage. 27(5), 517–522 (1991)
Wilcox-O’Hearn, A., Hirst, G., Budanitsky, A.: Real-word spelling correction with trigrams: a reconsideration of the Mays, Damerau, and Mercer Model. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 605–616. Springer, Heidelberg (2008)
Islam, A., Inkpen, D.: Real-word spelling correction using Google web 1tn-gram data set. In ACM International Conference on Information and Knowledge Management CIKM 2009, pp. 1689–1692(2009)
Sonmez, C., Ozgur, A.: A graph-based approach for contextual text normalization. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing EMNLP 2014, pp. 313–324 (2014)
Hirst, G.: An evaluation of the contextual spelling checker of Microsoft Office Word 2007(2008)
Németh, L.: Hunspell. Dostupno na (2010). http://hunspell.sourceforge.net/ [01.10.2013]
Zampieri, M., Amorim, R.: Between sound and spelling: combining phonetics and clustering algorithms to improve target word recovery. In: Proceedings of the 9th International Conference on Natural Language Processing PolTAL 2014, pp. 438–449 (2014)
Rusell, R.C.: US Patent 1261167 issued 1918–04-02 (1918)
Duran, M., Avanço, L., Aluísio, S., Pardo, T., Nunes, M.G.V.: Some issues on the normalization of a corpus of products reviews in Portuguese. In: Proceedings of the 9th Web as Corpus Workshop WaC-9, Gothenburg, Sweden, pp. 22–28, April 2014
De Clercq, O., Schulz, S., Desmet, B., Lefever, E., Hoste, V.: Normalization of dutch user-generated content. In: Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, pp. 179–188 (2013)
Han, B., Cook, P., Baldwin, T.: Lexical normalization for social media text. ACM Trans. Intell. Syst. Technol. 4(1), 5:1–5:27 (2013)
Andrade, G., Teixeira, F., Xavier, C., Oliveira, R., Rocha, L., Evsukoff, A.: HASCH: high performance automatic spell checker for Portuguese texts from the web. In: Proceedings of the International Conference on Computational Science, vol. 9, pp. 403–411 (2012)
Martins, B., Silva, M.J.: Spelling correction for search engine queries. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 372–383. Springer, Heidelberg (2004)
Ahmed, F., Luca, E.W.D., Nürnberger, A.: Revised N-Gram based Automatic spelling correction tool to improve retrieval effectiveness. Polibits 40, 39–48 (2009)
Philips, L.: The double metaphone search algorithm. C/C++ Users J. 18(6), 38–43 (2000)
Avanço, L., Duran, M., Nunes, M.G.V.: Towards a phonetic Brazilian Portuguese spell checker. In: Proceedings of ToRPorEsp Workshop PROPOR 2014, São Carlos, Brazil, pp. 24–31 (2014)
Hartmann, N., Avanço, L., Balage, P., Duran, M., Nunes, M.G.V., Pardo, T., Aluísio, S.: A large corpus of product reviews in Portuguese: tackling out-of-vocabulary words. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation LREC 2014, pp. 3866–3871 (2014)
Mendonça, G., Aluísio, S.: Using a hybrid approach to build a pronunciation dictionary for Brazilian Portuguese. In: Proceedings of the 15th Annual Conference of the International Speech Communication Association INTERSPEECH 2014, Singapore (2014)
Toutanova, K., Moore, R.C.: Pronunciation modeling for improved spelling correction. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL 2002, pp. 144–151 (2002)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Browne, K.: Snowball sampling: using social networks to research non-heterosexual women. Int. J. Soc. Res. Methodol. 8(1), 47–60 (2005)
Carletta, J.: Assessing agreement on classification tasks: the kappa statistic. Comput. Linguist. 22(2), 249–254 (1996)
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)
Brill, E., Moore, R.C.: An improved error model for noisy channel spelling correction. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, ACL 2000, pp. 286–293(2000)
van Berkel, B., Smedt, K.D.: Triphone analysis: a combined method for the correction of orthographical and typographical errors. In: Proceedings of the Second Conference on Applied Natural Language Processing, Austin, Texas, USA, pp. 77–83, February 1988
Acknowledgments
Part of the results presented in this paper were obtained through research activity in the project titled “Semantic Processing of Brazilian Portuguese Texts”, sponsored by Samsung Eletrônica da Amazônia Ltda. under the terms of Brazilian federal law number 8.248/91.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
de Mendonça Almeida, G.A., Avanço, L., Duran, M.S., Fonseca, E.R., Nunes, M.d.G.V., Aluísio, S.M. (2016). Evaluating Phonetic Spellers for User-Generated Content in Brazilian Portuguese. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds) Computational Processing of the Portuguese Language. PROPOR 2016. Lecture Notes in Computer Science(), vol 9727. Springer, Cham. https://doi.org/10.1007/978-3-319-41552-9_37
Download citation
DOI: https://doi.org/10.1007/978-3-319-41552-9_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41551-2
Online ISBN: 978-3-319-41552-9
eBook Packages: Computer ScienceComputer Science (R0)