Skip to main content

A Combination of BERT and Transformer for Vietnamese Spelling Correction

  • Conference paper
  • First Online:
Intelligent Information and Database Systems (ACIIDS 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13757))

Included in the following conference series:

Abstract

Recently, many studies have shown the efficiency of using Bidirectional Encoder Representations from Transformers (BERT) in various Natural Language Processing (NLP) tasks. Specifically, English spelling correction task that uses Encoder-Decoder architecture and takes advantage of BERT has achieved state-of-the-art result. However, to our knowledge, there is no implementation in Vietnamese yet. Therefore, in this study, a combination of Transformer architecture (state-of-the-art for Encoder-Decoder model) and BERT was proposed to deal with Vietnamese spelling correction. The experiment results have shown that our model outperforms other approaches as well as the Google Docs Spell Checking tool, achieves an 86.24 BLEU score on this task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://norvig.com/spell-correct.html.

  2. 2.

    Github: https://github.com/google-research/bert.

  3. 3.

    Github: https://github.com/VinAIResearch/PhoBERT.

  4. 4.

    Github: https://github.com/binhvq/news-corpus.

  5. 5.

    Github: https://github.com/tranhamduong/Vietnamese-Spelling-Correction-testset.

  6. 6.

    The tool can be found on the Google Docs website (https://docs.google.com/). We collected samples by using a web browser behavior simulator based on Selenium framework that manipulate the Google spell checking tool to correct all of its possible suggestions.

References

  1. Bassil, Y., Alwani, M.: Ocr post-processing error correction algorithm using google online spelling suggestion. J. Emerging Trends Comput. Inf. Sci. (2012)

    Google Scholar 

  2. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, pp. 4171–4186 (2019)

    Google Scholar 

  3. of Education Vietnam M: Ministry of Education Publisher (2002)

    Google Scholar 

  4. Fivez, P., Šuster, S., Daelemans, W.: Unsupervised context-sensitive spelling correction of clinical free-text with word and character n-gram embeddings. In: BioNLP 2017, pp. 143–148. Association for Computational Linguistics, Vancouver, Canada Aug 2017

    Google Scholar 

  5. Fivez, P., Suster, S., Daelemans, W.: Unsupervised context-sensitive spelling correction of english and dutch clinical free-text with word and character n-gram embeddings (2017)

    Google Scholar 

  6. Hao, C.X.: Youth Publisher (2003)

    Google Scholar 

  7. Hladek, D., Staš, J., Pleva, M.: Survey of automatic spelling correction. Electronics 9, 1670 (2020)

    Article  Google Scholar 

  8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–80 (1997)

    Article  Google Scholar 

  9. Kaneko, M., Mita, M., Kiyono, S., Suzuki, J., Inui, K.: Encoder-decoder models can benefit from pre-trained masked language models in grammatical error correction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4248–4254. Association for Computational Linguistics (2020)

    Google Scholar 

  10. Khanh, P.H.: Good spelling of vietnamese texts, one aspect of computational linguistics in vietnam. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, ACL 2000 p. 1–2. Association for Computational Linguistics, USA (2000)

    Google Scholar 

  11. Kissos, I., Dershowitz, N.: Ocr error correction using character correction and feature-based word classification. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 198–203. IEEE (2016)

    Google Scholar 

  12. Kiyono, S., Suzuki, J., Mita, M., Mizumoto, T., Inui, K.: An empirical study of incorporating pseudo data into grammatical error correction. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1236–1242. Association for Computational Linguistics, Hong Kong, China, Nov 2019

    Google Scholar 

  13. Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.: OpenNMT: Open-source toolkit for neural machine translation. In: Proceedings of ACL 2017, System Demonstrations, pp. 67–72. Association for Computational Linguistics, Vancouver, Canada (Jul 2017)

    Google Scholar 

  14. Liu, J., Cheng, F., Wang, Y., Shindo, H., Matsumoto, Y.: Automatic error correction on Japanese functional expressions using character-based neural machine translation. In: Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation. Association for Computational Linguistics, Hong Kong, 1–3 Dec 2018

    Google Scholar 

  15. Liu, Y., et al.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  16. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, 2–4 May 2013. Workshop Track Proceedings (2013)

    Google Scholar 

  17. Nguyen, D.Q., Nguyen, A.T.: PhoBERT: Pre-trained language models for Vietnamese. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1037–1042 (2020)

    Google Scholar 

  18. Nguyen, H.T., Dang, T.B., Nguyen, L.M.: Deep learning approach for vietnamese consonant misspell correction. In: Nguyen, L.-M., Phan, X.-H., Hasida, K., Tojo, S. (eds.) PACLING 2019. CCIS, vol. 1215, pp. 497–504. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-6168-9_40

    Chapter  Google Scholar 

  19. Nguyen, H., Dang, T., Nguyen, T.T., Le, C.: Using large n-gram for vietnamese spell checking. Adv. Intell. Syst. Comput. 326, 617–627 (2015)

    Google Scholar 

  20. Nguyen, P.H., Ngo, T.D., Phan, D.A., Dinh, T.P., Huynh, T.Q.: Vietnamese spelling detection and correction using bi-gram, minimum edit distance, soundex algorithms with some additional heuristics. In: 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies, pp. 96–102. IEEE (2008)

    Google Scholar 

  21. Nguyen, Q.D., Le, D.A., Zelinka, I.: Ocr error correction for unconstrained vietnamese handwritten text, pp. 132–138 (12 2019)

    Google Scholar 

  22. Ott, M., et al.: fairseq: A fast, extensible toolkit for sequence modeling. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). pp. 48–53. Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019)

    Google Scholar 

  23. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA (Jul 2002)

    Google Scholar 

  24. Pennington, J., Socher, R., Manning, C.: GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics, Doha, Qatar (Oct 2014)

    Google Scholar 

  25. Pham, N.L., Nguyen, T.H., Nguyen, V.V.: Grammatical error correction for vietnamese using machine translation. In: Nguyen, L.-M., Phan, X.-H., Hasida, K., Tojo, S. (eds.) PACLING 2019. CCIS, vol. 1215, pp. 505–512. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-6168-9_41

    Chapter  Google Scholar 

  26. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS 2014, pp. 3104–3112. MIT Press, Cambridge, MA, USA (2014)

    Google Scholar 

  27. Tedjopranoto, M., Wijaya, A., Santoso, L., Suhartono, D.: Correcting typographical error and understanding user intention in chatbot by combining n-gram and machine learning using schema matching technique. Int. J. Mach. Learn. Comput. 9, 471–476 (2019)

    Article  Google Scholar 

  28. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  29. Xuan, P.: Solutions to spelling mistakes in written vietnamese. VNU J. Sci. Educ. Research 33(2) (2017)

    Google Scholar 

  30. Yuan, Z., Briscoe, T.: Grammatical error correction using neural machine translation. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 380–386. Association for Computational Linguistics (Jun 2016)

    Google Scholar 

  31. Zhu, J., Xia, Y., Wu, L., He, D., Qin, T., Zhou, W., Li, H., Liu, T.: Incorporating BERT into neural machine translation. In: Eighth International Conference on Learning Representations (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Trung Hieu Ngo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ngo, T.H., Tran, H.D., Huynh, T., Hoang, K. (2022). A Combination of BERT and Transformer for Vietnamese Spelling Correction. In: Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, TP., Trawiński, B., Szczerbicki, E. (eds) Intelligent Information and Database Systems. ACIIDS 2022. Lecture Notes in Computer Science(), vol 13757. Springer, Cham. https://doi.org/10.1007/978-3-031-21743-2_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21743-2_43

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21742-5

  • Online ISBN: 978-3-031-21743-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics