A Combination of BERT and Transformer for Vietnamese Spelling Correction

Ngo, Trung Hieu; Tran, Ham Duong; Huynh, Tin; Hoang, Kiem

doi:10.1007/978-3-031-21743-2_43

Trung Hieu Ngo¹³,
Ham Duong Tran¹³,
Tin Huynh¹³ &
…
Kiem Hoang¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13757))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

853 Accesses
1 Citations

Abstract

Recently, many studies have shown the efficiency of using Bidirectional Encoder Representations from Transformers (BERT) in various Natural Language Processing (NLP) tasks. Specifically, English spelling correction task that uses Encoder-Decoder architecture and takes advantage of BERT has achieved state-of-the-art result. However, to our knowledge, there is no implementation in Vietnamese yet. Therefore, in this study, a combination of Transformer architecture (state-of-the-art for Encoder-Decoder model) and BERT was proposed to deal with Vietnamese spelling correction. The experiment results have shown that our model outperforms other approaches as well as the Google Docs Spell Checking tool, achieves an 86.24 BLEU score on this task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://norvig.com/spell-correct.html.
2.
Github: https://github.com/google-research/bert.
3.
Github: https://github.com/VinAIResearch/PhoBERT.
4.
Github: https://github.com/binhvq/news-corpus.
5.
Github: https://github.com/tranhamduong/Vietnamese-Spelling-Correction-testset.
6.
The tool can be found on the Google Docs website (https://docs.google.com/). We collected samples by using a web browser behavior simulator based on Selenium framework that manipulate the Google spell checking tool to correct all of its possible suggestions.

References

Bassil, Y., Alwani, M.: Ocr post-processing error correction algorithm using google online spelling suggestion. J. Emerging Trends Comput. Inf. Sci. (2012)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, pp. 4171–4186 (2019)
Google Scholar
of Education Vietnam M: Ministry of Education Publisher (2002)
Google Scholar
Fivez, P., Šuster, S., Daelemans, W.: Unsupervised context-sensitive spelling correction of clinical free-text with word and character n-gram embeddings. In: BioNLP 2017, pp. 143–148. Association for Computational Linguistics, Vancouver, Canada Aug 2017
Google Scholar
Fivez, P., Suster, S., Daelemans, W.: Unsupervised context-sensitive spelling correction of english and dutch clinical free-text with word and character n-gram embeddings (2017)
Google Scholar
Hao, C.X.: Youth Publisher (2003)
Google Scholar
Hladek, D., Staš, J., Pleva, M.: Survey of automatic spelling correction. Electronics 9, 1670 (2020)
Article Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–80 (1997)
Article Google Scholar
Kaneko, M., Mita, M., Kiyono, S., Suzuki, J., Inui, K.: Encoder-decoder models can benefit from pre-trained masked language models in grammatical error correction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4248–4254. Association for Computational Linguistics (2020)
Google Scholar
Khanh, P.H.: Good spelling of vietnamese texts, one aspect of computational linguistics in vietnam. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, ACL 2000 p. 1–2. Association for Computational Linguistics, USA (2000)
Google Scholar
Kissos, I., Dershowitz, N.: Ocr error correction using character correction and feature-based word classification. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 198–203. IEEE (2016)
Google Scholar
Kiyono, S., Suzuki, J., Mita, M., Mizumoto, T., Inui, K.: An empirical study of incorporating pseudo data into grammatical error correction. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1236–1242. Association for Computational Linguistics, Hong Kong, China, Nov 2019
Google Scholar
Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.: OpenNMT: Open-source toolkit for neural machine translation. In: Proceedings of ACL 2017, System Demonstrations, pp. 67–72. Association for Computational Linguistics, Vancouver, Canada (Jul 2017)
Google Scholar
Liu, J., Cheng, F., Wang, Y., Shindo, H., Matsumoto, Y.: Automatic error correction on Japanese functional expressions using character-based neural machine translation. In: Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation. Association for Computational Linguistics, Hong Kong, 1–3 Dec 2018
Google Scholar
Liu, Y., et al.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, 2–4 May 2013. Workshop Track Proceedings (2013)
Google Scholar
Nguyen, D.Q., Nguyen, A.T.: PhoBERT: Pre-trained language models for Vietnamese. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1037–1042 (2020)
Google Scholar
Nguyen, H.T., Dang, T.B., Nguyen, L.M.: Deep learning approach for vietnamese consonant misspell correction. In: Nguyen, L.-M., Phan, X.-H., Hasida, K., Tojo, S. (eds.) PACLING 2019. CCIS, vol. 1215, pp. 497–504. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-6168-9_40
Chapter Google Scholar
Nguyen, H., Dang, T., Nguyen, T.T., Le, C.: Using large n-gram for vietnamese spell checking. Adv. Intell. Syst. Comput. 326, 617–627 (2015)
Google Scholar
Nguyen, P.H., Ngo, T.D., Phan, D.A., Dinh, T.P., Huynh, T.Q.: Vietnamese spelling detection and correction using bi-gram, minimum edit distance, soundex algorithms with some additional heuristics. In: 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies, pp. 96–102. IEEE (2008)
Google Scholar
Nguyen, Q.D., Le, D.A., Zelinka, I.: Ocr error correction for unconstrained vietnamese handwritten text, pp. 132–138 (12 2019)
Google Scholar
Ott, M., et al.: fairseq: A fast, extensible toolkit for sequence modeling. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). pp. 48–53. Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA (Jul 2002)
Google Scholar
Pennington, J., Socher, R., Manning, C.: GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics, Doha, Qatar (Oct 2014)
Google Scholar
Pham, N.L., Nguyen, T.H., Nguyen, V.V.: Grammatical error correction for vietnamese using machine translation. In: Nguyen, L.-M., Phan, X.-H., Hasida, K., Tojo, S. (eds.) PACLING 2019. CCIS, vol. 1215, pp. 505–512. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-6168-9_41
Chapter Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS 2014, pp. 3104–3112. MIT Press, Cambridge, MA, USA (2014)
Google Scholar
Tedjopranoto, M., Wijaya, A., Santoso, L., Suhartono, D.: Correcting typographical error and understanding user intention in chatbot by combining n-gram and machine learning using schema matching technique. Int. J. Mach. Learn. Comput. 9, 471–476 (2019)
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Xuan, P.: Solutions to spelling mistakes in written vietnamese. VNU J. Sci. Educ. Research 33(2) (2017)
Google Scholar
Yuan, Z., Briscoe, T.: Grammatical error correction using neural machine translation. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 380–386. Association for Computational Linguistics (Jun 2016)
Google Scholar
Zhu, J., Xia, Y., Wu, L., He, D., Qin, T., Zhou, W., Li, H., Liu, T.: Incorporating BERT into neural machine translation. In: Eighth International Conference on Learning Representations (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

The Saigon International University, Ho Chi Minh City, Vietnam
Trung Hieu Ngo, Ham Duong Tran, Tin Huynh & Kiem Hoang

Authors

Trung Hieu Ngo
View author publications
You can also search for this author in PubMed Google Scholar
Ham Duong Tran
View author publications
You can also search for this author in PubMed Google Scholar
Tin Huynh
View author publications
You can also search for this author in PubMed Google Scholar
Kiem Hoang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Trung Hieu Ngo .

Editor information

Editors and Affiliations

Wrocław University of Science and Technology, Wrocław, Poland
Ngoc Thanh Nguyen
Vietnam National University, Ho Chi Minh City, Ho Chi Minh City, Vietnam
Tien Khoa Tran
Al-Farabi Kazakh National University, Almaty, Kazakhstan
Ualsher Tukayev
National University of Kaohsiung, Kaohsiung, Taiwan
Tzung-Pei Hong
Wrocław University of Science and Technology, Wrocław, Poland
Bogdan Trawiński
University of Newcastle, Newcastle, NSW, Australia
Edward Szczerbicki

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ngo, T.H., Tran, H.D., Huynh, T., Hoang, K. (2022). A Combination of BERT and Transformer for Vietnamese Spelling Correction. In: Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, TP., Trawiński, B., Szczerbicki, E. (eds) Intelligent Information and Database Systems. ACIIDS 2022. Lecture Notes in Computer Science(), vol 13757. Springer, Cham. https://doi.org/10.1007/978-3-031-21743-2_43

Download citation

DOI: https://doi.org/10.1007/978-3-031-21743-2_43
Published: 09 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21742-5
Online ISBN: 978-3-031-21743-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Combination of BERT and Transformer for Vietnamese Spelling Correction