research-article

BERT-Inspired Progressive Stacking to Enhance Spelling Correction in Bengali Text

Authors:

Debajyoty Banik,

SHESHIKALA MARTHA,

Achyut ShankarAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 23, Issue 8

Article No.: 128, Pages 1 - 12

https://doi.org/10.1145/3669941

Published: 08 August 2024 Publication History

Abstract

Common spelling checks in the current digital era have trouble reading languages such as Bengali, which employ English letters differently. In response, we have created a better Bidirectional Encoder Representations from Transformers (BERT)–based spell checker that makes use of a convolutional neural network (CNN) sub-model (Semantic Network). Our novelty, which we term progressive stacking, concentrates on improving BERT model training while expediting the corrective process. We discovered that, when comparing shallow and deep versions, deeper models could require less training time. There is potential for improving spelling corrections with this technique. We categorized and utilized as a test set a 6,300-word dataset that Nayadiganta Mohiuddin supplied, some of which had spelling errors. The most popular terms were the same as those found in the Prothom-Alo artificial error dataset.

References

[1]

Ahmad Al Hanbali, Haitham H. Saleh, Omar G. Alsawafy, Ahmed M. Attia, Ahmed M. Ghaithan, and Awsan Mohammed. 2022. Spare parts supply with incoming quality control and inspection errors in condition based maintenance. Computers & Industrial Engineering 172 (2022), 108534.

Digital Library

[2]

Mohammed Attia, Pavel Pecina, Younes Samih, Khaled Shaalan, and Josef Van Genabith. 2012. Improved spelling error detection and correction for Arabic. In Proceedings of COLING 2012: Posters. 103–112.

[3]

Mohammed Attia, Pavel Pecina, Younes Samih, Khaled Shaalan, and Josef Van Genabith. 2016. Arabic spelling error detection and correction. Natural Language Engineering 22, 5 (2016), 751–773.

[4]

Mohammed A. Attia. 2008. Handling Arabic Morphological and Syntactic Ambiguity within the LFG Framework with a View to Machine Translation. The University of Manchester (United Kingdom).

[5]

Mohammed El Hadi Attia, Zied Driss, A. Muthu Manokar, and Ravishankar Sathyamurthy. 2020. Effect of aluminum balls on the productivity of solar distillate. Journal of Energy Storage 30 (2020), 101466.

[6]

Debajyoty Banik, Pushpak Bhattacharyya, and Asif Ekbal. 2016. Rule based hardware approach for machine transliteration: A first thought. In 2016 6th International Symposium on Embedded Computing and System Design (ISED). IEEE, 192–195.

[7]

Debajyoty Banik, Asif Ekbal, and Pushpak Bhattacharyya. 2020. Statistical machine translation based on weighted syntax–semantics. Sādhanā 45 (2020), 1–12.

[8]

Debajyoty Banik, Asif Ekbal, Pushpak Bhattacharyya, and Siddhartha Bhattacharyya. 2019. Assembling translations from multi-engine machine translation outputs. Applied Soft Computing 78 (2019), 230–239.

Digital Library

[9]

Debajyoty Banik, Asif Ekbal, Pushpak Bhattacharyya, Siddhartha Bhattacharyya, and Jan Platos. 2019. Statistical-based system combination approach to gain advantages over different machine translation systems. Heliyon 5, 9 (2019).

[10]

Debajyoty Banik, Sukanta Sen, Asif Ekbal, and Pushpak Bhattacharyya. 2016. Can SMT and RBMT improve each other’s performance?—An experiment with English-Hindi translation. In Proceedings of the 13th International Conference on Natural Language Processing. 10–19.

[11]

Ondřej Bojar, Vojtěch Diatka, Pavel Rychlỳ, Pavel Straňák, Vít Suchomel, Aleš Tamchyna, and Daniel Zeman. 2014. Hindencorp-Hindi-English and Hindi-only corpus for machine translation. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 3550–3555.

[12]

Xingyi Cheng, Weidi Xu, Kunlong Chen, Shaohua Jiang, Feng Wang, Taifeng Wang, Wei Chu, and Yuan Qi. 2020. SpellGCN: Incorporating phonological and visual similarities into language models for Chinese spelling check. arXiv preprint arXiv:2004.14166 (2020).

[13]

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. 2018. Universal transformers. arXiv preprint arXiv:1807.03819 (2018).

[14]

Satya Deo and Debajyoty Banik. 2022. Text summarization using TextRank and LexRank through latent semantic analysis. In 2022 OITS International Conference on Information Technology (OCIT). 113–118.

[15]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[16]

Pravallika Etoori, Manoj Chinnakotla, and Radhika Mamidi. 2018. Automatic spelling correction for resource-scarce languages using deep learning. In Proceedings of ACL 2018, Student Research Workshop. 146–152.

[17]

Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, and Tieyan Liu. 2019. Efficient training of BERT by progressively stacking. In International Conference on Machine Learning. PMLR, 2337–2346.

[18]

Yuzhong Hong, Xianguo Yu, Neng He, Nan Liu, and Junhui Liu. 2019. FASPell: A fast, adaptable, simple, powerful Chinese spell checker based on DAE-decoder paradigm. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019). 160–169.

[19]

Sadidul Islam, Mst Farhana Sarkar, Towhid Hussain, Md Mehedi Hasan, Dewan Md Farid, and Swakkhar Shatabda. 2018. Bangla sentence correction using deep neural network based sequence to sequence learning. In 2018 21st International Conference of Computer and Information Technology (ICCIT). IEEE, 1–6.

[20]

Liming Jiang, Xingfa Shen, Qingbiao Zhao, and Jian Yao. 2024. MLSL-Spell: Chinese spelling check based on multi-label annotation. Applied Sciences 14, 6 (2024), 2541.

[21]

Nur Hossain Khan, Gonesh Chandra Saha, Bappa Sarker, and Md Habibur Rahman. 2014. Checking the correctness of Bangla words using n-gram. International Journal of Computer Application 89, 11 (2014).

[22]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[23]

Prianka Mandal and B. M. Mainul Hossain. 2017. Clustering-based Bangla spell checker. In 2017 IEEE International Conference on Imaging, Vision & Pattern Recognition (icIVPR). IEEE, 1–6.

[24]

Hanar Hoshyar Mustafa and Rebwar M. Nabi. 2023. Kurdish Kurmanji lemmatization and spell-checker with spell-correction. UHD Journal of Science and Technology 7, 1 (2023), 43–52.

[25]

Abisola Olayiwola, Dare Olayiwola, and Ajibola Oyedeji. 2024. Development of an automatic grammar checker for Yorùbá word processing using government and binding theory. Expert Systems with Applications 236 (2024), 121351.

Digital Library

[26]

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. 1802. Deep contextualized word representations. arXiv 2018. arXiv preprint arXiv:1802.05365 12 (1802).

[27]

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018).

[28]

Chowdhury Rafeed Rahman, M. D. Rahman, Samiha Zakir, Mohammad Rafsan, and Mohammed Eunus Ali. 2022. BSpell: A CNN-blended BERT based Bengali spell checker. arXiv preprint arXiv:2208.09709 (2022).

[29]

Sukanta Sen, Debajyoty Banik, Asif Ekbal, and Pushpak Bhattacharyya. 2016. IITP English-Hindi machine translation system at WAT 2016. In Proceedings of the 3rd Workshop on Asian Translation (WAT2016). The COLING 2016 Organizing Committee, Osaka, Japan, 216–222. https://aclanthology.org/W16-4622

[30]

Md Habibur Rahman Sifat, Chowdhury Rafeed Rahman, Mohammad Rafsan, and Hasibur Rahman. 2020. Synthetic error dataset generation mimicking Bengali writing pattern. In 2020 IEEE Region 10 Symposium (TENSYMP). IEEE, 1363–1366.

[31]

Naushad UzZaman and Mumit Khan. 2005. A double metaphone encoding for Bangla and its application in spelling checker. In 2005 International Conference on Natural Language Processing and Knowledge Engineering. IEEE, 705–710.

[32]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).

[33]

Dingmin Wang, Yi Tay, and Li Zhong. 2019. Confusionset-guided pointer networks for chinese spelling check. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5780–5785.

[34]

Jinhua Xiong, Qiao Zhang, Shuiyuan Zhang, Jianpeng Hou, and Xueqi Cheng. 2015. HANSpeller: A unified framework for Chinese spelling correction. In International Journal of Computational Linguistics & Chinese Language Processing, Volume 20, Number 1, June 2015—Special Issue on Chinese as a Foreign Language.

[35]

Shaohua Zhang, Haoran Huang, Jicong Liu, and Hang Li. 2020. Spelling error correction with soft-masked BERT. arXiv preprint arXiv:2005.07421 (2020).

Index Terms

BERT-Inspired Progressive Stacking to Enhance Spelling Correction in Bengali Text
1. Computing methodologies
2. Hardware
  1. Robustness
    1. Fault tolerance
      1. Error detection and error correction

Recommendations

BenLem (A Bengali Lemmatizer) and Its Role in WSD

A lemmatization algorithm for Bengali has been developed and evaluated. Its effectiveness for word sense disambiguation (WSD) is also investigated. One of the key challenges for computer processing of highly inflected languages is to deal with the ...
Multilingual text induced spelling correction
MLR '04: Proceedings of the Workshop on Multilingual Linguistic Ressources

We present TISC, a multilingual, language-independent and context-sensitive spelling checking and correction system designed to facilitate the automatic removal of non-word spelling errors in large corpora. Its lexicon is derived from raw text corpora, ...
Part of Speech Tagging in Bengali Using Support Vector Machine
ICIT '08: Proceedings of the 2008 International Conference on Information Technology

Part of Speech (POS) tagging is the task of labeling each word in a sentence with its appropriate syntactic category called part of speech. POS tagging is a very important preprocessing task for language processing activities. This paper reports about ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 23, Issue 8

August 2024

343 pages

EISSN:2375-4702

DOI:10.1145/3613611

Editor:
Imed Zitouni
Google, USA
,
Guest Editors:
Deepak Kumar Jain,
Thierry Boumans,
Stefano Berretti

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 August 2024

Online AM: 05 July 2024

Accepted: 24 May 2024

Revised: 05 April 2024

Received: 10 October 2023

Published in TALLIP Volume 23, Issue 8

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
148
Total Downloads

Downloads (Last 12 months)148
Downloads (Last 6 weeks)6

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents