research-article

Fast Recurrent Neural Network with Bi-LSTM for Handwritten Tamil Text Segmentation in NLP

Authors:

S. Lakshmana PandianAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 23, Issue 5

Article No.: 68, Pages 1 - 20

https://doi.org/10.1145/3643808

Published: 10 May 2024 Publication History

Abstract

Tamil text segmentation is a long-standing test in language comprehension that entails separating a record into adjacent pieces based on its semantic design. Each segment is important in its own way. The segments are organised according to the purpose of the content examination as text groups, sentences, phrases, words, characters or any other data unit. That process has been portioned using rapid tangled neural organisation in this research, which presents content segmentation methods based on deep learning in natural language processing (NLP). This study proposes a bidirectional long short-term memory (Bi-LSTM) neural network prototype in which fast recurrent neural networks (FRNNs) are used to learn Tamil text group embedding and phrases are fragmented using text-oriented data. As a result, this prototype is capable of handling variable measured setting data and gives a vast new dataset for naturally segmenting text in Tamil. In addition, we develop a segmentation prototype and show how well it sums up to unnoticeable regular content using this dataset as a base. With Bi-LSTM, the segmentation precision of FRNN is superior to that of other segmentation approaches; however, it is still inferior to that of certain other techniques. Every content is scaled to the required size in the proposed framework, which is immediately accessible for the preparation. This means, each word in a scaled Tamil text is employed to prepare neural organisation as fragmented content. The results reveal that the proposed framework produces high rates of segmentation for manually authored material that are nearly equivalent to segmentation-based plans.

References

[1]

Imed Zitouni (Ed). 2014. Natural Language Processing of Semitic Languages. Springer, Berlin, 299–334.

[2]

Ying Xiong, Zhongmin Wang, Dehuan Jiang, Xiaolong Wang, Qingcai Chen, Hua Xu, Jun Yan, and Buzhou Tang. 2019. A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text. BMC Medical Informatics and Decision-making 19, 2 (2019), 179–184.

[3]

Tiba Zaki Abdulhameed, Imed Zitouni, and Ikhlas Abdel-Qader. 2019. Wasf-Vec: Topology-based word embedding for modern standard Arabic and Iraqi dialect ontology. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 19, 2 (2019), 1–27.

[4]

Yuan Luo, Yu Cheng, Özlem Uzuner, Peter Szolovits, and Justin Starren. 2018. Segment convolutional neural networks (Seg-CNNs) for classifying relations in clinical notes. Journal of the American Medical Informatics Association 25, 1 (2018), 93–98.

[5]

Konstantinos Zagoris, Savvas A. Chatzichristofis, and Nikos Papamarkos. 2011. Text localization using standard deviation analysis of structure elements and support vector machines. EURASIP Journal on Advances in Signal Processing (2011), 1–12.

[6]

M. R. Kumar, N. N. Shetty, and B. P. Pragathi. 2012. Tamil text line segmentation of handwritten documents using clustering method based on thresholding approach. In National Conference on Advanced Computing and Communications (NCACC’12). 9–12.

[7]

C. Vinotheni and S. Lakshmana Pandian. 2021. Deep learning-based text segmentation in NLP using fast recurrent neural network with bi-LSTM. Smart Intelligent Computing and Communication Technology 38 (2021), 87–93.

[8]

A. M. Vil'kin, I. V. Safonov, and M. A. Egorova. 2013. Algorithm for segmentation of documents based on texture features. Pattern Recognition and Image Analysis 23 (2013), 153–159.

Digital Library

[9]

Chenchen Ding, Ye Kyaw Thu, Masao Utiyama, and Eiichiro Sumita. 2016. Word segmentation for Burmese (Myanmar). ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 15, 4 (2016), 1–10.

Digital Library

[10]

Bharathi Raja Chakravarthi, Ruba Priyadharshini, Parameswari Krishnamurthy, and Elizabeth Sherly. 2021. Proceedings of the 1st Workshop on Speech and Language Technologies for Dravidian Languages. Association for Computational Linguistics.

[11]

Bharathi Raja Chakravarthi, Ruba Priyadharshini, Parameswari Krishnamurthy, Elizabeth Sherly, and Sinnathamby Mahesan. 2022. Proceedings of the 2nd Workshop on Speech and Language Technologies for Dravidian Languages. Association for Computational Linguistics.

[12]

Ruba Priyadharshini, Bharathi Raja Chakravarthi, Sajeetha Thavareesan, Dhivya Chinnappa, Durairaj Thenmozhi, and Rahul Ponnusamy. 2021. Overview of the DravidianCodeMix 2021 shared task on sentiment detection in Tamil, Malayalam, and Kannada. In Forum for Information Retrieval Evaluation (FIRE’21). 4–6.

Digital Library

[13]

Qinjun Qiu et al. 2018. DGeo segmenter: A dictionary-based Chinese phrase segmenter for the geoscience domain. Computers & Geosciences 121 (2018), 1–11.

[14]

Xiaozheng Li et al. 2019. Intelligent diagnosis with Chinese electronic medical records based on convolutional neural networks. BMC Bioinformatics 20, 1 (2019), 1–12.

[15]

Chenghai Yu, Shupei Wang, and Jiajun Guo. 2019. Learning Chinese segmentation of phrase based on bidirectional GRU-CRF and CNN network prototype. International Journal of Technology and Human Interaction (IJTHI) 15, 3 (2019), 47–62.

[16]

Junxin Liu et al. 2019. Neural Chinese segmentation of phrase with dictionary. Neurocomputing 338 (2019), 46–54.

Digital Library

[17]

Yushi Yao and Zheng Huang. 2016. Bi-directional LSTM recurrent neural network for Chinese segmentation of phrase. International Conference on Neural Information Processing. Lecture Notes in Computer Science, Vol 9950. Springer, Cham.

[18]

Oussama Zayene et al. 2018. Multi-dimensional long short-term memory networks for artificial Arabic Tamil text recognition in news video. IET Computer Vision 12, 5 (2018), 710–719.

Digital Library

[19]

Yuan Luo et al. 2018. Segment convolutional neural networks (Seg-CNNs) for classifying relations in clinical notes. Journal of the American Medical Informatics Association 25, 1 (2018), 93–98.

[20]

Qinjun Qiu et al. 2018. DGeoSegmenter: A dictionary-based Chinese phrase segmenter for the geoscience domain. Computers & Geosciences 121 (2018), 1–11.

[21]

Daiki Shimada, Ryunosuke Kotani, and Hitoshi Iyatomi. 2016. Document classification through image-based character embedding and wildcard training. In IEEE International Conference on Big Data (Big Data’16). IEEE.

[22]

Md Sanzidul Islam et al. 2019. Sequence-to-sequence Bangla Tamil text group generation with LSTM recurrent neural networks. Procedia Computer Science 152 (2019), 51–58.

Digital Library

[23]

Pinkesh Badjatiya et al. 2018. Attention-based neural Tamil text segmentation. In European Conference on Information Retrieval. Springer, Cham.

[24]

Kunnapat Thipparaphonkul, Watchanan Chantapakul, Chayanin Suatap, and Karn Patanukhom. 2019. Thai handwritten character segmentation based on deep learning. In Proceedings of the 2019 2nd Artificial Intelligence and Cloud Computing Conference. 64–74.

Digital Library

[25]

Guillaume Renton, Yann Soullard, Clément Chatelain, Sébastien Adam, Christopher Kermorvant, and Thierry Paquet. 2018. Fully convolutional network with dilated convolutions for handwritten text line segmentation. International Journal on Document Analysis and Recognition (IJDAR) 21, 3 (2018), 177–186.

Digital Library

[26]

Chemseddine Neche, Abdel Belaid, and Afef Kacem-Echi. 2019. Arabic handwritten documents segmentation into text-lines and words using deep learning. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW’19). IEEE, 19–24.

[27]

Junho Jo, Hyung Il Koo, Jae Woong Soh, and Nam Ik Cho. 2020. Handwritten text segmentation via end-to-end learning of convolutional neural networks. Multimedia Tools and Applications 79, 43 (2020), 32137–32150.

Digital Library

[28]

B. Rajyagor and R. Rakholia. 2021. Tri-level handwritten text segmentation techniques for Gujarati language. Indian Journal of Science and Technology 14, 7 (2021), 618–627.

[29]

Naoto Inuzuka and Tetsuya Suzuki. 2021. Experimental application of a Japanese historical document image synthesis method to text line segmentation. In The International Conference on Pattern Recognition Applications and Methods (ICPRAM’21). 628–634.

[30]

Rahul Pramanik and Soumen Bag. 2020. Segmentation-based recognition system for handwritten Bangla and Devanagari words using conventional classification and transfer learning. IET Image Processing 14 (2020), 959–972.

[31]

Soumyadeep Kundu, Sayantan Paul, Suman Kumar Bera, Abraham Ajith, and Ram Sarkar. 2020. Text-line extraction from handwritten document images using GAN. Expert Systems with Applications 140 (2020), 112916.

Digital Library

[32]

Pawan Kumar Singh, Ram Sarkar, Nibaran Das, Subhadip Basu, Mahantapas Kundu, and Mita Nasipuri. 2018. Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images. Multimedia Tools and Applications 77, 7 (2018), 8441–8473.

Digital Library

[33]

Reya Sharma and Baijnath Kaushik. 2020. Offline recognition of handwritten Indic scripts: A state-of-the-art survey and future perspectives. Computer Science Review 38 (2020), 100302.

Digital Library

[34]

N. Shaffi and F. Hajamohideen. 2021. uTHCD: A new benchmarking for Tamil handwritten OCR. IEEE Access 9 (2021), 101469–101493.

[35]

HPLabs, Isolated Handwritten Tamil Character Dataset. (June 2013). Retrieved from http://lipitk.sourceforge.net/datasets/tamilchardata.htm

[36]

Mudit Agrawal, Ajay S. Bhaskarabhatla, and Sriganesh Madhvanath. 2004. Data collection for handwriting corpus creation in Indic scripts. In International Conference on Speech and Language Technology and Oriental COCOSDA (ICSLT-COCOSDA’04). Citeseer.

[37]

P. K. Singh, R. Sarkar, N. Das, S. Basu, M. Kundu, and M. Nasipuri. 2018. Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images. Multimedia Tools and Applications 77, 7 (2018), 8441–8473.

Digital Library

[38]

N. Das, K. Acharya, R. Sarkar, S. Basu, M. Kundu, and M. Nasipuri. 2014. A benchmark image database of isolated Bangla handwritten compound characters. International Journal on Document Analysis and Recognition (IJDAR) 17, 4 (2014), 413–431.

Digital Library

[39]

S. A. Mahmoud, I. Ahmad, W. G. Al-Khatib, M. Alshayeb, M. T. Parvez, V. Märgner, and G. A. Fink. 2014. KHATT: An open Arabic offline handwritten text database. Pattern Recognition 47, 3 (2014), 1096–1112.

Digital Library

[40]

N. Shanthi and K. Duraiswamy. 2010. A novel SVM-based handwritten Tamil character recognition system. Pattern Analysis and Applications 13, 2 (2010), 173–180.

Digital Library

[41]

T. M. Jose and A. Wahi. 2013. A recognition of Tamil handwritten characters using Daubechies wavelet transforms and feed-forward backpropagation network. International Journal of Computer Applications 64, 8 (2013), 0975–8887.

[42]

C. Vinotheni, S. Lakshmana Pandian, and G. Lakshmi. 2021. Modified convolutional neural network of Tamil character recognition. Advances in Distributed Computing and Machine Learning. Springer, Singapore, 469–480.

[43]

S. Thadchanamoorthy, N. D. Kodikara, H. L. Premaretne, U. Pal, and F. Kimura. 2013. Tamil handwritten city name database development and recognition for postal automation. In 2013 12th International Conference on Document Analysis and Recognition. IEEE, 793–797.

Digital Library

[44]

HPLabs, Handwritten Tamil Word Dataset (2006). Retrieved from http://lipitk.sourceforge.net/datasets/tamilworddata.htm

[45]

B. Nethravathi, C. P. Archana, K. Shashikiran, A. G. Ramakrishnan, and V. Kumar. 2010. Creation of a huge annotated database for Tamil and Kannada OHR. In 2010 12th International Conference on Frontiers in Handwriting Recognition. IEEE, 415–420.

[46]

Tamil Handwritten Documents Dataset in ResearchGate repository. Retrieved from https://www.researchgate.net/publication/362490821_Tamil_Handwritten_Documents_Dataset

[47]

Longkai Zhang, Houfeng Wang, Xu Sun, and Mairgup Mansur. 2013. Exploring representations from unlabeled data with cotraining for Chinese word segmentation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 311–321.

Index Terms

Fast Recurrent Neural Network with Bi-LSTM for Handwritten Tamil Text Segmentation in NLP
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Improving Offline Handwritten Text Recognition with Hybrid HMM/ANN Models

This paper proposes the use of hybrid Hidden Markov Model (HMM)/Artificial Neural Network (ANN) models for recognizing unconstrained offline handwritten texts. The structural part of the optical models has been modeled with Markov chains, and a ...
Thai Handwritten Character Segmentation Based on Deep Learning
AICCC '19: Proceedings of the 2019 2nd Artificial Intelligence and Cloud Computing Conference

Many computer vision applications rely on segmentation task. To achieve a good result on Handwritten text recognition (HTR), character segmentation is significant in terms of extracting each individual character. In this study, we propose a novel ...
Recognizing handwritten Arabic words using grapheme segmentation and recurrent neural networks

The Arabic alphabet is used in around 27 languages, including Arabic, Persian, Kurdish, Urdu, and Jawi. Many researchers have developed systems for recognizing cursive handwritten Arabic words, using both holistic and segmentation-based approaches. This ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 23, Issue 5

May 2024

297 pages

EISSN:2375-4702

DOI:10.1145/3613584

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 May 2024

Online AM: 07 February 2024

Accepted: 24 December 2023

Revised: 23 August 2023

Received: 13 July 2022

Published in TALLIP Volume 23, Issue 5

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
190
Total Downloads

Downloads (Last 12 months)138
Downloads (Last 6 weeks)3

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents