skip to main content
research-article

Fast Recurrent Neural Network with Bi-LSTM for Handwritten Tamil Text Segmentation in NLP

Published: 10 May 2024 Publication History

Abstract

Tamil text segmentation is a long-standing test in language comprehension that entails separating a record into adjacent pieces based on its semantic design. Each segment is important in its own way. The segments are organised according to the purpose of the content examination as text groups, sentences, phrases, words, characters or any other data unit. That process has been portioned using rapid tangled neural organisation in this research, which presents content segmentation methods based on deep learning in natural language processing (NLP). This study proposes a bidirectional long short-term memory (Bi-LSTM) neural network prototype in which fast recurrent neural networks (FRNNs) are used to learn Tamil text group embedding and phrases are fragmented using text-oriented data. As a result, this prototype is capable of handling variable measured setting data and gives a vast new dataset for naturally segmenting text in Tamil. In addition, we develop a segmentation prototype and show how well it sums up to unnoticeable regular content using this dataset as a base. With Bi-LSTM, the segmentation precision of FRNN is superior to that of other segmentation approaches; however, it is still inferior to that of certain other techniques. Every content is scaled to the required size in the proposed framework, which is immediately accessible for the preparation. This means, each word in a scaled Tamil text is employed to prepare neural organisation as fragmented content. The results reveal that the proposed framework produces high rates of segmentation for manually authored material that are nearly equivalent to segmentation-based plans.

References

[1]
Imed Zitouni (Ed). 2014. Natural Language Processing of Semitic Languages. Springer, Berlin, 299–334.
[2]
Ying Xiong, Zhongmin Wang, Dehuan Jiang, Xiaolong Wang, Qingcai Chen, Hua Xu, Jun Yan, and Buzhou Tang. 2019. A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text. BMC Medical Informatics and Decision-making 19, 2 (2019), 179–184.
[3]
Tiba Zaki Abdulhameed, Imed Zitouni, and Ikhlas Abdel-Qader. 2019. Wasf-Vec: Topology-based word embedding for modern standard Arabic and Iraqi dialect ontology. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 19, 2 (2019), 1–27.
[4]
Yuan Luo, Yu Cheng, Özlem Uzuner, Peter Szolovits, and Justin Starren. 2018. Segment convolutional neural networks (Seg-CNNs) for classifying relations in clinical notes. Journal of the American Medical Informatics Association 25, 1 (2018), 93–98.
[5]
Konstantinos Zagoris, Savvas A. Chatzichristofis, and Nikos Papamarkos. 2011. Text localization using standard deviation analysis of structure elements and support vector machines. EURASIP Journal on Advances in Signal Processing (2011), 1–12.
[6]
M. R. Kumar, N. N. Shetty, and B. P. Pragathi. 2012. Tamil text line segmentation of handwritten documents using clustering method based on thresholding approach. In National Conference on Advanced Computing and Communications (NCACC’12). 9–12.
[7]
C. Vinotheni and S. Lakshmana Pandian. 2021. Deep learning-based text segmentation in NLP using fast recurrent neural network with bi-LSTM. Smart Intelligent Computing and Communication Technology 38 (2021), 87–93.
[8]
A. M. Vil'kin, I. V. Safonov, and M. A. Egorova. 2013. Algorithm for segmentation of documents based on texture features. Pattern Recognition and Image Analysis 23 (2013), 153–159.
[9]
Chenchen Ding, Ye Kyaw Thu, Masao Utiyama, and Eiichiro Sumita. 2016. Word segmentation for Burmese (Myanmar). ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 15, 4 (2016), 1–10.
[10]
Bharathi Raja Chakravarthi, Ruba Priyadharshini, Parameswari Krishnamurthy, and Elizabeth Sherly. 2021. Proceedings of the 1st Workshop on Speech and Language Technologies for Dravidian Languages. Association for Computational Linguistics.
[11]
Bharathi Raja Chakravarthi, Ruba Priyadharshini, Parameswari Krishnamurthy, Elizabeth Sherly, and Sinnathamby Mahesan. 2022. Proceedings of the 2nd Workshop on Speech and Language Technologies for Dravidian Languages. Association for Computational Linguistics.
[12]
Ruba Priyadharshini, Bharathi Raja Chakravarthi, Sajeetha Thavareesan, Dhivya Chinnappa, Durairaj Thenmozhi, and Rahul Ponnusamy. 2021. Overview of the DravidianCodeMix 2021 shared task on sentiment detection in Tamil, Malayalam, and Kannada. In Forum for Information Retrieval Evaluation (FIRE’21). 4–6.
[13]
Qinjun Qiu et al. 2018. DGeo segmenter: A dictionary-based Chinese phrase segmenter for the geoscience domain. Computers & Geosciences 121 (2018), 1–11.
[14]
Xiaozheng Li et al. 2019. Intelligent diagnosis with Chinese electronic medical records based on convolutional neural networks. BMC Bioinformatics 20, 1 (2019), 1–12.
[15]
Chenghai Yu, Shupei Wang, and Jiajun Guo. 2019. Learning Chinese segmentation of phrase based on bidirectional GRU-CRF and CNN network prototype. International Journal of Technology and Human Interaction (IJTHI) 15, 3 (2019), 47–62.
[16]
Junxin Liu et al. 2019. Neural Chinese segmentation of phrase with dictionary. Neurocomputing 338 (2019), 46–54.
[17]
Yushi Yao and Zheng Huang. 2016. Bi-directional LSTM recurrent neural network for Chinese segmentation of phrase. International Conference on Neural Information Processing. Lecture Notes in Computer Science, Vol 9950. Springer, Cham.
[18]
Oussama Zayene et al. 2018. Multi-dimensional long short-term memory networks for artificial Arabic Tamil text recognition in news video. IET Computer Vision 12, 5 (2018), 710–719.
[19]
Yuan Luo et al. 2018. Segment convolutional neural networks (Seg-CNNs) for classifying relations in clinical notes. Journal of the American Medical Informatics Association 25, 1 (2018), 93–98.
[20]
Qinjun Qiu et al. 2018. DGeoSegmenter: A dictionary-based Chinese phrase segmenter for the geoscience domain. Computers & Geosciences 121 (2018), 1–11.
[21]
Daiki Shimada, Ryunosuke Kotani, and Hitoshi Iyatomi. 2016. Document classification through image-based character embedding and wildcard training. In IEEE International Conference on Big Data (Big Data’16). IEEE.
[22]
Md Sanzidul Islam et al. 2019. Sequence-to-sequence Bangla Tamil text group generation with LSTM recurrent neural networks. Procedia Computer Science 152 (2019), 51–58.
[23]
Pinkesh Badjatiya et al. 2018. Attention-based neural Tamil text segmentation. In European Conference on Information Retrieval. Springer, Cham.
[24]
Kunnapat Thipparaphonkul, Watchanan Chantapakul, Chayanin Suatap, and Karn Patanukhom. 2019. Thai handwritten character segmentation based on deep learning. In Proceedings of the 2019 2nd Artificial Intelligence and Cloud Computing Conference. 64–74.
[25]
Guillaume Renton, Yann Soullard, Clément Chatelain, Sébastien Adam, Christopher Kermorvant, and Thierry Paquet. 2018. Fully convolutional network with dilated convolutions for handwritten text line segmentation. International Journal on Document Analysis and Recognition (IJDAR) 21, 3 (2018), 177–186.
[26]
Chemseddine Neche, Abdel Belaid, and Afef Kacem-Echi. 2019. Arabic handwritten documents segmentation into text-lines and words using deep learning. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW’19). IEEE, 19–24.
[27]
Junho Jo, Hyung Il Koo, Jae Woong Soh, and Nam Ik Cho. 2020. Handwritten text segmentation via end-to-end learning of convolutional neural networks. Multimedia Tools and Applications 79, 43 (2020), 32137–32150.
[28]
B. Rajyagor and R. Rakholia. 2021. Tri-level handwritten text segmentation techniques for Gujarati language. Indian Journal of Science and Technology 14, 7 (2021), 618–627.
[29]
Naoto Inuzuka and Tetsuya Suzuki. 2021. Experimental application of a Japanese historical document image synthesis method to text line segmentation. In The International Conference on Pattern Recognition Applications and Methods (ICPRAM’21). 628–634.
[30]
Rahul Pramanik and Soumen Bag. 2020. Segmentation-based recognition system for handwritten Bangla and Devanagari words using conventional classification and transfer learning. IET Image Processing 14 (2020), 959–972.
[31]
Soumyadeep Kundu, Sayantan Paul, Suman Kumar Bera, Abraham Ajith, and Ram Sarkar. 2020. Text-line extraction from handwritten document images using GAN. Expert Systems with Applications 140 (2020), 112916.
[32]
Pawan Kumar Singh, Ram Sarkar, Nibaran Das, Subhadip Basu, Mahantapas Kundu, and Mita Nasipuri. 2018. Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images. Multimedia Tools and Applications 77, 7 (2018), 8441–8473.
[33]
Reya Sharma and Baijnath Kaushik. 2020. Offline recognition of handwritten Indic scripts: A state-of-the-art survey and future perspectives. Computer Science Review 38 (2020), 100302.
[34]
N. Shaffi and F. Hajamohideen. 2021. uTHCD: A new benchmarking for Tamil handwritten OCR. IEEE Access 9 (2021), 101469–101493.
[35]
HPLabs, Isolated Handwritten Tamil Character Dataset. (June 2013). Retrieved from http://lipitk.sourceforge.net/datasets/tamilchardata.htm
[36]
Mudit Agrawal, Ajay S. Bhaskarabhatla, and Sriganesh Madhvanath. 2004. Data collection for handwriting corpus creation in Indic scripts. In International Conference on Speech and Language Technology and Oriental COCOSDA (ICSLT-COCOSDA’04). Citeseer.
[37]
P. K. Singh, R. Sarkar, N. Das, S. Basu, M. Kundu, and M. Nasipuri. 2018. Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images. Multimedia Tools and Applications 77, 7 (2018), 8441–8473.
[38]
N. Das, K. Acharya, R. Sarkar, S. Basu, M. Kundu, and M. Nasipuri. 2014. A benchmark image database of isolated Bangla handwritten compound characters. International Journal on Document Analysis and Recognition (IJDAR) 17, 4 (2014), 413–431.
[39]
S. A. Mahmoud, I. Ahmad, W. G. Al-Khatib, M. Alshayeb, M. T. Parvez, V. Märgner, and G. A. Fink. 2014. KHATT: An open Arabic offline handwritten text database. Pattern Recognition 47, 3 (2014), 1096–1112.
[40]
N. Shanthi and K. Duraiswamy. 2010. A novel SVM-based handwritten Tamil character recognition system. Pattern Analysis and Applications 13, 2 (2010), 173–180.
[41]
T. M. Jose and A. Wahi. 2013. A recognition of Tamil handwritten characters using Daubechies wavelet transforms and feed-forward backpropagation network. International Journal of Computer Applications 64, 8 (2013), 0975–8887.
[42]
C. Vinotheni, S. Lakshmana Pandian, and G. Lakshmi. 2021. Modified convolutional neural network of Tamil character recognition. Advances in Distributed Computing and Machine Learning. Springer, Singapore, 469–480.
[43]
S. Thadchanamoorthy, N. D. Kodikara, H. L. Premaretne, U. Pal, and F. Kimura. 2013. Tamil handwritten city name database development and recognition for postal automation. In 2013 12th International Conference on Document Analysis and Recognition. IEEE, 793–797.
[44]
HPLabs, Handwritten Tamil Word Dataset (2006). Retrieved from http://lipitk.sourceforge.net/datasets/tamilworddata.htm
[45]
B. Nethravathi, C. P. Archana, K. Shashikiran, A. G. Ramakrishnan, and V. Kumar. 2010. Creation of a huge annotated database for Tamil and Kannada OHR. In 2010 12th International Conference on Frontiers in Handwriting Recognition. IEEE, 415–420.
[46]
Tamil Handwritten Documents Dataset in ResearchGate repository. Retrieved from https://www.researchgate.net/publication/362490821_Tamil_Handwritten_Documents_Dataset
[47]
Longkai Zhang, Houfeng Wang, Xu Sun, and Mairgup Mansur. 2013. Exploring representations from unlabeled data with cotraining for Chinese word segmentation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 311–321.

Index Terms

  1. Fast Recurrent Neural Network with Bi-LSTM for Handwritten Tamil Text Segmentation in NLP

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 5
      May 2024
      297 pages
      EISSN:2375-4702
      DOI:10.1145/3613584
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 10 May 2024
      Online AM: 07 February 2024
      Accepted: 24 December 2023
      Revised: 23 August 2023
      Received: 13 July 2022
      Published in TALLIP Volume 23, Issue 5

      Check for updates

      Author Tags

      1. Tamil text segmentation
      2. fast-RNN
      3. bidirectional LSTM
      4. natural language processing
      5. offline handwriting
      6. segmentation accuracy.

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 190
        Total Downloads
      • Downloads (Last 12 months)138
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 02 Mar 2025

      Other Metrics

      Citations

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media