A Low Resource Multi-lingual Simultaneous Script Identification and Text Recognition Model

Mukherjee, Jayati; Roy, Utpal

doi:10.1007/s42979-024-03107-6

A Low Resource Multi-lingual Simultaneous Script Identification and Text Recognition Model

Original Research
Published: 31 July 2024

Volume 5, article number 740, (2024)
Cite this article

SN Computer Science Aims and scope Submit manuscript

73 Accesses
Explore all metrics

Abstract

In this paper, we have proposed a multi-task learning model for multi-lingual Optical Character Recognition. Our model does the script identification and text recognition simultaneously of offline machine printed documents. We have extracted the spatial and temporal features of a line image by the combination of several CNN and BLSTM layers. The feature is shared between the script identification and text recognition modules. Fully connected layer and softmax identify the script. The identified script works as a case selector for the text recognizer which is a CTC layer. Finally, the text is identified by the text recognizer. The model is applied to two public datasets: ISIDDI, RETAS containing Bengali degraded, and English pages. We have created a dataset of Devnagari/Hindi and Tamil scripts to test our model. The model has achieved 99.2% accuracy for script recognition. The achieved text recognition accuracy on the scripts Bengali, English, Hindi, and Tamil are respectively 91.68%, 97.07%, 95.68% and 92.27%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Offline script recognition from handwritten and printed multilingual documents: a survey

Article 22 March 2021

MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification

Article Open access 25 August 2023

Script Identification in Natural Scene Text Images by Learning Local and Global Features on Inception Net

Data availability

The ISIDDI data base can be found in https://www.isical.ac.in/~ujjwal/download/ISIDDI.html. The RETAS database can be found in https://ciir.cs.umass.edu/downloads/ocr-evaluation/. The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Robertson B, Boschetti F. Large-scale optical character recognition of ancient Greek. Mouseion. 2017;14(3):341–59.
Article Google Scholar
White N. Training tesseract for ancient greek OCR. Eiiruzov. 2012;28–29.
Jenckel M, Bukhari SS, Dengel A. anyOCR: a sequence learning based OCR system for unlabeled historical documents. In: 2016 23rd International Conference on Pattern Recognition (ICPR). 2016; 4035–4040. https://doi.org/10.1109/ICPR.2016.7900265
Breuel TM, Ul-Hasan A, Al-Azawi MA, Shafait F. High-performance OCR for printed English and Fraktur using LSTM networks. In: 2013 12th International Conference on Document Analysis and Recognition. IEEE. 2013;683–687.
Chaudhuri B, Pal U, Mitra M. Automatic recognition of printed Oriya script. Sadhana. 2002;27(1):23–34.
Article Google Scholar
Chaudhuri B, Pal U. A complete printed Bangla OCR system. Pattern Recogn. 1998;31(5):531–49.
Article Google Scholar
Pal U, Chaudhuri BB. Ocr in Bangla: an Indo-Bangladeshi language. In: Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5). 1994;2:269–2732.
Chaudhuri B, Pal U. An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi). In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. 1997;2:1011–1015. https://doi.org/10.1109/ICDAR.1997.620662
Lakshmi CV, Patvardhan C. An optical character recognition system for printed Telugu text. Pattern Anal Appl. 2004;7(2):190–204.
Article MathSciNet Google Scholar
Mukherjee J, Parui SK, Roy U. NN-based analytic approach to symbol level recognition for degraded Bengali printed documents. Sādhanā. 2020;45(1):1–22.
Article Google Scholar
Chen Z, Wu Y, Yin F, Liu C-L. Simultaneous script identification and handwriting recognition via multi-task learning of recurrent neural networks. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). 2017;01:525–530. https://doi.org/10.1109/ICDAR.2017.92
Chen Z, Yin F, Zhang X-Y, Yang Q, Liu C-L. MuLTReNets: multilingual text recognition networks for simultaneous script identification and handwriting recognition. Pattern Recogn. 2020;108: 107555.
Article Google Scholar
Swaileh W, Lerouge J, Paquet T. A unified French/English syllabic model for handwriting recognition. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR). IEEE. 2016;536–541.
Huang J, Pang G, Kovvuri R, Toh M, Liang KJ, Krishnan P, Yin X, Hassner T. A Multiplexed Network for End-to-End, Multilingual OCR. 2021. arXiv preprint arXiv:2103.15992
Cheikhrouhou A, Kessentini Y, Kanoun S. Multi-task learning for simultaneous script identification and keyword spotting in document images. Pattern Recogn. 2021;113: 107832.
Article Google Scholar
Nayef N, Yin F, Bizid I, Choi H, Feng Y, Karatzas D, Luo Z, Pal U, Rigaud C, Chazalon J, Khlif W, Luqman MM, Burie J-C, Liu C-l, Ogier J-M. ICDAR2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification - RRC-MLT. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). 2017;01:1454–1459. https://doi.org/10.1109/ICDAR.2017.237
Chanda S, Pal U. English, devanagari and urdu text identification. In: Proc. International Conference on Document Analysis and Recognition. Citeseer. 2005;538–545.
Namboodiri AM, Jain AK. Online script recognition. In: 2002 International Conference on Pattern Recognition. IEEE. 2002;3:736–739.
Jaeger S, Ma H, Doermann D. Identifying script on word-level with informational confidence. In: Eighth International Conference on Document Analysis and Recognition (ICDAR’05). IEEE. 2005; 416–420.
Ablavsky V, Stevens MR. Automatic feature selection with applications to script identification of degraded documents. Citeseer. In: ICDAR. 2003;750–754.
Moussa SB, Zahour A, Benabdelhafid A, Alimi AM. Fractal-based system for arabic/latin, printed/handwritten script identification. In: 2008 19th International Conference on Pattern Recognition. IEEE. 2008;1–4.
Benjelil M, Kanoun S, Mullot R, Alimi AM. Arabic and latin script identification in printed and handwritten types based on steerable pyramid features. In: 2009 10th International Conference on Document Analysis and Recognition. IEEE. 2009;591–595.
Pan W, Suen CY, Bui TD. Script identification using steerable gabor filters. In: Eighth International Conference on Document Analysis and Recognition (ICDAR’05). IEEE. 2005;883–887.
Roy K, Pal U, Chaudhuri B. Neural network based word-wise handwritten script identification system for indian postal automation. In: Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing. IEEE. 2005;240–245.
Chanda S, Terrades OR, Pal U. Svm based scheme for thai and english script identification. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007). IEEE. 2007;1:551–555.
Ferrer MA, Morales A, Pal U. Lbp based line-wise script identification. In: 2013 12th International Conference on Document Analysis and Recognition. IEEE. 2013;369–373.
Sharma N, Chanda S, Pal U, Blumenstein M. Word-wise script identification from video frames. In: 2013 12th International Conference on Document Analysis and Recognition. IEEE. 2013;867–871.
Glauberman M. Character recognition for business machines. Electronics. 1956;29(2):132–6.
Google Scholar
Rohland WS, Traglia PJ, Hurley PJ. The design of an ocr system for reading hand written numerals. In: Proceedings of the December 9-11, 1968. Fall Joint Comput Conf. Part II. 1968; 1151–1162.
Dimond T. Devices for reading handwritten characters. In: Papers and Discussions Presented at the December 9-13, 1957, Eastern Joint Computer Conference: Computers with Deadlines to Meet. 1957; 232–237.
Hassin AH, Tang X-L, Liu J-F, Zhao W. Printed Arabic character recognition using hmm. J Comput Sci Technol. 2004;19(4):538–43.
Article Google Scholar
Raja S, John M. A novel tamil character recognition using decision tree classifier. IETE J Res. 2013;59(5):569–75.
Article Google Scholar
Mukherji P, Rege PP. Shape feature and fuzzy logic based offline devnagari handwritten optical character recognition. J Pattern Recog Res. 2009;4:52–68.
Google Scholar
Mukherjee J, Parui SK, Roy U. Degraded bangla character recognition by k- NN classifier. Int J Comput Sci Eng. 2019;07:42–7.
Google Scholar
Pino R, Mendoza R, Sambayan R. Optical character recognition system for baybayin scripts using support vector machine. PeerJ Comput Sci. 2021;7:360.
Article Google Scholar
Katiyar G, Katiyar A, Mehfuz S. Off-line handwritten character recognition system using support vector machine. Am J Neural Netw Appl. 2017;3(2):22–8.
Google Scholar
Kundu S, Paul S, Singh PK, Sarkar R, Nasipuri M. Understanding NFC-Net: a deep learning approach to word-level handwritten Indic script recognition. Neural Comput Appl. 2020;32(12):7879–95.
Article Google Scholar
Bhunia AK, Mukherjee S, Sain A, Bhunia AK, Roy PP, Pal U. Indic handwritten script identification using offline-online multi-modal deep network. Inform Fus. 2020;57:1–14.
Article Google Scholar
Mukherjee J, Parui SK, Roy U. NN-based analytic approach to symbol level recognition for degraded Bengali printed documents. Sādhanā. 2020;45(1):1–22.
Article Google Scholar
Mukherjee J, Roy U. Recognition of degraded bangla documents using hybrid deep neural network model. In: 2021 International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE). IEEE. 2021; 254–259.
Kozielski M, Doetsch P, Hamdani M, Ney H. Multilingual off-line handwriting recognition in real-world images. In: 2014 11th IAPR International Workshop on Document Analysis Systems. IEEE. 2014; 121–125.
Lin X-R, Guo C-Y, Chang F. Classifying textual components of bilingual documents with decision-tree support vector machines. In: 2011 International Conference on Document Analysis and Recognition. IEEE. 2011; 498–502.
Bhattacharya U, Chaudhuri BB. Handwritten numeral databases of indian scripts and multistage recognition of mixed numerals. IEEE Trans Pattern Anal Mach Intell. 2008;31(3):444–57.
Article Google Scholar
Pal U, Sharma N, Wakabayashi T, Kimura F. Handwritten numeral recognition of six popular indian scripts. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007). IEEE. 2007;2:749–753.
Fujii Y, Driesen K, Baccash J, Hurst A, Popat AC. Sequence-to-label script identification for multilingual OCR. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). 2017;01:161–168. https://doi.org/10.1109/ICDAR.2017.35
Keysers D, Deselaers T, Rowley HA, Wang L-L, Carbune V. Multi-language online handwriting recognition. IEEE Trans Pattern Anal Mach Intell. 2017;39(6):1180–94. https://doi.org/10.1109/TPAMI.2016.2572693.
Article Google Scholar
Mukherjee J, Parui SK, Roy U. An unsupervised and robust line and word segmentation method for handwritten and degraded printed document. Trans Asian Low Resour Lang Inform Process. 2021;21(2):1–31.
Google Scholar
Biswas C, Mukherjee PS, Ghosh K, Bhattacharya U, Parui SK. A hybrid deep architecture for robust recognition of text lines of degraded printed documents. In: 2018 24th International Conference on Pattern Recognition (ICPR). IEEE. 2018;3174–3179.
Yalniz IZ, Manmatha R. A fast alignment scheme for automatic OCR evaluation of books. In: 2011 International Conference on Document Analysis and Recognition. IEEE. 2011; 754–758.

Download references

Author information

Authors and Affiliations

Computer Science and Engineering, Academy of Technology, Hoogly, 712502, West Bengal, India
Jayati Mukherjee
Department of computer and system sciences, Visva-bharati, Santiniketan, 731235, West Bengal, India
Utpal Roy

Authors

Jayati Mukherjee
View author publications
You can also search for this author in PubMed Google Scholar
Utpal Roy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jayati Mukherjee.

Ethics declarations

Conflict of interest

The authors declare that there is no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mukherjee, J., Roy, U. A Low Resource Multi-lingual Simultaneous Script Identification and Text Recognition Model. SN COMPUT. SCI. 5, 740 (2024). https://doi.org/10.1007/s42979-024-03107-6

Download citation

Received: 08 December 2022
Accepted: 02 July 2024
Published: 31 July 2024
DOI: https://doi.org/10.1007/s42979-024-03107-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Low Resource Multi-lingual Simultaneous Script Identification and Text Recognition Model

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Offline script recognition from handwritten and printed multilingual documents: a survey

MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification

Script Identification in Natural Scene Text Images by Learning Local and Global Features on Inception Net

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A Low Resource Multi-lingual Simultaneous Script Identification and Text Recognition Model

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Offline script recognition from handwritten and printed multilingual documents: a survey

MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification

Script Identification in Natural Scene Text Images by Learning Local and Global Features on Inception Net

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation