Abstract
In this paper, we have proposed a multi-task learning model for multi-lingual Optical Character Recognition. Our model does the script identification and text recognition simultaneously of offline machine printed documents. We have extracted the spatial and temporal features of a line image by the combination of several CNN and BLSTM layers. The feature is shared between the script identification and text recognition modules. Fully connected layer and softmax identify the script. The identified script works as a case selector for the text recognizer which is a CTC layer. Finally, the text is identified by the text recognizer. The model is applied to two public datasets: ISIDDI, RETAS containing Bengali degraded, and English pages. We have created a dataset of Devnagari/Hindi and Tamil scripts to test our model. The model has achieved 99.2% accuracy for script recognition. The achieved text recognition accuracy on the scripts Bengali, English, Hindi, and Tamil are respectively 91.68%, 97.07%, 95.68% and 92.27%.







Similar content being viewed by others
Data availability
The ISIDDI data base can be found in https://www.isical.ac.in/~ujjwal/download/ISIDDI.html. The RETAS database can be found in https://ciir.cs.umass.edu/downloads/ocr-evaluation/. The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Robertson B, Boschetti F. Large-scale optical character recognition of ancient Greek. Mouseion. 2017;14(3):341–59.
White N. Training tesseract for ancient greek OCR. Eiiruzov. 2012;28–29.
Jenckel M, Bukhari SS, Dengel A. anyOCR: a sequence learning based OCR system for unlabeled historical documents. In: 2016 23rd International Conference on Pattern Recognition (ICPR). 2016; 4035–4040. https://doi.org/10.1109/ICPR.2016.7900265
Breuel TM, Ul-Hasan A, Al-Azawi MA, Shafait F. High-performance OCR for printed English and Fraktur using LSTM networks. In: 2013 12th International Conference on Document Analysis and Recognition. IEEE. 2013;683–687.
Chaudhuri B, Pal U, Mitra M. Automatic recognition of printed Oriya script. Sadhana. 2002;27(1):23–34.
Chaudhuri B, Pal U. A complete printed Bangla OCR system. Pattern Recogn. 1998;31(5):531–49.
Pal U, Chaudhuri BB. Ocr in Bangla: an Indo-Bangladeshi language. In: Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5). 1994;2:269–2732.
Chaudhuri B, Pal U. An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi). In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. 1997;2:1011–1015. https://doi.org/10.1109/ICDAR.1997.620662
Lakshmi CV, Patvardhan C. An optical character recognition system for printed Telugu text. Pattern Anal Appl. 2004;7(2):190–204.
Mukherjee J, Parui SK, Roy U. NN-based analytic approach to symbol level recognition for degraded Bengali printed documents. Sādhanā. 2020;45(1):1–22.
Chen Z, Wu Y, Yin F, Liu C-L. Simultaneous script identification and handwriting recognition via multi-task learning of recurrent neural networks. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). 2017;01:525–530. https://doi.org/10.1109/ICDAR.2017.92
Chen Z, Yin F, Zhang X-Y, Yang Q, Liu C-L. MuLTReNets: multilingual text recognition networks for simultaneous script identification and handwriting recognition. Pattern Recogn. 2020;108: 107555.
Swaileh W, Lerouge J, Paquet T. A unified French/English syllabic model for handwriting recognition. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR). IEEE. 2016;536–541.
Huang J, Pang G, Kovvuri R, Toh M, Liang KJ, Krishnan P, Yin X, Hassner T. A Multiplexed Network for End-to-End, Multilingual OCR. 2021. arXiv preprint arXiv:2103.15992
Cheikhrouhou A, Kessentini Y, Kanoun S. Multi-task learning for simultaneous script identification and keyword spotting in document images. Pattern Recogn. 2021;113: 107832.
Nayef N, Yin F, Bizid I, Choi H, Feng Y, Karatzas D, Luo Z, Pal U, Rigaud C, Chazalon J, Khlif W, Luqman MM, Burie J-C, Liu C-l, Ogier J-M. ICDAR2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification - RRC-MLT. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). 2017;01:1454–1459. https://doi.org/10.1109/ICDAR.2017.237
Chanda S, Pal U. English, devanagari and urdu text identification. In: Proc. International Conference on Document Analysis and Recognition. Citeseer. 2005;538–545.
Namboodiri AM, Jain AK. Online script recognition. In: 2002 International Conference on Pattern Recognition. IEEE. 2002;3:736–739.
Jaeger S, Ma H, Doermann D. Identifying script on word-level with informational confidence. In: Eighth International Conference on Document Analysis and Recognition (ICDAR’05). IEEE. 2005; 416–420.
Ablavsky V, Stevens MR. Automatic feature selection with applications to script identification of degraded documents. Citeseer. In: ICDAR. 2003;750–754.
Moussa SB, Zahour A, Benabdelhafid A, Alimi AM. Fractal-based system for arabic/latin, printed/handwritten script identification. In: 2008 19th International Conference on Pattern Recognition. IEEE. 2008;1–4.
Benjelil M, Kanoun S, Mullot R, Alimi AM. Arabic and latin script identification in printed and handwritten types based on steerable pyramid features. In: 2009 10th International Conference on Document Analysis and Recognition. IEEE. 2009;591–595.
Pan W, Suen CY, Bui TD. Script identification using steerable gabor filters. In: Eighth International Conference on Document Analysis and Recognition (ICDAR’05). IEEE. 2005;883–887.
Roy K, Pal U, Chaudhuri B. Neural network based word-wise handwritten script identification system for indian postal automation. In: Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing. IEEE. 2005;240–245.
Chanda S, Terrades OR, Pal U. Svm based scheme for thai and english script identification. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007). IEEE. 2007;1:551–555.
Ferrer MA, Morales A, Pal U. Lbp based line-wise script identification. In: 2013 12th International Conference on Document Analysis and Recognition. IEEE. 2013;369–373.
Sharma N, Chanda S, Pal U, Blumenstein M. Word-wise script identification from video frames. In: 2013 12th International Conference on Document Analysis and Recognition. IEEE. 2013;867–871.
Glauberman M. Character recognition for business machines. Electronics. 1956;29(2):132–6.
Rohland WS, Traglia PJ, Hurley PJ. The design of an ocr system for reading hand written numerals. In: Proceedings of the December 9-11, 1968. Fall Joint Comput Conf. Part II. 1968; 1151–1162.
Dimond T. Devices for reading handwritten characters. In: Papers and Discussions Presented at the December 9-13, 1957, Eastern Joint Computer Conference: Computers with Deadlines to Meet. 1957; 232–237.
Hassin AH, Tang X-L, Liu J-F, Zhao W. Printed Arabic character recognition using hmm. J Comput Sci Technol. 2004;19(4):538–43.
Raja S, John M. A novel tamil character recognition using decision tree classifier. IETE J Res. 2013;59(5):569–75.
Mukherji P, Rege PP. Shape feature and fuzzy logic based offline devnagari handwritten optical character recognition. J Pattern Recog Res. 2009;4:52–68.
Mukherjee J, Parui SK, Roy U. Degraded bangla character recognition by k- NN classifier. Int J Comput Sci Eng. 2019;07:42–7.
Pino R, Mendoza R, Sambayan R. Optical character recognition system for baybayin scripts using support vector machine. PeerJ Comput Sci. 2021;7:360.
Katiyar G, Katiyar A, Mehfuz S. Off-line handwritten character recognition system using support vector machine. Am J Neural Netw Appl. 2017;3(2):22–8.
Kundu S, Paul S, Singh PK, Sarkar R, Nasipuri M. Understanding NFC-Net: a deep learning approach to word-level handwritten Indic script recognition. Neural Comput Appl. 2020;32(12):7879–95.
Bhunia AK, Mukherjee S, Sain A, Bhunia AK, Roy PP, Pal U. Indic handwritten script identification using offline-online multi-modal deep network. Inform Fus. 2020;57:1–14.
Mukherjee J, Parui SK, Roy U. NN-based analytic approach to symbol level recognition for degraded Bengali printed documents. Sādhanā. 2020;45(1):1–22.
Mukherjee J, Roy U. Recognition of degraded bangla documents using hybrid deep neural network model. In: 2021 International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE). IEEE. 2021; 254–259.
Kozielski M, Doetsch P, Hamdani M, Ney H. Multilingual off-line handwriting recognition in real-world images. In: 2014 11th IAPR International Workshop on Document Analysis Systems. IEEE. 2014; 121–125.
Lin X-R, Guo C-Y, Chang F. Classifying textual components of bilingual documents with decision-tree support vector machines. In: 2011 International Conference on Document Analysis and Recognition. IEEE. 2011; 498–502.
Bhattacharya U, Chaudhuri BB. Handwritten numeral databases of indian scripts and multistage recognition of mixed numerals. IEEE Trans Pattern Anal Mach Intell. 2008;31(3):444–57.
Pal U, Sharma N, Wakabayashi T, Kimura F. Handwritten numeral recognition of six popular indian scripts. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007). IEEE. 2007;2:749–753.
Fujii Y, Driesen K, Baccash J, Hurst A, Popat AC. Sequence-to-label script identification for multilingual OCR. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). 2017;01:161–168. https://doi.org/10.1109/ICDAR.2017.35
Keysers D, Deselaers T, Rowley HA, Wang L-L, Carbune V. Multi-language online handwriting recognition. IEEE Trans Pattern Anal Mach Intell. 2017;39(6):1180–94. https://doi.org/10.1109/TPAMI.2016.2572693.
Mukherjee J, Parui SK, Roy U. An unsupervised and robust line and word segmentation method for handwritten and degraded printed document. Trans Asian Low Resour Lang Inform Process. 2021;21(2):1–31.
Biswas C, Mukherjee PS, Ghosh K, Bhattacharya U, Parui SK. A hybrid deep architecture for robust recognition of text lines of degraded printed documents. In: 2018 24th International Conference on Pattern Recognition (ICPR). IEEE. 2018;3174–3179.
Yalniz IZ, Manmatha R. A fast alignment scheme for automatic OCR evaluation of books. In: 2011 International Conference on Document Analysis and Recognition. IEEE. 2011; 754–758.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mukherjee, J., Roy, U. A Low Resource Multi-lingual Simultaneous Script Identification and Text Recognition Model. SN COMPUT. SCI. 5, 740 (2024). https://doi.org/10.1007/s42979-024-03107-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-024-03107-6