Skip to main content
Log in

A Low Resource Multi-lingual Simultaneous Script Identification and Text Recognition Model

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

In this paper, we have proposed a multi-task learning model for multi-lingual Optical Character Recognition. Our model does the script identification and text recognition simultaneously of offline machine printed documents. We have extracted the spatial and temporal features of a line image by the combination of several CNN and BLSTM layers. The feature is shared between the script identification and text recognition modules. Fully connected layer and softmax identify the script. The identified script works as a case selector for the text recognizer which is a CTC layer. Finally, the text is identified by the text recognizer. The model is applied to two public datasets: ISIDDI, RETAS containing Bengali degraded, and English pages. We have created a dataset of Devnagari/Hindi and Tamil scripts to test our model. The model has achieved 99.2% accuracy for script recognition. The achieved text recognition accuracy on the scripts Bengali, English, Hindi, and Tamil are respectively 91.68%, 97.07%, 95.68% and 92.27%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

The ISIDDI data base can be found in https://www.isical.ac.in/~ujjwal/download/ISIDDI.html. The RETAS database can be found in https://ciir.cs.umass.edu/downloads/ocr-evaluation/. The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

  1. Robertson B, Boschetti F. Large-scale optical character recognition of ancient Greek. Mouseion. 2017;14(3):341–59.

    Article  Google Scholar 

  2. White N. Training tesseract for ancient greek OCR. Eiiruzov. 2012;28–29.

  3. Jenckel M, Bukhari SS, Dengel A. anyOCR: a sequence learning based OCR system for unlabeled historical documents. In: 2016 23rd International Conference on Pattern Recognition (ICPR). 2016; 4035–4040. https://doi.org/10.1109/ICPR.2016.7900265

  4. Breuel TM, Ul-Hasan A, Al-Azawi MA, Shafait F. High-performance OCR for printed English and Fraktur using LSTM networks. In: 2013 12th International Conference on Document Analysis and Recognition. IEEE. 2013;683–687.

  5. Chaudhuri B, Pal U, Mitra M. Automatic recognition of printed Oriya script. Sadhana. 2002;27(1):23–34.

    Article  Google Scholar 

  6. Chaudhuri B, Pal U. A complete printed Bangla OCR system. Pattern Recogn. 1998;31(5):531–49.

    Article  Google Scholar 

  7. Pal U, Chaudhuri BB. Ocr in Bangla: an Indo-Bangladeshi language. In: Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5). 1994;2:269–2732.

  8. Chaudhuri B, Pal U. An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi). In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. 1997;2:1011–1015. https://doi.org/10.1109/ICDAR.1997.620662

  9. Lakshmi CV, Patvardhan C. An optical character recognition system for printed Telugu text. Pattern Anal Appl. 2004;7(2):190–204.

    Article  MathSciNet  Google Scholar 

  10. Mukherjee J, Parui SK, Roy U. NN-based analytic approach to symbol level recognition for degraded Bengali printed documents. Sādhanā. 2020;45(1):1–22.

    Article  Google Scholar 

  11. Chen Z, Wu Y, Yin F, Liu C-L. Simultaneous script identification and handwriting recognition via multi-task learning of recurrent neural networks. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). 2017;01:525–530. https://doi.org/10.1109/ICDAR.2017.92

  12. Chen Z, Yin F, Zhang X-Y, Yang Q, Liu C-L. MuLTReNets: multilingual text recognition networks for simultaneous script identification and handwriting recognition. Pattern Recogn. 2020;108: 107555.

    Article  Google Scholar 

  13. Swaileh W, Lerouge J, Paquet T. A unified French/English syllabic model for handwriting recognition. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR). IEEE. 2016;536–541.

  14. Huang J, Pang G, Kovvuri R, Toh M, Liang KJ, Krishnan P, Yin X, Hassner T. A Multiplexed Network for End-to-End, Multilingual OCR. 2021. arXiv preprint arXiv:2103.15992

  15. Cheikhrouhou A, Kessentini Y, Kanoun S. Multi-task learning for simultaneous script identification and keyword spotting in document images. Pattern Recogn. 2021;113: 107832.

    Article  Google Scholar 

  16. Nayef N, Yin F, Bizid I, Choi H, Feng Y, Karatzas D, Luo Z, Pal U, Rigaud C, Chazalon J, Khlif W, Luqman MM, Burie J-C, Liu C-l, Ogier J-M. ICDAR2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification - RRC-MLT. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). 2017;01:1454–1459. https://doi.org/10.1109/ICDAR.2017.237

  17. Chanda S, Pal U. English, devanagari and urdu text identification. In: Proc. International Conference on Document Analysis and Recognition. Citeseer. 2005;538–545.

  18. Namboodiri AM, Jain AK. Online script recognition. In: 2002 International Conference on Pattern Recognition. IEEE. 2002;3:736–739.

  19. Jaeger S, Ma H, Doermann D. Identifying script on word-level with informational confidence. In: Eighth International Conference on Document Analysis and Recognition (ICDAR’05). IEEE. 2005; 416–420.

  20. Ablavsky V, Stevens MR. Automatic feature selection with applications to script identification of degraded documents. Citeseer. In: ICDAR. 2003;750–754.

  21. Moussa SB, Zahour A, Benabdelhafid A, Alimi AM. Fractal-based system for arabic/latin, printed/handwritten script identification. In: 2008 19th International Conference on Pattern Recognition. IEEE. 2008;1–4.

  22. Benjelil M, Kanoun S, Mullot R, Alimi AM. Arabic and latin script identification in printed and handwritten types based on steerable pyramid features. In: 2009 10th International Conference on Document Analysis and Recognition. IEEE. 2009;591–595.

  23. Pan W, Suen CY, Bui TD. Script identification using steerable gabor filters. In: Eighth International Conference on Document Analysis and Recognition (ICDAR’05). IEEE. 2005;883–887.

  24. Roy K, Pal U, Chaudhuri B. Neural network based word-wise handwritten script identification system for indian postal automation. In: Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing. IEEE. 2005;240–245.

  25. Chanda S, Terrades OR, Pal U. Svm based scheme for thai and english script identification. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007). IEEE. 2007;1:551–555.

  26. Ferrer MA, Morales A, Pal U. Lbp based line-wise script identification. In: 2013 12th International Conference on Document Analysis and Recognition. IEEE. 2013;369–373.

  27. Sharma N, Chanda S, Pal U, Blumenstein M. Word-wise script identification from video frames. In: 2013 12th International Conference on Document Analysis and Recognition. IEEE. 2013;867–871.

  28. Glauberman M. Character recognition for business machines. Electronics. 1956;29(2):132–6.

    Google Scholar 

  29. Rohland WS, Traglia PJ, Hurley PJ. The design of an ocr system for reading hand written numerals. In: Proceedings of the December 9-11, 1968. Fall Joint Comput Conf. Part II. 1968; 1151–1162.

  30. Dimond T. Devices for reading handwritten characters. In: Papers and Discussions Presented at the December 9-13, 1957, Eastern Joint Computer Conference: Computers with Deadlines to Meet. 1957; 232–237.

  31. Hassin AH, Tang X-L, Liu J-F, Zhao W. Printed Arabic character recognition using hmm. J Comput Sci Technol. 2004;19(4):538–43.

    Article  Google Scholar 

  32. Raja S, John M. A novel tamil character recognition using decision tree classifier. IETE J Res. 2013;59(5):569–75.

    Article  Google Scholar 

  33. Mukherji P, Rege PP. Shape feature and fuzzy logic based offline devnagari handwritten optical character recognition. J Pattern Recog Res. 2009;4:52–68.

    Google Scholar 

  34. Mukherjee J, Parui SK, Roy U. Degraded bangla character recognition by k- NN classifier. Int J Comput Sci Eng. 2019;07:42–7.

    Google Scholar 

  35. Pino R, Mendoza R, Sambayan R. Optical character recognition system for baybayin scripts using support vector machine. PeerJ Comput Sci. 2021;7:360.

    Article  Google Scholar 

  36. Katiyar G, Katiyar A, Mehfuz S. Off-line handwritten character recognition system using support vector machine. Am J Neural Netw Appl. 2017;3(2):22–8.

    Google Scholar 

  37. Kundu S, Paul S, Singh PK, Sarkar R, Nasipuri M. Understanding NFC-Net: a deep learning approach to word-level handwritten Indic script recognition. Neural Comput Appl. 2020;32(12):7879–95.

    Article  Google Scholar 

  38. Bhunia AK, Mukherjee S, Sain A, Bhunia AK, Roy PP, Pal U. Indic handwritten script identification using offline-online multi-modal deep network. Inform Fus. 2020;57:1–14.

    Article  Google Scholar 

  39. Mukherjee J, Parui SK, Roy U. NN-based analytic approach to symbol level recognition for degraded Bengali printed documents. Sādhanā. 2020;45(1):1–22.

    Article  Google Scholar 

  40. Mukherjee J, Roy U. Recognition of degraded bangla documents using hybrid deep neural network model. In: 2021 International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE). IEEE. 2021; 254–259.

  41. Kozielski M, Doetsch P, Hamdani M, Ney H. Multilingual off-line handwriting recognition in real-world images. In: 2014 11th IAPR International Workshop on Document Analysis Systems. IEEE. 2014; 121–125.

  42. Lin X-R, Guo C-Y, Chang F. Classifying textual components of bilingual documents with decision-tree support vector machines. In: 2011 International Conference on Document Analysis and Recognition. IEEE. 2011; 498–502.

  43. Bhattacharya U, Chaudhuri BB. Handwritten numeral databases of indian scripts and multistage recognition of mixed numerals. IEEE Trans Pattern Anal Mach Intell. 2008;31(3):444–57.

    Article  Google Scholar 

  44. Pal U, Sharma N, Wakabayashi T, Kimura F. Handwritten numeral recognition of six popular indian scripts. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007). IEEE. 2007;2:749–753.

  45. Fujii Y, Driesen K, Baccash J, Hurst A, Popat AC. Sequence-to-label script identification for multilingual OCR. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). 2017;01:161–168. https://doi.org/10.1109/ICDAR.2017.35

  46. Keysers D, Deselaers T, Rowley HA, Wang L-L, Carbune V. Multi-language online handwriting recognition. IEEE Trans Pattern Anal Mach Intell. 2017;39(6):1180–94. https://doi.org/10.1109/TPAMI.2016.2572693.

    Article  Google Scholar 

  47. Mukherjee J, Parui SK, Roy U. An unsupervised and robust line and word segmentation method for handwritten and degraded printed document. Trans Asian Low Resour Lang Inform Process. 2021;21(2):1–31.

    Google Scholar 

  48. Biswas C, Mukherjee PS, Ghosh K, Bhattacharya U, Parui SK. A hybrid deep architecture for robust recognition of text lines of degraded printed documents. In: 2018 24th International Conference on Pattern Recognition (ICPR). IEEE. 2018;3174–3179.

  49. Yalniz IZ, Manmatha R. A fast alignment scheme for automatic OCR evaluation of books. In: 2011 International Conference on Document Analysis and Recognition. IEEE. 2011; 754–758.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jayati Mukherjee.

Ethics declarations

Conflict of interest

The authors declare that there is no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mukherjee, J., Roy, U. A Low Resource Multi-lingual Simultaneous Script Identification and Text Recognition Model. SN COMPUT. SCI. 5, 740 (2024). https://doi.org/10.1007/s42979-024-03107-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-024-03107-6

Keywords

Navigation