research-article

A Deep OCR for Degraded Bangla Documents

Authors:
Ayan Chaudhury

INRIA Grenoble Rhône-Alpes, France and IIT Kharagpur, West Bengal, India

INRIA Grenoble Rhône-Alpes, France and IIT Kharagpur, West Bengal, India

0000-0002-6474-5090
View Profile

,
Partha Sarathi Mukherjee

Tatras Data, New Delhi, Delhi, India

Tatras Data, New Delhi, Delhi, India

0000-0002-9254-5416
View Profile

,
Sudip Das

Indian Statistical Institute, Kolkata, West Bengal, India

Indian Statistical Institute, Kolkata, West Bengal, India

0000-0001-6069-0240
View Profile

,
Chandan Biswas

Indian Statistical Institute, Kolkata, West Bengal, India

Indian Statistical Institute, Kolkata, West Bengal, India

0000-0003-4468-7396
View Profile

,
Ujjwal Bhattacharya

Indian Statistical Institute, Kolkata, India

Indian Statistical Institute, Kolkata, India

0000-0002-8546-6453
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 21 Issue 5Article No.: 98pp 1–20https://doi.org/10.1145/3511807

Published:25 August 2022Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Despite the significant success of document image analysis techniques, efficient Optical Character Recognition (OCR) of degraded document images still remains an open problem. Although a body of work has been reported on degraded document recognition for English language, only little attention has been paid to Indic scripts. In this work, we focus on developing a degraded OCR for Bangla, a major Indian language. In general, an OCR system includes segmentation of the foreground text part from the background followed by recognition of the extracted text. The text segmentation module aims to assign the foreground or background label to each pixel of the document image. In this paper, we present a new OCR system which is particularly suitable for degraded quality Bangla document images. The contribution is two fold. In the first phase, we use a semi-supervised Markov Random Field (MRF)-based Generative Adversarial Network (GAN) model (which we call MRF-GAN) for foreground segmentation of texts from degraded text. In the proposed MRF-GAN, we extend the concept of GAN to a multitask learning mechanism where discriminator-classifier networks differentiate between real/fake images and also assign a foreground or background label to each pixel. In the second phase, we propose to use a new encoder-decoder based recognizer that incorporates an attention-based character to a word prediction model, which has the capability of minimizing Word Error Rate (WER). We optimize this network using a Multitask based Transfer Learning scheme (MTTL). We perform experiments on a publicly available degraded Bangla document image dataset as well as on a new degraded printed Hindi document image dataset, which has been created as a part of the present study. Results of the experimentations demonstrate the efficacy of the proposed OCR.

REFERENCES

[1] Avadesh Meduri and Goyal Navneet. 2018. Optical character recognition for Sanskrit using convolution neural networks. In DAS. 447–452.Google Scholar
[2] Chaudhuri B. B. and Pal U.. 1997. An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi). In ICDAR. 1011–1015.Google Scholar
[3] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
[4] Banerjee S., Mullick K., and Bhattacharya U.. 2013. A robust approach to extraction of texts from camera captured images. In Proc. of the 5th International Workshop on Camera-Based Document Analysis and Recognition (CBDAR 2013). 53–58.Google Scholar
[5] Baviskar D., Ahirrao S., and Kotecha K.. 2021. Multi-layout unstructured invoice documents dataset: A dataset for template-free invoice processing and its evaluation using AI approaches. IEEE Access 9 (2021), 101494–101512.Google ScholarCross Ref
[6] Baviskar D., Ahirrao S., Potdar V., and Kotecha K.. 2021. Efficient automated processing of the unstructured documents using artificial intelligence: A systematic literature review and future directions. IEEE Access (2021).Google ScholarCross Ref
[7] Bhattacharya U., Shridhar M., Parui S. K., Sen P. K., and Chaudhuri B. B.. 2012. Offline recognition of handwritten Bangla characters: An efficient two-stage approach. Pattern Analysis and Applications 15, 4 (2012), 445–458.Google ScholarDigital Library
[8] Biswas B., Bhattacharya U., and Chaudhuri B. B.. 2014. A global-to-local approach to binarization of degraded document images. In 22nd International Conference on Pattern Recognition (ICPR 2014). 3008–3013.Google ScholarDigital Library
[9] Biswas C., Mukherjee P. S., Ghosh K., Bhattacharya U., and Parui S. K.. 2018. A hybrid deep architecture for robust recognition of text lines of degraded printed documents. In ICPR. 3174–3179.Google Scholar
[10] Blake A., Rother C., Brown M. A., Pérez P., and Torr P. H. S.. 2004. Interactive image segmentation using an adaptive GMMRF Model. In ECCV. 428–441.Google Scholar
[11] Boykov Y. and Jolly M-P. 2001. Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images. In ICCV. 105–112.Google Scholar
[12] Boykov Y. and Kolmogorov V.. 2004. An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. PAMI 26, 9 (2004), 1124–1137.Google ScholarDigital Library
[13] Boykov Y., Veksler O., and Zabih R.. 2001. Fast approximate energy minimization via graph cuts. IEEE Trans. PAMI 23, 11 (2001), 1222–1239.Google ScholarDigital Library
[14] Calvo-Zaragoza J. and Gallego A.-J.. 2019. A selectional auto-encoder approach for document image binarization. PR 86 (2019), 37–47.Google Scholar
[15] Chakraborty Bappaditya, Shaw Bikash, Aich Jayanta, Bhattacharya Ujjwal, and Parui Swapan Kumar. 2018. Does deeper network lead to better accuracy: A case study on handwritten Devanagari characters. In Proc. of 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE, 411–416.Google Scholar
[16] Chan William, Jaitly Navdeep, Le Quoc, and Vinyals Oriol. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In ICASSP. 4960–4964.Google Scholar
[17] Chowdhury A. R., Bhattacharya U., and Parui S. K.. 2012. Text detection of two major Indian scripts in natural scene images. In Camera-Based Document Analysis and Recognition, Lecture Notes in Computer Science, Vol. 7139. 42–57.Google Scholar
[18] Cortes Corinna and Vapnik Vladimir. 1995. Support-vector networks. Mach. Learn. 20, 3 (Sept. 1995), 273–297.Google ScholarCross Ref
[19] Devika R., Vairavasundaram S., C S. J. Mahenthar,. Varadarajan V., and Kotecha K.. 2021. A deep learning model based on BERT and sentence transformer for semantic keyphrase extraction on big social data. IEEE Access (2021).Google ScholarCross Ref
[20] Dutta S., Sankaran N., Sankar K. P., and Jawahar C. V.. 2012. Robust recognition of degraded documents using character n-grams. In DAS. 130–134.Google Scholar
[21] Giotis A. P., Sfikas G., Gatos B., and Nikou C.. 2017. A survey of document image word spotting techniques. PR 68 (2017), 310–332.Google Scholar
[22] Goodfellow Ian. 2016. NIPS 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160 (2016).Google Scholar
[23] Goodfellow Ian, Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde-Farley David, Ozair Sherjil, Courville Aaron, and Bengio Yoshua. 2014. Generative adversarial nets. In NIPS. 2672–2680.Google Scholar
[24] Graves A., Fernández S., Gomez F., and Schmidhuber J.. 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In ICML. 369–376.Google Scholar
[25] Howe N. R.. 2011. A Laplacian energy for document binarization. In ICDAR. 6–10.Google Scholar
[26] Husain Hisham, Nock Richard, and Williamson Robert C.. 2019. A primal-dual link between GANs and autoencoders. In NIPS. 415–424.Google Scholar
[27] Im Daniel Jiwoong, Kim Chris Dongjoo, Jiang Hui, and Memisevic Roland. 2016. Generating images with recurrent adversarial networks. arXiv preprint arXiv:1602.05110 (2016).Google Scholar
[28] Isola Phillip, Zhu Jun-Yan, Zhou Tinghui, and Efros Alexei A.. 2017. Image-to-image translation with conditional adversarial networks. In CVPR. 1125–1134.Google Scholar
[29] Jawahar C. V., Kumar M. N. S. S. K. Pavan, and Kiran S. S. Ravi. 2003. A bilingual OCR for Hindi-Telugu documents and its applications. In ICDAR. 408–412.Google Scholar
[30] Jia F., Shi C., He K., Wang C., and Xiao B.. 2018. Degraded document image binarization using structural symmetry of strokes. PR 74 (2018), 225–240.Google Scholar
[31] Jino P. J., Balakrishnan K., and Bhattacharya U.. 2017. Offline handwritten Malayalam word recognition using a deep architecture. In Proc. of 7th Int. Conf. on Soft Computing for Problem Solving (SocProS), Vol. 1. 913–925.Google Scholar
[32] Gao C. Peng, K. Wu, Z., and Wen X.. 2013. Text window denoising autoencoder: Building deep architecture for Chinese word segmentation. In Natural Language Processing and Chinese Computing. Communications in Computer and Information Science, Vol. 400. 1–12.Google Scholar
[33] Kapur J. N., Sahoo P. K., and Wong A. K. C.. 1985. A new method for gray-level picture thresholding using the entropy of the histogram. Computer Vision, Graphics, and Image Processing 29, 3 (1985), 273–285.Google ScholarCross Ref
[34] Kuk J. Gap and Cho N. I.. 2009. Feature based binarization of document images degraded by uneven light condition. In ICDAR. 748–752.Google Scholar
[35] Kumar M., Jindal M. K., Sharma R. K., al. et2017. Offline handwritten Gurmukhi character recognition: Analytical study of different transformations. Proceedings of the National Academy of Sciences, India, Section A: Physical Sciences 87 (2017), 137–143.Google ScholarCross Ref
[36] Kumar M., Jindal M. K., Sharma R. K., al. et2020. Performance evaluation of classifiers for the recognition of offline handwritten Gurmukhi characters and numerals: A study. Artificial Intelligence Review 53 (2020), 2075–2097.Google ScholarCross Ref
[37] Kumar M., Jindal M. K., R. K. Sharma, Jindal S. R., and Singh H.. 2021. Improved recognition results of offline handwritten Gurumukhi characters using hybrid features and adaptive boosting. Soft Computing 25, 17 (2021), 11589–11601.Google ScholarDigital Library
[38] Kumar Munish, Jindal Manish Kumar, Sharma Rajendra Kumar, and Jindal Simpel Rani. 2019. Character and numeral recognition for non-Indic and Indic scripts: A survey. Artificial Intelligence Review 52, 4 (2019), 2235–2261.Google ScholarDigital Library
[39] Kumar Munish, Jindal M. K., Sharma R. K., and RaniJindal Simpel. 2018. Performance comparison of several feature selection techniques for offline handwritten character recognition. In Proc. of International Conference on Research in Intelligent and Computing in Engineering (RICE). IEEE, 1–6.Google ScholarCross Ref
[40] Kumar Munish, Jindal Simpel Rani, Jindal Manish Kumar, and Lehal Gurpreet Singh. 2019. Improved recognition results of medieval handwritten Gurmukhi manuscripts using boosting and bagging methodologies. Neural Processing Letters 50, 1 (2019), 43–56.Google ScholarDigital Library
[41] Lahiri Avisek, Ayush Kumar, Biswas Prabir Kumar, and Mitra Pabitra. 2017. Generative adversarial learning for reducing manual annotation in semantic segmentation on large scale miscroscopy images: Automated vessel segmentation in retinal fundus image as test case. In CVPR Workshops. 42–48.Google Scholar
[42] Lavrenko V., Rath T. M., and Manmatha R.. 2004. Holistic word recognition for handwritten historical documents. In 1st International Workshop on Document Image Analysis for Libraries. 278–287.Google Scholar
[43] Lee Chen-Yu and Osindero Simon. 2016. Recursive recurrent nets with attention modeling for OCR in the wild. In CVPR. 2231–2239.Google Scholar
[44] Luc Pauline, Couprie Camille, Chintala Soumith, and Verbeek Jakob. 2016. Semantic segmentation using adversarial networks. arXiv preprint arXiv:1611.08408 (2016).Google Scholar
[45] Ly Nam Tuan, Nguyen Cuong Tuan, and Nakagawa Masaki. 2019. An attention-based end-to-end model for multiple text lines recognition in Japanese historical documents. In ICDAR. 629–634.Google Scholar
[46] Milyaev S., Barinova O., Novikova T., Kohli P., and Lempitsky V. S.. 2013. Image binarization for end-to-end text understanding in natural images. In ICDAR. 128–132.Google Scholar
[47] Mishra A., Alahari K., and Jawahar C. V.. 2011. An MRF model for binarization of natural scene text. In ICDAR. 11–16.Google Scholar
[48] Mullick K., Banerjee S., and Bhattacharya U.. 2015. An efficient line segmentation approach for handwritten Bangla document image. In Proc. of ICAPR. 1–6.Google Scholar
[49] Mushtaq F., Misgar M. M., Kumar M., and Khurana S. S.. 2021. UrduDeepNet: Offline handwritten Urdu character recognition using deep neural network. Neural Computing and Applications (2021), 1–24.Google Scholar
[50] Nafchi H. Ziaei, Moghaddam R. F., and Cheriet M.. 2014. Phase-based binarization of ancient document images: Model and applications. IEEE Trans. on Image Processing 23, 7 (2014), 2916–2930.Google ScholarCross Ref
[51] Narang Sonika, Jindal M. K., and Kumar Munish. 2019. Devanagari ancient documents recognition using statistical feature extraction techniques. Sādhanā 44, 6 (2019), 1–8.Google ScholarCross Ref
[52] Narang Sonika Rani, Jindal Manish Kumar, Ahuja Shruti, and Kumar Munish. 2020. On the recognition of Devanagari ancient handwritten characters using SIFT and Gabor features. Soft Computing 24, 22 (2020), 17279–17289.Google ScholarDigital Library
[53] Narang S. R., Kumar M., and Jindal M. K.. 2021. DeepNetDevanagari: A deep learning model for Devanagari ancient character recognition. Multimedia Tools and Applications 80, 13 (2021), 20671–20686.Google ScholarDigital Library
[54] Nhat V. Q., Kim S.-H., Yang H. J., and Lee G.. 2016. An MRF model for binarization of music scores with complex background. PRL 69 (2016), 88–95.Google ScholarCross Ref
[55] Niblack Wayne. 1985. An Introduction to Digital Image Processing. Strandberg Publishing Company, Birkeroed, Denmark.Google ScholarDigital Library
[56] Otsu N.. 1979. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man and Cybernetics 9, 1 (1979), 62–66.Google ScholarCross Ref
[57] Peng Xujun, Setlur Srirangaraj, Govindaraju Venu, and Sitaram Ramachandrula. 2010. Markov random field based binarization for hand-held devices captured document images. In ICVGIP. 71–76.Google Scholar
[58] Rothacker L., Fink G. A., Banerjee P., Bhattacharya U., and Chaudhuri B. B.. 2013. Bag-of-features HMMs for segmentation-free Bangla word spotting. In Proceedings of the 4th International Workshop on Multilingual OCR (Washington, D.C., USA) (MOCR’13). Article 5, 5 pages.Google ScholarDigital Library
[59] Rother C., Kolmogorov V., and Blake A.. 2004. “GrabCut” – interactive foreground extraction using iterated cuts. ACM Trans. on Graphics 23, 3 (2004), 309–314.Google ScholarDigital Library
[60] Sari T., Kefali A., and Bahi H.. 2016. Structural feature-based evaluation method of binarization techniques for word retrieval in the degraded Arabic document images. IJDAR 19, 1 (2016), 31–47.Google ScholarDigital Library
[61] Sauvola J. J. and Pietikäinen M.. 2000. Adaptive document image binarization. PR 33, 2 (2000), 225–236.Google Scholar
[62] Shaw B., Bhattacharya U., and Parui S. K.. 2015. Offline handwritten Devanagari word recognition : Information fusion at feature and classifier levels. In Proc. of ACPR. IEEE, 720–724.Google Scholar
[63] Singh H., Sharma R. K., Singh V. P., and Kumar M.. 2021. Recognition of online handwritten Gurmukhi characters using recurrent neural network classifier. Soft Computing 25, 8 (2021), 6329–6338.Google ScholarCross Ref
[64] Souly Nasim, Spampinato Concetto, and Shah Mubarak. 2017. Semi supervised semantic segmentation using generative adversarial network. In ICCV. 5688–5696.Google Scholar
[65] Su B., Lu S., and Tan C. L.. 2013. Robust document image binarization technique for degraded document images. IEEE Trans. on Image Processing 22, 4 (2013), 1408–1417.Google ScholarDigital Library
[66] Tang Y., Peng L., Xu Q., Wang Y., and Furuhata A.. 2016. CNN based transfer learning for historical Chinese character recognition. In DAS. 25–29.Google Scholar
[67] Tensmeyer C. and Martinez T.. 2017. Document image binarization with fully convolutional neural networks. In ICDAR. 99–104.Google Scholar
[68] Tian S., Bhattacharya U., Lu S., Su B., Wang Q., Wei X., Lu Y., and Tan C. L.. 2016. Multilingual scene character recognition with co-occurrence of histogram of oriented gradients. Pattern Recognition 51 (2016), 125–134.Google ScholarDigital Library
[69] Tolstikhin Ilya, Bousquet Olivier, Gelly Sylvain, and Schoelkopf Bernhard. 2017. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558 (2017).Google Scholar
[70] Vo Q. N., Kim S.-H., Yang H. J., and Lee G.. 2018. Binarization of degraded document images based on hierarchical deep supervised network. PR 74 (2018), 568–586.Google Scholar
[71] Wang Jianfeng and Hu Xiaolin. 2017. Gated recurrent convolution neural network for OCR. In NIPS. 334–343.Google Scholar
[72] Wolf C. and Doermann D. S.. 2002. Binarization of low quality text using a Markov random field model. In ICPR. 160–163.Google Scholar
[73] Yan Ziang, Yan Chengzhe, and Zhang Changshui. 2017. Rare Chinese character recognition by radical extraction network. In 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC). 924–929.Google ScholarDigital Library
[74] Zhang Pengchuan, Liu Qiang, Zhou Dengyong, Xu Tao, and He Xiaodong. 2017. On the discrimination-generalization tradeoff in GANs. arXiv preprint arXiv:1711.02771 (2017).Google Scholar

Index Terms

A Deep OCR for Degraded Bangla Documents
1. Computing methodologies
  1. Machine learning
    1. Learning settings
      1. Semi-supervised learning settings

Recommendations

Development of an Assamese OCR using Bangla OCR
DAR '12: Proceeding of the workshop on Document Analysis and Recognition

This paper refers to the development of an OCR for the Assamese language by modifying an existing OCR for the Bangla language. This modification is feasible because the Assamese script is similar, except for a few characters, to the Bangla script. The ...
Read More
Development of OCR Techniques for Handwritten Bangla Text: OCR Techniques for Bangla Text
Read More
On OCR of Degraded Documents Using Fuzzy Multifactorial Analysis
AFSS '02: Proceedings of the 2002 AFSS International Conference on Fuzzy Systems. Calcutta: Advances in Soft Computing

Optical Character Recognition (OCR) systems show poor performance while processing documents like old books or newspapers, Xerox materials, faxed documents, etc. Such documents are considered as degraded documents. One of the important reasons for poor ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 21, Issue 5
September 2022
486 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3533669
Editor:
Imed Zitouni
Google, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 August 2022
- Online AM: 20 April 2022
- Revised: 1 January 2022
- Accepted: 1 January 2022
- Received: 1 July 2021
Published in tallip Volume 21, Issue 5

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Degraded OCR
MRF
GMM
GAN
BLSTM
CTC
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 375
  Total Downloads
- Downloads (Last 12 months)115
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

A Deep OCR for Degraded Bangla Documents

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Development of an Assamese OCR using Bangla OCR

Development of OCR Techniques for Handwritten Bangla Text: OCR Techniques for Bangla Text

On OCR of Degraded Documents Using Fuzzy Multifactorial Analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

HTML Format

Caption

A Deep OCR for Degraded Bangla Documents

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Development of an Assamese OCR using Bangla OCR

Development of OCR Techniques for Handwritten Bangla Text: OCR Techniques for Bangla Text

On OCR of Degraded Documents Using Fuzzy Multifactorial Analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

HTML Format

Share this Publication link

Share on Social Media