Skip to main content

A Novel Machine Annotated Balanced Bangla OCR Corpus

  • Conference paper
  • First Online:
Computer Vision and Image Processing (CVIP 2020)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1377))

Included in the following conference series:

Abstract

We present a balanced and 100% machine annotated Bangla OCR corpus of nearly eight and a half million characters. This is a “first-of-its-kind” effort for the Bangla language. Although Bangla is a top five most frequently used language in the world, it is still considered to be a resource-poor language since even the most basic language processing tools are not available. This corpus goes a long way in mitigating this shortcoming. Also, it is important to mention that this is a continuation of our previous work in building a synthetic corpus for training Bangla OCRs and is an intermediate step in our effort to build a true gold-standard corpus for Bangla OCRs—our ultimate goal. The paper not only discusses the corpus, but also the entire processing pipeline—from scanning pages to identifying lines to segmenting words and finally characters—which are then annotated using a homegrown OCR. We also present a comprehensive corpus characteristics and specification review—demonstrating that we have built a corpus which is very nearly balanced—a very desirable characteristic of any corpus design.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Rabby, A.S.A., Islam, M.M., Hasan, N., Nahar, J., Rahman, F.: Borno: bangla handwritten character recognition using a multiclass convolutional neural network. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) FTC 2020. AISC, vol. 1288, pp. 457–472. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-63128-4_35

    Chapter  Google Scholar 

  2. Banik, M., Rifat, M.J.R., Nahar, J., Hasan, N., Rahman, F.: Okkhor: a synthetic corpus of bangla printed characters. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) FTC 2020. AISC, vol. 1288, pp. 693–711. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-63128-4_53

    Chapter  Google Scholar 

  3. Bonchanoski, M., Zdravkova, K.: Machine learning-based approach to automatic pos tagging of macedonian language. In: Proceedings of the 8th Balkan Conference in Informatics, pp. 1–8 (2017)

    Google Scholar 

  4. Rabby, A.S.A., Haque, S., Shahinoor, S.A., Abujar, S., Hossain, S.A.: A universal way to collect and process handwritten data for any language. Procedia Comput. Sci. 143, 502–509 (2018)

    Article  Google Scholar 

  5. Rebholz-Schuhmann, D., et al.: Calbc silver standard corpus. J. Bioinform. Comput. Biol. 8(01), 163–179 (2010)

    Article  Google Scholar 

  6. Wissler, L., Almashraee, M., Díaz, D.M., Paschke, A.: The gold standard in corpus annotation. In: IEEE GSC (2014)

    Google Scholar 

  7. McHugh, M.L.: Interrater reliability: the kappa statistic. Biochemia Medica 22(3), 276–282 (2012)

    Article  MathSciNet  Google Scholar 

  8. Hallgren, K.A.: Computing inter-rater reliability for observational data: an overview and tutorial. Tutorials Quant. Method Psychol. 8(1), 23 (2012)

    Article  Google Scholar 

  9. The world factbook - central intelligence agency. https://www.cia.gov. Accessed 21 Feb 2018

  10. Summary by language size. https://www.ethnologue.com/statistics/summary-language-size-19. Accessed 21 Feb 2018

  11. Biswas, M., et al.: Banglalekha-isolated: a multi-purpose comprehensive dataset of handwritten bangla isolated characters. Data in brief 12, 103–107 (2017)

    Article  Google Scholar 

  12. Alam, S., Reasat, T., Doha, R.M., Humayun, A.I.: Numtadb-assembled bengali handwritten digits. arXiv preprint arXiv:1806.02452 (2018)

  13. Rabby, A.S.A., Haque, S., Islam, M.S., Abujar, S., Hossain, S.A.: Ekush: a multipurpose and multitype comprehensive database for online off-line bangla handwritten characters. In: Santosh, K., Hegadi, R.S. (eds.) RTIP2R 2018. CCIS, vol. 1037, pp. 149–158. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-9187-3_14

    Chapter  Google Scholar 

  14. Grüning, T., Leifert, G., Strauß, T., Michael, J., Labahn, R.: A two-stage method for text line detection in historical documents. Int. J. Doc. Anal. Recogn. (IJDAR) 22(3), 285–302 (2019). https://doi.org/10.1007/s10032-019-00332-1

    Article  Google Scholar 

  15. Chung, B.W.: Pro processing for images and computer vision with opencv

    Google Scholar 

  16. Rhody, H.: Lecture 10: hough circle transform. Chester F. Carlson Center for Imaging Science, Rochester Institute of Technology (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md Jamiur Rahman Rifat .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rifat, M.J.R., Banik, M., Hasan, N., Nahar, J., Rahman, F. (2021). A Novel Machine Annotated Balanced Bangla OCR Corpus. In: Singh, S.K., Roy, P., Raman, B., Nagabhushan, P. (eds) Computer Vision and Image Processing. CVIP 2020. Communications in Computer and Information Science, vol 1377. Springer, Singapore. https://doi.org/10.1007/978-981-16-1092-9_13

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-1092-9_13

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-1091-2

  • Online ISBN: 978-981-16-1092-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics