Skip to main content
Log in

CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English mixed script document image

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

In this paper, we have described the preparation of a benchmark database for research on off-line Optical Character Recognition (OCR) of document images of handwritten Bangla text and Bangla text mixed with English words. This is the first handwritten database in this area, as mentioned above, available as an open source document. As India is a multi-lingual country and has a colonial past, so multi-script document pages are very much common. The database contains 150 handwritten document pages, among which 100 pages are written purely in Bangla script and rests of the 50 pages are written in Bangla text mixed with English words. This database for off-line-handwritten scripts is collected from different data sources. After collecting the document pages, all the documents have been preprocessed and distributed into two groups, i.e., CMATERdb1.1.1, containing document pages written in Bangla script only, and CMATERdb1.2.1, containing document pages written in Bangla text mixed with English words. Finally, we have also provided the useful ground truth images for the line segmentation purpose. To generate the ground truth images, we have first labeled each line in a document page automatically by applying one of our previously developed line extraction techniques [Khandelwal et al., PReMI 2009, pp. 369–374] and then corrected any possible error by using our developed tool GT Gen 1.1. Line extraction accuracies of 90.6 and 92.38% are achieved on the two databases, respectively, using our algorithm. Both the databases along with the ground truth annotations and the ground truth generating tool are available freely at http://code.google.com/p/cmaterdb.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Wilkinson, R.A., Geist, J., Janet, S., Grother, P., Burges, C.J.C., Creecy, R., Hammond, B., Hull, J., Larsen, N.J., Vogl, T.P., Wilson, C.L.: The first census optical character recognition systems conference. Technical Report NISTIR 4912, The U.S Bureau of Census and the National Institute of Standards and Technology, Gaithersburg (July 1992)

  2. Marti, U., Bunke, H.: A full English sentence database for off-line handwriting recognition. In: Proceedings of fifth international conference on document analysis and recognition, pp. 705–708. Bangalore (1999)

  3. Suen C.Y., Nadal C., Legault R., Mai T.A., Lam L.: Computer recognition of unconstrained handwritten numerals. Proc. IEEE 80(7), 1162–1180 (1992)

    Article  Google Scholar 

  4. Kim, D.H., Hwang, Y.S., Park, S.T., Kim, E.J., Paek, S.H., Bang, S.Y.: Handwritten Korean character image database PE92. In: Proceedings of the Second international Conference on Document Analysis and Recognition, pp. 470–473, (1993)

  5. Saito T., Yamada H., Yamamoto K.: On the database ELT9 of Handprinted characters in JIS Chinese characters and Its Analysis (in Japanese). IECEJ Trans. Vol. J. 68-D(4), 757–764 (1985)

    Google Scholar 

  6. http://users.iit.demokritos.gr/~bgat/HandSegmCont2009/, “ICDAR2009 Handwriting Segmentation Contest”

  7. Basu S., Chaudhuri C., Kundu M., Nasipuri M., Basu D.K.: Text line extraction from multi-skewed handwritten documents. Pattern Recognit. 40(6), 1825–1839 (2007)

    Article  MATH  Google Scholar 

  8. Gonzalez R.C., Woods R.E.: Digital Image Processing. Prentice-Hall, Indian (1992)

    Google Scholar 

  9. http://www.isical.ac.in/~ujjwal/download/database.html

  10. Louloudis, G., Gatos, B., Pratikakis, I., Halatsis, C.: Line and word segmentation of handwritten documents. In: Proceedings of International Conference in Frontiers in Handwritten Recognition (ICFHR-08), pp. 247–252. August 19-21, Canada (2008)

  11. Louloudis G., Gatos B., Pratikakis I., Halatsis C.: Text line detection in handwritten documents. Pattern Recognit. 41(12), 3758–3772 (2008)

    Article  MATH  Google Scholar 

  12. Yin, F., Liu, C.: Handwritten text line segmentation by clustering with distance metric learning. In: Proceedings of International Conference in Frontiers in Handwritten Recognition (ICFHR-08), pp. 229–234, August 91–21, Canada (2008)

  13. Du, X., Pan, W., Bui, T.D.: Text line segmentation in handwritten documents using Mumford-Shah model. In: Proceedings of International Conference in Frontiers in Handwritten Recognition (ICFHR-08), pp. 253–258. August 91–21, Canada (2008)

  14. Li Y., Zheng Y., Doermann D.: Script-independent text line segmentation in freestyle handwritten documents. IEEE Trans. PAMI 30(8), 1313–1329 (2008)

    Article  Google Scholar 

  15. Roy, P.P., Pal, U., Llados, J.: Morphology based handwritten Line segmentation using foreground and background information. In: Proceedings of International Conference in Frontiers in Handwritten Recognition (ICFHR-08), pp. 241–246, August 19–21, Canada (2008)

  16. Louloudis G., Gatos B., Pratikakis I., Halatsis C.: Text line and word segmentation of handwritten documents. Pattern Recognit. 42(12), 3169–3183 (2009)

    Article  MATH  Google Scholar 

  17. Yin F., Liu C.: Handwritten Chinese text line segmentation by clustering with distance metric learning. Pattern Recognit. 42(12), 3146–3157 (2009)

    Article  MATH  Google Scholar 

  18. Statistical Summaries, Ethnologue, 2005. Retrieved 2007-03-03

  19. Languages spoken by more than 10 million people. Encarta Encyclopedia. Retrieved 2007-03-03, (2007)

  20. Basilios Gatos, Nikolaos Stamatopoulos, Georgios Louloudis. ICDAR 2009 Handwriting Segmentation Contest. ICDAR 2009, pp. 1393–1397

  21. Sarkar, R., Basu, S., Das, N., Mollah, A.F., Kundu, M., Nasipuri, M.: Line extraction from unconstrained handwritten document pages using piece-wise water-flow technique. In: Proceedings (CD) of 4th Indian International Conference on Artificial Intelligence (IICAI), pp. 1861–1872. Tumkur, India, 16–18 Dec (2009)

  22. Khandelwal, A., Choudhury, P., Sarkar, R., Basu, S., Nasipuri, M., Das, N.: Text line segmentation for unconstrained handwritten document images using neighborhood connected component analysis. In: Proceedings of International Conference on PreMI, pp. 369–374, Dec (2009)

  23. Bhattacharya U., Chaudhuri B.B.: Handwritten numeral databases of Indian scripts and multistage recognition of mixed numerals. IEEE Trans. Pattern Anal. Mach. Intell. 3(3), 444–457 (2009)

    Article  Google Scholar 

  24. Luthy, F., Varga, T., Bunke, H.: Using hidden Markov models as a tool for handwritten text line segmentation. In: Proceedings of Ninth International Conference on Document Analysis and Recognition, pp. 8–12. Curitiba, Brazil (2007)

  25. Huang, C., Srihari, S.: Word segmentation of off-line handwritten documents. In: Proceedings of the Document Recognition and Retrieval (DRR) XV, IST/SPIE Annual Symposium, San Jose, CA, USA, January (2008)

  26. Hull J.J.: A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 550–554 (1994)

    Article  Google Scholar 

  27. Al-Ohali Y., Cheriet M., Suen C.: Databases for recognition of handwritten Arabic cheques. Pattern Recognit. 36, 111–121 (2003)

    Article  MATH  Google Scholar 

  28. Noumi, T., Matsui, T., Yamashita, I., Wakahara, T., Tsutsumida, T.: Tegaki Suji Database ‘IPTP CD-ROM1’ no ichi bunseki (in Japanese). Autumn Meeting of IEICE, Vol. D-309, Sept. (1994)

  29. http://www.computerjagat.org/, “Computer Jagat”

  30. Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M., Basu, D.K.: Word level script identification from Bangla and Devanagri handwritten texts mixed with Roman script. J. Comput. vol. 2(2): 103–108, Feb, ISSN 2151-9617, (2010)

    Google Scholar 

  31. Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M., Basu, D.K.: A two-stage approach for segmentation of handwritten Bangla word images. In: Proceedings of the 11th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 403–408. Montreal, Canada, (2008)

  32. Ikeda, H., Ogawa, Y., Koga, M., Nishimura, H., Sako, H., Fujisawa, H.: A Recognition Method for Touching Japanese Handwritten Characters. In: Proceedings of ICDAR, pp. 641-644. Bangalore, India (1999)

  33. Yi-Kai C., Jhing-Fa W.: Segmentation of single or multiple-touching handwritten numeral string using background and foreground analysis. IEEE Trans. PAMI 22(11), 1304–1317 (2000)

    Article  Google Scholar 

  34. Lu Y.: Machine printed character segmentation-an overview. Pattern Recognit. 28(7), 67–80 (1995)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mita Nasipuri.

Electronic Supplementary Material

The Below is the Electronic Supplementary Material.

ESM 1 (DOC 22 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sarkar, R., Das, N., Basu, S. et al. CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English mixed script document image. IJDAR 15, 71–83 (2012). https://doi.org/10.1007/s10032-011-0148-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-011-0148-6

Keywords

Navigation