CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English mixed script document image

Sarkar, Ram; Das, Nibaran; Basu, Subhadip; Kundu, Mahantapas; Nasipuri, Mita; Basu, Dipak Kumar

doi:10.1007/s10032-011-0148-6

CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English mixed script document image

Original Paper
Published: 24 February 2011

Volume 15, pages 71–83, (2012)
Cite this article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Ram Sarkar¹,
Nibaran Das¹,
Subhadip Basu¹,
Mahantapas Kundu¹,
Mita Nasipuri¹ &
…
Dipak Kumar Basu²

485 Accesses
102 Citations
Explore all metrics

Abstract

In this paper, we have described the preparation of a benchmark database for research on off-line Optical Character Recognition (OCR) of document images of handwritten Bangla text and Bangla text mixed with English words. This is the first handwritten database in this area, as mentioned above, available as an open source document. As India is a multi-lingual country and has a colonial past, so multi-script document pages are very much common. The database contains 150 handwritten document pages, among which 100 pages are written purely in Bangla script and rests of the 50 pages are written in Bangla text mixed with English words. This database for off-line-handwritten scripts is collected from different data sources. After collecting the document pages, all the documents have been preprocessed and distributed into two groups, i.e., CMATERdb1.1.1, containing document pages written in Bangla script only, and CMATERdb1.2.1, containing document pages written in Bangla text mixed with English words. Finally, we have also provided the useful ground truth images for the line segmentation purpose. To generate the ground truth images, we have first labeled each line in a document page automatically by applying one of our previously developed line extraction techniques [Khandelwal et al., PReMI 2009, pp. 369–374] and then corrected any possible error by using our developed tool GT Gen 1.1. Line extraction accuracies of 90.6 and 92.38% are achieved on the two databases, respectively, using our algorithm. Both the databases along with the ground truth annotations and the ground truth generating tool are available freely at http://code.google.com/p/cmaterdb.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Line, Word, and Character Segmentation from Bangla Handwritten Text—A Precursor Toward Bangla HOCR

Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images

Article 18 May 2017

Character Segmentation from Offline Handwritten Gujarati Script Documents

References

Wilkinson, R.A., Geist, J., Janet, S., Grother, P., Burges, C.J.C., Creecy, R., Hammond, B., Hull, J., Larsen, N.J., Vogl, T.P., Wilson, C.L.: The first census optical character recognition systems conference. Technical Report NISTIR 4912, The U.S Bureau of Census and the National Institute of Standards and Technology, Gaithersburg (July 1992)
Marti, U., Bunke, H.: A full English sentence database for off-line handwriting recognition. In: Proceedings of fifth international conference on document analysis and recognition, pp. 705–708. Bangalore (1999)
Suen C.Y., Nadal C., Legault R., Mai T.A., Lam L.: Computer recognition of unconstrained handwritten numerals. Proc. IEEE 80(7), 1162–1180 (1992)
Article Google Scholar
Kim, D.H., Hwang, Y.S., Park, S.T., Kim, E.J., Paek, S.H., Bang, S.Y.: Handwritten Korean character image database PE92. In: Proceedings of the Second international Conference on Document Analysis and Recognition, pp. 470–473, (1993)
Saito T., Yamada H., Yamamoto K.: On the database ELT9 of Handprinted characters in JIS Chinese characters and Its Analysis (in Japanese). IECEJ Trans. Vol. J. 68-D(4), 757–764 (1985)
Google Scholar
http://users.iit.demokritos.gr/~bgat/HandSegmCont2009/, “ICDAR2009 Handwriting Segmentation Contest”
Basu S., Chaudhuri C., Kundu M., Nasipuri M., Basu D.K.: Text line extraction from multi-skewed handwritten documents. Pattern Recognit. 40(6), 1825–1839 (2007)
Article MATH Google Scholar
Gonzalez R.C., Woods R.E.: Digital Image Processing. Prentice-Hall, Indian (1992)
Google Scholar
http://www.isical.ac.in/~ujjwal/download/database.html
Louloudis, G., Gatos, B., Pratikakis, I., Halatsis, C.: Line and word segmentation of handwritten documents. In: Proceedings of International Conference in Frontiers in Handwritten Recognition (ICFHR-08), pp. 247–252. August 19-21, Canada (2008)
Louloudis G., Gatos B., Pratikakis I., Halatsis C.: Text line detection in handwritten documents. Pattern Recognit. 41(12), 3758–3772 (2008)
Article MATH Google Scholar
Yin, F., Liu, C.: Handwritten text line segmentation by clustering with distance metric learning. In: Proceedings of International Conference in Frontiers in Handwritten Recognition (ICFHR-08), pp. 229–234, August 91–21, Canada (2008)
Du, X., Pan, W., Bui, T.D.: Text line segmentation in handwritten documents using Mumford-Shah model. In: Proceedings of International Conference in Frontiers in Handwritten Recognition (ICFHR-08), pp. 253–258. August 91–21, Canada (2008)
Li Y., Zheng Y., Doermann D.: Script-independent text line segmentation in freestyle handwritten documents. IEEE Trans. PAMI 30(8), 1313–1329 (2008)
Article Google Scholar
Roy, P.P., Pal, U., Llados, J.: Morphology based handwritten Line segmentation using foreground and background information. In: Proceedings of International Conference in Frontiers in Handwritten Recognition (ICFHR-08), pp. 241–246, August 19–21, Canada (2008)
Louloudis G., Gatos B., Pratikakis I., Halatsis C.: Text line and word segmentation of handwritten documents. Pattern Recognit. 42(12), 3169–3183 (2009)
Article MATH Google Scholar
Yin F., Liu C.: Handwritten Chinese text line segmentation by clustering with distance metric learning. Pattern Recognit. 42(12), 3146–3157 (2009)
Article MATH Google Scholar
Statistical Summaries, Ethnologue, 2005. Retrieved 2007-03-03
Languages spoken by more than 10 million people. Encarta Encyclopedia. Retrieved 2007-03-03, (2007)
Basilios Gatos, Nikolaos Stamatopoulos, Georgios Louloudis. ICDAR 2009 Handwriting Segmentation Contest. ICDAR 2009, pp. 1393–1397
Sarkar, R., Basu, S., Das, N., Mollah, A.F., Kundu, M., Nasipuri, M.: Line extraction from unconstrained handwritten document pages using piece-wise water-flow technique. In: Proceedings (CD) of 4th Indian International Conference on Artificial Intelligence (IICAI), pp. 1861–1872. Tumkur, India, 16–18 Dec (2009)
Khandelwal, A., Choudhury, P., Sarkar, R., Basu, S., Nasipuri, M., Das, N.: Text line segmentation for unconstrained handwritten document images using neighborhood connected component analysis. In: Proceedings of International Conference on PreMI, pp. 369–374, Dec (2009)
Bhattacharya U., Chaudhuri B.B.: Handwritten numeral databases of Indian scripts and multistage recognition of mixed numerals. IEEE Trans. Pattern Anal. Mach. Intell. 3(3), 444–457 (2009)
Article Google Scholar
Luthy, F., Varga, T., Bunke, H.: Using hidden Markov models as a tool for handwritten text line segmentation. In: Proceedings of Ninth International Conference on Document Analysis and Recognition, pp. 8–12. Curitiba, Brazil (2007)
Huang, C., Srihari, S.: Word segmentation of off-line handwritten documents. In: Proceedings of the Document Recognition and Retrieval (DRR) XV, IST/SPIE Annual Symposium, San Jose, CA, USA, January (2008)
Hull J.J.: A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 550–554 (1994)
Article Google Scholar
Al-Ohali Y., Cheriet M., Suen C.: Databases for recognition of handwritten Arabic cheques. Pattern Recognit. 36, 111–121 (2003)
Article MATH Google Scholar
Noumi, T., Matsui, T., Yamashita, I., Wakahara, T., Tsutsumida, T.: Tegaki Suji Database ‘IPTP CD-ROM1’ no ichi bunseki (in Japanese). Autumn Meeting of IEICE, Vol. D-309, Sept. (1994)
http://www.computerjagat.org/, “Computer Jagat”
Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M., Basu, D.K.: Word level script identification from Bangla and Devanagri handwritten texts mixed with Roman script. J. Comput. vol. 2(2): 103–108, Feb, ISSN 2151-9617, (2010)
Google Scholar
Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M., Basu, D.K.: A two-stage approach for segmentation of handwritten Bangla word images. In: Proceedings of the 11th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 403–408. Montreal, Canada, (2008)
Ikeda, H., Ogawa, Y., Koga, M., Nishimura, H., Sako, H., Fujisawa, H.: A Recognition Method for Touching Japanese Handwritten Characters. In: Proceedings of ICDAR, pp. 641-644. Bangalore, India (1999)
Yi-Kai C., Jhing-Fa W.: Segmentation of single or multiple-touching handwritten numeral string using background and foreground analysis. IEEE Trans. PAMI 22(11), 1304–1317 (2000)
Article Google Scholar
Lu Y.: Machine printed character segmentation-an overview. Pattern Recognit. 28(7), 67–80 (1995)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Engineering Department, Jadavpur University, Kolkata, 700032, India
Ram Sarkar, Nibaran Das, Subhadip Basu, Mahantapas Kundu & Mita Nasipuri
A.I.C.T.E. Emeritus Fellow, Computer Science and Engineering Department, Jadavpur University, Kolkata, 700032, India
Dipak Kumar Basu

Authors

Ram Sarkar
View author publications
You can also search for this author inPubMed Google Scholar
Nibaran Das
View author publications
You can also search for this author inPubMed Google Scholar
Subhadip Basu
View author publications
You can also search for this author inPubMed Google Scholar
Mahantapas Kundu
View author publications
You can also search for this author inPubMed Google Scholar
Mita Nasipuri
View author publications
You can also search for this author inPubMed Google Scholar
Dipak Kumar Basu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Mita Nasipuri.

Electronic Supplementary Material

The Below is the Electronic Supplementary Material.

ESM 1 (DOC 22 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sarkar, R., Das, N., Basu, S. et al. CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English mixed script document image. IJDAR 15, 71–83 (2012). https://doi.org/10.1007/s10032-011-0148-6

Download citation

Received: 03 June 2010
Revised: 18 August 2010
Accepted: 24 January 2011
Published: 24 February 2011
Issue Date: March 2012
DOI: https://doi.org/10.1007/s10032-011-0148-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English mixed script document image

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Line, Word, and Character Segmentation from Bangla Handwritten Text—A Precursor Toward Bangla HOCR

Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images

Character Segmentation from Offline Handwritten Gujarati Script Documents

References

Author information

Authors and Affiliations

Corresponding author

Electronic Supplementary Material

ESM 1 (DOC 22 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now