Abstract
In this paper, we have described the preparation of a benchmark database for research on off-line Optical Character Recognition (OCR) of document images of handwritten Bangla text and Bangla text mixed with English words. This is the first handwritten database in this area, as mentioned above, available as an open source document. As India is a multi-lingual country and has a colonial past, so multi-script document pages are very much common. The database contains 150 handwritten document pages, among which 100 pages are written purely in Bangla script and rests of the 50 pages are written in Bangla text mixed with English words. This database for off-line-handwritten scripts is collected from different data sources. After collecting the document pages, all the documents have been preprocessed and distributed into two groups, i.e., CMATERdb1.1.1, containing document pages written in Bangla script only, and CMATERdb1.2.1, containing document pages written in Bangla text mixed with English words. Finally, we have also provided the useful ground truth images for the line segmentation purpose. To generate the ground truth images, we have first labeled each line in a document page automatically by applying one of our previously developed line extraction techniques [Khandelwal et al., PReMI 2009, pp. 369–374] and then corrected any possible error by using our developed tool GT Gen 1.1. Line extraction accuracies of 90.6 and 92.38% are achieved on the two databases, respectively, using our algorithm. Both the databases along with the ground truth annotations and the ground truth generating tool are available freely at http://code.google.com/p/cmaterdb.
Similar content being viewed by others
References
Wilkinson, R.A., Geist, J., Janet, S., Grother, P., Burges, C.J.C., Creecy, R., Hammond, B., Hull, J., Larsen, N.J., Vogl, T.P., Wilson, C.L.: The first census optical character recognition systems conference. Technical Report NISTIR 4912, The U.S Bureau of Census and the National Institute of Standards and Technology, Gaithersburg (July 1992)
Marti, U., Bunke, H.: A full English sentence database for off-line handwriting recognition. In: Proceedings of fifth international conference on document analysis and recognition, pp. 705–708. Bangalore (1999)
Suen C.Y., Nadal C., Legault R., Mai T.A., Lam L.: Computer recognition of unconstrained handwritten numerals. Proc. IEEE 80(7), 1162–1180 (1992)
Kim, D.H., Hwang, Y.S., Park, S.T., Kim, E.J., Paek, S.H., Bang, S.Y.: Handwritten Korean character image database PE92. In: Proceedings of the Second international Conference on Document Analysis and Recognition, pp. 470–473, (1993)
Saito T., Yamada H., Yamamoto K.: On the database ELT9 of Handprinted characters in JIS Chinese characters and Its Analysis (in Japanese). IECEJ Trans. Vol. J. 68-D(4), 757–764 (1985)
http://users.iit.demokritos.gr/~bgat/HandSegmCont2009/, “ICDAR2009 Handwriting Segmentation Contest”
Basu S., Chaudhuri C., Kundu M., Nasipuri M., Basu D.K.: Text line extraction from multi-skewed handwritten documents. Pattern Recognit. 40(6), 1825–1839 (2007)
Gonzalez R.C., Woods R.E.: Digital Image Processing. Prentice-Hall, Indian (1992)
Louloudis, G., Gatos, B., Pratikakis, I., Halatsis, C.: Line and word segmentation of handwritten documents. In: Proceedings of International Conference in Frontiers in Handwritten Recognition (ICFHR-08), pp. 247–252. August 19-21, Canada (2008)
Louloudis G., Gatos B., Pratikakis I., Halatsis C.: Text line detection in handwritten documents. Pattern Recognit. 41(12), 3758–3772 (2008)
Yin, F., Liu, C.: Handwritten text line segmentation by clustering with distance metric learning. In: Proceedings of International Conference in Frontiers in Handwritten Recognition (ICFHR-08), pp. 229–234, August 91–21, Canada (2008)
Du, X., Pan, W., Bui, T.D.: Text line segmentation in handwritten documents using Mumford-Shah model. In: Proceedings of International Conference in Frontiers in Handwritten Recognition (ICFHR-08), pp. 253–258. August 91–21, Canada (2008)
Li Y., Zheng Y., Doermann D.: Script-independent text line segmentation in freestyle handwritten documents. IEEE Trans. PAMI 30(8), 1313–1329 (2008)
Roy, P.P., Pal, U., Llados, J.: Morphology based handwritten Line segmentation using foreground and background information. In: Proceedings of International Conference in Frontiers in Handwritten Recognition (ICFHR-08), pp. 241–246, August 19–21, Canada (2008)
Louloudis G., Gatos B., Pratikakis I., Halatsis C.: Text line and word segmentation of handwritten documents. Pattern Recognit. 42(12), 3169–3183 (2009)
Yin F., Liu C.: Handwritten Chinese text line segmentation by clustering with distance metric learning. Pattern Recognit. 42(12), 3146–3157 (2009)
Statistical Summaries, Ethnologue, 2005. Retrieved 2007-03-03
Languages spoken by more than 10 million people. Encarta Encyclopedia. Retrieved 2007-03-03, (2007)
Basilios Gatos, Nikolaos Stamatopoulos, Georgios Louloudis. ICDAR 2009 Handwriting Segmentation Contest. ICDAR 2009, pp. 1393–1397
Sarkar, R., Basu, S., Das, N., Mollah, A.F., Kundu, M., Nasipuri, M.: Line extraction from unconstrained handwritten document pages using piece-wise water-flow technique. In: Proceedings (CD) of 4th Indian International Conference on Artificial Intelligence (IICAI), pp. 1861–1872. Tumkur, India, 16–18 Dec (2009)
Khandelwal, A., Choudhury, P., Sarkar, R., Basu, S., Nasipuri, M., Das, N.: Text line segmentation for unconstrained handwritten document images using neighborhood connected component analysis. In: Proceedings of International Conference on PreMI, pp. 369–374, Dec (2009)
Bhattacharya U., Chaudhuri B.B.: Handwritten numeral databases of Indian scripts and multistage recognition of mixed numerals. IEEE Trans. Pattern Anal. Mach. Intell. 3(3), 444–457 (2009)
Luthy, F., Varga, T., Bunke, H.: Using hidden Markov models as a tool for handwritten text line segmentation. In: Proceedings of Ninth International Conference on Document Analysis and Recognition, pp. 8–12. Curitiba, Brazil (2007)
Huang, C., Srihari, S.: Word segmentation of off-line handwritten documents. In: Proceedings of the Document Recognition and Retrieval (DRR) XV, IST/SPIE Annual Symposium, San Jose, CA, USA, January (2008)
Hull J.J.: A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 550–554 (1994)
Al-Ohali Y., Cheriet M., Suen C.: Databases for recognition of handwritten Arabic cheques. Pattern Recognit. 36, 111–121 (2003)
Noumi, T., Matsui, T., Yamashita, I., Wakahara, T., Tsutsumida, T.: Tegaki Suji Database ‘IPTP CD-ROM1’ no ichi bunseki (in Japanese). Autumn Meeting of IEICE, Vol. D-309, Sept. (1994)
http://www.computerjagat.org/, “Computer Jagat”
Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M., Basu, D.K.: Word level script identification from Bangla and Devanagri handwritten texts mixed with Roman script. J. Comput. vol. 2(2): 103–108, Feb, ISSN 2151-9617, (2010)
Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M., Basu, D.K.: A two-stage approach for segmentation of handwritten Bangla word images. In: Proceedings of the 11th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 403–408. Montreal, Canada, (2008)
Ikeda, H., Ogawa, Y., Koga, M., Nishimura, H., Sako, H., Fujisawa, H.: A Recognition Method for Touching Japanese Handwritten Characters. In: Proceedings of ICDAR, pp. 641-644. Bangalore, India (1999)
Yi-Kai C., Jhing-Fa W.: Segmentation of single or multiple-touching handwritten numeral string using background and foreground analysis. IEEE Trans. PAMI 22(11), 1304–1317 (2000)
Lu Y.: Machine printed character segmentation-an overview. Pattern Recognit. 28(7), 67–80 (1995)
Author information
Authors and Affiliations
Corresponding author
Electronic Supplementary Material
The Below is the Electronic Supplementary Material.
Rights and permissions
About this article
Cite this article
Sarkar, R., Das, N., Basu, S. et al. CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English mixed script document image. IJDAR 15, 71–83 (2012). https://doi.org/10.1007/s10032-011-0148-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-011-0148-6