Skip to main content
Log in

PHDIndic_11: page-level handwritten document image dataset of 11 official Indic scripts for script identification

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Without publicly available dataset, specifically in handwritten document recognition (HDR), we cannot make a fair and/or reliable comparison between the methods. Considering HDR, Indic script’s document recognition is still in its early stage compared to others such as Roman and Arabic. In this paper, we present a page-level handwritten document image dataset (PHDIndic_11), of 11 official Indic scripts: Bangla, Devanagari, Roman, Urdu, Oriya, Gurumukhi, Gujarati, Tamil, Telugu, Malayalam and Kannada. PHDIndic_11 is composed of 1458 document text-pages written by 463 individuals from various parts of India. Further, we report the benchmark results for handwritten script identification (HSI). Beside script identification, the dataset can be effectively used in many other applications of document image analysis such as script sentence recognition/understanding, text-line segmentation, word segmentation/recognition, word spotting, handwritten and machine printed texts separation and writer identification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

References

  1. Aleai A, Nagabhushan P, Pal U (2011) A benchmark Kannada handwritten document dataset and its segmentation. In: Proceedings of the International Conference on Document Analysis and Recognition, p 140–145

  2. Aleai A, Nagabhushan P, Pal U (2012) Dataset and Ground truth for Handwritten Text in Four Different Scripts. International Journal of Pattern Recognition and Artificial Intelligence, World Scientific, 26(4):1253001 (25 pages)

  3. Bhattacharya U, Chaudhuri BB (2005) Databases for research on recognition of handwritten characters of Indian scripts. In: Proceedings of the International Conference on Document Analysis and Recognition, p 789–793

  4. Busch A, Boles WW, Sridharan S (2005) Texture for script identification. IEEE Trans Pattern Anal Mach Intell 27(11):1720–1732

    Article  Google Scholar 

  5. Chaudhuri BB (2006) A complete handwritten numeral database of Bangla-a major Indic script. In: Proceedings of the International Workshop on Frontiers of Handwriting Recognition, p 379–384

  6. Cun YL, Bottou L, Bengio Y, Haffiner P (1998) Gradient based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  7. Das N, Acharya K, Sarkar R, Basu S, Kundu M, Nasipuri M (2014) A benchmark image database of isolated Bangla handwritten compound characters. Int J Doc Anal Recognit 17(4):413–431

    Article  Google Scholar 

  8. Das N, Sarkar R, Basu S, Saha PK, Kundu M, Nasipuri M (2015) Handwritten Bangla character recognition using a soft computing paradigm embedded in two pass approach. Pattern Recogn 48(6):2054–2071

    Article  Google Scholar 

  9. Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  10. Diem M, Fiel S, Kleber F, Sablatnig R (2013) CVL-database: an off-line database for writer retrieval, writer identification and word spotting. In: Proceedings of the International Conference on Document Analysis and Recognition, p 560–564

  11. Dongre VJ, Mankar VH (2012) Development of comprehensive Devanagari numeral and character database for offline handwritten character recognition. Journal of Applied Computational Intelligence and Soft Computing (ACISC), Hindawi Publishing Corporation. doi:10.1155/2012/871834

  12. Gatos B, Stamatopoulos N, Louloudis G (2009) Handwriting segmentation contest. In: Proceedings of the International Conference on Document Analysis and Recognition, p 1393–1397

  13. Ghosh D, Dube T, Shivaprasad AP (2010) Script recognition- a review. IEEE Trans Pattern Anal Mach Intell 32(12):2142–2161

    Article  Google Scholar 

  14. Hull J (1994) A database for handwritten text recognition research. IEEE Transaction on Pattern Analysis and Machine Intelligence 16(5):550–554

    Article  Google Scholar 

  15. Kittler J, Hatef M, Robert PWD, Matas J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239

    Article  Google Scholar 

  16. Lakshmi CV, Patvardhan C (2004) An optical character recognition system for printed Telugu text. Pattern Analysis and Application 7(2):190–204

    Article  MathSciNet  Google Scholar 

  17. Marti U, Bunke H (1999) A full English sentence database for off-line handwriting recognition. In: Proceedings of the International Conference on Document Analysis and Recognition, p 705–708

  18. Marti U, Bunke H (2002) The IAM-database: an English sentence database for off-line handwriting recognition. Int J Doc Anal Recognit 5:39–46

    Article  MATH  Google Scholar 

  19. Mulhem P, Martin H (2003) From database to web multimedia documents. Multimed Tool Appl 20(3):263–282

    Article  Google Scholar 

  20. Nethravathi B, Archana CP, Shashikiran K, Ramakrishnan AG, Kumar V (2010) Creation of a huge annotated database for Tamil and Kannada OHR. In: Proceedings of the International Workshop on Frontiers in Handwriting Recognition, p 415–420

  21. Obaidullah SM, Mondal A, Das N, Roy K (2014) Script identification from printed Indian document images and performance evaluation using different classifiers. Applied Computational Intelligence and Soft Computing 2014:12

    Article  Google Scholar 

  22. Obaidullah SM, Halder C, Das N, Roy K (2015) A corpus of word-level offline handwritten numeral images from official indic scripts. In: Proceedings of the International Conference on Computer and Communication Technologies, p 703–711

  23. Obaidullah SM, Goswami C, Santosh KC, Halder C, Das N, Roy K (2016a) Separating Indic scripts with ‘shirorekha’ -- a precursor to script identification in multi-script documents.In: Proceedings of the IAPR International Conference on Computer Vision & Image Processing, India. doi:10.1007/978-981-10-2104-6_19

  24. Obaidullah SM, Halder C, Das N, Roy K (2016b) A new dataset of word-level offline handwritten numeral images from four official Indic scripts and its benchmarking using image transform fusion. International Journal of Intelligent Engineering Informatics 4(1):1–20

    Article  Google Scholar 

  25. Paul M (ed.) (2009) Ethnologue: languages of the world, Sixteenth edition. Dallas: SIL International. Available: http://www.ethnologue.com/. Last accessed on 20 Oct 2016

  26. Rani R, Dhir R, Lehal GS (2013) Script identification for pre-segmented multi-font characters and digits. In: Proceedings of the International Conference on Document Analysis and Recognition, p 2010–1154

  27. Raza A, Siddiqi I, Abidi A, Arif F (2012a) An unconstrained benchmark Urdu sentence database with automatic line segmentation. In: Proceedings of the International Conference on Frontiers in Handwriting Recognition, p 491–496

  28. Raza A, Siddiqi I, Abidi A, Arif F (2012b) QUWI: an Arabic and English handwriting dataset for offline writer identification. In: Proceedings of the International Conference on Frontiers in Handwriting Recognition, p 746–751

  29. Saqheer MW, He CL, Nobile N, Suen CY (2009) A new large urdu database for off-line handwriting recognition. In: Proceedings of the International Conference on Image Analysis and Processing, p 538–546

  30. Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2012) CMATERdb1: a database of unconstrained handwritten Bangla and Bangla-English mixed script document image. International Journalon Document Analysis and Recognition 15:71–83

    Article  Google Scholar 

  31. Singh PK, Dalal SK, Sarkar R, Nasipuri M (2015) Page-level script identification from multi-script handwritten documents. In: Proceedings of the third international conference computer, Communication, Control and Information Technology, p 1–6

  32. Sklansky J (1982) Finding the convex hull of a simple polygon. Pattern Recogn Lett 1:79–83

    Article  MATH  Google Scholar 

  33. Suen CY, Nadal C, Legault R, Mai T, Lam L (1992) Computer recognition of unconstrained handwritten numerals. Proc IEEE 80(7):1162–1180

    Article  Google Scholar 

  34. Sumner M, Frank E, Hall M (2005) Speeding up logistic model tree induction. In: Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases, p 675–683

  35. Thadchanamoorthy S, Kodikara ND, Premaretne HL, Pal U, Kimura F (2013) Tamil handwritten city name database development and recognition for postal automation. In: Proceedings of the International Conference on Document Analysis and Recognition, p 793–797

  36. Wilkinson R, Geist J, Janet S, Grother P, Burges C, Creecy R, Hammond B, Hull J, Larsen N, Vogl T, Wilson C (1992) The First Census Optical Character Recognition Systems. Conference #NISTIR 4912 (The U.S. Bureau of Census and the National Institute of Standards and Technology, Gaithersburg, MD, 1992)

  37. Writing_System (2016) Writing System of India Available: http://en.wikipedia.org/wiki/Writing_system. Last accessed on 20 Oct 2016

  38. Zimmermann M, Bunke H (2000) Automatic segmentation of the IAM off-line database for handwritten English text. In: Proceedings of the International Conference on Pattern Recognition, p 35–39

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kaushik Roy.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Obaidullah, S.M., Halder, C., Santosh, K.C. et al. PHDIndic_11: page-level handwritten document image dataset of 11 official Indic scripts for script identification. Multimed Tools Appl 77, 1643–1678 (2018). https://doi.org/10.1007/s11042-017-4373-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-4373-y

Keywords

Navigation