PHDIndic_11: page-level handwritten document image dataset of 11 official Indic scripts for script identification

Obaidullah, Sk Md; Halder, Chayan; Santosh, K. C.; Das, Nibaran; Roy, Kaushik

doi:10.1007/s11042-017-4373-y

PHDIndic_11: page-level handwritten document image dataset of 11 official Indic scripts for script identification

Published: 18 January 2017

Volume 77, pages 1643–1678, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Sk Md Obaidullah¹,
Chayan Halder²,
K. C. Santosh³,
Nibaran Das⁴ &
…
Kaushik Roy²

970 Accesses
67 Citations
Explore all metrics

Abstract

Without publicly available dataset, specifically in handwritten document recognition (HDR), we cannot make a fair and/or reliable comparison between the methods. Considering HDR, Indic script’s document recognition is still in its early stage compared to others such as Roman and Arabic. In this paper, we present a page-level handwritten document image dataset (PHDIndic_11), of 11 official Indic scripts: Bangla, Devanagari, Roman, Urdu, Oriya, Gurumukhi, Gujarati, Tamil, Telugu, Malayalam and Kannada. PHDIndic_11 is composed of 1458 document text-pages written by 463 individuals from various parts of India. Further, we report the benchmark results for handwritten script identification (HSI). Beside script identification, the dataset can be effectively used in many other applications of document image analysis such as script sentence recognition/understanding, text-line segmentation, word segmentation/recognition, word spotting, handwritten and machine printed texts separation and writer identification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

An Approach for Automatic Indic Script Identification from Handwritten Document Images

PWDB_13: A Corpus of Word-Level Printed Document Images from Thirteen Official Indic Scripts

Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images

Article 18 May 2017

Pawan Kumar Singh, Ram Sarkar, … Mita Nasipuri

References

Aleai A, Nagabhushan P, Pal U (2011) A benchmark Kannada handwritten document dataset and its segmentation. In: Proceedings of the International Conference on Document Analysis and Recognition, p 140–145
Aleai A, Nagabhushan P, Pal U (2012) Dataset and Ground truth for Handwritten Text in Four Different Scripts. International Journal of Pattern Recognition and Artificial Intelligence, World Scientific, 26(4):1253001 (25 pages)
Bhattacharya U, Chaudhuri BB (2005) Databases for research on recognition of handwritten characters of Indian scripts. In: Proceedings of the International Conference on Document Analysis and Recognition, p 789–793
Busch A, Boles WW, Sridharan S (2005) Texture for script identification. IEEE Trans Pattern Anal Mach Intell 27(11):1720–1732
Article Google Scholar
Chaudhuri BB (2006) A complete handwritten numeral database of Bangla-a major Indic script. In: Proceedings of the International Workshop on Frontiers of Handwriting Recognition, p 379–384
Cun YL, Bottou L, Bengio Y, Haffiner P (1998) Gradient based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Das N, Acharya K, Sarkar R, Basu S, Kundu M, Nasipuri M (2014) A benchmark image database of isolated Bangla handwritten compound characters. Int J Doc Anal Recognit 17(4):413–431
Article Google Scholar
Das N, Sarkar R, Basu S, Saha PK, Kundu M, Nasipuri M (2015) Handwritten Bangla character recognition using a soft computing paradigm embedded in two pass approach. Pattern Recogn 48(6):2054–2071
Article Google Scholar
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Diem M, Fiel S, Kleber F, Sablatnig R (2013) CVL-database: an off-line database for writer retrieval, writer identification and word spotting. In: Proceedings of the International Conference on Document Analysis and Recognition, p 560–564
Dongre VJ, Mankar VH (2012) Development of comprehensive Devanagari numeral and character database for offline handwritten character recognition. Journal of Applied Computational Intelligence and Soft Computing (ACISC), Hindawi Publishing Corporation. doi:10.1155/2012/871834
Gatos B, Stamatopoulos N, Louloudis G (2009) Handwriting segmentation contest. In: Proceedings of the International Conference on Document Analysis and Recognition, p 1393–1397
Ghosh D, Dube T, Shivaprasad AP (2010) Script recognition- a review. IEEE Trans Pattern Anal Mach Intell 32(12):2142–2161
Article Google Scholar
Hull J (1994) A database for handwritten text recognition research. IEEE Transaction on Pattern Analysis and Machine Intelligence 16(5):550–554
Article Google Scholar
Kittler J, Hatef M, Robert PWD, Matas J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239
Article Google Scholar
Lakshmi CV, Patvardhan C (2004) An optical character recognition system for printed Telugu text. Pattern Analysis and Application 7(2):190–204
Article MathSciNet Google Scholar
Marti U, Bunke H (1999) A full English sentence database for off-line handwriting recognition. In: Proceedings of the International Conference on Document Analysis and Recognition, p 705–708
Marti U, Bunke H (2002) The IAM-database: an English sentence database for off-line handwriting recognition. Int J Doc Anal Recognit 5:39–46
Article MATH Google Scholar
Mulhem P, Martin H (2003) From database to web multimedia documents. Multimed Tool Appl 20(3):263–282
Article Google Scholar
Nethravathi B, Archana CP, Shashikiran K, Ramakrishnan AG, Kumar V (2010) Creation of a huge annotated database for Tamil and Kannada OHR. In: Proceedings of the International Workshop on Frontiers in Handwriting Recognition, p 415–420
Obaidullah SM, Mondal A, Das N, Roy K (2014) Script identification from printed Indian document images and performance evaluation using different classifiers. Applied Computational Intelligence and Soft Computing 2014:12
Article Google Scholar
Obaidullah SM, Halder C, Das N, Roy K (2015) A corpus of word-level offline handwritten numeral images from official indic scripts. In: Proceedings of the International Conference on Computer and Communication Technologies, p 703–711
Obaidullah SM, Goswami C, Santosh KC, Halder C, Das N, Roy K (2016a) Separating Indic scripts with ‘shirorekha’ -- a precursor to script identification in multi-script documents.In: Proceedings of the IAPR International Conference on Computer Vision & Image Processing, India. doi:10.1007/978-981-10-2104-6_19
Obaidullah SM, Halder C, Das N, Roy K (2016b) A new dataset of word-level offline handwritten numeral images from four official Indic scripts and its benchmarking using image transform fusion. International Journal of Intelligent Engineering Informatics 4(1):1–20
Article Google Scholar
Paul M (ed.) (2009) Ethnologue: languages of the world, Sixteenth edition. Dallas: SIL International. Available: http://www.ethnologue.com/. Last accessed on 20 Oct 2016
Rani R, Dhir R, Lehal GS (2013) Script identification for pre-segmented multi-font characters and digits. In: Proceedings of the International Conference on Document Analysis and Recognition, p 2010–1154
Raza A, Siddiqi I, Abidi A, Arif F (2012a) An unconstrained benchmark Urdu sentence database with automatic line segmentation. In: Proceedings of the International Conference on Frontiers in Handwriting Recognition, p 491–496
Raza A, Siddiqi I, Abidi A, Arif F (2012b) QUWI: an Arabic and English handwriting dataset for offline writer identification. In: Proceedings of the International Conference on Frontiers in Handwriting Recognition, p 746–751
Saqheer MW, He CL, Nobile N, Suen CY (2009) A new large urdu database for off-line handwriting recognition. In: Proceedings of the International Conference on Image Analysis and Processing, p 538–546
Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2012) CMATERdb1: a database of unconstrained handwritten Bangla and Bangla-English mixed script document image. International Journalon Document Analysis and Recognition 15:71–83
Article Google Scholar
Singh PK, Dalal SK, Sarkar R, Nasipuri M (2015) Page-level script identification from multi-script handwritten documents. In: Proceedings of the third international conference computer, Communication, Control and Information Technology, p 1–6
Sklansky J (1982) Finding the convex hull of a simple polygon. Pattern Recogn Lett 1:79–83
Article MATH Google Scholar
Suen CY, Nadal C, Legault R, Mai T, Lam L (1992) Computer recognition of unconstrained handwritten numerals. Proc IEEE 80(7):1162–1180
Article Google Scholar
Sumner M, Frank E, Hall M (2005) Speeding up logistic model tree induction. In: Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases, p 675–683
Thadchanamoorthy S, Kodikara ND, Premaretne HL, Pal U, Kimura F (2013) Tamil handwritten city name database development and recognition for postal automation. In: Proceedings of the International Conference on Document Analysis and Recognition, p 793–797
Wilkinson R, Geist J, Janet S, Grother P, Burges C, Creecy R, Hammond B, Hull J, Larsen N, Vogl T, Wilson C (1992) The First Census Optical Character Recognition Systems. Conference #NISTIR 4912 (The U.S. Bureau of Census and the National Institute of Standards and Technology, Gaithersburg, MD, 1992)
Writing_System (2016) Writing System of India Available: http://en.wikipedia.org/wiki/Writing_system. Last accessed on 20 Oct 2016
Zimmermann M, Bunke H (2000) Automatic segmentation of the IAM off-line database for handwritten English text. In: Proceedings of the International Conference on Pattern Recognition, p 35–39

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Aliah University, Kolkata, India
Sk Md Obaidullah
Department of Computer Science, West Bengal State University, Kolkata, India
Chayan Halder & Kaushik Roy
Department of Computer Science, University of South Dakota, Vermillion, SD, USA
K. C. Santosh
Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
Nibaran Das

Authors

Sk Md Obaidullah
View author publications
You can also search for this author in PubMed Google Scholar
Chayan Halder
View author publications
You can also search for this author in PubMed Google Scholar
K. C. Santosh
View author publications
You can also search for this author in PubMed Google Scholar
Nibaran Das
View author publications
You can also search for this author in PubMed Google Scholar
Kaushik Roy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kaushik Roy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Obaidullah, S.M., Halder, C., Santosh, K.C. et al. PHDIndic_11: page-level handwritten document image dataset of 11 official Indic scripts for script identification. Multimed Tools Appl 77, 1643–1678 (2018). https://doi.org/10.1007/s11042-017-4373-y

Download citation

Received: 10 April 2016
Revised: 24 October 2016
Accepted: 09 January 2017
Published: 18 January 2017
Issue Date: January 2018
DOI: https://doi.org/10.1007/s11042-017-4373-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

PHDIndic_11: page-level handwritten document image dataset of 11 official Indic scripts for script identification

Abstract

Access this article

Similar content being viewed by others

An Approach for Automatic Indic Script Identification from Handwritten Document Images

PWDB_13: A Corpus of Word-Level Printed Document Images from Thirteen Official Indic Scripts

Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

PHDIndic_11: page-level handwritten document image dataset of 11 official Indic scripts for script identification

Abstract

Access this article

Similar content being viewed by others

An Approach for Automatic Indic Script Identification from Handwritten Document Images

PWDB_13: A Corpus of Word-Level Printed Document Images from Thirteen Official Indic Scripts

Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation