Skip to main content
Log in

Automatic Indic script identification from handwritten documents: page, block, line and word-level approach

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Script identification is a well-studied problem in literature since last decade. Several methods for automatic script identification have been reported. All these methods consider a document as either at page, block, line or word-level, but no experimental/empirical conclusion has been provided in choosing the particular level of work. To address this, we have carried out a multi-level script identification experiment, i.e., the same document is considered at different levels namely: page, block, line and word for script identification. Two different types of features are considered: script dependent and script independent, which is computed at each level to categorize different scripts. The experiment is conducted on a newly created handwritten multi-script and multi-level dataset, where 5 blocks, 7.5 lines and 15 words are generated from a single page, on an average (440 pages, 2200 blocks, 3300 lines and 6600 words, in total). Finally, we conclude two major issues: (1) find an optimal level of work, i.e. page/block/line/word-level, (2) provide a qualitative measure of feature set on particular level of work considered.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Eight_Schedule. [Online]. http://mha.nic.in/hindi/sites/upload_files/mhahindi/files/pdf/Eighth_Schedule.pdf. Accessed 01 May 2017

  2. Ghosh D, Dube T, Shivprasad SP (2010) Script recognition—a review. IEEE Trans Pattern Anal Mach Intell 32(12):2142–2161

    Article  Google Scholar 

  3. Obaidullah SM, Das SK, Roy K (2013) A system for handwritten script identification from Indian document. J Pattern Recognit Res 8:1–12

  4. Obaidullah SM, Das N, Roy K (2014) Gabor filter based technique for offline Indic script identification from handwritten document images. In: International conference on devices, circuits and communications (ICDCCom-2014), pp 1–6

  5. Obaidullah SM, Karim R, Shaikh S, Halder C, Das N, Roy K (2015) Transform based approach for Indic script identification from handwritten document images. In: 3rd International conference on signal processing, communications and networking, pp 1–7

  6. Singh PK, Chatterjee I, Sarkar R (2015) Page-level handwritten script identification using modified log-Gabor filter based features. In: IEEE 2nd international conference on recent trends in information systems, pp 225–230

  7. Basu S, Das N, Sarkar R, Kundu M, Nasipuri M, Basu DK (2010) A novel framework for automatic sorting of postal documents with multi-script address blocks. Pattern Recognit 43(10):3507–3521

    Article  MATH  Google Scholar 

  8. Rajput G, Anita HB (2010) Handwritten script recognition using DCT and wavelet features at block level. Int J Comput Appl Spec Issue Recent Trends Image Process Pattern Recognit 3:158–163

    Google Scholar 

  9. Obaidullah SM, Halder C, Das N, Roy K (2015) An approach for automatic Indic script identification from handwritten document images. In: 2nd doctoral symposium on applied computation and security systems, pp 37–51

  10. Hangarge M, Santosh KC, Pardeshi R (2013) Directional discrete cosine transform for handwritten script identification. In: Proceedings of the international conference on document analysis and recognition, ICDAR, pp 344–348

  11. Pardeshi R, Chaudhuri BB, Hangarge M, Santosh KC (2014) Automatic handwritten Indian scripts identification. In: 2014 14th international conference on frontiers in handwriting recognition, pp 375–380

  12. Singh PK, Sarkar R, Nasipuri M, Doermann D (2015) Word-level script identification for handwritten Indic scripts. In: 13th international conference on document analysis and recognition, pp 1106–1110

  13. Obaidullah SM, Halder C, Das N, Roy K (2015) Numeral script identification from handwritten document images. Procedia Comput Sci J 54C:585–594

    Article  Google Scholar 

  14. Hochberg J, Bowers K, Cannon M, Kelly P (1999) Script and language identification for handwritten document images. Int J Doc Anal Recognit 2(2/3):45–52

    Article  Google Scholar 

  15. Zhu G, Yu X, Li Y, Doermann D (2009) Language identification for handwritten document images using a shape codebook. Pattern Recognit 42:3184–3191

    Article  MATH  Google Scholar 

  16. Kanoun S, Ennaji A, Courtier YL, Alimi AM (2002) Script and nature differentiation for arabic and latin text images. In: 8th international workshop on frontiers in handwriting recognition (IWFHR), pp 309–313

  17. Singhal V, Navin N, Ghosh D (2003) Script-based classification of hand-written text documents in a multi-lingual environment. In: 13th international workshop on research issues in data engineering: multi-lingual information management, pp 47–54

  18. Zhou L, Lu Y, Tan CL (2006) Bangla/English script identification based on analysis of connected component profiles. In: 2nd international workshop on document analysis systems, pp 243–254

  19. Hangarge M, Dhandra BV (2010) Offline handwritten script identification in document images. Int J Comput Appl 4(6):6–10

    Google Scholar 

  20. Obaidullah SM, Halder C, Das N, Roy K (2015) Indic script identification from handwritten document images—an unconstrained block-level approach. In: IEEE 2nd international conference on recent trends in information systems, pp 213–218

  21. Moussa SB, Zahour A, Benabdelhafid A, Alimi AM (2008) Fractal-based system for Arabic/Latin, printed/handwritten script identification. In: International conference on pattern recognition, pp 1–4

  22. Rajput GG, Anita HB (2011) Handwritten script identification from a bi-script document at line level using gabor filter. In: International workshop on soft computing applications and knowledge discovery, pp 94–101

  23. Roy K, Banerjee A, Pal U (2004) A system for word wise handwritten script identification for indian postal automation. In: IEEE India annual conference, pp 266–271

  24. Roy K, Pal U, Chaudhuri BB (2005) Neural network based word-wise handwritten script identification system for Indian postal automation. In: International conference on intelligent sensing and information processing, pp 240–245

  25. Roy K, Pal U (2006) Word-wise hand-written script separation for Indian postal automation. In: 10th International workshop on frontiers in handwriting recognition (IWFHR), pp 521–526

  26. Benjelil M, Kanoun S, Mullot R, Alimi AM (2009) Arabic and Latin script identification in printed and handwritten types based on steerable pyramid features. In: Steerable pyramid features, international conference on document analysis and recognition (ICDAR), pp 591–595

  27. Roy K, Alaei A, Pal U (2010) Word-wise handwritten Persian and Roman script identification. In: 12th international conference on frontiers in handwriting recognition (ICFHR), pp 628–633

  28. Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2010) Word level script identification from Bangla and Devanagri handwritten texts mixed with Roman script. J Comput 2(2):103–108

    Google Scholar 

  29. Chanda S, Franke K, Pal U (2011) Identification of Indic scripts on torn-documents. In: International conference on document analysis and recognition, pp 713–717

  30. Singh PK, Sarkar R, Das N, Basu S, Nasipuri M (2013) Identification of Devnagari and Roman scripts from multi-script handwritten documents. In: 5th International conference pattern recognition and machine intelligence, pp 509–514

  31. Dey N, Ashoura A, Hassanien A (2017) Feature detectors and descriptors generations with numerous images and video applications: a recap. In: Handbook of research on applied video processing and mining, pp 36–65

  32. Obaidullah SM, Roy K, Das N (2013) Comparison of different classifiers for script identification from handwritten document. In: 2013 IEEE International Conference Signal Processing, Computing and Control, ISPCC, pp 0–5

  33. Obaidullah SM, Goswami C, Santosh KC, Halder C, Das N, Roy K (2017) Separating Indic scripts with ‘matra’ for effective handwritten script identification in multi-script documents. Int J Artif Intell Pattern Recognit 31(4):1753003

    Article  Google Scholar 

  34. Chacko BP, Krishnan VRV, Raju G, Anto PB (2012) Handwritten character recognition using wavelet energy and extreme learning machine. Int J Mach Learn Cybern 3(2):149–161

    Article  Google Scholar 

  35. Saba T, Rehman A (2013) Effects of artificially intelligent tools on pattern recognition. Int J Mach Learn Cybern 4(2):155–162

    Article  Google Scholar 

  36. AlShahrani A, Al-Abadi M, Al-Malki A, Ashour A, Dey N (2016) Automated system for crops recognition and classification. In: Handbook of research on applied video processing and mining, pp 54–69

  37. Hore S et al (2016) Neural-based prediction of structural failure of multistoried RC buildings. Struct Eng Mech 58(3):459–473

    Article  Google Scholar 

  38. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  39. Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2012) CMATERdb1: a database of unconstrained handwritten Bangla and Bangla-English mixed script document image. Int J Doc Anal Recognit 15(1):71–83

    Article  Google Scholar 

  40. Aleai A, Nagabhushan P, Pal U (2011) A benchmark Kannada handwritten document dataset and its segmentation. In: International conference on document analysis and recognition (ICDAR), pp 140–145

  41. Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  42. Huang GB, Zhou H, Ding X, Zhang R (2012) Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cybern Part B Cybern 42(2):513–529

    Article  Google Scholar 

  43. Liu P, Huang Y, Meng L, Gong S, Zhang G (2016) Two-stage extreme learning machine for high-dimensional data. Int J Mach Learn Cybern 7(5):765–772

    Article  Google Scholar 

  44. Li J, Mei X, Prokhorov D, Tao D (2017) Deep neural network for structural prediction and lane detection in traffic scene. IEEE Trans Neural Netw Learn Syst 28(3):690–703

    Article  Google Scholar 

  45. Fang Y, Liu ZH, Min F (2016) Multi-objective cost-sensitive attribute reduction on data with error ranges. Int J Mach Learn Cybern 7(5):783–793

    Article  Google Scholar 

  46. Abdessalem W, Ashour A, Sassi D, Roy P, Kausar N, Dey N (2015) MEDLINE text mining: an enhancement genetic algorithm based approach for document clustering. In: Applications of intelligent optimization in biology and medicine, Springer, pp 267–287

  47. Acharjya D, Anitha A (2017) A comparative study of statistical and rough computing models in predictive data analysis. IJACI 8(2):32–35

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. C. Santosh.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Obaidullah, S.M., Santosh, K.C., Halder, C. et al. Automatic Indic script identification from handwritten documents: page, block, line and word-level approach. Int. J. Mach. Learn. & Cyber. 10, 87–106 (2019). https://doi.org/10.1007/s13042-017-0702-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-017-0702-8

Keywords

Navigation