Word-Level Multi-Script Indic Document Image Dataset and Baseline Results on Script Identification

S. K. Obaidullah, K. C. Santosh, Chayan Halder, Nibaran Das, Kaushik Roy

Source Title: International Journal of Computer Vision and Image Processing (IJCVIP)7(2)

ISSN: 2155-6997|EISSN: 2155-6989|EISBN13: 9781522514404|DOI: 10.4018/IJCVIP.2017040106

MLA

Obaidullah, S. K., et al. "Word-Level Multi-Script Indic Document Image Dataset and Baseline Results on Script Identification." IJCVIP vol.7, no.2 2017: pp.81-94. http://doi.org/10.4018/IJCVIP.2017040106

APA

Obaidullah, S. K., Santosh, K. C., Halder, C., Das, N., & Roy, K. (2017). Word-Level Multi-Script Indic Document Image Dataset and Baseline Results on Script Identification. International Journal of Computer Vision and Image Processing (IJCVIP), 7(2), 81-94. http://doi.org/10.4018/IJCVIP.2017040106

Chicago

Obaidullah, S. K., et al. "Word-Level Multi-Script Indic Document Image Dataset and Baseline Results on Script Identification," International Journal of Computer Vision and Image Processing (IJCVIP) 7, no.2: 81-94. http://doi.org/10.4018/IJCVIP.2017040106

Export Reference

Favorite Full-Issue Download

View Full Text HTML

View Full Text PDF

Abstract

Document analysis research starves from the availability of public datasets. Without publicly available dataset, one cannot make fair comparison with the state-of-the-art methods. To bridge this gap, in this paper, the authors propose a word-level document image dataset of 13 different Indic languages from 11 official scripts. It is composed of 39K words that are equally distributed i.e., 3K words per language. For a baseline results, five different classifiers: multilayer perceptron (MLP), fuzzy unordered rule induction algorithm (FURIA), simple logistic (SL), library for linear classifier (LibLINEAR) and bayesian network (BayesNet) classifiers are used with three state-of-the-art features: spatial energy (SE), wavelet energy (WE) and the Radon transform (RT), including their possible combinations. The authors observed that MLP provides better results when all features are used, and achieved the bi-script accuracy of 99.24% (keeping Roman common), 98.38% (keeping Devanagari common) and tri-script accuracy of 98.19% (keeping both Devanagari and Roman common).

You do not own this content. Please login to recommend this title to your institution's librarian or purchase it from the IGI Global bookstore.

Username or email: *

Password: *

Forgot individual login password?

Create individual account

Word-Level Multi-Script Indic Document Image Dataset and Baseline Results on Script Identification

MLA

APA

Chicago

Export Reference

Abstract

Request Access