research-article

Curvature feature distribution based classification of Indian scripts from document images

Authors:
Gaurav Sharma

Indian Institute of Technology, Delhi

Indian Institute of Technology, Delhi
View Profile

,
Ritu Garg

Indian Institute of Technology, Delhi

Indian Institute of Technology, Delhi
View Profile

,
Santanu Chaudhury

Indian Institute of Technology, Delhi

Indian Institute of Technology, Delhi
View Profile

MOCR '09: Proceedings of the International Workshop on Multilingual OCRJuly 2009Article No.: 3Pages 1–6https://doi.org/10.1145/1577802.1577806

Published:25 July 2009Publication History

MOCR '09: Proceedings of the International Workshop on Multilingual OCR

Pages 1–6

ABSTRACT

We present a framework for classification of text document images based on their script. We deal with the domain of Indian scripts which has high inter script similarities. Indian scripts have characteristic curvature distributions which help in visual discrimination of scripts. We use edge direction based features to capture the distribution of curvature. We also use a recently proposed feature selection algorithm to obtain the most discriminating curvature features. We form hierarchy (automatically) based on statistical distances between the script models. Hierarchy allows us to group similar scripts at one level and then focus on the classification between the similar scripts at the next level leading to improvement in accuracy. We show experiments and results on a large set of about 3400 images.

References

A. V. Anil, A. Jain, and H. J. Zhang. On image classification: City images vs. landscapes. Pattern Recognition, 31:1921--1935, 1998.Google ScholarCross Ref
W. Chan and G. G. Coghill. Text analysis using local energy. Pattern Recognition, 34(12):2523--2532, 2001.Google ScholarDigital Library
S. Chaudhury and R. Seth. Trainable script identification strategies for Indian languages. ICDAR, pages 657--660, 1999. Google ScholarDigital Library
J. Hochberg, L. Kerns, P. Kelly, and T. Thomas. Automatic script identification from images using cluster-based templates. TPAMI, 19(2):176--181, 1997. Google ScholarDigital Library
G. D. Joshi, S. Garg, and J. Sivaswamy. Script identification from indian documents. DAS, pages 255--267, 2006. Google ScholarDigital Library
U. Pal, S. Sinha, and B. B. Chaudhuri. Multi-script line identification from Indian document. ICDAR, 2:880--884, 2003. Google ScholarDigital Library
G. Sfikas, C. Constantinopoulos, A. Likas, and N. Galatsanos. An analytic distance metric for gaussian mixture models with application in image retrieval. ICANN, LNCS 3697, pages 835--840, 2005. Google ScholarDigital Library
A. Spitz. Determination of the script and language content of document images. TPAMI, 19(3):235--245, 1997. Google ScholarDigital Library
T. N. Tan. Rotation invariant texture features and their use in automatic script identification. TPAMI, 20(7): 751--756, 1998. Google ScholarDigital Library
M. Vasconcelos and N. Vasconcelos. Natural image statistics and low-complexity feature selection. PAMI, 31(2):228--244, 2009. Google ScholarDigital Library
S. L. Wood, X. Yao, K. Krishnamurthi, and L. Dang. Language identification for printed text independent of segmentation. Intl. Conf. Image Processing, 3:428--431, 1995. Google ScholarDigital Library

Index Terms

Curvature feature distribution based classification of Indian scripts from document images
1. Applied computing
  1. Document management and text processing
2. Computing methodologies
  1. Machine learning

Recommendations

Handwriting Recognition in Indian Regional Scripts: A Survey of Offline Techniques

Offline handwriting recognition in Indian regional scripts is an interesting area of research as almost 460 million people in India use regional scripts. The nine major Indian regional scripts are Bangla (for Bengali and Assamese languages), Gujarati, ...
Read More
Handwritten Numeral Recognition of Six Popular Indian Scripts
ICDAR '07: Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02

India is a multi-lingual multi-script country but there is not much work towards handwritten character recognition of Indian languages. In this paper we propose a modified quadratic classifier based scheme towards the recognition of off-line handwritten ...
Read More
Handwritten Numeral Databases of Indian Scripts and Multistage Recognition of Mixed Numerals

This article primarily concerns the problem of isolated handwritten numeral recognition of major Indian scripts. The principal contributions presented here are (a) pioneering development of two databases for handwritten numerals of two most popular ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MOCR '09: Proceedings of the International Workshop on Multilingual OCR
July 2009
139 pages
ISBN:9781605586984
DOI:10.1145/1577802
General Chairs:
Venu Govindaraju
University at Buffalo
,
Prem Natarajan
BBN Technologies
,
Program Chairs:
Santanu Chaudhury
IIT Delhi
,
Daniel Lopresti
Lehigh University
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Indic script image identification
statistical modeling
text document image classification system
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate17of34submissions,50%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 50
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Curvature feature distribution based classification of Indian scripts from document images

MOCR '09: Proceedings of the International Workshop on Multilingual OCR

ABSTRACT

References

Cited By

Index Terms

Recommendations

Handwriting Recognition in Indian Regional Scripts: A Survey of Offline Techniques

Handwritten Numeral Recognition of Six Popular Indian Scripts

Handwritten Numeral Databases of Indian Scripts and Multistage Recognition of Mixed Numerals