research-article

Managing multilingual OCR project using XML

Authors:
Gaurav Harit

IIT, Kharagpur

IIT, Kharagpur
View Profile

,
K. J. Jinesh

IIIT, Hyderabad

IIIT, Hyderabad
View Profile

,
Ritu Garg

IIT, Delhi

IIT, Delhi
View Profile

,
C. V. Jawahar

IIIT, Hyderabad

IIIT, Hyderabad
View Profile

,
Santanu Chaudhury

IIT, Delhi

IIT, Delhi
View Profile

MOCR '09: Proceedings of the International Workshop on Multilingual OCRJuly 2009Article No.: 18Pages 1–10https://doi.org/10.1145/1577802.1577822

Published:25 July 2009Publication History

MOCR '09: Proceedings of the International Workshop on Multilingual OCR

Pages 1–10

ABSTRACT

This paper presents an XML-based scheme for managing a large multilingual OCR project. In particular we describe how a new XML based tagging scheme has been exploited to achieve the objectives of the project. Managing a large multi-lingual OCR project involving multiple research groups, developing script specific and script independent technologies in a collaborative fashion is a challenging problem. In this paper, we present some of the software and data management strategies designed for the project aimed at developing OCR for 11 scripts of Indian origin for which mature OCR technology was not available.

References

A. Bhaskarbhatla, S. Madhavanath, M. Pavan Kumar, A. Balasubramanian and C. V. Jawahar. Representation and Annotation of Online Handwritten Data. In Proc. of 9th International Workshop on Frontiers in Handwriting Recognition (IWFHR), pages 136--141, 2004. Google ScholarDigital Library
M. Agrawal, K. Bali, S. Madhvanath, and L. Vuurpijl. Upx: a new xml representation for annotated datasets of online handwriting data. In Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on, pages 1161--1165 Vol. 2, Aug.-1 Sept. 2005. Google ScholarDigital Library
T. Breuel and U. Kaiserslautern. The hocr microformat for ocr workflow and results. In Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on, volume 2, pages 1063--1067, Sept. 2007. Google ScholarDigital Library
C. V. Jawahar, Anand Kumar, A. Phaneendra, K. J. Jinesh. Building Data Sets for Indian Language OCR Research. Springer Series in Advances in Pattern Recognition, 2009.Google Scholar
C. V. Jawahar and Anand Kumar. Content Level Annotation of Large Collection of Printed Document Images. In Proc. of International Conference on Document Analysis and Recognition (ICDAR), pages 799--803, 2007. Google ScholarDigital Library
H. Ghosh, G. Harit, and S. Chaudhury. Ontology based interaction with multimedia collections. ICDL'06, International Conference on Digital Library, 2006.Google Scholar
I. Guyon, L. Schomaker, R. Plamondon, M. Liberman, and S. Janet. Unipen project of on-line data exchange and recognizer benchmarks. In Pattern Recognition, 1994. Vol. 2 -- Conference B: Computer Vision and Image Processing., Proceedings of the 12th IAPR International. Conference on, volume 2, pages 29--33 vol. 2, Oct 1994.Google ScholarCross Ref
S. W. Houlding. Xml -- an opportunity for &lt;meaningful&gt; data standards in the geosciences. Computers & Geosciences, 27(7):839--849, 2001. Google ScholarDigital Library
International Unipen foundation. The unipen project. http://www.unipen.org, 1994.Google Scholar
A. Lear. Xml seen as integral to application integration. IT Professional, 1(5):12--16, Sep/Oct 1999. Google ScholarDigital Library
A. Mallik, P. Pasumarthi, and S. Chaudhury. Multimedia ontology learning for automatic annotation and video browsing. In MIR '08: Proceeding of the 1st ACM international conference on Multimedia information retrieval, pages 387--394, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
W3C Mullti-modal Interaction Working Group. Ink markup language (inkml). http://www.w3.org/2002/mmi/ink, 2003.Google Scholar
W3C Web Ontology Working Group. Web Ontological Language (OWL). http://www.w3.org/TR/owl-guide/, 2004.Google Scholar
S. Wrede, J. Fritsch, C. Bauckhage, and G. Sagerer. An xml based framework for cognitive vision architectures. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 1, pages 757--760 Vol. 1, Aug. 2004. Google ScholarDigital Library
J. Yoon and S. Kim. Schema extraction for multimedia xml document retrieval. In Web Information Systems Engineering, 2000. Proceedings of the First International Conference on, volume 2, pages 113--120 vol. 2, 2000. Google ScholarDigital Library

Index Terms

Managing multilingual OCR project using XML
1. Information systems
  1. Information storage systems
    1. Record storage systems
      1. Record storage alternatives
2. Software and its engineering

Recommendations

Multilingual OCR research and applications: an overview
MOCR '13: Proceedings of the 4th International Workshop on Multilingual OCR

This paper offers an overview of the current approaches to research in the field of off-line multilingual OCR. Typically, off-line OCR systems are designed for a particular script or language. However, the ideal approach to multilingual OCR would likely ...
Read More
Adapting the Tesseract open source OCR engine for multilingual OCR
MOCR '09: Proceedings of the International Workshop on Multilingual OCR

We describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond ...
Read More
Learning on the fly: a font-free approach toward multilingual OCR
Special issue - Selected and extended papers from ICDAR2009

Despite ubiquitous claims that optical character recognition (OCR) is a “solved problem,” many categories of documents continue to break modern OCR software such as documents with moderate degradation or unusual fonts. Many approaches rely on pre-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MOCR '09: Proceedings of the International Workshop on Multilingual OCR
July 2009
139 pages
ISBN:9781605586984
DOI:10.1145/1577802
General Chairs:
Venu Govindaraju
University at Buffalo
,
Prem Natarajan
BBN Technologies
,
Program Chairs:
Santanu Chaudhury
IIT Delhi
,
Daniel Lopresti
Lehigh University
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
OCR markup
XML data representation
XML representation
concept based search
ground-truth data management
multilingual Indian OCR
multilingual OCR
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate17of34submissions,50%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 85
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Managing multilingual OCR project using XML

MOCR '09: Proceedings of the International Workshop on Multilingual OCR

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multilingual OCR research and applications: an overview

Adapting the Tesseract open source OCR engine for multilingual OCR

Learning on the fly: a font-free approach toward multilingual OCR