ABSTRACT
This paper presents an XML-based scheme for managing a large multilingual OCR project. In particular we describe how a new XML based tagging scheme has been exploited to achieve the objectives of the project. Managing a large multi-lingual OCR project involving multiple research groups, developing script specific and script independent technologies in a collaborative fashion is a challenging problem. In this paper, we present some of the software and data management strategies designed for the project aimed at developing OCR for 11 scripts of Indian origin for which mature OCR technology was not available.
- A. Bhaskarbhatla, S. Madhavanath, M. Pavan Kumar, A. Balasubramanian and C. V. Jawahar. Representation and Annotation of Online Handwritten Data. In Proc. of 9th International Workshop on Frontiers in Handwriting Recognition (IWFHR), pages 136--141, 2004. Google ScholarDigital Library
- M. Agrawal, K. Bali, S. Madhvanath, and L. Vuurpijl. Upx: a new xml representation for annotated datasets of online handwriting data. In Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on, pages 1161--1165 Vol. 2, Aug.-1 Sept. 2005. Google ScholarDigital Library
- T. Breuel and U. Kaiserslautern. The hocr microformat for ocr workflow and results. In Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on, volume 2, pages 1063--1067, Sept. 2007. Google ScholarDigital Library
- C. V. Jawahar, Anand Kumar, A. Phaneendra, K. J. Jinesh. Building Data Sets for Indian Language OCR Research. Springer Series in Advances in Pattern Recognition, 2009.Google Scholar
- C. V. Jawahar and Anand Kumar. Content Level Annotation of Large Collection of Printed Document Images. In Proc. of International Conference on Document Analysis and Recognition (ICDAR), pages 799--803, 2007. Google ScholarDigital Library
- H. Ghosh, G. Harit, and S. Chaudhury. Ontology based interaction with multimedia collections. ICDL'06, International Conference on Digital Library, 2006.Google Scholar
- I. Guyon, L. Schomaker, R. Plamondon, M. Liberman, and S. Janet. Unipen project of on-line data exchange and recognizer benchmarks. In Pattern Recognition, 1994. Vol. 2 -- Conference B: Computer Vision and Image Processing., Proceedings of the 12th IAPR International. Conference on, volume 2, pages 29--33 vol. 2, Oct 1994.Google ScholarCross Ref
- S. W. Houlding. Xml -- an opportunity for <meaningful> data standards in the geosciences. Computers & Geosciences, 27(7):839--849, 2001. Google ScholarDigital Library
- International Unipen foundation. The unipen project. http://www.unipen.org, 1994.Google Scholar
- A. Lear. Xml seen as integral to application integration. IT Professional, 1(5):12--16, Sep/Oct 1999. Google ScholarDigital Library
- A. Mallik, P. Pasumarthi, and S. Chaudhury. Multimedia ontology learning for automatic annotation and video browsing. In MIR '08: Proceeding of the 1st ACM international conference on Multimedia information retrieval, pages 387--394, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- W3C Mullti-modal Interaction Working Group. Ink markup language (inkml). http://www.w3.org/2002/mmi/ink, 2003.Google Scholar
- W3C Web Ontology Working Group. Web Ontological Language (OWL). http://www.w3.org/TR/owl-guide/, 2004.Google Scholar
- S. Wrede, J. Fritsch, C. Bauckhage, and G. Sagerer. An xml based framework for cognitive vision architectures. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 1, pages 757--760 Vol. 1, Aug. 2004. Google ScholarDigital Library
- J. Yoon and S. Kim. Schema extraction for multimedia xml document retrieval. In Web Information Systems Engineering, 2000. Proceedings of the First International Conference on, volume 2, pages 113--120 vol. 2, 2000. Google ScholarDigital Library
Index Terms
- Managing multilingual OCR project using XML
Recommendations
Multilingual OCR research and applications: an overview
MOCR '13: Proceedings of the 4th International Workshop on Multilingual OCRThis paper offers an overview of the current approaches to research in the field of off-line multilingual OCR. Typically, off-line OCR systems are designed for a particular script or language. However, the ideal approach to multilingual OCR would likely ...
Adapting the Tesseract open source OCR engine for multilingual OCR
MOCR '09: Proceedings of the International Workshop on Multilingual OCRWe describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond ...
Learning on the fly: a font-free approach toward multilingual OCR
Special issue - Selected and extended papers from ICDAR2009Despite ubiquitous claims that optical character recognition (OCR) is a “solved problem,” many categories of documents continue to break modern OCR software such as documents with moderate degradation or unusual fonts. Many approaches rely on pre-...
Comments