Loading [a11y]/accessibility-menu.js
Document image ground truth generation from electronic text | IEEE Conference Publication | IEEE Xplore

Document image ground truth generation from electronic text


Abstract:

The problem of generating synthetic data for the training and evaluation of document analysis systems has been widely addressed. With the increased interest in processing...Show More

Abstract:

The problem of generating synthetic data for the training and evaluation of document analysis systems has been widely addressed. With the increased interest in processing multilingual sources, however, there is a tremendous need to be able to rapidly generate data in new languages and scripts, without the need to develop specialized systems. We have developed an approach, which uses language support of the MS Windows operating system combined with custom print drivers to render tiff images simultaneously with Windows enhanced metafile directives. The metafile information is parsed to generate zone, line, word, and character ground truth including location, font information and content in any language supported by Windows. The resulting images can be physically or synthetically degraded, and used for training and evaluating OCR systems. We briefly survey related work and describe our system.
Date of Conference: 26-26 August 2004
Date Added to IEEE Xplore: 20 September 2004
Print ISBN:0-7695-2128-2
Print ISSN: 1051-4651
Conference Location: Cambridge, UK

Contact IEEE to Subscribe

References

References is not available for this document.