Paper
1 April 1998 Automated conversion of structured documents into SGML
Janusz Wnek, Robert J. Price
Author Affiliations +
Proceedings Volume 3305, Document Recognition V; (1998) https://doi.org/10.1117/12.304626
Event: Photonics West '98 Electronic Imaging, 1998, San Jose, CA, United States
Abstract
Intelligent document understanding (IDU) systems convert scanned document pages into an electronic format which preserves layout and logical document structure in addition to document content. MOst of the IDU experimental systems, however, lack the capability of full exploitation of recognition results. In this paper we present an integrated IDU system that processes documents all the way from recognition to full utilization using standard generalized markup language (SGML). The standardization and widespread use of SGML-based tools provides the means for filling the gap between document recognition and seamless document reuse. The conversion process involves OCR of a multipage document, document structure analysis, processing of tabular data and mathematical expressions, and generation of the final SGML description. Document structure analysis is reduce here to parsing OCR results and recreating document structure by performing fuzzy searches for standard phrases and format analysis. Tabular data processing utilizes OCR results with positional data, horizontal lines and heuristic rules to determine cell boundaries and contents. Recognition of mathematical expressions involves OCR on an extended symbol set, and equation structure recognition via transformations on a tree representation. The transformations are ordered and involve connecting of separated symbols, context-sensitive OCR correction, extraction of horizontally aligned subexpressions, subscript and superscript processing, and a general processing of symbols detected above or below the target symbol.
© (1998) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Janusz Wnek and Robert J. Price "Automated conversion of structured documents into SGML", Proc. SPIE 3305, Document Recognition V, (1 April 1998); https://doi.org/10.1117/12.304626
Lens.org Logo
CITATIONS
Cited by 1 scholarly publication.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Optical character recognition

Data conversion

Chemical elements

Data processing

Image processing

Visualization

Fuzzy logic

Back to Top