Abstract
This paper presents a syntactic method for logical structure analysis and generation for creation of Web documents. The method transforms document images with multiple pages and hierarchical structure into an XML document. To produce a logical structure more accurately and quickly than previous works of which the basic units are text lines, the proposed method takes text regions with hierarchical structure as input. Furthermore, we define a document model that is able to describe geometric characteristics and logical structure information of document class efficiently. Experimental results with 372 images scanned from the technical journal show that the method has performed logical structure analysis successfully. Particularly, the method generates XML documents as the result of structural analysis, so that it enhances the reusability of documents and independence of platform.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
World Wide Web Consortium, (2000). Extensible Markup Language (XML) 1.0 (Second Edition), http://www.w3c.org/TR/REC-xml.
Nagy G., (2000). Twenty Years of Document Image Analysis in PAMI, IEEE Trans. Pattern Analysis and Machine Intelligence, 22 (1), 38–62.
Summers K. M., (1995). Toward a Taxonomy of Logical Document Structures, Proc. Dartmouth Institute for Advanced Graduate Studies (DAGS’95), 124–133.
Nagy G., Seth S., Viswanathan M., (1992). A Prototype Document Image Analysis System for Technical Journals, IEEE Computer, 25 (7), 10–22.
Krishnamoorthy M., Nagy G., Seth S., Viswanathan, M, (1993). Syntactic Segmentation and Labeling of Digitized Pages from Technical Journals, IEEE Trans. Pattern Analysis and Machine Intelligence, 15(7), 737–747.
Niyogi D., Srihari S. N., (1996). An Integrated Approach to Document Decomposition and Structural Analysis, Intl Journal of Imaging Systems and Technology, 7, 330–342.
Farrow G. S. D., Xydears C. S., Oakley J. P., Khorabi A., Prelcic, N. G., (1996). A Comparison of System Architectures for Intelligent Document Understanding, Signal Processing: Image Communication, 9, 1–19.
Tsujimoto S., Asada H., (1992). Major Components of a Complete Text Reading System, Proc. IEEE, 80 (7), 1133–1149.
Dengel A., Barth G., (1988). High Level Document Analysis Guided By Geometric Aspects, Int’l Journal of Pattern Recognition and Artificial Intelligence, 2 (4), 641–655.
Dengel A., Bleisinger R., Hoch R., Fein F., Hines F., (1992). From Paper to Office Document Standard Representation, IEEE Computer, 25 (7), 63–67.
Story G. A., O’Gorman L., Fox D., Schaper L. L., Jagadish H. V., (1992). The RightPages Image-Based Electronic Library for Alerting and Browsing, IEEE Computer, 25 (9), 17–26.
Hu T., Ingold R., (1993). A Mixed Approach toward an Efficient Logical Structure Recognition from Document Images, Electronic Publishing: Origination, Dissemination and Design, 6 (4), 457–468.
Conway A., (1993). Page Grammars and Page Parsing: a Syntactic Approach to Document Layout Recognition, Proc. Second Intl Conf. Document Analysis and Recognition, 761–764.
Tateisi Y., Itoh N., (1994). Using Stochastic Syntactic Analysis for Extraction a Logical Structure from a Document Image, Proc. Int’l Conf. Pattern Recognition, 2, 391–394.
Klein B., Fankhauser P., (1997). Error Tolerant Document Structure Analysis, Proc. IEEE Intl Forum on Research and Technology on Advances in Digital Libraries, 116–127.
Klein B., Abecker A., (1999). Distributed Knowledge-based Parsing for Document Analysis and Understanding, Proc. IEEE Intl Forum on Research and Technology on Advances in Digital Libraries, 6–15.
Bayer T. A., Walischewski H., (1995). Experiments on Extracting Structural Information from Paper Documents using Syntactic Pattern Analysis, Proc. Third Intl Conf. Document Analysis and Recognition, 476–479.
Lin C., Niwa Y., Narita S., (1997). Logical Structure Analysis of Book Document Image Using Contents Information, Proc. Fourth Intl Conf Document Analysis and Recognition, II, 1048–1051.
Rus D., Summers K., (1997). Geometric Algorithms and Experiments for Automated Document Structuring, Mathematical and Computer Modelling, 26 (1), 55–83.
Summers K. M., (1998). Automatic Discovery of Logical Document Structure, Ph.D. Thesis, Cornell University.
Kochi T., Saitoh T., (1998). A Layout-Free Method for Element Extraction from Document Images, Proc. Workshop on Document Analysis System, 336–345.
Hitz O., Robadey L., Ingold R., (1999). Analysis of Synthetic Document Images, Proc. Fifth Intl Conf. Document Analysis and Recognition, 374–377.
Worring M., Smeulders A. W. M., (1999). Content-based Internet Access to Paper Documents, Intl Journal on Document Analysis and Recognition, 1 (4), 209–220.
Koffka K., (1935). Principles of Gestalt Psychology, New York, Harcourt, Brace and World.
Lee K. H., Choy Y. C., Cho S. B., (2000). Geometric Structure Analysis of Document Images: A Knowledge-based Approach, IEEE Trans. Pattern Analysis and Machine Intelligence, 22 (11), 1224–1240.
Zhang K., Shasha D., (1989). Simple Fast Algorithms for the Editing Distance between Trees and Related Problems, SIAM Journal on Computing, 18 (6), 1245–1262.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Lee, K.H., Choy, Y.C., Cho, SB. (2003). Structure Analysis and Generation for Internet Documents. In: Szczepaniak, P.S., Segovia, J., Kacprzyk, J., Zadeh, L.A. (eds) Intelligent Exploration of the Web. Studies in Fuzziness and Soft Computing, vol 111. Physica, Heidelberg. https://doi.org/10.1007/978-3-7908-1772-0_1
Download citation
DOI: https://doi.org/10.1007/978-3-7908-1772-0_1
Publisher Name: Physica, Heidelberg
Print ISBN: 978-3-7908-2519-0
Online ISBN: 978-3-7908-1772-0
eBook Packages: Springer Book Archive