Skip to main content

Structure Analysis and Generation for Internet Documents

  • Chapter
Intelligent Exploration of the Web

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 111))

  • 228 Accesses

Abstract

This paper presents a syntactic method for logical structure analysis and generation for creation of Web documents. The method transforms document images with multiple pages and hierarchical structure into an XML document. To produce a logical structure more accurately and quickly than previous works of which the basic units are text lines, the proposed method takes text regions with hierarchical structure as input. Furthermore, we define a document model that is able to describe geometric characteristics and logical structure information of document class efficiently. Experimental results with 372 images scanned from the technical journal show that the method has performed logical structure analysis successfully. Particularly, the method generates XML documents as the result of structural analysis, so that it enhances the reusability of documents and independence of platform.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. World Wide Web Consortium, (2000). Extensible Markup Language (XML) 1.0 (Second Edition), http://www.w3c.org/TR/REC-xml.

  2. Nagy G., (2000). Twenty Years of Document Image Analysis in PAMI, IEEE Trans. Pattern Analysis and Machine Intelligence, 22 (1), 38–62.

    Article  Google Scholar 

  3. Summers K. M., (1995). Toward a Taxonomy of Logical Document Structures, Proc. Dartmouth Institute for Advanced Graduate Studies (DAGS’95), 124–133.

    Google Scholar 

  4. Nagy G., Seth S., Viswanathan M., (1992). A Prototype Document Image Analysis System for Technical Journals, IEEE Computer, 25 (7), 10–22.

    Google Scholar 

  5. Krishnamoorthy M., Nagy G., Seth S., Viswanathan, M, (1993). Syntactic Segmentation and Labeling of Digitized Pages from Technical Journals, IEEE Trans. Pattern Analysis and Machine Intelligence, 15(7), 737–747.

    Google Scholar 

  6. Niyogi D., Srihari S. N., (1996). An Integrated Approach to Document Decomposition and Structural Analysis, Intl Journal of Imaging Systems and Technology, 7, 330–342.

    Article  Google Scholar 

  7. Farrow G. S. D., Xydears C. S., Oakley J. P., Khorabi A., Prelcic, N. G., (1996). A Comparison of System Architectures for Intelligent Document Understanding, Signal Processing: Image Communication, 9, 1–19.

    Article  Google Scholar 

  8. Tsujimoto S., Asada H., (1992). Major Components of a Complete Text Reading System, Proc. IEEE, 80 (7), 1133–1149.

    Article  Google Scholar 

  9. Dengel A., Barth G., (1988). High Level Document Analysis Guided By Geometric Aspects, Int’l Journal of Pattern Recognition and Artificial Intelligence, 2 (4), 641–655.

    Article  Google Scholar 

  10. Dengel A., Bleisinger R., Hoch R., Fein F., Hines F., (1992). From Paper to Office Document Standard Representation, IEEE Computer, 25 (7), 63–67.

    Article  Google Scholar 

  11. Story G. A., O’Gorman L., Fox D., Schaper L. L., Jagadish H. V., (1992). The RightPages Image-Based Electronic Library for Alerting and Browsing, IEEE Computer, 25 (9), 17–26.

    Article  Google Scholar 

  12. Hu T., Ingold R., (1993). A Mixed Approach toward an Efficient Logical Structure Recognition from Document Images, Electronic Publishing: Origination, Dissemination and Design, 6 (4), 457–468.

    Google Scholar 

  13. Conway A., (1993). Page Grammars and Page Parsing: a Syntactic Approach to Document Layout Recognition, Proc. Second Intl Conf. Document Analysis and Recognition, 761–764.

    Google Scholar 

  14. Tateisi Y., Itoh N., (1994). Using Stochastic Syntactic Analysis for Extraction a Logical Structure from a Document Image, Proc. Int’l Conf. Pattern Recognition, 2, 391–394.

    Google Scholar 

  15. Klein B., Fankhauser P., (1997). Error Tolerant Document Structure Analysis, Proc. IEEE Intl Forum on Research and Technology on Advances in Digital Libraries, 116–127.

    Google Scholar 

  16. Klein B., Abecker A., (1999). Distributed Knowledge-based Parsing for Document Analysis and Understanding, Proc. IEEE Intl Forum on Research and Technology on Advances in Digital Libraries, 6–15.

    Google Scholar 

  17. Bayer T. A., Walischewski H., (1995). Experiments on Extracting Structural Information from Paper Documents using Syntactic Pattern Analysis, Proc. Third Intl Conf. Document Analysis and Recognition, 476–479.

    Google Scholar 

  18. Lin C., Niwa Y., Narita S., (1997). Logical Structure Analysis of Book Document Image Using Contents Information, Proc. Fourth Intl Conf Document Analysis and Recognition, II, 1048–1051.

    Google Scholar 

  19. Rus D., Summers K., (1997). Geometric Algorithms and Experiments for Automated Document Structuring, Mathematical and Computer Modelling, 26 (1), 55–83.

    Article  MathSciNet  MATH  Google Scholar 

  20. Summers K. M., (1998). Automatic Discovery of Logical Document Structure, Ph.D. Thesis, Cornell University.

    Google Scholar 

  21. Kochi T., Saitoh T., (1998). A Layout-Free Method for Element Extraction from Document Images, Proc. Workshop on Document Analysis System, 336–345.

    Google Scholar 

  22. Hitz O., Robadey L., Ingold R., (1999). Analysis of Synthetic Document Images, Proc. Fifth Intl Conf. Document Analysis and Recognition, 374–377.

    Google Scholar 

  23. Worring M., Smeulders A. W. M., (1999). Content-based Internet Access to Paper Documents, Intl Journal on Document Analysis and Recognition, 1 (4), 209–220.

    Article  Google Scholar 

  24. Koffka K., (1935). Principles of Gestalt Psychology, New York, Harcourt, Brace and World.

    Google Scholar 

  25. Lee K. H., Choy Y. C., Cho S. B., (2000). Geometric Structure Analysis of Document Images: A Knowledge-based Approach, IEEE Trans. Pattern Analysis and Machine Intelligence, 22 (11), 1224–1240.

    Article  Google Scholar 

  26. Zhang K., Shasha D., (1989). Simple Fast Algorithms for the Editing Distance between Trees and Related Problems, SIAM Journal on Computing, 18 (6), 1245–1262.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Lee, K.H., Choy, Y.C., Cho, SB. (2003). Structure Analysis and Generation for Internet Documents. In: Szczepaniak, P.S., Segovia, J., Kacprzyk, J., Zadeh, L.A. (eds) Intelligent Exploration of the Web. Studies in Fuzziness and Soft Computing, vol 111. Physica, Heidelberg. https://doi.org/10.1007/978-3-7908-1772-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-7908-1772-0_1

  • Publisher Name: Physica, Heidelberg

  • Print ISBN: 978-3-7908-2519-0

  • Online ISBN: 978-3-7908-1772-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics