skip to main content
10.1145/1998076.1998079acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

Structure extraction from PDF-based book documents

Authors Info & Claims
Published:13 June 2011Publication History

ABSTRACT

Nowadays PDF documents have become a dominating knowledge repository for both the academia and industry largely because they are very convenient to print and exchange. However, the methods of automated structure information extraction are yet to be fully explored and the lack of effective methods hinders the information reuse of the PDF documents. To enhance the usability for PDF-formatted electronic books, we propose a novel computational framework to analyze the underlying physical structure and logical structure. The analysis is conducted at both page level and document level, including global typographies, reading order, logical elements, chapter/section hierarchy and metadata. Moreover, two characteristics of PDF-based books, i.e., style consistency in the whole book document and natural rendering order of PDF files, are fully exploited in this paper to improve the conventional image-based structure extraction methods. This paper employs the bipartite graph as a common structure for modeling various tasks, including reading order recovery, figure and caption association, and metadata extraction. Based on the graph representation, the optimal matching (OM) method is utilized to find the global optima in those tasks. Extensive benchmarking using real-world data validates the high efficiency and discrimination ability of the proposed method.

References

  1. Aiello, M., Monz, C., Todoran, L. and Worring M. Document Understanding for a Broad Class of Documents. International Journal on Document Analysis and Recognition, 5(5):1--16, 2002.Google ScholarGoogle Scholar
  2. Altamura, O., Esposito, F. and Malerba, D. Transforming Paper Documents into XML Format with WISDOM+. International Journal of Document Analysis and Recognition, 3(2):175--198, 2001.Google ScholarGoogle Scholar
  3. Anjewierden, A. AIDAS: Incremental Logical Structure Discovery in PDF Documents. In Proc. of the 6th International Conference on Document Analysis and Recognition, ICDAR'01, pages 374--378, Seattle, USA, September 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bart, E., Sarkar, P. Information Extraction by Finding Repeated Structure. In Proc. of the 9th International Workshop on Document Analysis Systems, DAS'10, pages 175--182, Cambridge, MA, USA, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bloechle, J.L. and Pugin, C. and Ingold, R. Dolores: An Interactive and Class-Free Approach for Document Logical Restructuring. In Proc. of the 8th International Workshop on Document Analysis Systems, DAS'08, pages 644--652, Nara, Japan, September 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Breuel, T. B. Layout Analysis Based on Text Line Segment Hypotheses. In Proc. of the International Workshop on Document Layout Interpretation and Its Applications, DLIA'03, Edinburgh, Scotland, August 2003.Google ScholarGoogle Scholar
  7. Buchanan, G. and Owen, T. Improving Navigation Interaction in Digital Documents. In Proc. of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL'08, pages 389--392, Pittsburgh, PA, USA, June 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ceci, M., Appice, A., Loglisci, C. and Malerba, D. Preference Learning for Document Image Analysis. In Proc. of ECML/PKDD-10 Tutorial and Workshop on Preference Learning, Barcelona, Spain, September 2010.Google ScholarGoogle Scholar
  9. Cees W. D.J. Jan Tschichold: Master Typographer: His Life, Work & Legacy. Thames & Hudson.Thames & Hudson Ltd., NY., 2008.Google ScholarGoogle Scholar
  10. Chen, C. C., Yang, K. H. and Ho, J. M. BibPro: A Citation Parser Based on Sequence Alignment Techniques. In Proc. of the 22nd International Conference on Advanced Information Networking and Applications, AINA'08, pages 1175--1180, GinoWan, Japan, March 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Déjean, H. and Meunier, J. L. A System for Converting PDF Documents into Structured XML Format. In Proc. of the 7th International Workshop on Document Analysis Systems, DAS'06, pages 129--140, Nelson, New Zealand, February 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Gao, L.C., Tang, Z., Lin, X. F. CEBBIP: A Parser of Bibliographic Information in Chinese Electronic Books. In Proc. of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL'09, pages 73--76, Austin, USA, June 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Gao, L.C., Tang, Z., Lin, X. F. and Qiu, R .H. Comprehensive Global Typography Extraction System for Electronic Book Documents. In Proc. of the 8th International Workshop on Document Analysis Systems, DAS'08, pages 615--621, Nara, Japan, September 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Gao, L.C., Tang, Z., Lin, X. F., Tao, X. and Chu, Y. M. Analysis of Book Documents' Table of Content Based on Clustering. In Proc. of the 10th International Conference on Document Analysis and Recognition, ICDAR'09, pages 911--914, Barcelona, Spain, July 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hassan, T. PDF to HTML Conversion. Technical report, University of Warwick, 2003.Google ScholarGoogle Scholar
  16. Hassan, T. User-Guided Wrapping of PDF Documents Using Graph Matching Techniques. In Proc. of the 10th International Conference on Document Analysis and Recognition, ICDAR'09, pages 631--635, Barcelona, Spain, July 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. He, F., Ding, X., and Peng, L. Hierarchical Logical Structure Extraction of Book Documents by Analyzing Tables of Contents. In Proc. of the International Conference on Document Recognition and Retrieval XI, pages 6--13, San Jose, USA, January 2004.Google ScholarGoogle Scholar
  18. Ishitani, Y. Document Transformation System from Papers to XML Data Based on Pivot XML Document Method. In Proc. of the 7th International Conference on Document Analysis and Recognition, ICDAR'03, pages 250--255, Edinburgh, Scotland, August 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Klink, S., Dengel, A. and Kieninger, T. Document Structure Analysis Based on Layout and Textual Features. In Proc. of the 4th International Workshop on Document Analysis Systems, DAS'00, pages 99--111, Rio de Janeiro, Brazil, June 2000.Google ScholarGoogle Scholar
  20. Lee, K.H., Choy, Y.C. and Cho, S.B. Logical Structure Analysis and Generation for Structured Documents: A Syntactic Approach. IEEE Transaction on Knowledge and Data Engineering, 15(5):1277--1294, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Lin, C., Niwa, Y., and Narita, S. Logical Structure Analysis of Book Document Images Using Contents Information. In Proc. of the 4th International Conference on Document Analysis and Recognition, ICDAR'97, pages 1048--1054, Ulm, Germany, January 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Lin, X. F. Header and Footer Extraction by Page-association. In Proc. of the International Conference on Document Recognition and Retrieval X, pages 164--171, Santa Clara, USA, January 2003.Google ScholarGoogle ScholarCross RefCross Ref
  23. Liu, Y. and Mitra, P. and Giles, C.L. and Bai, K. Automatic Extraction of Table Metadata from Digital Documents. In Proc. of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 339--340, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Meunier, J. L. Optimized XY-cut for Determining a Page Reading Order. In Proc. of the 8th International Conference on Document Analysis and Recognition, ICDAR'05, pages 347--351, Seoul, Korea, August--September 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Nagy, G., and Seth, S. Hierarchical Representation of Optically Scanned Documents. In Proc. of the 7th International Conference on Pattern Recognition, pages 347--349, Montreal, Canada, 1984.Google ScholarGoogle Scholar
  26. Nojoumian, M. and Lethbridge, T.C. Reengineering PDF-Based Documents Targeting Complex Software Specifications. Master's thesis, University of Waterloo, Canada, 2009.Google ScholarGoogle Scholar
  27. Papadimitriou, C. H., and Steiglitz, K. Combinatorial Optimization: Algorithms and Complexity. Englewood Cliffs, NJ, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Rigamonti, M., Bloechle, J. L., Hadjar K., et al. Towards a Canonical and Structured Representation of PDF Documents through Reverse Engineering. In Proc. of the 8th International Conference on Document Analysis and Recognition, ICDAR'05, pages 1050--1055, Seoul, Korea, August--September 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Shipman, F. M., Marshall, C. C., and Moran, T. P. Finding and Using Implicit Structure in Human-organized Spatial Layouts of Information. In Proc. of the ACM Conference on Human Factors in Computing Systems, pages 346--353, Denver, CO, May 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Tang, Y. Y., Yan, C. D. and Suen, C. Y. Document Processing for Automatic Knowledge Acquisition. IEEE Trans. on Knowledge and Data Engineering, 6(1):3--21, February 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Xiao, W. S. Graph Theory and Its Algorithms. Aviation Industrial Press, Beijing, 1993.Google ScholarGoogle Scholar
  32. Yang, Y. and Liu, X. A Re-examination of Text Categorization Methods. In Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'99, pages 42--49, Berkeley, CA, USA, August 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Structure extraction from PDF-based book documents

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      JCDL '11: Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
      June 2011
      500 pages
      ISBN:9781450307444
      DOI:10.1145/1998076

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 June 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate415of1,482submissions,28%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader