skip to main content
10.1145/1998076.1998079acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

Structure extraction from PDF-based book documents

Published: 13 June 2011 Publication History

Abstract

Nowadays PDF documents have become a dominating knowledge repository for both the academia and industry largely because they are very convenient to print and exchange. However, the methods of automated structure information extraction are yet to be fully explored and the lack of effective methods hinders the information reuse of the PDF documents. To enhance the usability for PDF-formatted electronic books, we propose a novel computational framework to analyze the underlying physical structure and logical structure. The analysis is conducted at both page level and document level, including global typographies, reading order, logical elements, chapter/section hierarchy and metadata. Moreover, two characteristics of PDF-based books, i.e., style consistency in the whole book document and natural rendering order of PDF files, are fully exploited in this paper to improve the conventional image-based structure extraction methods. This paper employs the bipartite graph as a common structure for modeling various tasks, including reading order recovery, figure and caption association, and metadata extraction. Based on the graph representation, the optimal matching (OM) method is utilized to find the global optima in those tasks. Extensive benchmarking using real-world data validates the high efficiency and discrimination ability of the proposed method.

References

[1]
Aiello, M., Monz, C., Todoran, L. and Worring M. Document Understanding for a Broad Class of Documents. International Journal on Document Analysis and Recognition, 5(5):1--16, 2002.
[2]
Altamura, O., Esposito, F. and Malerba, D. Transforming Paper Documents into XML Format with WISDOM+. International Journal of Document Analysis and Recognition, 3(2):175--198, 2001.
[3]
Anjewierden, A. AIDAS: Incremental Logical Structure Discovery in PDF Documents. In Proc. of the 6th International Conference on Document Analysis and Recognition, ICDAR'01, pages 374--378, Seattle, USA, September 2001.
[4]
Bart, E., Sarkar, P. Information Extraction by Finding Repeated Structure. In Proc. of the 9th International Workshop on Document Analysis Systems, DAS'10, pages 175--182, Cambridge, MA, USA, June 2010.
[5]
Bloechle, J.L. and Pugin, C. and Ingold, R. Dolores: An Interactive and Class-Free Approach for Document Logical Restructuring. In Proc. of the 8th International Workshop on Document Analysis Systems, DAS'08, pages 644--652, Nara, Japan, September 2008.
[6]
Breuel, T. B. Layout Analysis Based on Text Line Segment Hypotheses. In Proc. of the International Workshop on Document Layout Interpretation and Its Applications, DLIA'03, Edinburgh, Scotland, August 2003.
[7]
Buchanan, G. and Owen, T. Improving Navigation Interaction in Digital Documents. In Proc. of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL'08, pages 389--392, Pittsburgh, PA, USA, June 2008.
[8]
Ceci, M., Appice, A., Loglisci, C. and Malerba, D. Preference Learning for Document Image Analysis. In Proc. of ECML/PKDD-10 Tutorial and Workshop on Preference Learning, Barcelona, Spain, September 2010.
[9]
Cees W. D.J. Jan Tschichold: Master Typographer: His Life, Work & Legacy. Thames & Hudson.Thames & Hudson Ltd., NY., 2008.
[10]
Chen, C. C., Yang, K. H. and Ho, J. M. BibPro: A Citation Parser Based on Sequence Alignment Techniques. In Proc. of the 22nd International Conference on Advanced Information Networking and Applications, AINA'08, pages 1175--1180, GinoWan, Japan, March 2008.
[11]
Déjean, H. and Meunier, J. L. A System for Converting PDF Documents into Structured XML Format. In Proc. of the 7th International Workshop on Document Analysis Systems, DAS'06, pages 129--140, Nelson, New Zealand, February 2006.
[12]
Gao, L.C., Tang, Z., Lin, X. F. CEBBIP: A Parser of Bibliographic Information in Chinese Electronic Books. In Proc. of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL'09, pages 73--76, Austin, USA, June 2009.
[13]
Gao, L.C., Tang, Z., Lin, X. F. and Qiu, R .H. Comprehensive Global Typography Extraction System for Electronic Book Documents. In Proc. of the 8th International Workshop on Document Analysis Systems, DAS'08, pages 615--621, Nara, Japan, September 2008.
[14]
Gao, L.C., Tang, Z., Lin, X. F., Tao, X. and Chu, Y. M. Analysis of Book Documents' Table of Content Based on Clustering. In Proc. of the 10th International Conference on Document Analysis and Recognition, ICDAR'09, pages 911--914, Barcelona, Spain, July 2009.
[15]
Hassan, T. PDF to HTML Conversion. Technical report, University of Warwick, 2003.
[16]
Hassan, T. User-Guided Wrapping of PDF Documents Using Graph Matching Techniques. In Proc. of the 10th International Conference on Document Analysis and Recognition, ICDAR'09, pages 631--635, Barcelona, Spain, July 2009.
[17]
He, F., Ding, X., and Peng, L. Hierarchical Logical Structure Extraction of Book Documents by Analyzing Tables of Contents. In Proc. of the International Conference on Document Recognition and Retrieval XI, pages 6--13, San Jose, USA, January 2004.
[18]
Ishitani, Y. Document Transformation System from Papers to XML Data Based on Pivot XML Document Method. In Proc. of the 7th International Conference on Document Analysis and Recognition, ICDAR'03, pages 250--255, Edinburgh, Scotland, August 2003.
[19]
Klink, S., Dengel, A. and Kieninger, T. Document Structure Analysis Based on Layout and Textual Features. In Proc. of the 4th International Workshop on Document Analysis Systems, DAS'00, pages 99--111, Rio de Janeiro, Brazil, June 2000.
[20]
Lee, K.H., Choy, Y.C. and Cho, S.B. Logical Structure Analysis and Generation for Structured Documents: A Syntactic Approach. IEEE Transaction on Knowledge and Data Engineering, 15(5):1277--1294, 2003.
[21]
Lin, C., Niwa, Y., and Narita, S. Logical Structure Analysis of Book Document Images Using Contents Information. In Proc. of the 4th International Conference on Document Analysis and Recognition, ICDAR'97, pages 1048--1054, Ulm, Germany, January 1997.
[22]
Lin, X. F. Header and Footer Extraction by Page-association. In Proc. of the International Conference on Document Recognition and Retrieval X, pages 164--171, Santa Clara, USA, January 2003.
[23]
Liu, Y. and Mitra, P. and Giles, C.L. and Bai, K. Automatic Extraction of Table Metadata from Digital Documents. In Proc. of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 339--340, 2006.
[24]
Meunier, J. L. Optimized XY-cut for Determining a Page Reading Order. In Proc. of the 8th International Conference on Document Analysis and Recognition, ICDAR'05, pages 347--351, Seoul, Korea, August--September 2005.
[25]
Nagy, G., and Seth, S. Hierarchical Representation of Optically Scanned Documents. In Proc. of the 7th International Conference on Pattern Recognition, pages 347--349, Montreal, Canada, 1984.
[26]
Nojoumian, M. and Lethbridge, T.C. Reengineering PDF-Based Documents Targeting Complex Software Specifications. Master's thesis, University of Waterloo, Canada, 2009.
[27]
Papadimitriou, C. H., and Steiglitz, K. Combinatorial Optimization: Algorithms and Complexity. Englewood Cliffs, NJ, 1982.
[28]
Rigamonti, M., Bloechle, J. L., Hadjar K., et al. Towards a Canonical and Structured Representation of PDF Documents through Reverse Engineering. In Proc. of the 8th International Conference on Document Analysis and Recognition, ICDAR'05, pages 1050--1055, Seoul, Korea, August--September 2005.
[29]
Shipman, F. M., Marshall, C. C., and Moran, T. P. Finding and Using Implicit Structure in Human-organized Spatial Layouts of Information. In Proc. of the ACM Conference on Human Factors in Computing Systems, pages 346--353, Denver, CO, May 1995.
[30]
Tang, Y. Y., Yan, C. D. and Suen, C. Y. Document Processing for Automatic Knowledge Acquisition. IEEE Trans. on Knowledge and Data Engineering, 6(1):3--21, February 1994.
[31]
Xiao, W. S. Graph Theory and Its Algorithms. Aviation Industrial Press, Beijing, 1993.
[32]
Yang, Y. and Liu, X. A Re-examination of Text Categorization Methods. In Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'99, pages 42--49, Berkeley, CA, USA, August 1999.

Cited By

View all
  • (2023)Algorithms for extracting lines, paragraphs with their properties in PDF documentsE3S Web of Conferences10.1051/e3sconf/202338908024389(08024)Online publication date: 31-May-2023
  • (2021)Knowledge models from PDF textbooksNew Review of Hypermedia and Multimedia10.1080/13614568.2021.1889692(1-49)Online publication date: 28-Feb-2021
  • (2021)Automatic Text Extraction from Digital Brochures: Achieving Competitiveness for Mauritius SupermarketsSoft Computing and its Engineering Applications10.1007/978-981-16-0708-0_20(234-248)Online publication date: 5-Mar-2021
  • Show More Cited By

Index Terms

  1. Structure extraction from PDF-based book documents

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    JCDL '11: Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
    June 2011
    500 pages
    ISBN:9781450307444
    DOI:10.1145/1998076
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 June 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. bipartite graph
    2. electronic book
    3. layout analysis
    4. structure extraction

    Qualifiers

    • Research-article

    Conference

    JCDL '11
    Sponsor:
    JCDL '11: Joint Conference on Digital Libraries
    June 13 - 17, 2011
    Ontario, Ottawa, Canada

    Acceptance Rates

    Overall Acceptance Rate 415 of 1,482 submissions, 28%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)37
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 25 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Algorithms for extracting lines, paragraphs with their properties in PDF documentsE3S Web of Conferences10.1051/e3sconf/202338908024389(08024)Online publication date: 31-May-2023
    • (2021)Knowledge models from PDF textbooksNew Review of Hypermedia and Multimedia10.1080/13614568.2021.1889692(1-49)Online publication date: 28-Feb-2021
    • (2021)Automatic Text Extraction from Digital Brochures: Achieving Competitiveness for Mauritius SupermarketsSoft Computing and its Engineering Applications10.1007/978-981-16-0708-0_20(234-248)Online publication date: 5-Mar-2021
    • (2021)Boosting training for PDF malware classifier via active learningInternational Journal of Intelligent Systems10.1002/int.22451Online publication date: 16-May-2021
    • (2020)Order out of ChaosProceedings of the ACM Symposium on Document Engineering 202010.1145/3395027.3419585(1-10)Online publication date: 29-Sep-2020
    • (2019)A Collaborative Framework for Structure Identification over Print DocumentsProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3328519.3329131(1-8)Online publication date: 5-Jul-2019
    • (2018)Extracting Learning Outcomes Using Machine Learning and White Space AnalysisProceedings of the 4th EAI International Conference on Smart Objects and Technologies for Social Good10.1145/3284869.3284879(7-12)Online publication date: 28-Nov-2018
    • (2017)A INFLUÊNCIA DA TIPOGRAFIA NA USABIILIDADE: UMA REVISÃO SISTEMÁTICA PRELIMINAR DA LITERATURABlucher Design Proceedings10.5151/16ergodesign-0209(1987-1996)Online publication date: Aug-2017
    • (2017)A survey on scholarly dataInformation Processing and Management: an International Journal10.1016/j.ipm.2017.03.00653:4(923-944)Online publication date: 1-Jul-2017
    • (2017)Layout-Aware Semi-automatic Information Extraction for Pharmaceutical DocumentsData Integration in the Life Sciences10.1007/978-3-319-69751-2_8(71-85)Online publication date: 24-Oct-2017
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media