research-article

Structure extraction from PDF-based book documents

Authors:

Yongtao WangAuthors Info & Claims

JCDL '11: Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries

Pages 11 - 20

https://doi.org/10.1145/1998076.1998079

Published: 13 June 2011 Publication History

Abstract

Nowadays PDF documents have become a dominating knowledge repository for both the academia and industry largely because they are very convenient to print and exchange. However, the methods of automated structure information extraction are yet to be fully explored and the lack of effective methods hinders the information reuse of the PDF documents. To enhance the usability for PDF-formatted electronic books, we propose a novel computational framework to analyze the underlying physical structure and logical structure. The analysis is conducted at both page level and document level, including global typographies, reading order, logical elements, chapter/section hierarchy and metadata. Moreover, two characteristics of PDF-based books, i.e., style consistency in the whole book document and natural rendering order of PDF files, are fully exploited in this paper to improve the conventional image-based structure extraction methods. This paper employs the bipartite graph as a common structure for modeling various tasks, including reading order recovery, figure and caption association, and metadata extraction. Based on the graph representation, the optimal matching (OM) method is utilized to find the global optima in those tasks. Extensive benchmarking using real-world data validates the high efficiency and discrimination ability of the proposed method.

References

[1]

Aiello, M., Monz, C., Todoran, L. and Worring M. Document Understanding for a Broad Class of Documents. International Journal on Document Analysis and Recognition, 5(5):1--16, 2002.

[2]

Altamura, O., Esposito, F. and Malerba, D. Transforming Paper Documents into XML Format with WISDOM+. International Journal of Document Analysis and Recognition, 3(2):175--198, 2001.

[3]

Anjewierden, A. AIDAS: Incremental Logical Structure Discovery in PDF Documents. In Proc. of the 6th International Conference on Document Analysis and Recognition, ICDAR'01, pages 374--378, Seattle, USA, September 2001.

Digital Library

[4]

Bart, E., Sarkar, P. Information Extraction by Finding Repeated Structure. In Proc. of the 9th International Workshop on Document Analysis Systems, DAS'10, pages 175--182, Cambridge, MA, USA, June 2010.

Digital Library

[5]

Bloechle, J.L. and Pugin, C. and Ingold, R. Dolores: An Interactive and Class-Free Approach for Document Logical Restructuring. In Proc. of the 8th International Workshop on Document Analysis Systems, DAS'08, pages 644--652, Nara, Japan, September 2008.

Digital Library

[6]

Breuel, T. B. Layout Analysis Based on Text Line Segment Hypotheses. In Proc. of the International Workshop on Document Layout Interpretation and Its Applications, DLIA'03, Edinburgh, Scotland, August 2003.

[7]

Buchanan, G. and Owen, T. Improving Navigation Interaction in Digital Documents. In Proc. of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL'08, pages 389--392, Pittsburgh, PA, USA, June 2008.

Digital Library

[8]

Ceci, M., Appice, A., Loglisci, C. and Malerba, D. Preference Learning for Document Image Analysis. In Proc. of ECML/PKDD-10 Tutorial and Workshop on Preference Learning, Barcelona, Spain, September 2010.

[9]

Cees W. D.J. Jan Tschichold: Master Typographer: His Life, Work & Legacy. Thames & Hudson.Thames & Hudson Ltd., NY., 2008.

[10]

Chen, C. C., Yang, K. H. and Ho, J. M. BibPro: A Citation Parser Based on Sequence Alignment Techniques. In Proc. of the 22nd International Conference on Advanced Information Networking and Applications, AINA'08, pages 1175--1180, GinoWan, Japan, March 2008.

Digital Library

[11]

Déjean, H. and Meunier, J. L. A System for Converting PDF Documents into Structured XML Format. In Proc. of the 7th International Workshop on Document Analysis Systems, DAS'06, pages 129--140, Nelson, New Zealand, February 2006.

Digital Library

[12]

Gao, L.C., Tang, Z., Lin, X. F. CEBBIP: A Parser of Bibliographic Information in Chinese Electronic Books. In Proc. of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL'09, pages 73--76, Austin, USA, June 2009.

Digital Library

[13]

Gao, L.C., Tang, Z., Lin, X. F. and Qiu, R .H. Comprehensive Global Typography Extraction System for Electronic Book Documents. In Proc. of the 8th International Workshop on Document Analysis Systems, DAS'08, pages 615--621, Nara, Japan, September 2008.

Digital Library

[14]

Gao, L.C., Tang, Z., Lin, X. F., Tao, X. and Chu, Y. M. Analysis of Book Documents' Table of Content Based on Clustering. In Proc. of the 10th International Conference on Document Analysis and Recognition, ICDAR'09, pages 911--914, Barcelona, Spain, July 2009.

Digital Library

[15]

Hassan, T. PDF to HTML Conversion. Technical report, University of Warwick, 2003.

[16]

Hassan, T. User-Guided Wrapping of PDF Documents Using Graph Matching Techniques. In Proc. of the 10th International Conference on Document Analysis and Recognition, ICDAR'09, pages 631--635, Barcelona, Spain, July 2009.

Digital Library

[17]

He, F., Ding, X., and Peng, L. Hierarchical Logical Structure Extraction of Book Documents by Analyzing Tables of Contents. In Proc. of the International Conference on Document Recognition and Retrieval XI, pages 6--13, San Jose, USA, January 2004.

[18]

Ishitani, Y. Document Transformation System from Papers to XML Data Based on Pivot XML Document Method. In Proc. of the 7th International Conference on Document Analysis and Recognition, ICDAR'03, pages 250--255, Edinburgh, Scotland, August 2003.

Digital Library

[19]

Klink, S., Dengel, A. and Kieninger, T. Document Structure Analysis Based on Layout and Textual Features. In Proc. of the 4th International Workshop on Document Analysis Systems, DAS'00, pages 99--111, Rio de Janeiro, Brazil, June 2000.

[20]

Lee, K.H., Choy, Y.C. and Cho, S.B. Logical Structure Analysis and Generation for Structured Documents: A Syntactic Approach. IEEE Transaction on Knowledge and Data Engineering, 15(5):1277--1294, 2003.

Digital Library

[21]

Lin, C., Niwa, Y., and Narita, S. Logical Structure Analysis of Book Document Images Using Contents Information. In Proc. of the 4th International Conference on Document Analysis and Recognition, ICDAR'97, pages 1048--1054, Ulm, Germany, January 1997.

Digital Library

[22]

Lin, X. F. Header and Footer Extraction by Page-association. In Proc. of the International Conference on Document Recognition and Retrieval X, pages 164--171, Santa Clara, USA, January 2003.

[23]

Liu, Y. and Mitra, P. and Giles, C.L. and Bai, K. Automatic Extraction of Table Metadata from Digital Documents. In Proc. of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 339--340, 2006.

Digital Library

[24]

Meunier, J. L. Optimized XY-cut for Determining a Page Reading Order. In Proc. of the 8th International Conference on Document Analysis and Recognition, ICDAR'05, pages 347--351, Seoul, Korea, August--September 2005.

Digital Library

[25]

Nagy, G., and Seth, S. Hierarchical Representation of Optically Scanned Documents. In Proc. of the 7th International Conference on Pattern Recognition, pages 347--349, Montreal, Canada, 1984.

[26]

Nojoumian, M. and Lethbridge, T.C. Reengineering PDF-Based Documents Targeting Complex Software Specifications. Master's thesis, University of Waterloo, Canada, 2009.

[27]

Papadimitriou, C. H., and Steiglitz, K. Combinatorial Optimization: Algorithms and Complexity. Englewood Cliffs, NJ, 1982.

Digital Library

[28]

Rigamonti, M., Bloechle, J. L., Hadjar K., et al. Towards a Canonical and Structured Representation of PDF Documents through Reverse Engineering. In Proc. of the 8th International Conference on Document Analysis and Recognition, ICDAR'05, pages 1050--1055, Seoul, Korea, August--September 2005.

Digital Library

[29]

Shipman, F. M., Marshall, C. C., and Moran, T. P. Finding and Using Implicit Structure in Human-organized Spatial Layouts of Information. In Proc. of the ACM Conference on Human Factors in Computing Systems, pages 346--353, Denver, CO, May 1995.

Digital Library

[30]

Tang, Y. Y., Yan, C. D. and Suen, C. Y. Document Processing for Automatic Knowledge Acquisition. IEEE Trans. on Knowledge and Data Engineering, 6(1):3--21, February 1994.

Digital Library

[31]

Xiao, W. S. Graph Theory and Its Algorithms. Aviation Industrial Press, Beijing, 1993.

[32]

Yang, Y. and Liu, X. A Re-examination of Text Categorization Methods. In Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'99, pages 42--49, Berkeley, CA, USA, August 1999.

Digital Library

Cited By

Martsinkevich VBerezhkov ATereshchenko VGorlushkina NTretjakova V(2023)Algorithms for extracting lines, paragraphs with their properties in PDF documentsE3S Web of Conferences10.1051/e3sconf/202338908024389(08024)Online publication date: 31-May-2023
https://doi.org/10.1051/e3sconf/202338908024
Alpizar-Chacon ISosnovsky S(2021)Knowledge models from PDF textbooksNew Review of Hypermedia and Multimedia10.1080/13614568.2021.1889692(1-49)Online publication date: 28-Feb-2021
https://doi.org/10.1080/13614568.2021.1889692
Chuttur YFauzel YRamasawmy S(2021)Automatic Text Extraction from Digital Brochures: Achieving Competitiveness for Mauritius SupermarketsSoft Computing and its Engineering Applications10.1007/978-981-16-0708-0_20(234-248)Online publication date: 5-Mar-2021
https://doi.org/10.1007/978-981-16-0708-0_20
Show More Cited By

Index Terms

Structure extraction from PDF-based book documents
1. Information systems
  1. Information retrieval

Recommendations

Making accessible PDF documents
DocEng '11: Proceedings of the 11th ACM symposium on Document engineering

Accessibility features in the Adobe Portable Document Format (PDF) help facilitate access to electronic information for people with disabilities. This workshop explores how to create accessible PDF documents, from within Adobe Acrobat and other ...
Configurable Table Structure Recognition in Untagged PDF documents
DocEng '16: Proceedings of the 2016 ACM Symposium on Document Engineering

Today, PDF is one of the most popular document formats in the web. Many PDF documents are not images, but remain untagged. They have no tags for identifying the logical reading order, paragraphs, figures, and tables. One of the challenges with these ...
Logical structure based semantic relationship extraction from semi-structured documents
WWW '06: Proceedings of the 15th international conference on World Wide Web

Addressed in this paper is the issue of semantic relationship extraction from semi-structured documents. Many research efforts have been made so far on the semantic information extraction. However, much of the previous work focuses on detecting `...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

JCDL '11: Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries

June 2011

500 pages

ISBN:9781450307444

DOI:10.1145/1998076

General Chair:
Glen Newton
Carleton University, Canada
,
Program Chairs:
Michael Wright
UCAR/NCAR, USA
,
Lillian Cassel
Villanova University, USA

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

JCDL '11

Sponsor:

JCDL '11: Joint Conference on Digital Libraries

June 13 - 17, 2011

Ontario, Ottawa, Canada

Acceptance Rates

Overall Acceptance Rate 415 of 1,482 submissions, 28%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
907
Total Downloads

Downloads (Last 12 months)37
Downloads (Last 6 weeks)6

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Martsinkevich VBerezhkov ATereshchenko VGorlushkina NTretjakova V(2023)Algorithms for extracting lines, paragraphs with their properties in PDF documentsE3S Web of Conferences10.1051/e3sconf/202338908024389(08024)Online publication date: 31-May-2023
https://doi.org/10.1051/e3sconf/202338908024
Alpizar-Chacon ISosnovsky S(2021)Knowledge models from PDF textbooksNew Review of Hypermedia and Multimedia10.1080/13614568.2021.1889692(1-49)Online publication date: 28-Feb-2021
https://doi.org/10.1080/13614568.2021.1889692
Chuttur YFauzel YRamasawmy S(2021)Automatic Text Extraction from Digital Brochures: Achieving Competitiveness for Mauritius SupermarketsSoft Computing and its Engineering Applications10.1007/978-981-16-0708-0_20(234-248)Online publication date: 5-Mar-2021
https://doi.org/10.1007/978-981-16-0708-0_20
Li YWang XShi ZZhang RXue JWang Z(2021)Boosting training for PDF malware classifier via active learningInternational Journal of Intelligent Systems10.1002/int.22451Online publication date: 16-May-2021
https://doi.org/10.1002/int.22451
Alpizar-Chacon ISosnovsky S(2020)Order out of ChaosProceedings of the ACM Symposium on Document Engineering 202010.1145/3395027.3419585(1-10)Online publication date: 29-Sep-2020
https://dl.acm.org/doi/10.1145/3395027.3419585
Hanafi MMannino MAbouzied A(2019)A Collaborative Framework for Structure Identification over Print DocumentsProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3328519.3329131(1-8)Online publication date: 5-Jul-2019
https://dl.acm.org/doi/10.1145/3328519.3329131
Budhiraja SMago VFurini MMirri SBouchard KGuidi B(2018)Extracting Learning Outcomes Using Machine Learning and White Space AnalysisProceedings of the 4th EAI International Conference on Smart Objects and Technologies for Social Good10.1145/3284869.3284879(7-12)Online publication date: 28-Nov-2018
https://dl.acm.org/doi/10.1145/3284869.3284879
Costa RCampos LNascimento A(2017)A INFLUÊNCIA DA TIPOGRAFIA NA USABIILIDADE: UMA REVISÃO SISTEMÁTICA PRELIMINAR DA LITERATURABlucher Design Proceedings10.5151/16ergodesign-0209(1987-1996)Online publication date: Aug-2017
https://doi.org/10.5151/16ergodesign-0209
Khan SLiu XShakil KAlam M(2017)A survey on scholarly dataInformation Processing and Management: an International Journal10.1016/j.ipm.2017.03.00653:4(923-944)Online publication date: 1-Jul-2017
https://dl.acm.org/doi/10.1016/j.ipm.2017.03.006
Harmata SHofer-Schmitz KNguyen PQuix CBakiu B(2017)Layout-Aware Semi-automatic Information Extraction for Pharmaceutical DocumentsData Integration in the Life Sciences10.1007/978-3-319-69751-2_8(71-85)Online publication date: 24-Oct-2017
https://doi.org/10.1007/978-3-319-69751-2_8
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten