ABSTRACT
Nowadays PDF documents have become a dominating knowledge repository for both the academia and industry largely because they are very convenient to print and exchange. However, the methods of automated structure information extraction are yet to be fully explored and the lack of effective methods hinders the information reuse of the PDF documents. To enhance the usability for PDF-formatted electronic books, we propose a novel computational framework to analyze the underlying physical structure and logical structure. The analysis is conducted at both page level and document level, including global typographies, reading order, logical elements, chapter/section hierarchy and metadata. Moreover, two characteristics of PDF-based books, i.e., style consistency in the whole book document and natural rendering order of PDF files, are fully exploited in this paper to improve the conventional image-based structure extraction methods. This paper employs the bipartite graph as a common structure for modeling various tasks, including reading order recovery, figure and caption association, and metadata extraction. Based on the graph representation, the optimal matching (OM) method is utilized to find the global optima in those tasks. Extensive benchmarking using real-world data validates the high efficiency and discrimination ability of the proposed method.
- Aiello, M., Monz, C., Todoran, L. and Worring M. Document Understanding for a Broad Class of Documents. International Journal on Document Analysis and Recognition, 5(5):1--16, 2002.Google Scholar
- Altamura, O., Esposito, F. and Malerba, D. Transforming Paper Documents into XML Format with WISDOM+. International Journal of Document Analysis and Recognition, 3(2):175--198, 2001.Google Scholar
- Anjewierden, A. AIDAS: Incremental Logical Structure Discovery in PDF Documents. In Proc. of the 6th International Conference on Document Analysis and Recognition, ICDAR'01, pages 374--378, Seattle, USA, September 2001. Google ScholarDigital Library
- Bart, E., Sarkar, P. Information Extraction by Finding Repeated Structure. In Proc. of the 9th International Workshop on Document Analysis Systems, DAS'10, pages 175--182, Cambridge, MA, USA, June 2010. Google ScholarDigital Library
- Bloechle, J.L. and Pugin, C. and Ingold, R. Dolores: An Interactive and Class-Free Approach for Document Logical Restructuring. In Proc. of the 8th International Workshop on Document Analysis Systems, DAS'08, pages 644--652, Nara, Japan, September 2008. Google ScholarDigital Library
- Breuel, T. B. Layout Analysis Based on Text Line Segment Hypotheses. In Proc. of the International Workshop on Document Layout Interpretation and Its Applications, DLIA'03, Edinburgh, Scotland, August 2003.Google Scholar
- Buchanan, G. and Owen, T. Improving Navigation Interaction in Digital Documents. In Proc. of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL'08, pages 389--392, Pittsburgh, PA, USA, June 2008. Google ScholarDigital Library
- Ceci, M., Appice, A., Loglisci, C. and Malerba, D. Preference Learning for Document Image Analysis. In Proc. of ECML/PKDD-10 Tutorial and Workshop on Preference Learning, Barcelona, Spain, September 2010.Google Scholar
- Cees W. D.J. Jan Tschichold: Master Typographer: His Life, Work & Legacy. Thames & Hudson.Thames & Hudson Ltd., NY., 2008.Google Scholar
- Chen, C. C., Yang, K. H. and Ho, J. M. BibPro: A Citation Parser Based on Sequence Alignment Techniques. In Proc. of the 22nd International Conference on Advanced Information Networking and Applications, AINA'08, pages 1175--1180, GinoWan, Japan, March 2008. Google ScholarDigital Library
- Déjean, H. and Meunier, J. L. A System for Converting PDF Documents into Structured XML Format. In Proc. of the 7th International Workshop on Document Analysis Systems, DAS'06, pages 129--140, Nelson, New Zealand, February 2006. Google ScholarDigital Library
- Gao, L.C., Tang, Z., Lin, X. F. CEBBIP: A Parser of Bibliographic Information in Chinese Electronic Books. In Proc. of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL'09, pages 73--76, Austin, USA, June 2009. Google ScholarDigital Library
- Gao, L.C., Tang, Z., Lin, X. F. and Qiu, R .H. Comprehensive Global Typography Extraction System for Electronic Book Documents. In Proc. of the 8th International Workshop on Document Analysis Systems, DAS'08, pages 615--621, Nara, Japan, September 2008. Google ScholarDigital Library
- Gao, L.C., Tang, Z., Lin, X. F., Tao, X. and Chu, Y. M. Analysis of Book Documents' Table of Content Based on Clustering. In Proc. of the 10th International Conference on Document Analysis and Recognition, ICDAR'09, pages 911--914, Barcelona, Spain, July 2009. Google ScholarDigital Library
- Hassan, T. PDF to HTML Conversion. Technical report, University of Warwick, 2003.Google Scholar
- Hassan, T. User-Guided Wrapping of PDF Documents Using Graph Matching Techniques. In Proc. of the 10th International Conference on Document Analysis and Recognition, ICDAR'09, pages 631--635, Barcelona, Spain, July 2009. Google ScholarDigital Library
- He, F., Ding, X., and Peng, L. Hierarchical Logical Structure Extraction of Book Documents by Analyzing Tables of Contents. In Proc. of the International Conference on Document Recognition and Retrieval XI, pages 6--13, San Jose, USA, January 2004.Google Scholar
- Ishitani, Y. Document Transformation System from Papers to XML Data Based on Pivot XML Document Method. In Proc. of the 7th International Conference on Document Analysis and Recognition, ICDAR'03, pages 250--255, Edinburgh, Scotland, August 2003. Google ScholarDigital Library
- Klink, S., Dengel, A. and Kieninger, T. Document Structure Analysis Based on Layout and Textual Features. In Proc. of the 4th International Workshop on Document Analysis Systems, DAS'00, pages 99--111, Rio de Janeiro, Brazil, June 2000.Google Scholar
- Lee, K.H., Choy, Y.C. and Cho, S.B. Logical Structure Analysis and Generation for Structured Documents: A Syntactic Approach. IEEE Transaction on Knowledge and Data Engineering, 15(5):1277--1294, 2003. Google ScholarDigital Library
- Lin, C., Niwa, Y., and Narita, S. Logical Structure Analysis of Book Document Images Using Contents Information. In Proc. of the 4th International Conference on Document Analysis and Recognition, ICDAR'97, pages 1048--1054, Ulm, Germany, January 1997. Google ScholarDigital Library
- Lin, X. F. Header and Footer Extraction by Page-association. In Proc. of the International Conference on Document Recognition and Retrieval X, pages 164--171, Santa Clara, USA, January 2003.Google ScholarCross Ref
- Liu, Y. and Mitra, P. and Giles, C.L. and Bai, K. Automatic Extraction of Table Metadata from Digital Documents. In Proc. of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 339--340, 2006. Google ScholarDigital Library
- Meunier, J. L. Optimized XY-cut for Determining a Page Reading Order. In Proc. of the 8th International Conference on Document Analysis and Recognition, ICDAR'05, pages 347--351, Seoul, Korea, August--September 2005. Google ScholarDigital Library
- Nagy, G., and Seth, S. Hierarchical Representation of Optically Scanned Documents. In Proc. of the 7th International Conference on Pattern Recognition, pages 347--349, Montreal, Canada, 1984.Google Scholar
- Nojoumian, M. and Lethbridge, T.C. Reengineering PDF-Based Documents Targeting Complex Software Specifications. Master's thesis, University of Waterloo, Canada, 2009.Google Scholar
- Papadimitriou, C. H., and Steiglitz, K. Combinatorial Optimization: Algorithms and Complexity. Englewood Cliffs, NJ, 1982. Google ScholarDigital Library
- Rigamonti, M., Bloechle, J. L., Hadjar K., et al. Towards a Canonical and Structured Representation of PDF Documents through Reverse Engineering. In Proc. of the 8th International Conference on Document Analysis and Recognition, ICDAR'05, pages 1050--1055, Seoul, Korea, August--September 2005. Google ScholarDigital Library
- Shipman, F. M., Marshall, C. C., and Moran, T. P. Finding and Using Implicit Structure in Human-organized Spatial Layouts of Information. In Proc. of the ACM Conference on Human Factors in Computing Systems, pages 346--353, Denver, CO, May 1995. Google ScholarDigital Library
- Tang, Y. Y., Yan, C. D. and Suen, C. Y. Document Processing for Automatic Knowledge Acquisition. IEEE Trans. on Knowledge and Data Engineering, 6(1):3--21, February 1994. Google ScholarDigital Library
- Xiao, W. S. Graph Theory and Its Algorithms. Aviation Industrial Press, Beijing, 1993.Google Scholar
- Yang, Y. and Liu, X. A Re-examination of Text Categorization Methods. In Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'99, pages 42--49, Berkeley, CA, USA, August 1999. Google ScholarDigital Library
Index Terms
- Structure extraction from PDF-based book documents
Recommendations
Making accessible PDF documents
DocEng '11: Proceedings of the 11th ACM symposium on Document engineeringAccessibility features in the Adobe Portable Document Format (PDF) help facilitate access to electronic information for people with disabilities. This workshop explores how to create accessible PDF documents, from within Adobe Acrobat and other ...
Configurable Table Structure Recognition in Untagged PDF documents
DocEng '16: Proceedings of the 2016 ACM Symposium on Document EngineeringToday, PDF is one of the most popular document formats in the web. Many PDF documents are not images, but remain untagged. They have no tags for identifying the logical reading order, paragraphs, figures, and tables. One of the challenges with these ...
Logical structure based semantic relationship extraction from semi-structured documents
WWW '06: Proceedings of the 15th international conference on World Wide WebAddressed in this paper is the issue of semantic relationship extraction from semi-structured documents. Many research efforts have been made so far on the semantic information extraction. However, much of the previous work focuses on detecting `...
Comments