Skip to main content

A Study on Information Extraction from PDF Files

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3930))

Abstract

Portable Document Format (PDF) is increasingly being recognized as a common format of electronic documents. The prerequisite to management and indexing of PDF files is to extract information from them. This paper describes an approach for extracting information from PDF files. The key idea is to transform the text information parsed from PDF files into semi-structured information by injecting additional uniform tags. An extensible rule set is built on tags and other knowledge. Guided by the rules, one pattern matching algorithm based on a tree model is applied to obtain the necessary information. A further experiment proved that this method was effective.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical approach to wrapper induction. In: Proceedings of the Third International Conference on Autonomous Agents, Seattle, WA 1999, pp. 221–227 (1999)

    Google Scholar 

  2. Kushmerick, N.: Wrapper Induction: Efficiency and expressiveness. Artificial Intelligence (118), 15–68 (2000)

    Google Scholar 

  3. Zhu, M., Wang, J., Wang, J.: Multiple Record Extraction from HTML Page Based On Hierarchical Pattern. Computer Engineering 27(9), 40–42 (2001)

    Google Scholar 

  4. Zhu, M., Huang, Y., Cai, Q.: Information Extraction From Web Pages Based on Multi-Knowledge. Mini-Micro System 22(9), 1058–1061 (2001)

    Google Scholar 

  5. Adobe Systems Incorporated. Adobe Portable Document Format Version 1.4, American Addison Wesley (2001)

    Google Scholar 

  6. Ben Litchfield PDFBOX[CP], http://sourceforge.net/projects/pdfbox/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yuan, F., Liu, B., Yu, G. (2006). A Study on Information Extraction from PDF Files. In: Yeung, D.S., Liu, ZQ., Wang, XZ., Yan, H. (eds) Advances in Machine Learning and Cybernetics. Lecture Notes in Computer Science(), vol 3930. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11739685_27

Download citation

  • DOI: https://doi.org/10.1007/11739685_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-33584-9

  • Online ISBN: 978-3-540-33585-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics