A Study on Information Extraction from PDF Files

Yuan, Fang; Liu, Bo; Yu, Ge

doi:10.1007/11739685_27

A Study on Information Extraction from PDF Files

Fang Yuan^22,23,
Bo Liu²² &
Ge Yu²³

Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3930))

Abstract

Portable Document Format (PDF) is increasingly being recognized as a common format of electronic documents. The prerequisite to management and indexing of PDF files is to extract information from them. This paper describes an approach for extracting information from PDF files. The key idea is to transform the text information parsed from PDF files into semi-structured information by injecting additional uniform tags. An extensible rule set is built on tags and other knowledge. Guided by the rules, one pattern matching algorithm based on a tree model is applied to obtain the necessary information. A further experiment proved that this method was effective.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical approach to wrapper induction. In: Proceedings of the Third International Conference on Autonomous Agents, Seattle, WA 1999, pp. 221–227 (1999)
Google Scholar
Kushmerick, N.: Wrapper Induction: Efficiency and expressiveness. Artificial Intelligence (118), 15–68 (2000)
Google Scholar
Zhu, M., Wang, J., Wang, J.: Multiple Record Extraction from HTML Page Based On Hierarchical Pattern. Computer Engineering 27(9), 40–42 (2001)
Google Scholar
Zhu, M., Huang, Y., Cai, Q.: Information Extraction From Web Pages Based on Multi-Knowledge. Mini-Micro System 22(9), 1058–1061 (2001)
Google Scholar
Adobe Systems Incorporated. Adobe Portable Document Format Version 1.4, American Addison Wesley (2001)
Google Scholar
Ben Litchfield PDFBOX[CP], http://sourceforge.net/projects/pdfbox/

Download references

Author information

Authors and Affiliations

College of Mathematics and Computer Science, Hebei University, Baoding, Hebei, 071002, P.R. China
Fang Yuan & Bo Liu
College of Information Science and Engineering, Northeastern University, Shenyang, Liaoning, 110004, P.R. China
Fang Yuan & Ge Yu

Authors

Fang Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Bo Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ge Yu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computing, Hong Kong Polytechnic University, P.O. Box, Hong Kong, China
Daniel S. Yeung
School of Creative Media, City University of Hong Kong,, China
Zhi-Qiang Liu
Department of Mathematics and Computer Science, Hebei University, 071002, Baoding, Hebei, P.R. China
Xi-Zhao Wang
School of Electrical and Information Engineering, University of Sydney, 2006, NSW, Australia
Hong Yan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yuan, F., Liu, B., Yu, G. (2006). A Study on Information Extraction from PDF Files. In: Yeung, D.S., Liu, ZQ., Wang, XZ., Yan, H. (eds) Advances in Machine Learning and Cybernetics. Lecture Notes in Computer Science(), vol 3930. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11739685_27

Download citation

DOI: https://doi.org/10.1007/11739685_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33584-9
Online ISBN: 978-3-540-33585-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics