Header Metadata Extraction from Semi-structured Documents Using Template Matching

Huang, Zewu; Jin, Hai; Yuan, Pingpeng; Han, Zongfen

doi:10.1007/11915072_84

Zewu Huang¹⁹,
Hai Jin¹⁹,
Pingpeng Yuan¹⁹ &
…
Zongfen Han¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4278))

Included in the following conference series:

OTM Confederated International Conferences "On the Move to Meaningful Internet Systems"

1000 Accesses
6 Citations

Abstract

With the recent proliferation of documents, automatic metadata extraction from document becomes an important task. In this paper, we propose a novel template matching based method for header metadata extraction form semi-structured documents stored in PDF. In our approach, templates are defined, and the document is considered as strings with format. Templates are used to guide finite state automaton (FSA) to extract header metadata of papers. The testing results indicate that our approach can effectively extract metadata, without any training cost and available to some special situation. This approach can effectively assist the automatic index creation in lots of fields such as digital libraries, information retrieval, and data mining.

This paper is supported by the National 973 Key Basic Research Program under grant No.2003CB317003, and the Cultivation Fund of the Key Scientific and Technical Innovation Project, Ministry of Education of China under grant No.705034.

An erratum to this chapter can be found at http://dx.doi.org/10.1007/11915072_109.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Murphy, L.D.: Digital document metadata in organizations: roles, analytical approaches, and future research directions. In: Proceedings of the 31st Annual Hawaii International Conference on System Sciences, pp. 267–276 (1998)
Google Scholar
Brody, T.: Celestial - Open Archives Gateway, http://celestial.eprints.org
Liu, X.: Federating. Heterogeneous Digital Libraries by metadata harvesting. Ph.D. Dissertation, Old Dominion University (2002)
Google Scholar
Bishop, A.P.: Digital libraries and knowledge disaggregation: The use of journal article components. In: Proceedings of the 3rd ACM International Conference on Digital Libraries, pp. 29–39 (1998)
Google Scholar
Giuffrida, G., Shek, E.C., Yang, J.: Knowledge-based metadata extraction from PostScript files. In: Proceedings of the 5th ACM Conference on Digital Libraries, pp. 77–84 (2000)
Google Scholar
Nevill-Manning, C.G., Reed, T., Witten, I.H.: Extracting text from postscript. Technical report, Comp. Science Dept., University of Waikato, New Zealand (1997)
Google Scholar
Liddy, E.D., Sutton, S., Allen, E., Harwell, S., Corieri, S., Yilmazel, O., Ozgencil, N.E., Diekema, A., McCracken, N., Silverstein, J.: Automatic Metadata generation & evaluation. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 401–402 (2002)
Google Scholar
Mao, S., Kim, J.W., Thoma, G.R.: A dynamic feature generation system for automated metadata extraction in preservation of digital materials. In: Proceedings of the 1st International Workshop on Document Image Analysis for Libraries, pp. 225–232 (2004)
Google Scholar
Hu, Y., Li, H., Cao, Y., Meyerzon, D., Zheng, Q.: Automatic extraction of titles from general documents using machine learning. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (2005)
Google Scholar
Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 37–48 (2003)
Google Scholar
Joachims, T.: A statistical learning model of text classification with Support Vector Machines. In: Proceedings of the 24th ACM International Conference on Research and Development in Information Retrieval, pp. 128–136 (2001)
Google Scholar
McCallum, A., Freitag, D., Pereira, F.: Maximum entropy Markov models for information extraction and segmentation. In: Proceedings of the 17th International Conf. on Machine Learning, pp. 591–598 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Cluster and Grid Computing Lab, Huazhong University of Science and Technology, Wuhan, 430074, China
Zewu Huang, Hai Jin, Pingpeng Yuan & Zongfen Han

Authors

Zewu Huang
View author publications
You can also search for this author in PubMed Google Scholar
Hai Jin
View author publications
You can also search for this author in PubMed Google Scholar
Pingpeng Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Zongfen Han
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

STARLab, Vrije Universiteit Brussel (VUB), Bldg G/10, Pleinlaan 2, 1050, Brussels, Belgium
Robert Meersman
School of Computer Science and Information Technology, RMIT University, Bld 10.10, 376-392 Swanston Street, 3001, Melbourne, VIC, Australia
Zahir Tari
Facultad de Informática, Universidad Politécnica de Madrid, Campus de Montegancedo S/N, 28660, Boadilla del Monte, Madrid, Spain
Pilar Herrero

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, Z., Jin, H., Yuan, P., Han, Z. (2006). Header Metadata Extraction from Semi-structured Documents Using Template Matching. In: Meersman, R., Tari, Z., Herrero, P. (eds) On the Move to Meaningful Internet Systems 2006: OTM 2006 Workshops. OTM 2006. Lecture Notes in Computer Science, vol 4278. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11915072_84

Download citation

DOI: https://doi.org/10.1007/11915072_84
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-48273-4
Online ISBN: 978-3-540-48276-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics