Metadata Extraction from Bibliographies Using Bigram HMM

Yin, Ping; Zhang, Ming; Deng, ZhiHong; Yang, DongQing

doi:10.1007/978-3-540-30544-6_33

Metadata Extraction from Bibliographies Using Bigram HMM

Ping Yin²²,
Ming Zhang²²,
ZhiHong Deng²² &
…
DongQing Yang²²

Conference paper

971 Accesses
16 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3334))

Abstract

In recent years, we have seen huge volumes of research papers available on the World Wide Web. Metadata provides a good approach for organizing and retrieving these useful resources. Accordingly, automatic extraction of metadata from these papers and their bibliographies is meaningful and has been widely studied. In this paper, we utilize a bigram HMM (Hidden Markov Model) for automatic extraction of metadata (i.e. title, author, date, journal, pages, etc.) from bibliographies with various styles. Different from the traditional HMM, which only uses word frequency, this model also considers both words’ bigram sequential relation and position information in text fields. We have evaluated the model on a real corpus downloaded from Web and compared it with other methods. Experiments show that the bigram HMM yields the best result and seem to be the most promising candidate for metadata extraction of bibliographies.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Borkar, V.R., Deshmukh, K., Sarawagi, S.: Automatic Segmentation of Text into Structured Records. In: Proc. ACM-SIGMOD Int’l Conf. Management of Data (SIGMOD 2001), pp. 175–186. ACM Press, New York (2001)
Chapter Google Scholar
Lawrence, S., Giles, C., Bollacker, K.: Digital libraries and autonomous citation indexing. IEEE Computer 32(6), 67–71 (1999)
Google Scholar
harvester.jar, http://www.cs.cornell.edu/cdlrg/Reference%20Linking/software/RefLink.tar.gz
Rabiner, L.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–285 (1989)
Article Google Scholar
Freitag, D., McCallurn, A.: Information extraction with HMMs and shrinkage. In: Workshop Notes of AAAI-99 Conference on Machine Learning for Information Extraction, pp. 31–36 (1999)
Google Scholar
Seymore, K., McCallum, A., Rosenreid, R.: Learning hidden Markov model structure for information extraction. In: AAAI-1999 Workshop on Machine Learning for Information Extraction, pp. 37–42 (1999)
Google Scholar
Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: Nymble: a high performance learning namefinder. In: Proceeding of the fifth Conference on Applied Language Processing, pp. 194–201 (1999)
Google Scholar
Connan, J., Omlin, C.W.: Bibliography Extraction with Hidden Markov Models, Technical Report US-CS-TR-00-6, 24, Department of Computer Science, University of Stellenbosch (February 2000)
Google Scholar
Leek, T.: Information Extraction Using Hidden Markov Models, Masters Thesis, Department of Computer Science & Engineering, University of California, San Diego (1997)
Google Scholar
Freitag, D., McCallum, A.: Information extraction with HMM structures learned by stochastic optimization. In: Proceedings of the Eighteenth Conference on Artificial Intelligence, AAAI-2000 (2000)
Google Scholar
Stolcke, A., Omohundro, S.M.: Hidden Markov Model Induction by Bayesian Model Merging. In: Hanson, S.J., Cowan, J.D., Giles, C.L. (eds.) Advances in Neural Information Processing Systems, 1992, vol. 5, pp. 11–18. Morgan Kaufman, San Francisco (1992)
Google Scholar
Stolcke, A., Omohundro, S.M.: Best-first model merging for hidden Markov model induction. Technical Report TR-94-003, Computer Science Division, University of California at Berkeley and International Computer Science Institute (1994)
Google Scholar
McCallum, A., Freitag, D., Pereira, F.: Maximum entropy Markov models for information extraction and segmentation. In: Proc. 17th International Conf. on Machine Learning, pp. 591–598 (2000)
Google Scholar
Probabilistic Logic Learning Seminar. Hidden Markov Models for Information Extraction
Google Scholar
Soderland, S.: Learning Information Extraction Rules for Semi-Structured and Free Text. Machine Learning: Special Issue on Natural Language Learning 34, 233–272 (1999)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Electronics Engineering and Computer Science, Peking University, Beijing, China
Ping Yin, Ming Zhang, ZhiHong Deng & DongQing Yang

Authors

Ping Yin
View author publications
You can also search for this author in PubMed Google Scholar
Ming Zhang
View author publications
You can also search for this author in PubMed Google Scholar
ZhiHong Deng
View author publications
You can also search for this author in PubMed Google Scholar
DongQing Yang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Shanghai Jiao Tong University, Shanghai, P.R. China
Zhaoneng Chen
Department of Management Information Systems, Eller College of Management, The University of Arizona, 85721, AZ, USA
Hsinchun Chen
Shanghai Library, Shanghai, P.R. China
Qihao Miao
BASICS, Department of Computer Science and Engineering, Shanghai Jiao Tong University, 200030, Shanghai, China
Yuxi Fu
Digital Library Research Laboratory, Virginia Tech, USA
Edward Fox
School of Computer Engineering, Nanyang Technological University,
Ee-peng Lim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yin, P., Zhang, M., Deng, Z., Yang, D. (2004). Metadata Extraction from Bibliographies Using Bigram HMM. In: Chen, Z., Chen, H., Miao, Q., Fu, Y., Fox, E., Lim, Ep. (eds) Digital Libraries: International Collaboration and Cross-Fertilization. ICADL 2004. Lecture Notes in Computer Science, vol 3334. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30544-6_33

Download citation

DOI: https://doi.org/10.1007/978-3-540-30544-6_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24030-3
Online ISBN: 978-3-540-30544-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics