Skip to main content

Metadata Extraction from Bibliographies Using Bigram HMM

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3334))

Abstract

In recent years, we have seen huge volumes of research papers available on the World Wide Web. Metadata provides a good approach for organizing and retrieving these useful resources. Accordingly, automatic extraction of metadata from these papers and their bibliographies is meaningful and has been widely studied. In this paper, we utilize a bigram HMM (Hidden Markov Model) for automatic extraction of metadata (i.e. title, author, date, journal, pages, etc.) from bibliographies with various styles. Different from the traditional HMM, which only uses word frequency, this model also considers both words’ bigram sequential relation and position information in text fields. We have evaluated the model on a real corpus downloaded from Web and compared it with other methods. Experiments show that the bigram HMM yields the best result and seem to be the most promising candidate for metadata extraction of bibliographies.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Borkar, V.R., Deshmukh, K., Sarawagi, S.: Automatic Segmentation of Text into Structured Records. In: Proc. ACM-SIGMOD Int’l Conf. Management of Data (SIGMOD 2001), pp. 175–186. ACM Press, New York (2001)

    Chapter  Google Scholar 

  2. Lawrence, S., Giles, C., Bollacker, K.: Digital libraries and autonomous citation indexing. IEEE Computer 32(6), 67–71 (1999)

    Google Scholar 

  3. harvester.jar, http://www.cs.cornell.edu/cdlrg/Reference%20Linking/software/RefLink.tar.gz

  4. Rabiner, L.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–285 (1989)

    Article  Google Scholar 

  5. Freitag, D., McCallurn, A.: Information extraction with HMMs and shrinkage. In: Workshop Notes of AAAI-99 Conference on Machine Learning for Information Extraction, pp. 31–36 (1999)

    Google Scholar 

  6. Seymore, K., McCallum, A., Rosenreid, R.: Learning hidden Markov model structure for information extraction. In: AAAI-1999 Workshop on Machine Learning for Information Extraction, pp. 37–42 (1999)

    Google Scholar 

  7. Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: Nymble: a high performance learning namefinder. In: Proceeding of the fifth Conference on Applied Language Processing, pp. 194–201 (1999)

    Google Scholar 

  8. Connan, J., Omlin, C.W.: Bibliography Extraction with Hidden Markov Models, Technical Report US-CS-TR-00-6, 24, Department of Computer Science, University of Stellenbosch (February 2000)

    Google Scholar 

  9. Leek, T.: Information Extraction Using Hidden Markov Models, Masters Thesis, Department of Computer Science & Engineering, University of California, San Diego (1997)

    Google Scholar 

  10. Freitag, D., McCallum, A.: Information extraction with HMM structures learned by stochastic optimization. In: Proceedings of the Eighteenth Conference on Artificial Intelligence, AAAI-2000 (2000)

    Google Scholar 

  11. Stolcke, A., Omohundro, S.M.: Hidden Markov Model Induction by Bayesian Model Merging. In: Hanson, S.J., Cowan, J.D., Giles, C.L. (eds.) Advances in Neural Information Processing Systems, 1992, vol. 5, pp. 11–18. Morgan Kaufman, San Francisco (1992)

    Google Scholar 

  12. Stolcke, A., Omohundro, S.M.: Best-first model merging for hidden Markov model induction. Technical Report TR-94-003, Computer Science Division, University of California at Berkeley and International Computer Science Institute (1994)

    Google Scholar 

  13. McCallum, A., Freitag, D., Pereira, F.: Maximum entropy Markov models for information extraction and segmentation. In: Proc. 17th International Conf. on Machine Learning, pp. 591–598 (2000)

    Google Scholar 

  14. Probabilistic Logic Learning Seminar. Hidden Markov Models for Information Extraction

    Google Scholar 

  15. Soderland, S.: Learning Information Extraction Rules for Semi-Structured and Free Text. Machine Learning: Special Issue on Natural Language Learning 34, 233–272 (1999)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yin, P., Zhang, M., Deng, Z., Yang, D. (2004). Metadata Extraction from Bibliographies Using Bigram HMM. In: Chen, Z., Chen, H., Miao, Q., Fu, Y., Fox, E., Lim, Ep. (eds) Digital Libraries: International Collaboration and Cross-Fertilization. ICADL 2004. Lecture Notes in Computer Science, vol 3334. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30544-6_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30544-6_33

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-24030-3

  • Online ISBN: 978-3-540-30544-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics