Skip to main content

An Improved Hidden Markov Model for Literature Metadata Extraction

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6215))

Abstract

In this paper, we proposed an improved Hidden Markov Model (HMM) to extract metadata in the academic literatures. We have built a dataset including 458 literatures from the VLDB conferences, which contains the visual feature of text blocks. Our approach is based on the assumption that the text blocks in the same line have the same state (information type). The assumption is effective in more than 98% occasions. Thus, the state transition probability among the same states in the same line is much larger than that in different lines. According to this conclusion, we add one state transition matrix for HMM and modified the Viterbi algorithm. The experiments show that our extraction accuracy is superior to that of any existing works.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Giles, C.L., Kurt, D.B., Steve, L.C.: An automatic citation indexing system. In: Digital Libraries 1998 (1998)

    Google Scholar 

  2. Ying, D., Gobinda, C., Schubert, F.: Template mining for the extraction of citation from digital documents. In: Proc. Second Asian Digital Library Conference, Taiwan, pp. 47–62 (1999)

    Google Scholar 

  3. Dayne, F., Andrew, K.M.: Information extraction with HMMs and shrinkage. In: AAAI 1999 (1999)

    Google Scholar 

  4. Cora Dataset (2003), http://www.cs.umass.edu/~mccallum/data/cora-hmm.tar.gz

  5. pdftohtml (2006), http://sourceforge.net/projects/pdftohtml/files/

  6. Du, L.: Hidden markov model (HMM), http://math.sjtu.edu.cn/teacher/wuyk/HMM-DL.pdf

  7. Cui, B.: Scientific literature metadata extraction based on HMM. In: Luo, Y. (ed.) Cooperative Design, Visualization, and Engineering. LNCS, vol. 5738, pp. 64–68. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  8. Zhang, N.R.: Hidden markov models for information extraction (June 2001)

    Google Scholar 

  9. Seymore, K., McCallum, A., Ronal, R.: Learning hidden markov model structure for information extraction. In: AAAI 1999 Workshop on Machine Learning for Information Extraction (1999)

    Google Scholar 

  10. Zhang, L.: Research and application of web information extraction technology. Master’s thesis. Chinese Academy of Sciences (2003)

    Google Scholar 

  11. Zhang, M., Yin, P., Deng, Z.H., Yang, D.Q.: SVM+BiHMM: A hybrid statistic model for metadata extraction. Journal of Software 19, 358–368 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cui, BG., Chen, X. (2010). An Improved Hidden Markov Model for Literature Metadata Extraction. In: Huang, DS., Zhao, Z., Bevilacqua, V., Figueroa, J.C. (eds) Advanced Intelligent Computing Theories and Applications. ICIC 2010. Lecture Notes in Computer Science, vol 6215. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14922-1_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14922-1_26

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14921-4

  • Online ISBN: 978-3-642-14922-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics