Skip to main content

Two-Tier Machine Learning Using Conditional Random Fields with Constraints

  • Conference paper
  • First Online:
Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2013)

Abstract

This paper shows a novel approach of two-tier machine learning to locate bibliographic references in HTML and separate them into fields. First it is demonstrated, how Conditional Random Fields (CRFs) with constraints can be used to split bibliographic references into fields e.g. authors and title. Therefore a unique feature set, constraints and a method for automatic keyword extraction are introduced. The output of this CRF for tagging bibliographic references, Part Of Speech (POS) analysis and Named Entity Recognition (NER) build the first tier and their output is used to locate the bibliographic reference section in the first place. For this the documents are split into blocks, which are then used for classification. For this task a Support Vector Machines (SVM) approach is compared with another one using a CRF. We demonstrate this two-tier approach archives very good results, while the reference tagging approach is able to compete with other state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://scholar.google.com.

  2. 2.

    http://citeseerx.ist.psu.edu/index.

  3. 3.

    http://www.springermaterials.com.

  4. 4.

    http://www.springerreference.com.

  5. 5.

    http://jsoup.org/.

  6. 6.

    seleniumhq.org.

  7. 7.

    http://aye.comp.nus.edu.sg/parsCit/.

References

  1. Bollacker, K.D., Lawrence, S., Giles, C.L.: CiteSeer: An autonomous web agent for automatic retrieval and identification of interesting publications. In: Proceedings of the Second International Conference on Autonomous Agents, pp. 116–123. ACM (1998)

    Google Scholar 

  2. Zou, J., Le, D., Thoma, G.R.: Locating and parsing bibliographic references in HTML medical articles. Int. J. Doc. Anal. Recogn. 2, 107–119 (2010)

    Article  Google Scholar 

  3. Hetzner, E.: A simple method for citation metadata extraction using hidden markov models. In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 280–284. ACM (2008)

    Google Scholar 

  4. Gao, L., Qi, X., Tang, Z., Lin, X., Liu, Y.: Web-based citation parsing, correction and augmentation. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 295–304. ACM (2012)

    Google Scholar 

  5. Park, S.H., Ehrich, R.W., Fox, E.A.: A hybrid two-stage approach for discipline-independent canonical representation extraction from references. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2012, pp. 285–294. ACM, New York (2012)

    Google Scholar 

  6. Sutton, C., McCallum, A.: Introduction to Conditional Random Fields for Relational Learning. MIT Press, Cambridge (2006)

    Google Scholar 

  7. Mann, G.S., McCallum, A.: Generalized expectation criteria for semi-supervised learning with weakly labeled data. J. Mach. Learn. Res. 11, 955–984 (2010)

    MATH  MathSciNet  Google Scholar 

  8. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probablistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML-2001), pp. 282–289 (2001)

    Google Scholar 

  9. McCallum, A.: Mallet: A machine learning for language toolkit (2002). http://mallet.cs.umass.edu

  10. Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: An open-source CRF reference string parsing package. In: International Language Resources and Evaluation. European Language Resources Association (2008)

    Google Scholar 

  11. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  12. Lindner, S., Höhn, W.: Parsing and maintaining bibliographic references. In: International Conference on Knowledge Discovery and Information Retrieval (KDIR 2012) (2012)

    Google Scholar 

  13. Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006)

    Article  Google Scholar 

  14. Fontan, L., Lopez-Garcia, R., Alvarez, M., Pan, A.: Automatically extracting complex data structures from the web. In: International Conference on Knowledge Discovery and Information Retrieval (KDIR 2012) (2012)

    Google Scholar 

  15. Ha, J., Haralick, R.M., Phillips, I.T.: Recursive XY cut using bounding boxes of connected components. In: Proceedings of the Third International Conference on Document Analysis and Recognition, vol. 2, pp. 952–955. IEEE (1995)

    Google Scholar 

  16. Jain, A.K., Yu, B.: Document representation and its application to page decomposition. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 294–308 (1998)

    Article  Google Scholar 

  17. Finkel, J.R.: Named entity recognition and the stanford NER software (2007)

    Google Scholar 

  18. Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 173–180. Association for Computational Linguistics (2003)

    Google Scholar 

  19. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)

    Article  Google Scholar 

  20. Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to platt’s SMO algorithm for SVM classifier design. Neural Comput. 13(3), 637–649 (2001)

    Article  MATH  Google Scholar 

  21. Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Thirteenth International Joint Conference on Articial Intelligence, vol. 2, pp. 1022–1027. Morgan Kaufmann Publishers (1993)

    Google Scholar 

  22. McCallum, A., Nigam, K., Rennie, J., Seymore, K.: Automating the contruction of internet portals with machine learning. Inf. Retrieval J. 3, 127–163 (2000)

    Article  Google Scholar 

  23. Chang, M.W., Ratinov, L., Roth, D.: Guiding semi-supervision with constraint-driven learning. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 280–287 (2007)

    Google Scholar 

  24. Ganchev, K., Graca, J., Gillenwater, J., Taskar, B.: Posterior regularization for structured latent variable models. J. Mach. Learn. Res. 11, 2001–2049 (2010)

    MATH  MathSciNet  Google Scholar 

  25. Swain, M., Fawcett, S.: Accounting system implications of TOC. In: Swamidass, P. (ed.) Encyclopedia of Production and Manufacturing Management. Springer, Heidelberg (2000). http://www.springerreference.com January 31 2011

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sebastian Lindner .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lindner, S. (2015). Two-Tier Machine Learning Using Conditional Random Fields with Constraints. In: Fred, A., Dietz, J., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2013. Communications in Computer and Information Science, vol 454. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46549-3_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-46549-3_6

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-46548-6

  • Online ISBN: 978-3-662-46549-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics