Skip to main content

Evaluating Reference String Extraction Using Line-Based Conditional Random Fields: A Case Study with German Language Publications

  • Conference paper
  • First Online:
New Trends in Databases and Information Systems (ADBIS 2017)

Abstract

The extraction of individual reference strings from the reference section of scientific publications is an important step in the citation extraction pipeline. Current approaches divide this task into two steps by first detecting the reference section areas and then grouping the text lines in such areas into reference strings. We propose a classification model that considers every line in a publication as a potential part of a reference string. By applying line-based conditional random fields rather than constructing the graphical model based on individual words, dependencies and patterns that are typical in reference sections provide strong features while the overall complexity of the model is reduced. We evaluated our novel approach RefExt against various state-of-the-art tools (CERMINE, GROBID, and ParsCit) and a gold standard which consists of 100 German language full text publications from the social sciences. The evaluation demonstrates that we are able to outperform state-of-the-art tools which rely on the identification of reference section areas.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://west.uni-koblenz.de/en/research/excite.

  2. 2.

    https://www.zotero.org/styles/.

  3. 3.

    This approach was described in a preprint by Körner [3].

  4. 4.

    https://github.com/exciteproject/refext.

  5. 5.

    https://github.com/exciteproject/amsd2017.

  6. 6.

    http://www.ssoar.info/.

  7. 7.

    https://github.com/exciteproject/ssoar-gold-standard.

  8. 8.

    https://www.crossref.org/labs/pdfextract.

  9. 9.

    http://www.ssoar.info/ssoar/handle/document/32521 and http://www.ssoar.info/ssoar/handle/document/43525.

  10. 10.

    Shown as the first example in Fig. 1.

  11. 11.

    http://www.hait.tu-dresden.de/td/home.asp.

  12. 12.

    http://suedosteuropaeische-hefte.org/.

References

  1. Hienert, D., Sawitzki, F., Mayr, P.: Digital library research in action-supporting information retrieval in Sowiport. D-Lib Mag. 21(3/4) (2015)

    Google Scholar 

  2. Moed, H.F.: Citation Analysis in Research Evaluation, vol. 9. Springer, Dordrecht (2005)

    Google Scholar 

  3. Körner, M.: Reference String Extraction Using Line-Based Conditional Random Fields. ArXiv e-prints (2017)

    Google Scholar 

  4. Peng, F., McCallum, A.: Information extraction from research papers using conditional random fields. Inf. Process. Manage. 42(4), 963–979 (2006)

    Article  Google Scholar 

  5. Cortez, E., da Silva, A.S., Gonçalves, M.A., Mesquita, F., de Moura, E.S.: FLUX-CiM: flexible unsupervised extraction of citation metadata. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 215–224. ACM (2007)

    Google Scholar 

  6. Groza, T., Grimnes, G.A., Handschuh, S.: Reference information extraction and processing using conditional random fields. Inf. Technol. Libr. (Online) 31(2), 6 (2012)

    Google Scholar 

  7. Lopez, P.: GROBID: combining automatic bibliographic data recognition and term extraction for scholarship publications. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds.) ECDL 2009. LNCS, vol. 5714, pp. 473–474. Springer, Heidelberg (2009). doi:10.1007/978-3-642-04346-8_62

    Chapter  Google Scholar 

  8. Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: an open-source CRF reference string parsing package. In: Proceedings of LREC, vol. 2008, pp. 661–667 (2008)

    Google Scholar 

  9. Wu, J., Williams, K., Chen, H.H., Khabsa, M., Caragea, C., Ororbia, A., Jordan, D., Giles, C.L.: CiteSeerX: AI in a digital library search engine. In: AAAI, pp. 2930–2937 (2014)

    Google Scholar 

  10. Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P.J., Bolikowski, Ł.: CERMINE: automatic extraction of structured metadata from scientific literature. Int. J. Doc. Anal. Recogn. (IJDAR) 18(4), 317–335 (2015)

    Article  Google Scholar 

  11. Lafferty, J., McCallum, A., Pereira, F., et al.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML, vol. 1, pp. 282–289 (2001)

    Google Scholar 

  12. Houngbo, H., Mercer, R.E.: Method mention extraction from scientific research papers. In: COLING 2012, 24th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 8–15 December 2012, Mumbai, India, pp. 1211–1222 (2012)

    Google Scholar 

  13. Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT Press (2009)

    Google Scholar 

  14. Constantin, A., Pettifer, S., Voronkov, A.: PDFX: fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the 2013 ACM symposium on Document engineering, pp. 177–180. ACM (2013)

    Google Scholar 

  15. McCallum, A.K.: MALLET: a machine learning for language toolkit (2002)

    Google Scholar 

Download references

Acknowledgements

This work has been funded by Deutsche Forschungsgemeinschaft (DFG) as part of the project “Extraction of Citations from PDF Documents (EXCITE)” under grant numbers MA 3964/8-1 and STA 572/14-1. We would like to thank Dominika Tkaczyk for her support regarding the CERMINE tool as well as Alexandra Bormann, Jan Hübner, and Daniel Kostić for contributing to the gold standard that was used in this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Körner .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Körner, M., Ghavimi, B., Mayr, P., Hartmann, H., Staab, S. (2017). Evaluating Reference String Extraction Using Line-Based Conditional Random Fields: A Case Study with German Language Publications. In: Kirikova, M., et al. New Trends in Databases and Information Systems. ADBIS 2017. Communications in Computer and Information Science, vol 767. Springer, Cham. https://doi.org/10.1007/978-3-319-67162-8_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67162-8_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67161-1

  • Online ISBN: 978-3-319-67162-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics