Evaluating Reference String Extraction Using Line-Based Conditional Random Fields: A Case Study with German Language Publications

Körner, Martin; Ghavimi, Behnam; Mayr, Philipp; Hartmann, Heinrich; Staab, Steffen

doi:10.1007/978-3-319-67162-8_15

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 767))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

1056 Accesses
7 Citations
10 Altmetric

Abstract

The extraction of individual reference strings from the reference section of scientific publications is an important step in the citation extraction pipeline. Current approaches divide this task into two steps by first detecting the reference section areas and then grouping the text lines in such areas into reference strings. We propose a classification model that considers every line in a publication as a potential part of a reference string. By applying line-based conditional random fields rather than constructing the graphical model based on individual words, dependencies and patterns that are typical in reference sections provide strong features while the overall complexity of the model is reduced. We evaluated our novel approach RefExt against various state-of-the-art tools (CERMINE, GROBID, and ParsCit) and a gold standard which consists of 100 German language full text publications from the social sciences. The evaluation demonstrates that we are able to outperform state-of-the-art tools which rely on the identification of reference section areas.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://west.uni-koblenz.de/en/research/excite.
2.
https://www.zotero.org/styles/.
3.
This approach was described in a preprint by Körner [3].
4.
https://github.com/exciteproject/refext.
5.
https://github.com/exciteproject/amsd2017.
6.
http://www.ssoar.info/.
7.
https://github.com/exciteproject/ssoar-gold-standard.
8.
https://www.crossref.org/labs/pdfextract.
9.
http://www.ssoar.info/ssoar/handle/document/32521 and http://www.ssoar.info/ssoar/handle/document/43525.
10.
Shown as the first example in Fig. 1.
11.
http://www.hait.tu-dresden.de/td/home.asp.
12.
http://suedosteuropaeische-hefte.org/.

References

Hienert, D., Sawitzki, F., Mayr, P.: Digital library research in action-supporting information retrieval in Sowiport. D-Lib Mag. 21(3/4) (2015)
Google Scholar
Moed, H.F.: Citation Analysis in Research Evaluation, vol. 9. Springer, Dordrecht (2005)
Google Scholar
Körner, M.: Reference String Extraction Using Line-Based Conditional Random Fields. ArXiv e-prints (2017)
Google Scholar
Peng, F., McCallum, A.: Information extraction from research papers using conditional random fields. Inf. Process. Manage. 42(4), 963–979 (2006)
Article Google Scholar
Cortez, E., da Silva, A.S., Gonçalves, M.A., Mesquita, F., de Moura, E.S.: FLUX-CiM: flexible unsupervised extraction of citation metadata. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 215–224. ACM (2007)
Google Scholar
Groza, T., Grimnes, G.A., Handschuh, S.: Reference information extraction and processing using conditional random fields. Inf. Technol. Libr. (Online) 31(2), 6 (2012)
Google Scholar
Lopez, P.: GROBID: combining automatic bibliographic data recognition and term extraction for scholarship publications. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds.) ECDL 2009. LNCS, vol. 5714, pp. 473–474. Springer, Heidelberg (2009). doi:10.1007/978-3-642-04346-8_62
Chapter Google Scholar
Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: an open-source CRF reference string parsing package. In: Proceedings of LREC, vol. 2008, pp. 661–667 (2008)
Google Scholar
Wu, J., Williams, K., Chen, H.H., Khabsa, M., Caragea, C., Ororbia, A., Jordan, D., Giles, C.L.: CiteSeerX: AI in a digital library search engine. In: AAAI, pp. 2930–2937 (2014)
Google Scholar
Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P.J., Bolikowski, Ł.: CERMINE: automatic extraction of structured metadata from scientific literature. Int. J. Doc. Anal. Recogn. (IJDAR) 18(4), 317–335 (2015)
Article Google Scholar
Lafferty, J., McCallum, A., Pereira, F., et al.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML, vol. 1, pp. 282–289 (2001)
Google Scholar
Houngbo, H., Mercer, R.E.: Method mention extraction from scientific research papers. In: COLING 2012, 24th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 8–15 December 2012, Mumbai, India, pp. 1211–1222 (2012)
Google Scholar
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT Press (2009)
Google Scholar
Constantin, A., Pettifer, S., Voronkov, A.: PDFX: fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the 2013 ACM symposium on Document engineering, pp. 177–180. ACM (2013)
Google Scholar
McCallum, A.K.: MALLET: a machine learning for language toolkit (2002)
Google Scholar

Download references

Acknowledgements

This work has been funded by Deutsche Forschungsgemeinschaft (DFG) as part of the project “Extraction of Citations from PDF Documents (EXCITE)” under grant numbers MA 3964/8-1 and STA 572/14-1. We would like to thank Dominika Tkaczyk for her support regarding the CERMINE tool as well as Alexandra Bormann, Jan Hübner, and Daniel Kostić for contributing to the gold standard that was used in this research.

Author information

Authors and Affiliations

Institute for Web Science and Technologies, University of Koblenz-Landau, Koblenz, Germany
Martin Körner & Steffen Staab
GESIS – Leibniz Institute for the Social Sciences, Cologne, Germany
Behnam Ghavimi & Philipp Mayr
Independent, Munich, Germany
Heinrich Hartmann

Authors

Martin Körner
View author publications
You can also search for this author in PubMed Google Scholar
Behnam Ghavimi
View author publications
You can also search for this author in PubMed Google Scholar
Philipp Mayr
View author publications
You can also search for this author in PubMed Google Scholar
Heinrich Hartmann
View author publications
You can also search for this author in PubMed Google Scholar
Steffen Staab
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Körner .

Editor information

Editors and Affiliations

Riga Technical University , Riga, Latvia
Mārīte Kirikova
Norwegian University of Science and Technology, Trondheim, Norway
Kjetil Nørvåg
University of Cyprus , Nicosia, Cyprus
George A. Papadopoulos
Free University of Bozen-Bolzano , Bozen-Bolzano, Italy
Johann Gamper
Institute of Computing Science, Poznan University of Technology, Poznan, Poland
Robert Wrembel
Université Lumière Lyon 2, Lyon, France
Jérôme Darmont
University of Bologna , Bologna, Italy
Stefano Rizzi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Körner, M., Ghavimi, B., Mayr, P., Hartmann, H., Staab, S. (2017). Evaluating Reference String Extraction Using Line-Based Conditional Random Fields: A Case Study with German Language Publications. In: Kirikova, M., et al. New Trends in Databases and Information Systems. ADBIS 2017. Communications in Computer and Information Science, vol 767. Springer, Cham. https://doi.org/10.1007/978-3-319-67162-8_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-67162-8_15
Published: 09 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67161-1
Online ISBN: 978-3-319-67162-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics