Reconstructing the Logical Structure of a Scientific Publication Using Machine Learning

Klampfl, Stefan; Kern, Roman

doi:10.1007/978-3-319-46565-4_20

Stefan Klampfl¹⁴ &
Roman Kern¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 641))

Included in the following conference series:

Semantic Web Evaluation Challenge

742 Accesses
2 Citations

Abstract

Semantic enrichment of scientific publications has an increasing impact on scholarly communication. This document describes our contribution to Semantic Publishing Challenge 2016, which aims at investigating novel approaches for improving scholarly publishing through semantic technologies. We participated in Task 2 of this challenge, which requires the extraction of information from the content of a paper given as PDF. The extracted information allows answering queries about the paper’s internal organisation and the context in which it was written. We build upon our contribution to the previous edition of the challenge, where we categorised meta-data, such as authors and affiliations, and extracted funding information. Here we use unsupervised machine learning techniques in order to extend the analysis of the logical structure of the document as to identify section titles and captions of figures and tables. Furthermore, we employ clustering techniques to create the hierarchical table of contents of the article. Our system is modular in nature and allows a separate training of different stages on different training sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/ceurws/lod/wiki/SemPub2016.
2.
http://code-annotator.know-center.tugraz.at.
3.
https://svn.know-center.tugraz.at/opensource/projects/code/trunk User: Anonymous, empty password.
4.
http://pdfbox.apache.org/.
5.
https://jena.apache.org/.
6.
https://developers.google.com/maps/documentation/geocoding/start.

References

Aiello, M., Monz, C., Todoran, L., Worring, M.: Document understanding for a broad class of documents. Int. J. Doc. Anal. Recogn. 5(1), 1–16 (2002)
Article MATH Google Scholar
Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Comput. Linguist. 22(1), 39–71 (1996)
Google Scholar
Iorio, A.D., Lange, C., Dimou, A., Vahdati, S.: Semantic publishing challenge – assessing the quality of scientific output by information extraction and interlinking. SemWebEval 2015. CCIS, vol. 548, pp. 65–80. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25518-7_6
Chapter Google Scholar
Gao, L., Tang, Z., Lin, X., Liu, Y., Qiu, R., Wang, Y.: Structure extraction from PDF-based book documents. In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, pp. 11–20 (2011)
Google Scholar
Kern, R., Jack, K., Hristakeva, M., Granitzer, M.: TeamBeam - meta-data extraction from scientific literature. In: 1st International Workshop on Mining Scientific Publications (2012)
Google Scholar
Kern, R., Klampfl, S.: Extraction of references using layout and formatting information from scientific articles. D-Lib Mag. 19(9/10), 2 (2013)
Google Scholar
Klampfl, S., Granitzer, M., Jack, K., Kern, R.: Unsupervised document structure analysis of digital scientific articles. Int. J. Digit. Libr. 14(3–4), 83–99 (2014)
Article Google Scholar
Klampfl, S., Kern, R.: An unsupervised machine learning approach to body text and table of contents extraction from digital scientific articles. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C.J. (eds.) TPDL 2013. LNCS, vol. 8092, pp. 144–155. Springer, Heidelberg (2013)
Chapter Google Scholar
Klampfl, S., Kern, R.: Machine learning techniques for automatically extracting contextual information from scientific publications. In: Gandon, F., et al. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 105–116. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25518-7_9
Chapter Google Scholar
Kröll, M., Klampfl, S., Kern, R.: Towards a marketplace for the scientific community: accessing knowledge from the computer science domain. D-Lib Mag. 20(11/12), 10 (2014)
Google Scholar
Lin, X.: Header and footer extraction by page-association. In: Proceedings of SPIE vol. 5010, pp. 164–171 (2002)
Google Scholar
Liu, Y., Mitra, P., Giles, C.L.: Identifying table boundaries in digital documents via sparse line detection. In: Proceeding of the 17th ACM Conference on Information and Knowledge Mining CIKM 2008, pp. 1311–1320. ACM Press (2008)
Google Scholar
Ratnaparkhi, A.: Maximum entropy models for natural langual ambiguity resolution. Ph.D. thesis (1998)
Google Scholar

Download references

Acknowledgements

The presented work was in part developed within the CODE project (grant no. 296150) and within the EEXCESS project (grant no. 600601) funded by the EU FP7, as well as the TEAM IAPP project (grant no. 251514) within the FP7 People Programme. The Know-Center is funded within the Austrian COMET Program – Competence Centers for Excellent Technologies – under the auspices of the Austrian Federal Ministry of Transport, Innovation and Technology, the Austrian Federal Ministry of Economy, Family and Youth and by the State of Styria. COMET is managed by the Austrian Research Promotion Agency FFG.

Author information

Authors and Affiliations

Know-Center GmbH, Inffeldgasse 13, 8010, Graz, Austria
Stefan Klampfl & Roman Kern

Authors

Stefan Klampfl
View author publications
You can also search for this author in PubMed Google Scholar
Roman Kern
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefan Klampfl .

Editor information

Editors and Affiliations

IT Systems Engineering, Hasso-Plattner Institute, Potsdam, Germany
Harald Sack
Leibniz Universität Hannover , Hannover, Germany
Stefan Dietze
Elsevier B.V. , Amsterdem, The Netherlands
Anna Tordai
Universität Bonn , Bonn, Germany
Christoph Lange

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Klampfl, S., Kern, R. (2016). Reconstructing the Logical Structure of a Scientific Publication Using Machine Learning. In: Sack, H., Dietze, S., Tordai, A., Lange, C. (eds) Semantic Web Challenges. SemWebEval 2016. Communications in Computer and Information Science, vol 641. Springer, Cham. https://doi.org/10.1007/978-3-319-46565-4_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-46565-4_20
Published: 09 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46564-7
Online ISBN: 978-3-319-46565-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics