Dr. Inventor Framework: Extracting Structured Information from Scientific Publications

Ronzano, Francesco; Saggion, Horacio

doi:10.1007/978-3-319-24282-8_18

Francesco Ronzano¹⁵ &
Horacio Saggion¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9356))

Included in the following conference series:

International Conference on Discovery Science

1580 Accesses
15 Citations
1 Altmetric

Abstract

Even if research communities and publishing houses are putting increasing efforts in delivering scientific articles as structured texts, nowadays a considerable part of on-line scientific literature is still available in layout-oriented data formats, like PDF, lacking any explicit structural or semantic information. As a consequence the bootstrap of textual analysis of scientific papers is often a time-consuming activity. We present the first version of the Dr. Inventor Framework, a publicly available collection of scientific text mining components useful to prevent or at least mitigate this problem. Thanks to the integration and the customization of several text mining tools and on-line services, the Dr. Inventor Framework is able to analyze scientific publications both in plain text and PDF format, making explicit and easily accessible core aspects of their structure and semantics. The facilities implemented by the Framework include the extraction of structured textual contents, the discursive characterization of sentences, the identifications of the structural elements of both papers header and bibliographic entries and the generation of graph based representations of text excerpts. The Framework is distributed as a Java library. We describe in detail the scientific mining facilities included in the Framework and present two use cases where the Framework is respectively exploited to boost scientific creativity and to generate RDF graphs from scientific publications.

The work described in this paper has been funded by the European Project Dr. Inventor (FP7-ICT-2013.8.1 - Grant no: 611383).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://jats.nlm.nih.gov/.
2.
http://www.elsevier.com/author-schemas/elsevier-xml-dtds-and-transport-schemas.
3.
http://cs.unibo.it/save-sd/rash/documentation/index.html.
4.
The last release of Framework together with the related documentation can be downloaded at: http://backingdata.org/dri/library/.
5.
https://gate.ac.uk/sale/tao/splitch7.html.
6.
http://pdfbox.apache.org/.
7.
http://pdf2xml.sourceforge.net/ and http://sourceforge.net/projects/pdf2xml/.
8.
http://poppler.freedesktop.org/.
9.
https://code.google.com/p/lapdftext/.
10.
http://cermine.ceon.pl/.
11.
http://wing.comp.nus.edu.sg/parsCit/.
12.
http://pdfx.cs.man.ac.uk/.
13.
https://gate.ac.uk/sale/tao/splitch6.html#chap:annie.
14.
https://gate.ac.uk/sale/tao/splitch8.html.
15.
http://freecite.library.brown.edu/welcome.
16.
https://hpi.de/naumann/projects/repeatability/datasets/cora-dataset.html.
17.
http://search.crossref.org/help/api.
18.
http://www.bibsonomy.org/help/doc/api.html.
19.
https://code.google.com/p/mate-tools/.
20.
http://www.cs.waikato.ac.nz/ml/weka/.
21.
https://github.com/ceurws/lod/wiki/SemPub2015.
22.
http://ceur-ws.org/.

References

Huh, S.: Coding practice of the Journal Article Tag Suite extensible markup language. Sci. Editing 1(2), 105–112 (2014)
Article Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K.: Text Processing with GATE. Gateway Press CA, Murphys (2011)
Google Scholar
Ramakrishnan, C., Patnia, A., Hovy, E.H., Burns, G.A.: Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol. Med. 7(1), 7 (2012)
Article Google Scholar
Tkaczyk, D., Szostek, P., Dendek, P.J., Fedoryszak, M., Bolikowski, L.: CERMINE-automatic extraction of metadata and references from scientific literature. In: 11th IAPR International Workshop on Document Analysis Systems (DAS), pp. 217–221. IEEE (2014)
Google Scholar
Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: an open-source CRF reference string parsing package. In: LREC Proceedings (2008)
Google Scholar
Constantin, A., Pettifer, S., Voronkov., A.: PDFX: fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the 2013 ACM Symposium on Document Engineering. ACM (2013)
Google Scholar
Abu-Jbara, A., Radev., D.: Coherent citation-based summarization of scientific papers. In: Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1. Association for Computational Linguistics (2011)
Google Scholar
Abu-Jbara, A., Radev., D.: Reference scope identification in citing sentences. In: North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics (2012)
Google Scholar
Bohnet, B.: Very high accuracy and fast dependency parsing is not a contradiction. In: Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics (2010)
Google Scholar
Fisas, B., Saggion, H., Ronzano, F.: On the discursive structure of computer graphics research papers. In: Proceedings of the Linguistic Annotation Workshop, NA-ACL (2015)
Google Scholar
Schlkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, Cambridge (2002)
Google Scholar
O’Donoghue, D., Abgaz, Y., Hurley, D., Ronzano F., Saggion, H.: Stimulating and simulating creativity with Dr inventor. In: International Conference of Scientific Creativity (2015)
Google Scholar
Gentner, D.: StructureMapping: a theoretical framework for analogy. Cogn. Sci. 7(2), 155–170 (1983)
Article Google Scholar
Teufel, S., Siddharthan, A., Batchelor, C.: Towards discipline-independent argumentative zoning: evidence from chemistry and computational linguistics. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, vol. 3, pp. 1493–1502. Association for Computational Linguistics (2009)
Google Scholar
Liakata, M., Teufel, S., Siddharthan, A., Batchelor, C.R.: Corpora for the conceptualisation and zoning of scientific papers. In: LREC (2010)
Google Scholar
Agarwal, S., Yu, H.: Automatically classifying sentences in full-text biomedical articles into Introduction, Methods Results and Discussion. Bioinformatics 25(23), 3174–3180 (2009)
Article Google Scholar
Guo, Y., Reichart, R., Korhonen, A.: Improved information structure analysis of scientific documents through discourse and lexical constraints. In: HLT-NAACL, pp. 928–937 (2013)
Google Scholar
Saggion, H.: SUMMA a robust and adaptable summarization tool. Traitement Automatique des Langues 49, 103–125 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

TALN Research Group, Universitat Pompeu Fabra, C/Tanger 122, 08018, Barcelona, Spain
Francesco Ronzano & Horacio Saggion

Authors

Francesco Ronzano
View author publications
You can also search for this author in PubMed Google Scholar
Horacio Saggion
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francesco Ronzano .

Editor information

Editors and Affiliations

University of Ottawa, Ottawa, Ontario, Canada
Nathalie Japkowicz
Faculty of Computer Science, Dalhousie University, HALIFAX, Nova Scotia, Canada
Stan Matwin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ronzano, F., Saggion, H. (2015). Dr. Inventor Framework: Extracting Structured Information from Scientific Publications. In: Japkowicz, N., Matwin, S. (eds) Discovery Science. DS 2015. Lecture Notes in Computer Science(), vol 9356. Springer, Cham. https://doi.org/10.1007/978-3-319-24282-8_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-24282-8_18
Published: 25 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24281-1
Online ISBN: 978-3-319-24282-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics