Abstract
Even if research communities and publishing houses are putting increasing efforts in delivering scientific articles as structured texts, nowadays a considerable part of on-line scientific literature is still available in layout-oriented data formats, like PDF, lacking any explicit structural or semantic information. As a consequence the bootstrap of textual analysis of scientific papers is often a time-consuming activity. We present the first version of the Dr. Inventor Framework, a publicly available collection of scientific text mining components useful to prevent or at least mitigate this problem. Thanks to the integration and the customization of several text mining tools and on-line services, the Dr. Inventor Framework is able to analyze scientific publications both in plain text and PDF format, making explicit and easily accessible core aspects of their structure and semantics. The facilities implemented by the Framework include the extraction of structured textual contents, the discursive characterization of sentences, the identifications of the structural elements of both papers header and bibliographic entries and the generation of graph based representations of text excerpts. The Framework is distributed as a Java library. We describe in detail the scientific mining facilities included in the Framework and present two use cases where the Framework is respectively exploited to boost scientific creativity and to generate RDF graphs from scientific publications.
The work described in this paper has been funded by the European Project Dr. Inventor (FP7-ICT-2013.8.1 - Grant no: 611383).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
The last release of Framework together with the related documentation can be downloaded at: http://backingdata.org/dri/library/.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
References
Huh, S.: Coding practice of the Journal Article Tag Suite extensible markup language. Sci. Editing 1(2), 105–112 (2014)
Cunningham, H., Maynard, D., Bontcheva, K.: Text Processing with GATE. Gateway Press CA, Murphys (2011)
Ramakrishnan, C., Patnia, A., Hovy, E.H., Burns, G.A.: Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol. Med. 7(1), 7 (2012)
Tkaczyk, D., Szostek, P., Dendek, P.J., Fedoryszak, M., Bolikowski, L.: CERMINE-automatic extraction of metadata and references from scientific literature. In: 11th IAPR International Workshop on Document Analysis Systems (DAS), pp. 217–221. IEEE (2014)
Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: an open-source CRF reference string parsing package. In: LREC Proceedings (2008)
Constantin, A., Pettifer, S., Voronkov., A.: PDFX: fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the 2013 ACM Symposium on Document Engineering. ACM (2013)
Abu-Jbara, A., Radev., D.: Coherent citation-based summarization of scientific papers. In: Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1. Association for Computational Linguistics (2011)
Abu-Jbara, A., Radev., D.: Reference scope identification in citing sentences. In: North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics (2012)
Bohnet, B.: Very high accuracy and fast dependency parsing is not a contradiction. In: Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics (2010)
Fisas, B., Saggion, H., Ronzano, F.: On the discursive structure of computer graphics research papers. In: Proceedings of the Linguistic Annotation Workshop, NA-ACL (2015)
Schlkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, Cambridge (2002)
O’Donoghue, D., Abgaz, Y., Hurley, D., Ronzano F., Saggion, H.: Stimulating and simulating creativity with Dr inventor. In: International Conference of Scientific Creativity (2015)
Gentner, D.: StructureMapping: a theoretical framework for analogy. Cogn. Sci. 7(2), 155–170 (1983)
Teufel, S., Siddharthan, A., Batchelor, C.: Towards discipline-independent argumentative zoning: evidence from chemistry and computational linguistics. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, vol. 3, pp. 1493–1502. Association for Computational Linguistics (2009)
Liakata, M., Teufel, S., Siddharthan, A., Batchelor, C.R.: Corpora for the conceptualisation and zoning of scientific papers. In: LREC (2010)
Agarwal, S., Yu, H.: Automatically classifying sentences in full-text biomedical articles into Introduction, Methods Results and Discussion. Bioinformatics 25(23), 3174–3180 (2009)
Guo, Y., Reichart, R., Korhonen, A.: Improved information structure analysis of scientific documents through discourse and lexical constraints. In: HLT-NAACL, pp. 928–937 (2013)
Saggion, H.: SUMMA a robust and adaptable summarization tool. Traitement Automatique des Langues 49, 103–125 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Ronzano, F., Saggion, H. (2015). Dr. Inventor Framework: Extracting Structured Information from Scientific Publications. In: Japkowicz, N., Matwin, S. (eds) Discovery Science. DS 2015. Lecture Notes in Computer Science(), vol 9356. Springer, Cham. https://doi.org/10.1007/978-3-319-24282-8_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-24282-8_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24281-1
Online ISBN: 978-3-319-24282-8
eBook Packages: Computer ScienceComputer Science (R0)