Authors:
Muhammad Suryani
1
;
2
;
Steffen Hahne
1
;
Christian Beth
1
;
Klaus Wallmann
2
and
Matthias Renz
1
Affiliations:
1
Institute of Informatik, Christian-Albrechts-Universität zu Kiel, Kiel, Germany
;
2
GEOMAR Helmholtz Centre for Ocean Research Kiel, Kiel, Germany
Keyword(s):
Information Extraction, Data Acquisition, Research Data Management, Scientific Publication, Marine Science.
Abstract:
Researchers encapsulate their findings in publications, generally available in PDFs, which are designed primarily for platform-independent viewing and printing and do not support editing or automatic data extraction. These documents are a rich source of information in any domain, but the information in these publications is presented in text, tables and figures. However, manual extraction of information from these components would be beyond tedious and necessitates an automatic approach. Therefore, an automatic extraction approach could provide valuable data to the research community while also helping to manage the increasing number of publications. Previously, many approaches focused on extracting individual components from scientific publications, i.e. metadata, text or tables, but failed to target these data components collectively. This paper proposes a Data Acquisition Framework (DAF), the most comprehensive framework to our knowledge. The DAF extracts enhanced metadata, segmen
ted text, captions and content of tables and figures respectively. Through rigorous evaluation on two distinct datasets from the Marine Science and Chemical Domain we showcase the superior performance compared of the DAF to the baseline PDFDataExtractor. We also provide an illustrative example to underscore DAF’s adaptability in the realm of research data management.
(More)