Extracting Experiment Statistics,
Conditions, and Topics from Scientific Papers with STEREO
(pp252-277)
S. Epp, M. Hoffmann,
N. Lell, M. Mohr, and
A. Scherp
doi:
https://doi.org/10.26421/JDI3.2-4
Abstracts:
We address the problem of extracting
reports of statistics along with information about the experiment
conditions and experiment topics from scientific publications. A
common writing style for statistical results are the recommendations
of the American Psychology Association (APA).
In practice, writing styles vary as reports are not 100\%
following
APA-style
or parameters are not reported despite being mandatory. In addition,
the statistics are not reported in isolation but in context of
experiment conditions investigated and the general experiment topic.
We address these challenges by proposing a flexible pipeline STEREO
based on wrapper induction and unsupervised aspect detection to
extract experiment statistics, conditions, and topics. Thus, in
contrast to existing rule-based tools like
statcheck
with a
pre-defined
set of rules, we learn rules via induction. Hierarchical wrapper
induction is applied to learn rules to extract the reported
statistics. Challenge here is to apply wrapper induction on an
information extraction task without having formatting landmarks as
they can be exploited in HTML pages. Result of step 1 is a set of
extracted statistic reports together with sentences in which the
reports were found. This is used as input to step 2 of STEREO, which
has two parts. We extract experiment conditions using a
grammar-based wrapper. Furthermore, we identify the experiment topic
using an unsupervised attention-based aspect extraction approach
adapted to our problem domain. We applied our pipeline to the over
100,000
documents in the CORD-19
dataset.
It required only 0.25%
of the CORD-19 corpus (about 500
documents) to learn statistics extraction rules that cover
95%
of the sentences in CORD-19. The statistic extraction has
100%
precision on
APA-conform
statistics, which is identical with
statcheck.
In addition, STEREO can extract non-APA
writing styles with 95%
precision, which
statcheck
does not support. Extracting non-APA
conform statistics is important as they make more than
99%
of all $113$k
extracted statistics. We could extract in
46%
the correct conditions from
APA-conform
reports (30%
for non-APA).
The best model for topic extraction achieves a precision of
75%
on statistics reported in
APA
style $73%
for non-APA
conform). We conclude that STEREO is a good foundation for automatic
statistic extraction and future developments for scientific paper
analysis. Particularly the extraction of non-APA
conform reports is important and allows applications such as giving
feedback to authors about what is missing and could be changed.
Finally, STEREO complements existing
metadata
extraction tools and can be integrated in a general scientific paper
analysis pipeline.
Key words:
structured data extraction,
scientific paper analysis, meta-research