ISSN: 2577-610X

 JDI Homepage
 Guidelines for Authors
 JDI Online

Subscribers: to view a paper, simply click on the title of the paper, the pdf (or ps or zip file) file will pup up on your screen. If you have any problem to access the files, please check with your librarian or contact jdi@rintonpress.com      To subscribe to JDI, please click Here.

 

Journal of Data Intelligence  ISSN: 2577-610X      published since 2020
Vol.3 No.2 May, 2022 

Extracting Experiment Statistics, Conditions, and Topics from Scientific Papers with STEREO (pp252-277)
        
S. Epp, M. Hoffmann, N. Lell, M. Mohr, and
        
A. Scherp
         
doi: https://doi.org/10.26421/JDI3.2-4

Abstracts: We address the problem of extracting reports of statistics along with information about the experiment conditions and experiment topics from scientific publications. A common writing style for statistical results are the recommendations of the American Psychology Association (APA). In practice, writing styles vary as reports are not 100\% following APA-style or parameters are not reported despite being mandatory. In addition, the statistics are not reported in isolation but in context of experiment conditions investigated and the general experiment topic. We address these challenges by proposing a flexible pipeline STEREO based on wrapper induction and unsupervised aspect detection to extract experiment statistics, conditions, and topics. Thus, in contrast to existing rule-based tools like statcheck with a pre-defined set of rules, we learn rules via induction. Hierarchical wrapper induction is applied to learn rules to extract the reported statistics. Challenge here is to apply wrapper induction on an information extraction task without having formatting landmarks as they can be exploited in HTML pages. Result of step 1 is a set of extracted statistic reports together with sentences in which the reports were found. This is used as input to step 2 of STEREO, which has two parts. We extract experiment conditions using a grammar-based wrapper. Furthermore, we identify the experiment topic using an unsupervised attention-based aspect extraction approach adapted to our problem domain. We applied our pipeline to the over 100,000 documents in the CORD-19 dataset. It required only 0.25% of the CORD-19 corpus (about 500 documents) to learn statistics extraction rules that cover 95% of the sentences in CORD-19. The statistic extraction has 100% precision on APA-conform statistics, which is identical with statcheck. In addition, STEREO can extract non-APA writing styles with 95% precision, which statcheck does not support. Extracting non-APA conform statistics is important as they make more than 99% of all $113$k extracted statistics. We could extract in 46% the correct conditions from APA-conform reports (30% for non-APA). The best model for topic extraction achieves a precision of 75% on statistics reported in APA style $73% for non-APA conform). We conclude that STEREO is a good foundation for automatic statistic extraction and future developments for scientific paper analysis. Particularly the extraction of non-APA conform reports is important and allows applications such as giving feedback to authors about what is missing and could be changed. Finally, STEREO complements existing metadata extraction tools and can be integrated in a general scientific paper analysis pipeline.
Key words:
structured data extraction, scientific paper analysis, meta-research