Published April 27, 2022 | Version 1.0
Dataset Open

DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resource Entity Extraction Using Clinical Trials Literature

  • 1. University of Geneva

Description

Datasets

  1. DISTANT-CTO is a weakly-labelled dataset of 'Intervention' and 'Comparator' entity annotated sentences. The dataset was obtained using candidate generation the approach described in "DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low Resource Entity Extraction Using Clinical Trials Literature".
    1. distantcto_high_conf.txt    - ds conf 1.0 (full dataset)
    2. extraction1_pos_posnegtrail_conf09.txt - ds conf 0.9 (partial dataset)
  2. The physio test set is a dataset comprising 153 PICO annotated randomized controlled trial abstracts from Physiotherapy and Rehabilitation. This dataset was used as an additional benchmark to evaluate the generalization power of the weakly annotated dataset and NER model for this sub-domain.

 

Utility

The dataset could be used as an input for training 'Intervention' named-entity recognition (NER) models.

 

Availability

This directory includes extraction1_pos_posnegtrail_conf09.txt - This text data file contains all the weak annotations (source intervention terms mapped onto target sentences) from clinicaltrials.org (CTO) with a confidence score of 0.9 and above.

The directory also includes ‘physio_sent_annot2POS_posnegtrail.txt’ – This data file contains manually annotated (Intervention entity) data from the physiotherapy and rehabilitation domain. It follows a roughly similar structure as described in the ‘Description for long targets’ section. (‘Participant’ and ‘Outcome’ annotations are removed from this file)

Files

distantcto_high_conf.txt

Files (4.2 GB)

Name Size Download all
md5:66f0686803fc5f9577e81801c4937e36
3.8 GB Preview Download
md5:e95e3984a9b46e340b90aeed262e12cc
360.2 MB Preview Download
md5:a0b6f478ce1896b1a531ed8b823a2752
1.1 MB Preview Download