Learning to extract adverse drug reaction events from electronic health records in Spanish
Introduction
Text mining in the clinical domain has emerged as a field of interest during the last decade with several attempts in the literature that aim at easing the reading of reports in English (Chaplin, Meloni, Eisen, Jolayemi, Banigbe, Adeola, Wen, Nieva, Chang, Okonkwo, Kanki, 2015, Rink, Harabagiu, Roberts, 2011, Toldo, Bhattacharya, Gurulingappa, 2012). There is a great variety in the type of biomedical events aimed at, such as binary protein-protein interaction (Wong, 2001), biomolecular event extraction (Jin-Dong et al., 2011), drug-drug interaction extraction (Segura-Bedmar, Martínez, & Herrero-Zazo, 2013), and cause-effect event extraction (Björne, Ginter, Pyysalo, Tsujii, Salakoski, 2010, Mihaila, Ohta, Pyysalo, Ananiadou, 2013).
We focus on the detection of Adverse Drug Reactions (ADRs): when a drug prescribed to combat a disease can be the cause of other new diseases. In Fig. 1a some adverse reactions originally written in Spanish are shown (“MEG”, “vómitos”, “diarrea”) caused by a drug (“Actira”). In this particular case of ADR detection, there are solutions based on rules, machine learning techniques or combinations, with encouraging results in different contexts:
- •
Medical literature (SriJyothsna, Aditya, Saipradeep, Govindakrishnan, Rajgopal, 2014, Xu, Wang, 2013): e.g. scientific journals. These texts tend to be grammatically correct and without misspellings.
- •
Social media (Nikfarjam, Gonzalez, 2011, SriJyothsna, Aditya, Saipradeep, Govindakrishnan, Rajgopal, 2014): e.g. blogs and tweets related to health usually written by non-experts.
- •
Electronic health records (EHRs) (Karlsson, Zhao, Asker, Boström, 2013, Sohn, Kocher, Chute, Savova, 2011): e.g. clinical reports. They do not use either a fully formal linguistic register or a lay register. They can contain abbreviations, typos, or grammatically incorrect sentences.
The aim of this work is to automatically highlight ADRs in EHRs in order to alleviate the work-load of various services within a hospital (pharmacy, documentation, etc.) that have to read these reports. The processing of these texts presents a real challenge to the rapid detection of ADRs and, in consequence, to the safety of the patient.
We present a hybrid ADR extraction system to cope with this task. It entails two stages in sequence: 1) the first one carries out, among others, the annotation of entities such as drugs and substances (from now onwards both of them shall be referred to as drug), and also disorders and findings (referred to as disease); 2) the second stage determines whether a given (drug, disease) pair of entities represents an ADR event. Note that we are interested in highlighting events involving (drug, disease) pairs where the drug caused the disease. The final system should present the ADRs marked in a friendly front-end. To this end, we will represent the text in the framework provided by Brat rapid annotation tool (Stenetorp et al., 2012). Fig. 1 shows examples, represented in Brat, of some cause-effect events manually tagged by experts.
ADRs differ from medication errors where drugs are used in an inappropriate way and, in consequence, are preventable situations, while ADRs are hardly preventable (MSC, 2006). Thus, a drug is prescribed to combat a disease but, in some situations, it could cause unexpected side effects on specific patients. In this work we aim to differentiate between (drug, disease) pairs causing an ADR and those pairs with positive or neutral consequences for patients. The task of detecting ADRs in EHRs is tough since the same pair might represent both ADR and non-ADR events. As an example, Fig. 1 presents the pair (AC × FA, betabloqueantes) twice in the same document. AC × FA is an abbreviation that indicates “Arritmia cardíaca por fibrilación auricular”, meaning “atrial fibrillation”, and “betabloqueantes” are “beta blockers”, a drug family. That pair, in one case (depicted in Fig. 1b), represents a treatment; however, in the same document the same pair is found but as an ADR (see Fig. 1c). 6% of the drug-disease entity pairs trigger an adverse drug reaction, and these results are in accord with similar estimates for other health systems. For example, Henriksson, Kvist, Dalianis, and Duneld (2015) state that ADRs are responsible for approximately 35% of hospital admissions world-wide, and they suffer heavily from under-reporting. The ENEAS report, written by the Spanish Ministry of Health (MSC, 2006) examined twenty-four hospitals in Spain to determine the impact and preventability of ADRs, concluding that 42% of the adverse affects are avoidable.
The personnel at the hospital uses prescription management systems that help to avoid the most typical and frequent drug-disease pairs causing ADRs. However, these lists are not specially useful in the context of analyzing EHRs because the ADRs that we intend to discover are those that do not belong to the list of typical ADRs, which are filtered before the prescription of the drug, according to each patient’s characteristics. As Henriksson et al. (2015) pointed out, ADRs are heavily under-reported in EHRs most of the times, except in the few cases when they are the main cause of disease. Additionally, the main difficulty of the present task is to discover ADRs which are specified in multiple ways in EHR texts, potentially with big differences with respect to the standard naming of drugs, diseases and ADR triggers. We are dealing with discharge EHRs written by around 400 different doctors. These records are not written with the aid of a template, thus, they do not follow a pre-determined structure, and this, by itself, entails a challenge. Most of the recent works cope with event extraction within the same sentence, that is, intra-sentence events. By contrast, in this work we have realised that around 22% of the ADR events in our EHRs occur between medical entities that are in different sentences, and some of them are far from each other.
Fig. 2 presents an example of an inter-sentence event, the disease “INTOXICACION” (intoxication) is caused by the drug “Sintrom” and each entity is in a different sentence. The ADRs between entities that are in different sentences are going to be explained in depth in Section 4.1. The pair (“INTOXICACION”, “ACENOCUMAROL”) corresponds to an ADR where the drug and the disease are in the same sentence. The pair (“INTOXICACION”,“Sintrom”) is an ADR with the drug and the disease in different sentences. Note that Acenocumarol (in English, Acenocoumarol) is the active ingredient of the pharmacological product called Sintrom.
Section snippets
Related work
Since 2008 various European projects have tackled the problem of the early detection of adverse drug reactions, some examples are: EU-ADR (van Mulligen et al., 2012), PSIP (Beuscart, McNair, & Brender, 2009) or EHR4CR (Moor et al., 2015). Amongst others, Wang, Hripcsak, Markatou, and Friedman (2009) were pioneering researchers in discovering adverse events in EHRs. They integrated a natural language processing system to identify medical entities and, after that, co-occurrence statistics to
Methods
The proposed system (depicted in Fig. 3) deals with two subproblems: first, relevant medical entities are identified; and then (drug, disease) pairs are explored, annotating those pairs classified as an ADR. Named entity recognition within the biomedical domain for Spanish is accomplished by means of a linguistic analyser. Once the analyser has detected the drug and disease entities, all possible (drug, disease) pairs in a document represent a candidate relation that the system will classify,
Data: qualitative and quantitative description
Given the entire set of manually annotated documents, 25.8% were randomly selected without replacement to produce the test set. The resulting partition is presented in Table 1. The table shows the number of documents, entities recognised, ADR events and the negative/positive ADR event-ratio.
Fig. 6 shows the distribution of positive (in blue) and negative (in red) ADRs in a set of 50 randomly chosen files. This figure highlights the unequal distribution of the classes: there is a notably
Concluding remarks
This work deals with text mining within the medical domain and, in particular, with ADR event extraction. The challenge stands on the skewed nature of the task: 94% of the instances belong to the negative class while the key class is the positive one. The proposed system focuses on real EHRs in Spanish. To our knowledge, this is one of the first works on medical event extraction for Spanish, and certainly the first published one dealing with EHRs in Spanish. We work with a corpus that has been
Acknowledgements
The authors would like to thank the personnel of Pharmacy and Pharmacovigilance services of the Galdakao-Usansolo Hospital. This work was partially funded by the Spanish Ministry of Science and Innovation EXTRECM: TIN2013-46616-C2-1-R, TADEEP: TIN2015-70214-P and the Basque Government (DETEAMI: Ministry of Health 2014111003, IXA Research Group of type A (2010-2015), Predoctoral Grant: PRE_2015_1_0211).
References (45)
- et al.
Scale-up of networked HIV treatment in nigeria: Creation of an integrated electronic medical records system
International Journal of Medical Informatics
(2015) - et al.
IxaMed: Applying freeling and a perceptron sequential tagger at the shared task on analyzing clinical texts
International workshop on semantic evaluation, task 7: Analysis of clinical text
(2014) SNOMED CT Starter Guide. February 2014
Technical Report
(2014)- et al.
Biocause: Annotating and analysing causality in the biomedical domain
BMC Bioinformatics
(2013) - et al.
Semantic Services in Freeling 2.1: WordNet and UKB
Global wordnet conference
(2010) - et al.
Active computerized pharmacovigilance using natural language processing, statistics, and electronic health records: a feasibility study
JAMIA
(2009) - et al.
Training cost-sensitive neural networks with methods addressing the class imbalance problem
Knowledge and Data Engineering, IEEE Transactions on
(2006) - et al.
Extraction of adverse drug effects from clinical records
Proceedings of medinfo
(2010) - et al.
Patient safety through intelligent procedures in medication: the psip project
Studies in Health Technology and Informatics
(2009) Pattern recognition and machine learning
(2006)