Keywords

1 Introduction

A large amount of medical examinations are carried out every day at hospitals and emergency rooms. For every visit, many bureaucratic documents must be compiled. One of these is the discharge letter (DiL). A DiL is a document issued to the patient at time of discharge from a hospital. It is the summary of the information contained in the medical record - of which it is an integral part - and contains the advice for any checks or therapies to be carried out. The information contained in the document is therefore intended to be useful to the doctor who will follow the patient in the future.

To be correctly classified according to international standards, at least one International Statistical Classification of Diseases and Related Health Problems code (ICD  [27]) must be associated with each visit. ICD-10 is a medical classification taxonomy created by the World Health Organization. It contains codes for diseases, signs and symptoms, abnormal findings, complaints, social circumstances, and external causes of injury or diseases.

Responsibility for the correct completion of the letter of discharge lies with the doctor responsible for discharge. The ICD code must be assigned manually by the attending physician, who however is not always available to manually enter data, mainly due to the fast-paced environment of hospitals and first aid centers. Therefore, the code is often missing or entered as a placeholder.

We would like to investigate the following three research questions:

  • Q1: How can machine learning techniques support the clinical staff classifying the text data and help avoiding human errors?

  • Q2: Can such a system provide explanations to users for supporting transparency and trustworthiness?

  • Q3: Is there a way to discover lexical similarities and relationships between medical terms?

1.1 Motivating Example

To elaborate on our research question, next we introduce a concrete example, which allows the reader to understand the problem and shows why and for whom the system is useful to. We investigate the scenario of a particular Italian medical center, where the number of visits without ICD is very high (\(\frac{1}{3}\) of visits). Assigning the right ICD code to a DL is a crucial and beneficial task for the hospital, for many practical reasons: getting reimbursements from the regional health service provider, assessment of fund allocation and so on. If the code is not assigned, it is necessary to manually read the DiLs, which contain all the information necessary to trace the specific ICD of the visit. This process is error-prone and time-consuming; however, it could be improved by applying text mining techniques such as text classification.

This particular medical center only treats a subset of possible diseases, those that are specific to Italy and to the expertise of the center. As an application paper, we would like to focus on this specific domain for our analyses.

1.2 Contribution

The contribution of this project is the following:

  • Realisation of a XAI-based prototype to healthcare for classifying and explaining Discharge Letter (DiL) data, in order to answer the research questions Q1 and Q2. A demonstrative video of the prototype realised is also provided.

  • The XAI module is beneficial to healthcare operators to understand the rationale behind the classification process as well as to speed-up the classification process as a whole;

  • The trained italian Word Embeddings, specific to the healthcare domain, support word similarities and classification tasks and are required for the research task Q3.

We begin by discussing the technical backgrounds in Sect. 2. Then, we show the “eXplainable Discharge Letters" (eXDiL) system in Sect. 3 and its performances in Sect. 4. Section 5 concludes the paper with the discussion of pros and cons, conclusions and future work we intend to carry on.

2 Backgrounds and Related Work

In this section we introduce some background notions on word embeddings and XAI, as well as previous articles that explored this research area in the past.

Diagnosis code assignment is a well-known classification problem. In recent years it has been tackled with rule based methods as well as statistical and machine learning approaches.

One of the first to approach this task was  [16] and  [7] using rule based methods. Such methods are not easily created, and require extensive domain expertise in order to create the appropriate classification rules. However, the interpretability of this kind of models is the highest, as you can explain perfectly how a prediction was made.

In  [17], the authors used machine learning methods such as Support Vector Machine (SVM) and Bayesian Ridge Regression (BRR). They include only the five most frequent ICD codes, and their classification performance is not very high. The upside of machine learning methods in classification tasks is that a lesser amount of domain expertise is needed, and the performance is in many cases higher than rule based systems.

More recent works include  [26], who use a novel Convolutional Neural Network (CNN) model with attention. They select a subsection of the 50 most frequent codes, and perform a multilabel classification. They also conduct an human evaluation on the attention explanations. In  [2] authors use multiple models; SVM, Continuous Bag of Words (CBOW), CNN and an hierarchical model, HA-GRU. The performance of more complex and deep models is superior to a model such as SVM. They interpret their results with an attention mechanism. Unlike other works presented, they include all the labels present in the dataset, using both the full 5 character codes and rolled up codes at 3 characters. The latter two works both use the publicly available MIMIC II and III datasets for training and testing their models.

2.1 Word Embeddings

Vector representation of words belongs to the family of neural language models  [3], where each word of a given lexicon is mapped to a unique vector in the corresponding N-dimensional space (with a fixed N).

In our application, each word can be considered as the text content of an DiL. Here, an important contribution comes from the Word2Vec algorithm  [22, 23], that computes the vector representations of words by looking at the context where these words are used. Intuitively, given a word w and its context k (i.e., m words in the neighbourhood of w), it uses k as a feature for predicting the word w. This task can be expressed as a machine learning problem, where the representation of m context words is fed into a neural network trained to predict the representation of w, according to the Continuous Bag of Words (CBOW) model proposed by  [22]Footnote 1.

Consider two different words \(w_1\) and \(w_2\) having very similar contexts, \(k_1\) and \(k_2\) (e.g., synonyms are likely to have similar though different contexts), a neural network builds an internal (abstract) representations of the input data in each internal network layer. If the two output words have similar input contexts (namely, \(k_1\) and \(k_2\)) then, the neural network is motivated to learn similar internal representations for the output words \(w_1\) and \(w_2\). For more details, see  [23].

After the Word2vec training on the lexicon, words with similar meanings are mapped to a similar position in the vector space. For example, “powerful” and “strong” are close to each other, whereas “powerful” and “Paris” are farther away. The word vector differences are also meaningful. For example, the word vectors can be used to answer analogy questions using simple vector algebra: “King” - “man” + “woman” \( \approx \) “Queen” [24].

As one might note, this approach allows representing a specific word in the N-dimensional space, while our task is to compute the vector space of documents (i.e., research products), rather than words. We therefore apply the Doc2Vec approach [15], an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents as well. As a consequence, a vector is now the N-dimensional representation of documents.

In our framework, the use of word-embedding allows computing the similarities between DiLs, improving the effectiveness of the result.

2.2 XAI

There is a growing interest in the use of AI algorithms in many real-life scenarios and applications. However, many AI algorithms - as in the case of machine learning - rarely provide explanations or justifications to allow users understanding what the system really learnt, and this might affect the reliability of the algorithms’ outcomes when these are used for taking decisions. In order to engender trust in AI, humans must understand what an AI system is trying to achieve, and which criteria guided its decision. A way to overcome this problem requests the underlying AI process must produce explanations that are transparent and comprehensible to the final user, so that she/he can consider the outcome generated by the system as believableFootnote 2 taking decisions accordingly. Not surprisingly, an aspect that still plays a key role in machine learning relies on the quality of the data used for training the model. In essence, we may argue that the well-known principle “garbage-in, garbage-out" that characterizes the data quality research field, also applies to machine learning ,and AI in general, that is used to evaluated data quality on big data (see, e.g.  [1, 4, 20, 21]) and perform cleaning tasks as well (see, e.g.  [6, 18, 19]).

Given the success and spread of AI systems, all these concerns are becoming quite relevant enabling a wide branch of AI to emerge, with the aim of making AI algorithms explainable for getting an improved trustability and transparency (aka Explainable AI (XAI)). Though some research on explainable AI had already been published before DARPA’s program that launched a call for XAI in 2016 (see, e.g., [28, 31]) XAI  [5, 8, 25], effectively encouraged a large number of researchers to take up this challenge. In the last couple of years, several publications have appeared that investigate how to explain the different areas of AI, such as machine learning [12, 30], robotics and autonomous systems  [11], constraint reasoning  [10], and AI planning  [9], just to cite a few. Furthermore, as recently argued in  [5], a key element of an AI system relies on the ability to explain its decisions, recommendations, predictions or actions as well as the process through which they are made. Hence, explanation is closely related to the concept of interpretability: systems are interpretable if their operations can be understood by a human, either through introspection or through a produced explanation.

3 The eXDiL System

The “eXplainable Discharge Letters" system (eXDiL) intends to help the clinical staff identifying the main ICD code related to each visit in order to significantly lighten human workload. It would act at operational-level of the hospital to support operators in assigning the correct ICD code to each visit (i.e., operational-level information system). We propose a two-step system for semi-real time classification with an human in the loop approach. A visual representation of the workflow can be seen in Fig. 4.

Fig. 1.
figure 1

A representation of the eXDiL workflow, highlighting the main modules.

In order to let the reader easily understand the workflow, we also present an operative example in this section. A video walkthrough of this example is available at https://youtu.be/u0UJnp4RyQQ. The following working example should clarify the matter.

Working Example. Let us consider the following DiL: “Reason for visit: Low back pain in patients with osteoporosis not under drug treatment. Diagnosis: lumbar pain in patient with osteoporosis, deformation in L1 with lowering of the limiting upper in outcomes, anterograde slipping of L3-L4 and L5-S1 with reduction in amplitude of the interbody spaces.” The true ICD class of this example is Dorsalgia. Assigning the correct ICD class to the letter is crucial to allow the hospitals to be refunded according to the health service really provided. To this end, in the following we discuss how eXDiL works (Fig. 1).

3.1 Step 1: ICD Prediction

After the doctor finishes writing the letter, the system uses the text data to automatically classify the most common ICDs, as to perform a first skimming. If the classification is successful, then the clinician can accept or reject the suggestion.

In the prototype, the doctor must type a reason for visit and diagnosis in a free text format, or choose an example from a predefined list. Then, the eXDiL system will attempt to classify the data using the workflow described in Fig. 4.

In the other case, if the reason for the visit is not found among the most common cases, then the most relevant chapter is proposed, but the single ICD is not provided. The doctor can use this suggestion to input manually the correct code.

3.2 Step 2: Local eXplanation

After the prediction, a visual explanation of the result is displayed in order to make the clinician aware of the main reasons why certain classes are assigned to certain visits. Moreover, this part is fundamental to establish a relationship of trust between man and algorithm.

Using the LIME  [30] approach, we propose a first visualization that explains quantitatively how much a particular term is in favour or against w.r.t. the aforementioned classification (see Fig. 2).

This visualization is accompanied by a second one, in which the terms in question are highlighted directly in the text in such a way as to easily identify the context of use and determine whether or not they are significant according to the judgment of the domain expert, who in this case is the doctor of reference.

Fig. 2.
figure 2

Example of explainable output from the eXDiL prototype.

3.3 Step 3: Word Similarity

In addition, a Word2Vec  [22,23,24] system is offered to suggest words similar to those already inserted, in order make the clinician more aware of similar cases. A word cloud generated by the model is printed on the screen in order to generate immediate cues. The size of the related words is related to the similarity degree. The aforementioned is accompanied by a more detailed outline of the similarity of the suggested words w.r.t. the starting one. An example of the visualization can be seen in Fig. 3.

Fig. 3.
figure 3

Example of most similar words from the eXDiL prototype.

Fig. 4.
figure 4

Semi-real time ML assisted ICD classification workflow.

4 Experimental Results

4.1 Dataset Description and Characteristics

The dataset is composed by 168.925 individual visits, collected from 2011 to 2018 in an Italian Hospital.Footnote 3 Notice that the ICD taxonomy is built to include any disease, including eradicated ones. For this reason and due to the massive amount of existing classes (around 10,000), the classification task should concentrate only on diseases that are more likely to be treated by a given hospital.

Following the example of  [14], the codes are truncated at the three character level; as an example, code M54.1 (Radiculopathy) is converted to M54 (Dorsalgia). We also chose to restrict the dataset to the 5 most common specialities described in Table 1. For each speciality, the top-2 most common ICD classes were chosen for the single ICD classifier, as seen in Table 2.

Table 1. Count of selected ICD chapters.
Table 2. Count of most frequent ICD classes.

After filtering for only the five chosen specialities and applying the pre-processing pipeline, 82.439 letters remain (49% of total data). For the single ICD prediction task, the two most common ICD classes were chosen for each speciality, obtaining 30.086 letters (36% of the 5 specialities, 18% of total data).

For each visit, the DiL describes many features of the patient, such as: reason for the visit, free text doctor diagnosis, medical history, therapy and clinical tests to be carried out, specialization of the practitioner and other information related to the DiL, such as clinical indications, allergies, follow-up instructions.

Out of these features, reason for the visit and free text doctor diagnosis have been chosen to make a prediction, as they are most informative for this task.

4.2 Classification Pipelines

Free text data is not eligible to use as-is. A pre-processing step is required in order to clean the data, according to the following steps: (i) fix character encoding issues, (ii) remove punctuation, isolated numbers and stop words, (iii) lower casing, (iv) remove domain-specific common words, such as “visit" and “control". After pre-processing, we can use the text data to train the classifiers. The mean word count of each note was 16 words is of 15.6 words, with a high standard deviation of 28.7. The longest note in the dataset contains 1124 words. The data set was then separated in two sets: the training set, with 67% of the data, and the test set, with 33% of the data.

In order to create a classification model, a text representation, classifier and set of hyperparameters must be chosen. We have considered the following text representations: (i) Bag of Words (BoW), (ii) Tf-Idf with only Unigrams, (iii) Tf-Idf with Bigrams.

We also considered the Word2Vec Skipgram, Continuous Bag Of Words (CBoW) and GloVe  [29] text representations. However, these embeddings did not provide sufficient performances on the classification tasks. As such, they were not included in the results, and the embeddings were used exclusively for the word similarity task.

The following models and hyperparameters sets were considered:

  1. 1.

    SVC with C \(\in \) {1, 0}

  2. 2.

    Random Forest with number of estimators \(\in \) {10, 100}

  3. 3.

    Naive Bayes with \(\alpha \in \) {0.1, 0.01, 0.001}

  4. 4.

    Logistic regression with mode \(\in \) {multinomial, sag} \(\times \) C \(\in \) {1, 10}

A 5-fold cross-validated grid search on the text representations, models and hyperparameters was conducted in order to find the best models for classification of single and chapter ICD. Also, in order to test the hypothesis that the system will improve with additional data from human input, we started by training the models with only 50% of the available training data, then increasing to 67% and finally using all the training data. The test set was not changed between tests to ensure results consistency.

Similarly to  [2], the main metric for model evaluation was chosen as the F1 micro average score.

4.3 Evaluation of Single ICD Model

Figure 5 shows the performances of the models on the test set. Each point in the violin plot represents the F1 score for a tuple (model type, text representation, hyperparameters). The y axis represents the F1 score, ranging between [0.7, 1.0], while the X axis represents the ICD class. For a complete list of the ICD classes, see Table 2. The average performance increases slightly when increasing the available training data. This suggests that increasing the training set size might lead to better performances, however the impact of this increase might is not statistically significant.

Fig. 5.
figure 5

Visualization of single ICD classification models performance.

The best model out of all the possible combinations is a logistic regression with mode = OVR and C = 1 on the BoW representation. This model reaches an average F1 score of 0.952. The highest percentage of errors is between ICD classes vasomotor and allergic rhinitis and asthma. This might be induced by the semantic similarity between the two concepts, and also the fact that a patient might be affected by both conditions at the same time.

4.4 Evaluation of Chapter ICD Model

In this higher granularity of classification the performance is higher compared to the third character level. The classes are semantically different between each other, and this distance is reflected on the more accurate results in the classifiers. In Fig. 6 we show the results for each classifier trained. In order to properly show the differences between classes, we restricted the y axis between [0.8, 1.0]. In this case, the models performances do not decrease nor increase with different amounts of training data.

Fig. 6.
figure 6

Visualization of chapter ICD classifications models performance.

The best performing model is a SVC classifier with C = 1 on Tf-Idf Bigrams text representation. This model reaches an average F1 score of 0.983.

5 Discussion and Conclusion

Considering the Pros, the resulting classifications have excellent performances on both the chapter and single ICD levels. In particular, the chapter level can be trusted with high confidence (F1 Micro = 0.983). It seems that increasing the collected data helps improve performance by a small amount, however the increase does not seem to be significant. The system may therefore help the doctor save time and better classify the DiLs.

Evaluating the Cons, a major issue is that this procedure cannot be done on all ICD classes at once. It is firstly advised to choose specific specialties, since without filtering, classifying most of the over 10,000 classes would be infeasible with our dataset. Therefore the scope must be restricted to certain specialties, and, as shown in  [14], the granularity should be set at the third character level, as in our case it is not possible to distinguish accurately between the subtle differences at the fourth character level.

In conclusion, we have shown that the eXDiL system is an accurate XAI system for classifying hospital discharge letters. Future work can be conducted to improve and assess the whole procedure to finally bring it to life in a real world environment providing an hopefully useful service. Firstly, it has to be tested in the hospital field to check for real world usefulness. Secondly, it is to be understood if the “human in the loop" works as it has been conceived.

To date, eXDiL has been trained on Italian DLs provided by an Italian Hospital. We have been working on applying eXDiL on the well-known benchmark MIMIC-III  [13], a widely known and used dataset. It comprises a larger amount of data, with more labels and variety, and importantly it is in english, meaning it would create a classifier with a broader use case.

DEMO. A demonstration video of the system has been also provided.Footnote 4