Keywords

1 Introduction

The developing methods of electronic health record (EHR) analysis is a relevant scientific problem because EHR contains critical information about the treatment process. In this regard, the automation of EHR texts processing could significantly expand and improve the diseases classification, the treatment duration, and cost prediction as well as prediction other important events. However, the difficulties of EHR analysis are associated with the semi-structuring, specific stylistic and grammatical features, abbreviations, clinician terms, etc. Pattern-based methods are often used to solving this problem [1]. These methods are characterized by high accuracy but they do not provide high flexibility and identification of various combinations of speech turns which limits their application. On the other hand, the machine learning (ML) based methods show high flexibility in the analysis of EHR and have a relatively good accuracy [2,3,4].

This research is demonstrating the capabilities of using ML-based methods of text mining to extend EHR analysis. The experimental study is focused on data extraction and classification of EHR of cardiovascular patients in Alamazov Centre (one of the leading cardiological centers in Russia). The proposed approach enables improvement of the accuracy of events and parameters predicting in the course of patient treatment.

2 Related Work

The text-mining methods are widely used in a various domains [5,6,7,8,9]. Also, there are a lot of applications in healthcare. For example, a task of extraction of associations of diseases and genes associated with them is considered in [10]. The proposed approach is based on the method of named object recognition (NOR) using a dictionary of names and synonyms. Also, to analyze text on associations the analysis of the EHR semantics was applied to identify the links between medical terms with disease and treatment. For example, semantic analysis based on the ML method to identify clinical terms is presented in [11]. There are known problems of determining the sequence of events (treatment, examination, operation) and identifying cause-and-effect relationships in semi-structured EHR in the healthcare domain. The semantic analysis allows defining the relationship between events and causes. For example, solving this problem is possible with the semantic analysis using name recognition, Bayesian statistics, and logical rules to identify the risk factors [12]. It is possible to extract cause-and-effect relationships from EHR with using text mining based on ML methods. For example, a random forest is used to classify the descriptions of allergic reactions to drugs. Such method provides sufficient accuracy (90%) in comparison with the pattern based method (82%) [13].

Considering healthcare in Russia, it worth to mention that the Russian language has own features. To analyze EHR in the Russian language, the pattern based methods [1] are used more often. The pattern based methods demonstrate good accuracy in EHR analysis, but it does not give flexibility comparable to ML methods.

3 EHR Analysis: Cardiovascular Patients

Within the study, the analysis of diagnoses performed to identify implicit connections between patients EHR that do not fit into the basic clinician categories. The analyzed EHR dataset contains information about 8865 cardiovascular patients, different age categories from 25 to 96 years and various outcomes of treatment (died or not), which is not reflected in the texts of diagnoses. The diagnoses were taken from electronic medical records of the patients of the Almazov Centre treated from 2010 to 2017. The general study group (n = 8865) included several categories of the patients:

  • Patients with the acute coronary syndrome (ACS), admitted to hospital urgently by Ambulance or by transfer from other hospitals (n = 4955).

  • Patients with a stable course of coronary heart disease, which is a routine performed angioplasty and stenting of coronary arteries (n = 2760).

  • Patients with heart rhythm disorders, which was performed by the arrhythmic interventions in emergency and planned order or cardioversion (n = 619).

  • Other patients with emergency cardiac conditions requiring treatment in intensive care units and surgery patients (n = 531).

The study analyzed cases of medical care, which accumulated a sufficient amount of text data (5500 patients) containing a detailed description of clinical diagnosis (underlying disease, associated conditions, complications, comorbidities with detailed characteristics and stages, the name and date of invasive interventions and surgical operations, ICD-10 codes). Analyzed EHRs include only manual entry of the text (without templates). The comparison of the selected group with the general database of cardiovascular patients by average indicators (gender, age, indicators, analyses, death rate, the average length of stay, etc.) do not reveal differences. It can be concluded that the selected cases are representative.

EHR preprocessing involves bringing the text to a more readable form because the diagnoses contain an abundance of abbreviations and various stylistic methods of writing medical texts. The preprocessing is to find and unify terms and acronyms. Regular expressions and other procedural methods collected in the Python script enabled unification of terms within the texts with replacement of various occurrence of the terms with a single one. The complete list contains more than 1000 regular expressions in 12 main groups: (1) abbreviation of diseases and synonyms; (2) ranks, stages of disease, functional classes; (3) removing unnecessary words; (4) general condition of the patient; (5) drugs; (6) locations and type of pain; (7) breathing and dyspnea; (8) edemas; (9) urogenital system; (10) blood pressure; (11) numerals; (12) removing special characters.

Further EHR processing is based on the methods of texts vectorization and bringing them to binary values and identify groups of the patients. This stage includes converting of texts to the matrix of token (keywords) counts using CountVectorizer from scikit-learnFootnote 1 library. Within the analysis 1060 keywords were used. Top keywords include: disease (6064); ischemic (5912); heart (5463); acute (3607); coronary (2477); heart attack (2301); myocardium (1739) heart (1502); angina (1265); hypertonic (1084).

The developed pre-processing routines were generalized to enable reusing of the script in various analysis scenarios.

On the next step ML methods were hired to get deeper understanding of patient variability. The reducing of dimensions of the binary vectors using the t-SNE [14] method from the scikit-learn package discovers the 10 clusters (Fig. 1). The configuration of the t-SNE method to obtain clusters of patients eliminates the random initialization of the model and the coefficient of perplexity 30 which gives reasonable distinction between clusters. t-SNE method gives clearer cluster boundaries even in 2d space, while, for example, principal component analysis (PCA) widely used for dimension reduction shows a weak distinction between clusters with 2 principal components. As a result, t-SNE was selected for further analysis and interpretation.

Fig. 1.
figure 1

The clusters of diagnoses descriptions.

From a practical point of view, interpretation of different clusters can have different meanings. Class of the patient can be considered as a predictor of an event (e.g., re-hospitalization). To get deeper insight on clusters interpretation it is needed to analyze the patient data and individual characteristics in depth. The obtained 10 classes of patients during the analysis of individual parameters of patients and episodes revealed various meaning of classes. For example, Class 5 and Class 10 differ in the duration of treatment comparing to other classes (Fig. 2a). Also, the correlation analysis showed correlations by class variable (C1–C10, Boolean) with individual characteristics of patients and episodes: age; ischemic chronic disease (IHD); essential hypertension (EH); angina (Ang); Q-wave infarction (QInf); treatment duration in intensive care (TIC); treatment duration in hospital (THosp); hemoglobin minimum (HGB); troponin maximum (Trop); glucose maximum (Gluc); leukocytes maximum (Lkc); thrombocytes maximum (PLT); Repetition (Rep); cholesterol (Ch) (Fig. 2b).

Fig. 2.
figure 2

EHR analysis (a) treatment duration of the patients in discovered classes; (b) correlation matrix of text classes with individual characteristics of the patients

E.g., there is notable correlation between C1 and duration of treatment in the hospital; C3 and presence of hypertension in patients; C4 and myocardial infarction, C5 and repeatability, duration of treatment in intensive care, angina; C8 and cholesterol, maximum leukocyte level, maximum troponin level, q heart attack, angina pectoris, and duration; C10 and leukocytes, angina pectoris, and duration.

It is possible to expand the idea that a combination of text characteristics associated with repeated hospitalizations has been found. E.g., repetition of hospitalization is an essential indicator of value-based medicine related to high cost of treatment.

In general, the diagnosis is based on individual symptoms, syndromes, laboratory and instrumental studies, as well as the results of assessing the dynamics of these manifestations. During the observation of the patient, a clinician forms an idea of the nature of the disease and formulates a diagnosis. The presence of the disease implies the probable presence of certain risk factors that contributed to its development. The role of well-known pathogenetic mechanisms in its occurrence and progression, characteristics of its course, the need for and a specific effect of therapy, the probable prognosis, is considered as a single logical structure of the patient with a particular disease. The classes obtained earlier were analyzed with using the clinical method associated with indication of most frequent tokens in EHR texts:

  • Class 1. (n = 592) Patients with ACS in the duration of hospital stay is slightly above average (14 K/d), in-hospital mortality of 1.8%. Tokens: acute, coronary.

  • Class 2. Lighter, sharper, more heavy routine, or massive intervention in stable patients. Hospitalization 10 days, mortality of 1.2%. Tokens: ischemic, chronic, rheumatic.

  • Class 3. Mostly scheduled patients with significant heart failure complicating various cardiac diseases, duration of admission is above average and relatively high mortality (2.7%). Tokens: atherosclerosis, mitral.

  • Class 4. Patients with the most severe myocardial infarction, including transmural (with Q), with prolonged hospitalization (15/d), relatively high mortality of 3.1%. Tokens: ischemic, acute, infarction.

  • Class 5. These are planned subsequent stages of revascularization in ACS or planned interventions (including stenting, reconstructive vascular surgery) with a favorable course, but requiring a subsequent step of early hospital rehabilitation, almost no mortality, the duration of treatment is small (mortality 0.4%, bed-day 7). Tokens: ischemic, hypertension, angina.

  • Class 6. Planned stenting or patients with unstable angina, mortality is almost there, stay short light patients. Tokens: ischemic, unstable.

  • Class 7. A variety of emergency patients with stay above average, but almost without lethality and severe emergency pathology with a favorable prognosis. Tokens: coronary, acute, syndrome.

  • Class 8. Young patients with severe transmural heart attacks, with pronounced risk factors, even longer hospitalization than in 4 Class, but with a more favorable prognosis. Tokens: infarction, myocardium, acute.

  • Class 9. Patients with ACS, hospitalized at a later stage of the disease. Tokens: coronary, ischemic, acute.

  • Class 10. The most stable planned patients. Tokens: heart, failure, cardiac.

Based on this, one can draw conclusions about the impact of certain words in the diagnosis on the length of stay and the probability of death (for example, ischemic, action, infarction to increase the mortality rate). In the same time, this classification does not correspond to the nosological forms or type of operation or category of patients. It reflects the severity of the patient’s condition, the seriousness of intervention, unique features, and overall treatment strategy. Also, it may coincide in patients with various diseases and vary in patients with the same diseases (depending on the characteristics of the patient and the course of the disease). The classification does not coincide with the principles of construction of medical nosological classifications, closer to the prognostic scales, but more fully and multifaceted (with characteristics’ combinations). From the cost analysis or predict future outcomes should be more accurate than the tradition nosological classification and the scale. Clinical statistical groups (flexible classes) must form not on nosology but on clusters with cost accounting.

4 Discussion and Conclusion

Medical processes have many aspects to be considered. For example, the diagnosis of anamnesis and patient characteristics may be different from pathway or event classification. So one need to understand how the different classifications may be compared from the point of view of medical knowledge. For this purpose, it is necessary to have accurate predictive models. From medical point of view, this approach does not allow to plan the patient’s treatment strategy for the long term, but enables assessing the characteristics of the episode of care and outcomes in the short term. It is relevant for a large medical center where a lot of patients receiving different treatment, which may have a similar set of random and unrelated characteristics. The interpretation of these combinations to a certain extent conditional, in fact, is the search for matches between clusters and groups selected by the clinical approach.

Using ML methods for data processing enables selection of combinations of features to be divided into classes outside the logical framework of the clinical presentation of the patient. It needs extraordinary efforts to integrate this information into the structure of the representation of the patient, with similar interpretation, some non-obvious characteristics of the groups identified in the analysis may be lost. The logical structure of the clinical presentation of the patient is the basis for the formation of treatment strategies and clinical decision-making in the conditions of data shortage.

The current study is still in progress. It includes development of a generalized toolbox for classifying, processing, and analyzing of EHR text data. Based on the early results of the analysis of texts, it can be concluded that the proposed approach to clustering of text data is useful. Future work includes several directions of development. For example, having information about which group a patient belongs to, it may be possible to efficiently solve the problem of prognosis (required tests, duration or type of disease), as well as to improve the model of prediction of treatment processes.