Extracting a statistical data matrix from electronic patient records

https://doi.org/10.1016/S0169-2607(00)00130-9Get rights and content

Abstract

This paper describes the processing and transformation of medical data from a clinical database to a statistical data matrix. Precise extraction and linking tools must be available for the desired data to be processed for statistical purposes. We show that flexible mechanisms are required for the different types of users, such as physicians and statisticians. In our retrieval tools we use logical queries based on operands and operators. The paper describes the method and appliance of the operators with which the desired matrix is created through a process of selection and linking. Examples with a Kaplan–Meier function and time-dependent covariables demonstrate how our model is useful for different user groups.

Introduction

Electronic patient records have made it possible to collect clinical data as a routine operation. Data collected in this way can be used for research, but the characteristics of this data have to be taken into account [1], [2]. Recently, we have described the retrieval system ArchiMed [3] permitting joint evaluation of study and routine data. In this paper we explain the query method using operands and operators and discuss the functions required to obtain the desired statistical matrix.

Many researchers and decision-makers are interested in the statistical analysis of clinical data. On the one hand, some work with macro-data, i.e. data that has been transformed into statistical tables, usually summarised in statistical databases. Techniques for interaction with tables are normally required for the analysis of this aggregated data [4].

Our work, on the other hand, focuses on evaluation with direct access to micro-data. This data is in the form of elementary clinical events. We analyse the steps for statistical procedures starting from patient records to form tables (matrices).

A medical retrieval system has to provide correct access to the data necessary to perform the desired statistical analysis. The analysis of data from patient records can serve a variety of purposes. It can be considered from a clinical perspective, from a ward perspective (e.g. for process control) or from an aggregate perspective (e.g. for quality control) [5].

It is essential that the clinical analysis system is accessible to different user groups. The system should be easy to use and have to allow clinical researchers to carry out simple statistical analysis by themselves. In addition, the export of patient data for use in statistical software packages in the form of a data matrix should be also supported. Different ‘user cubes’ may be defined as a function of clinical experience, statistics, application and frequency of use of a system [4].

The tools for data extraction and analysis for clinical workers must be integrated within a single environment [6]. All functions, formulated extractions, links and applied statistical analysis must be capable of repeated use. Because of the heterogeneous nature of clinical IT structures, the application should be able to run on different platforms.

It should be possible to jointly analyse study variables and those derived from daily routine. It should also be possible to analyse patient records and variables from different information systems, e.g. mass data such as laboratory readings from laboratory information systems. Proposals for standardising patient records exist but the information obtained is formulated as complex objects [7]. For statistical procedures the data usually must be in matrix form.

There are very few clinical forms in use with well structured table format (patients versus variables). Most study forms have more complex structures. A typical example is the use of master and follow-up forms.

Precise tools are required to obtain the desired values in the data matrix. For a long time programs for the preparation of statistical matrices exist. The main procedures used from the outset were group definition and selection and synchronisation of the individual courses of disease [8].

We regard powerful time functions as an essential requirement. Time functions such as ‘all patients with kidney transplants and rejection within 3 months’ to consider the course of disease are frequently used to form patient cohorts. Time relations such as ‘closest weight in time’ are required to avoid undesired line combinations when selecting values and linking variables to form lines. We have therefore concentrated considerable energy on the development of time functions in queries.

In Section 2 we show that complex operations are required even with simple data constellations if the desired lines are to be obtained. In Section 3 we present our method of operands and operators for the generation of queries. In Section 4 we describe the functions required for the various data manipulations. In Section 5 we demonstrate our approach on the basis of kidney transplant patients. First we draw a Kaplan–Meier plot and then we examine the effect of time-dependent co-variables.

Section snippets

Background: joining variables from patient records

In normal hospital practice patient data flow continuously into clinical databases. They are created through ongoing documentation of routine data from hospital information systems, through automatic data transfer from electronic analysis equipment (e.g. laboratory information systems) and also through controlled scientific studies. Data from clinical studies ought to be capable of processing together with routine data, with account taken, of course, of confidentiality and data protection

Query method

A specific query method has been implemented in the ArchiMed system, which consists of documentation and analysis modules [9]. It has been installed at university hospitals in Vienna (1997) and Graz (1998).

Evaluation steps

Before data from large clinical data records can be statistically analysed several steps are usually required. In [3] we described three main steps when evaluating clinical data: ‘cohort formation’, ‘selection of variables’ and ‘execution of statistical procedures’. The query method described in Section 3 is used for the first two steps.

With cohort formation the patients or documents meeting defined criteria (e.g. all patients with a certain operation and complications as a result) can be

Examples

In the following two examples we will show typical steps to extract the data from a patient record to fit a specific statistical procedure. Depending on the documentation form different operations are necessary to obtain the desired matrix. In the first example we use a simple data constellation and a predefined statistical function. In the second example a more complex data situation and statistical procedure is examined.

We will consider kidney transplant patients to demonstrate our approach.

Experience

We have been gathering experience with ArchiMed for 3 years as a system for analysing studies and routine data together. The program generates SQL statements from logical conditions. The method with operators and operands has been used even longer in WAREL [15]. In this system the operators are still provided by programs (PL/1). This has the advantage that functions that are not possible in SQL (e.g. aggregate functions such as ‘monotonously falling’) are implemented.

In our experience users

Acknowledgements

The authors thank Wolfgang Dorda for his valuable suggestions.

References (17)

There are more references available in the full text version of this article.

Cited by (0)

View full text