An automated data verification approach for improving data quality in a clinical registry

https://doi.org/10.1016/j.cmpb.2019.01.012Get rights and content

Highlights

  • Proposed and implemented an automated data verification approach for registry data quality assessment and improvement.

  • Paper-based documents and electronic medical records are used to verify registry data automatically.

  • Machine learning enhanced optical character recognition is used to recognize paper-based documents more accurate.

  • The automated approach is more accurate and efficient to identify incomplete data and incorrect data of registry study than the traditional manual approach.

Abstract

Background and Objective

The quality of data is crucial for clinical registry studies as it impacts credibility. In the regular practice of most such studies, a vulnerability arises from researchers recording data on paper-based case report forms (CRFs) and further transcribing them onto registry databases. To ensure the quality of data, verifying data in the registry is necessary. However, traditional manual data verification methods are time-consuming, labor-intensive and of limited-effect. As paper-based CRFs and electronic medical records (EMRs) are two sources for verification, we propose an automated data verification approach based on the techniques of optical character recognition (OCR) and information retrieval to identify data errors in a registry more efficiently.

Methods

Three steps are involved to develop the automated verification approach. First, we analyze the scanned images of paper-based CRFs with machine learning enhanced OCR to recognize the checkbox marks and hand-writing. Then, we retrieve the related patient information from the EMRs using natural language processing (NLP) techniques. Finally, we compare the retrieved information in the previous two steps with the data in the registry, and synthesize the results accordingly. The proposed automated method has been applied in a Chinese registry study and the difference between automated and manual approach has been evaluated.

Results

The automated approach has been implemented in The Chinese Coronary Artery Disease Registry. For CRF data recognition, the accuracy of recognition for checkboxes marks and hand-writing are 0.93 and 0.74, respectively. For EMR data extraction, the accuracy of information retrieval from textual electronic medical records is 0.97. The accuracy, recall and time consumption of the automated approach are 0.93, 0.96 and 0.5 h, better than the corresponding values of the manual approach, which are 0.92, 0.71 and 7.5 h.

Conclusions

Compared to the manual data verification approach, the automated approach enhances the recall of identify data errors and has a higher accuracy. The time consumed is far less. The results show that the automated approach is more effective and efficient for identifying incomplete data and incorrect data in a registry. The proposed approach has potential to improve the quality of registry data.

Introduction

A clinical registry is defined as a collection of information about individuals, usually for a specific public health purposes and focused around a specific diagnosis or condition [1]. It provides first-hand information about people with certain conditions and increases our understanding of that condition or disease [2]. The quality of data collected in registry studies is crucial as it has an impact on the credibility of registry study and is emphasized by clinical researchers [3]. The International Conference on Harmonization (ICH) guideline E6 on Good Clinical Practice (GCP) requires that data in clinical trials must be accurate, complete and verifiable [4] to ensure patient safety and data quality [5]. For that reason, researchers are always dedicated to improving the quality of registry data.

Despite the fact that electronic case report forms (eCRFs), which guarantee the quality of collected data to a certain degree, are used in some registry studies, paper-based CRFs are still used in many registry studies as they better fit the busy clinical environment and traditional working habit of clinicians, especially in developing countries like China and India [6], [7]. The transcription procedure of data from paper-based CRFs to eCRFs is error-prone. In this situation, despite systematic audit processes such as range check and data dictionary employed in eCRFs being effective in suggesting data plausibility, they cannot guarantee consistency between registry data and original data sources.

Some approaches have been proposed to remedy this problem, but they have not proven to be ideal. Double data entry is one of the approaches in data collection procedure. It consists of two steps: an initial entry step and a verification step; each step is performed by a separate data entry clinician. However, it is time-consuming, labor-intensive [8], [9] and ineffective for detecting recording errors. Another controversial approach is source data verification (SDV), a verification of conformity of the data recorded in case report forms with source data [10]. In 1988, the Food & Drug Administration (FDA) recommended SDV is one of the most effective ways to ensure data quality in its Guideline for Monitoring of Clinical Investigations [11]. However, traditional SDV only uses original paper-based CRFs as the supporting data source, and this constrains the findings of verification related to only data transcription procedure from paper-based documents to eCRFs and neglects data errors of the original data source itself. Not only that, the time taken for the SDV is too much because almost all the verification work is completely manual.

Since the original data is recorded on the paper-based CRF, verifying registry data with it is accomplished manually. Actually, Optical Character Recognition (OCR) is able to turn paper-based information into digital data [12]. It would save great amount of time and reduce human errors if the paper-based CRFs can be verified automatically. In addition, there are data sources other than original paper-based documents, which could be used as supporting sources to verify registry data, such as the data from EMRs [13]. EMRs usually record the information of clinical diagnosis and treatments happening in hospital. Depending on the goal of the study, the data from an EMR system may partially or completely overlap with data from the registry study and therefore could be used as a reference for registry data verification. Data verification could be further improved if there are multiple data sources involved.

In this paper, based on paper-based CRFs and EMR data sources, we propose an automated data verification approach using machine learning enhanced OCR and NLP techniques for improving data quality. The remainder of this paper is organized as follows. In the methods section, we describe the verification approach and evaluation experiment. In the results section, we illustrate the accuracy of the proposed approach and its efficiency. Then we discuss the strengths and limitations of this study.

Section snippets

Developing an automated verification approach

There are three steps in developing an automated verification approach in a registry study. They are paper-based CRFs’ recognition, EMR data extraction and automated verification procedure implementation. The framework of this approach is shown in Fig. 1.

Firstly, in part a of Fig. 1, the data recognition algorithm recognizes the scanned images of paper-based CRFs and stores the recognition results in database for next step comparison. Meanwhile, in part b, the data extraction algorithm

Results of data preparation

Except for 2 Chinese text fields, we recognized 54 data fields (96%) in OCR procedure. In the training procedure, the accuracy of recognizing checkbox is 0.84 without machine learning technique, and it becomes 0.93 after enhanced by machine learning. For hand-written numbers, the accuracy is 0.74.

For EMR data extraction, we analyzed the data correspondence relationship between registry data and EMR data and list the results in Table 2 below. Only medication history information has a perfect

Discussion

Due to the restrictions in applying an electronical data capture system and the traditional work habit of recording data in paper documents, many clinical researchers are still using paper-based CRFs to collect data and then transcribe them into an eCRFs system. However, the data transcription is vulnerable to missing and incorrect data. Traditional approaches to guarantee the quality of data are time-consuming, labor-intensive and of limited effect. There is an urgent need to develop an

Conclusions

In this study, an automated method of data verification with data from paper-based CRF and EMRs was developed and applied to the Chinese Coronary Artery Diseases Registry to reveal data quality problems and improve the data quality of registry data. Compared to the manual data verification approach, the automated approach has a higher recall of identified data errors and accuracy. At the same time, the time consumed is far less. The results suggest that the automated approach is more effective

Conflict of interest

The authors do not have financial and personal relationships with other people or organizations that could inappropriately influence (bias) their work.

Acknowledgments

The authors would like to thank the clinical experts and clinicians participating in the evaluation experiment.

Findings

This work was supported by grant National Key R&D Program of China [grant number 2016YFC1300300].

References (25)

  • Electronic Data Capture in Clinical Trials | Applied Clinical Trials n.d....
  • J.R. Andersen et al.

    Impact of source data verification on data quality in clinical trials: an empirical post hoc analysis of three phase 3 randomized clinical trials: the impact of source data verification on data quality in clinical trials

    Br. J. Clin. Pharmacol.

    (2015)
  • Cited by (15)

    View all citing articles on Scopus
    View full text