An automated data verification approach for improving data quality in a clinical registry
Introduction
A clinical registry is defined as a collection of information about individuals, usually for a specific public health purposes and focused around a specific diagnosis or condition [1]. It provides first-hand information about people with certain conditions and increases our understanding of that condition or disease [2]. The quality of data collected in registry studies is crucial as it has an impact on the credibility of registry study and is emphasized by clinical researchers [3]. The International Conference on Harmonization (ICH) guideline E6 on Good Clinical Practice (GCP) requires that data in clinical trials must be accurate, complete and verifiable [4] to ensure patient safety and data quality [5]. For that reason, researchers are always dedicated to improving the quality of registry data.
Despite the fact that electronic case report forms (eCRFs), which guarantee the quality of collected data to a certain degree, are used in some registry studies, paper-based CRFs are still used in many registry studies as they better fit the busy clinical environment and traditional working habit of clinicians, especially in developing countries like China and India [6], [7]. The transcription procedure of data from paper-based CRFs to eCRFs is error-prone. In this situation, despite systematic audit processes such as range check and data dictionary employed in eCRFs being effective in suggesting data plausibility, they cannot guarantee consistency between registry data and original data sources.
Some approaches have been proposed to remedy this problem, but they have not proven to be ideal. Double data entry is one of the approaches in data collection procedure. It consists of two steps: an initial entry step and a verification step; each step is performed by a separate data entry clinician. However, it is time-consuming, labor-intensive [8], [9] and ineffective for detecting recording errors. Another controversial approach is source data verification (SDV), a verification of conformity of the data recorded in case report forms with source data [10]. In 1988, the Food & Drug Administration (FDA) recommended SDV is one of the most effective ways to ensure data quality in its Guideline for Monitoring of Clinical Investigations [11]. However, traditional SDV only uses original paper-based CRFs as the supporting data source, and this constrains the findings of verification related to only data transcription procedure from paper-based documents to eCRFs and neglects data errors of the original data source itself. Not only that, the time taken for the SDV is too much because almost all the verification work is completely manual.
Since the original data is recorded on the paper-based CRF, verifying registry data with it is accomplished manually. Actually, Optical Character Recognition (OCR) is able to turn paper-based information into digital data [12]. It would save great amount of time and reduce human errors if the paper-based CRFs can be verified automatically. In addition, there are data sources other than original paper-based documents, which could be used as supporting sources to verify registry data, such as the data from EMRs [13]. EMRs usually record the information of clinical diagnosis and treatments happening in hospital. Depending on the goal of the study, the data from an EMR system may partially or completely overlap with data from the registry study and therefore could be used as a reference for registry data verification. Data verification could be further improved if there are multiple data sources involved.
In this paper, based on paper-based CRFs and EMR data sources, we propose an automated data verification approach using machine learning enhanced OCR and NLP techniques for improving data quality. The remainder of this paper is organized as follows. In the methods section, we describe the verification approach and evaluation experiment. In the results section, we illustrate the accuracy of the proposed approach and its efficiency. Then we discuss the strengths and limitations of this study.
Section snippets
Developing an automated verification approach
There are three steps in developing an automated verification approach in a registry study. They are paper-based CRFs’ recognition, EMR data extraction and automated verification procedure implementation. The framework of this approach is shown in Fig. 1.
Firstly, in part a of Fig. 1, the data recognition algorithm recognizes the scanned images of paper-based CRFs and stores the recognition results in database for next step comparison. Meanwhile, in part b, the data extraction algorithm
Results of data preparation
Except for 2 Chinese text fields, we recognized 54 data fields (96%) in OCR procedure. In the training procedure, the accuracy of recognizing checkbox is 0.84 without machine learning technique, and it becomes 0.93 after enhanced by machine learning. For hand-written numbers, the accuracy is 0.74.
For EMR data extraction, we analyzed the data correspondence relationship between registry data and EMR data and list the results in Table 2 below. Only medication history information has a perfect
Discussion
Due to the restrictions in applying an electronical data capture system and the traditional work habit of recording data in paper documents, many clinical researchers are still using paper-based CRFs to collect data and then transcribe them into an eCRFs system. However, the data transcription is vulnerable to missing and incorrect data. Traditional approaches to guarantee the quality of data are time-consuming, labor-intensive and of limited effect. There is an urgent need to develop an
Conclusions
In this study, an automated method of data verification with data from paper-based CRF and EMRs was developed and applied to the Chinese Coronary Artery Diseases Registry to reveal data quality problems and improve the data quality of registry data. Compared to the manual data verification approach, the automated approach has a higher recall of identified data errors and accuracy. At the same time, the time consumed is far less. The results suggest that the automated approach is more effective
Conflict of interest
The authors do not have financial and personal relationships with other people or organizations that could inappropriately influence (bias) their work.
Acknowledgments
The authors would like to thank the clinical experts and clinicians participating in the evaluation experiment.
Findings
This work was supported by grant National Key R&D Program of China [grant number 2016YFC1300300].
References (25)
- et al.
A quantifiable alternative to double data entry
Control Clin. Trials
(2000) - et al.
Double data entry: what value, what price?
Control Clin. Trials
(1998) - et al.
A novel drop-fall algorithm based on digital features for touching digit segmentation
(2016) - et al.
Extracting important information from Chinese Operation Notes with natural language processing methods
J. Biomed. Inform.
(2014) Natl Inst Health NIH
- et al.
Evaluation and implementation of public health registries
Public Health Rep.
(1991) - et al.
Quality assurance and quality control in longitudinal studies
Epidemiol. Rev.
(1998) E6: Note for Guidance on Good Clinical Practice
(2002)- et al.
The value of source data verification in a cancer clinical trial
PLoS ONE
(2012) Paperless clinical trials: myth or reality
Indian J. Pharmacol.
(2015)
Impact of source data verification on data quality in clinical trials: an empirical post hoc analysis of three phase 3 randomized clinical trials: the impact of source data verification on data quality in clinical trials
Br. J. Clin. Pharmacol.
Cited by (15)
Common data elements and features of brucellosis health information management system
2022, Informatics in Medicine UnlockedCitation Excerpt :The analytical power of any study lies in generalizable and high-quality data. When credible and consistent data collection tools capture the data about a disease's natural progress, researchers can plan their study more reliably and detect eligible participants [60]. We believe that the MDS developed in this study will be an effective tool to collect higher-quality data on brucellosis that may lead to better clinical decision-making.
Guest editorial: Special issue in biomedical data quality assessment methods
2019, Computer Methods and Programs in BiomedicineAn efficient hybrid optimization of ETL process in data warehouse of cloud architecture
2024, Journal of Cloud ComputingDesign and quality control of a database for postoperative infectious complications following gastrointestinal surgery based on clinical practice
2023, Chinese Journal of Gastrointestinal Surgery / Zhonghua Wei Chang Wai Ke Za ZhiData Quality in Health Research: Integrative Literature Review
2023, Journal of Medical Internet Research