Efficient learning from big data for cancer risk modeling: A case study with melanoma

https://doi.org/10.1016/j.compbiomed.2019.04.039Get rights and content

Abstract

Background

Building cancer risk models from real-world data requires overcoming challenges in data preprocessing, efficient representation, and computational performance. We present a case study of a cloud-based approach to learning from de-identified electronic health record data and demonstrate its effectiveness for melanoma risk prediction.

Methods

We used a hybrid distributed and non-distributed approach to computing in the cloud: distributed processing with Apache Spark for data preprocessing and labeling, and non-distributed processing for machine learning model training with scikit-learn. Moreover, we explored the effects of sampling the training dataset to improve computational performance. Risk factors were evaluated using regression weights as well as tree SHAP values.

Results

Among 4,061,172 patients who did not have melanoma through the 2016 calendar year, 10,129 were diagnosed with melanoma within one year. A gradient-boosted classifier achieved the best predictive performance with cross-validation (AUC = 0.799, Sensitivity = 0.753, Specificity = 0.688). Compared to a model built on the original data, a dataset two orders of magnitude smaller could achieve statistically similar or better performance with less than 1% of the training time and cost.

Conclusions

We produced a model that can effectively predict melanoma risk for a diverse dermatology population in the U.S. by using hybrid computing infrastructure and data sampling. For this de-identified clinical dataset, sampling approaches significantly shortened the time for model building while retaining predictive accuracy, allowing for more rapid machine learning model experimentation on familiar computing machinery. A large number of risk factors (>300) were required to produce the best model.

Introduction

In 2018, there were an estimated 1,735,350 new cases of cancer in the U.S [1]. Nevertheless, it is economically infeasible to screen over 320 million people in the U.S. for all types of cancer, and some cancers do not have a screening test or have not shown any improvements in detection from such a test. In this context, cancer risk models could provide immense value for informing screening guidelines by facilitating the identification and close follow-up of high-risk patients. In addition, they would enable more shared decision making between physicians and patients by providing evidence-based estimates of disease risk and prognosis [2].

A literature search retrieved several reports of predictive models for cancer risk [3]. We identified several shortcomings. Availability of structured clinical data: Structured data points regarding patient history and encounters are limited. Many data-capture systems record free-text notes that are difficult to standardize across several patient charts. Data sharing among healthcare providers is lacking, limiting holistic views of patient history. Old data: Most studies were published five or more years after the end of the study period. This results in stale models that might not reflect the current state of diagnosis and treatment. Advanced modeling methods: Researchers often only use one or two familiar algorithms, possibly because of a lack of experience with various tools or limitations in computing power.

In their review, Usher-Smith et al. found several case–control studies that compared melanoma populations to non-melanoma populations and built discriminators between them [4]. Like many cancers, melanoma screening is important given the poor survival rate of late-stage patients (five-year survival rate: 20% in patients with distant metastases vs. >99% in early-stage patients). Moreover, the incidence of melanoma is growing, with 96,000 new cases expected in 2019 [5]. Thus, a risk prediction model could flag high-risk patients and enroll them in screening programs to detect melanoma early. In addition to learning from tabular data, there have been recent advancements in computer vision models to detect cancerous lesions from skin images [6,7].

Electronic health record (EHR) systems have been rapidly adopted in the U.S. over the last decade, largely because of requirements introduced by the Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009 [8]. In addition, the 21st Century Cures Act provided funding for increasing interoperability among EHR systems [9]. While there are structured coding vocabularies for some clinical information, many systems collect clinical narrative and procedure notes through dictation or free typing. Hence, the complexity of clinical information and differences among EHR systems make it difficult to build natural language-processing systems that synthesize structured data across multiple practices or hospitals [10].

If a large quantity of consistent patient data can be collected for a predictive model, computational challenges arise when transforming the data and training a machine learning algorithm. First, data elements must be extracted from the EHR system and transformed into a tabular format to be passed to a machine learning model. The size of the dataset and complexity of the machine learning algorithm can subsequently introduce computational challenges. The cloud, computing infrastructure accessed through the internet, enables users to launch machines of varying size with prebuilt libraries for machine learning algorithms. This technology can be utilized to evaluate a wide range of algorithms to produce the most accurate model. When dealing with big data, or data that cannot be processed through traditional architectures, predictive accuracy is not the only consideration when choosing classifiers and machine learning techniques; computational complexity and cost must also be factored in the selection process.

Here, we present a cloud-based approach to learning from big data and demonstrate its effectiveness on melanoma risk prediction from EHR system data. We evaluated methods for practical cost savings while maintaining model accuracy by using various types of computing infrastructures and data sampling techniques. Clinical utility of the models was assessed by examining the selected features and their impact on predicting melanoma risk for this population.

Section snippets

Data

We used the Modernizing Analytics for Melanoma (MAMEL) dataset for the experiments [11]. The data were collected from a mobile-first, structured data-input, and cloud-based dermatology-specific EHR system and de-identified in accordance with HIPAA [12]. De-identified data were available from over 100 million dermatology visits throughout the U.S. recorded from January 1, 2011 to December 31, 2017.

The models in this study were built to predict melanoma diagnosis within 12 months of a given

Population

There were a total of 4,061,172 patients in the MAMEL dataset that met the inclusion criteria, 10,129 of whom were diagnosed with melanoma within one year (Table 1). Compared to the “no melanoma” class, the “melanoma” class had a lower proportion of females (59.33% vs. 39.74%), and higher proportions of white race (69.57% vs. 75.26%) and family history of melanoma (11.97% vs. 13.69%).

Performance

Table 2 outlines the sizes of the original dataset and each sampled dataset as well as the average performance

Discussion

The results of the present study offer several perspectives on the intersection of risk models, EHR systems, and big data. Datasets for specific biomedical and health applications can be small because of limited data sharing between institutions, strict inclusion criteria, and a lack of structured clinical data. Risk models are often built with data collected from individual healthcare or academic institutions. While large centers can attract patients from different geographical areas, the

Conclusion

We described a case study of learning from big data to produce an effective melanoma risk prediction model based on data collected from a large representative dermatology EHR system covering millions of patients across the U.S. Our study provides a reference framework for machine learning studies using large, high-dimensional, and imbalanced EHR data. We used a distributed processing infrastructure for collecting and formatting the data as well as a non-distributed infrastructure for machine

Author contributions

ANR and TMK created the experimental designs and facilitated data acquisition. ANR performed the data processing, experiments, and drafted the initial manuscript. TMK provided manuscript edits and guidance for analysis and interpretation of the results.

Conflict of interest statement

None declared.

Acknowledgements

We would like to thank the anonymous reviewers and editor for their insightful feedback and helpful comments. Also, we thank the reviewers at the Data Mining and Machine Learning Laboratory at Florida Atlantic University. The icons in Fig. 2 were made by smalllikeart on www.flaticon.com.

References (33)

  • A.N. Richter et al.

    A review of statistical and machine learning methods for modeling cancer risk using structured clinical data

    Artif. Intell. Med.

    (2018)
  • National Cancer Institute

    Cancer Statistics

    (2018)
  • E.W. Steyerberg

    Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating

    (2009)
  • J.A. Usher-Smith et al.

    Risk prediction models for melanoma: a systematic review

    Canc. Epidemiol. Biomark. Prevent.

    (2014)
  • A.C. Society

    Cancer Facts & Figures 2019

    (2019)
  • A. Romero-Lopez et al.

    Skin lesion classification from dermoscopic images using deep learning techniques

  • A. Esteva et al.

    Dermatologist-level classification of skin cancer with deep neural networks

    Nature

    (2017)
  • J. AK

    Meaningful use of electronic health records: the road ahead

    JAMA

    (2010)
  • K.L. Hudson et al.

    The 21st century cures act — a view from the NIH

    N. Engl. J. Med.

    (2017)
  • S. Doan et al.

    Natural language processing in biomedicine: a unified system architecture overview

    Clin. Bioinformat.

    (2014)
  • A.N. Richter et al.

    Modernizing Analytics for melanoma with a large-scale research dataset

  • Methods for De-identification of PHI | HHS.gov

    HHS.gov.

  • A. Avati et al.

    Improving palliative care with deep learning

  • M. Zaharia et al.

    Apache spark: a unified engine for big data processing

    Commun. ACM

    (2016)
  • C.-C. Chang et al.

    LIBSVM: a library for support vector machines

    ACM Transact. Intell. Sys. Technol.

    (2011)
  • J. Van Hulse et al.

    Experimental perspectives on learning from imbalanced data

  • Cited by (0)

    View full text