Hazards of data leakage in machine learning: a study on classification of breast cancer using deep neural networks

Ravi K. Samala; Heang-Ping Chan; Lubomir Hadjiiski; Sathvik Koneru

doi:10.1117/12.2549313

16 March 2020 Hazards of data leakage in machine learning: a study on classification of breast cancer using deep neural networks

Ravi K. Samala, Heang-Ping Chan, Lubomir Hadjiiski, Sathvik Koneru

Author Affiliations +

Proceedings Volume 11314, Medical Imaging 2020: Computer-Aided Diagnosis; 1131416 (2020) https://doi.org/10.1117/12.2549313
Event: SPIE Medical Imaging, 2020, Houston, Texas, United States

Abstract

With the renewed interest in developing machine learning methods for medical imaging using deep-learning approaches, it is essential to reexamine data leakage. In this study, we simulated data leakage in the form of feature leakage, where a classifier was trained on the training set, but the feature selection was influenced by the performance on the validation set. A pre-trained deep-learning convolutional neural network (DCNN) without fine-tuning was used as a feature extractor for malignant and benign mass classification in mammography. A feature selection algorithm was trained in the wrapper mode with a cost function tuned to follow the performance metric on the validation set. Linear discriminant analysis (LDA) classifier was trained to classify masses on mammographic patches. Mammograms from 1,882 patient cases with 4,577 unique patches were partitioned by patient into 3,222 for training and 508 for validation, while 847 were sequestered as unseen independent test set to evaluate the generalization error. The effects of the finite sample size on data leakage were studied by varying the training and validation set sizes from 10% to 100% of the available sets. The area under the receiver operating characteristic curve (AUC) was used as the performance metric. The results show that the performance on the validation set could be overestimated, having AUCs of 0.75 to 0.99 for various sample sizes, whereas the independent test performance could realistically only reach an AUC of 0.72. The analysis indicates that deep learning can risk a high inflation in performance and proper housekeeping rules should be followed when designing and developing deep learning methods in medical imaging.

Conference Presentation

Citation Download Citation

Ravi K. Samala, Heang-Ping Chan, Lubomir Hadjiiski, and Sathvik Koneru "Hazards of data leakage in machine learning: a study on classification of breast cancer using deep neural networks", Proc. SPIE 11314, Medical Imaging 2020: Computer-Aided Diagnosis, 1131416 (16 March 2020); https://doi.org/10.1117/12.2549313

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available