Keywords

1 Introduction

1.1 Lung Cancer

Lung cancer is the leading cancer-related cause of death in the US and despite significant advances in treatment options, the five-year survival rate remains low at 18.6% [7]. This is partially explained by the fact that lung cancer has often progressed to an advanced stage by the time it becomes clinically noticeable to most patients. Multiple randomized studies investigating the survival benefit of lung cancer screening have been performed. Screening with plain chest radiographs and sputum cytology has proven questionably useful, but the pivotal National Lung Screening Trial demonstrated a significant survival benefit of screening with low dose spiral computed tomography (CT) in high-risk patients. This study spurred a change in the United States Preventive Services Task Force screening guidelines, which now recommends annual low dose CT scans for patients with a high risk of lung cancer. Under these guidelines, 9 million Americans will receive a screening CT each year, resulting in 2.2 million positive test results [8], which will all require further evaluation. After a positive screening CT, the next step is a diagnostic, full resolution CT scan. If the radiologist finds this CT scan to be concerning, a CT-guided biopsy is performed.

1.2 Lung Biopsies

CT-guided lung biopsies are used to diagnose malignancy with pathologic certainty in patients with suspected lung cancer. Tissue is needed to confirm a diagnosis of lung cancer as well as to guide the application of targeted therapies. Although often necessary, CT-guided lung biopsies carry a risk of possible complications. Complications include pneumothorax, hemoptysis, pain, air embolism, and even death in rare cases. If lung nodules could be accurately classified into malignant vs. benign using exclusively non-invasive imaging data, biopsies could be avoided, sparing patients without malignancies the risk and cost of the biopsy.

1.3 Machine Learning in Lung Cancer Diagnosis

For nearly 30 years, physicians have sought to enhance their ability to accurately classify pulmonary nodules using predictive models [2]. In a study published in 1993, a Bayesian classifier was trained to classify solitary pulmonary nodules as benign or malignant based on hand-extracted clinical and radiographic features [4, 5]. Even with a dataset of limited size, the investigators were able to train a model with an area under the curve (AUC) of 0.71. Various studies have also employed radiomics approaches, using handcrafted imaging features such as texture and entropy to predict for malignancy, with one such study achieving an AUC of 0.79 [6].

Recent advances in the field of computer vision, in particular the popularization of the convolutional neural network (CNN), have resulted in corresponding advances in the field of automated medical image analysis. In the field of lung nodule analysis specifically, multiple groups have presented impressive results. Chon et al. demonstrated that a deep learning approach could accurately detect pulmonary nodules from CT images employing a U-net architecture [1]. The interest pulmonary nodule classification peaked in 2017 when the popular “Kaggle Data Science Bowl" was focused on this topic. A public dataset from the Lung Image Data Consortium (LIDC) was made available and many groups submitted impressive solutions leveraging a variety of network architectures including U-net, AlexNet, and others [10].

All of these studies, however, relied on datasets without true ground truth labels. That is, the label of malignancy vs. non malignancy was based on subjective ratings of by radiologists, not on subjects with biopsy confirmed disease. Because of this, all of these models are limited in performance to what an actual radiologist can achieve today. A natural extension of these methods would be application to a dataset with ground truth labels provided by biopsies.

1.4 Transfer Learning

Because a large open dataset containing both ground truth pathology data and CT imaging data does not yet exist, we turned to transfer learning as a possible solution [9]. In transfer learning, a network is trained on one dataset and then fine-tuned using another dataset. In our case, we chose to train a network using an open dataset from the Lung Image Data Consortium of subjects with and without lung nodules containing CT imaging data along with subjective radiologists ratings of suspicious nodules. We then fine-tuned this network to predict pathologically-confirmed lung cancer, using a smaller dataset of patients with pathologically confirmed lung cancer diagnoses.

2 Methods

2.1 Dataset

Our dataset consists of 796 patients who underwent CT-guided lung biopsy at one institution between 2012 and 2017 to evaluate suspicious pulmonary nodules. All patients had pathology-confirmed diagnosis (from CT-guided biopsy) and high-resolution CT imaging data acquired immediately prior to biopsy. Lesion location was manually determined using the biopsy guidance CT scan as a reference for a subset of 86 patients for this proof of concept study. The median nodule size was 2.1 cm.

A random selection of 65 patients was used as the training set. The remaining 21 patients were reserved solely for testing and performance assessment.

In the training set, 72% of subjects had biopsies showing malignancy, while 28% of patients were shown to have benign disease. In the testing set, 72% were malignant, while 28% were benign.

Given the 3D location of nodule, the images were re-sampled to 1 mm\(\,\times \,\)1 mm\(\,\times \,\)1 mm resolution and a 64\(\,\times \,\)64\(\,\times \,\)64 cube volume around the center of nodule was then cropped. Visual inspection on the cropped volume was performed to ensure the inclusion of the full nodule. Examples of cropped volumes are shown in Fig. 1 below.

Fig. 1.
figure 1

Example 2D slice of a thoracic CT scan showing malignant and benign (non-malignant) lesions.

2.2 Network Construction

Rather than training the network without a priori knowledge, which risks over fitting on small datasets, we employed transfer learning, taking the initial layers of our network from an existing neural network trained on a distinct but similar dataset. In our case, we first identified an existing 3D-CNN to identify nodules from thoracic CT data that was trained on an open dataset from the LIDC [3]. Using the final batch pooling layer from the existing network, we then added three new untrained layers (spatial pooling, dense, and softmax) and re-trained the network using our training set employing dropout and batch normalization. This final 3D-CNN network was evaluated using the test set. The network schematic is shown in Fig. 2.

Fig. 2.
figure 2

A diagram showing our transfer learning methodology.

To reduce the model’s parameters, we applied a simple weighted spatial pooling of the pretrained feature vector. Next, a voxel-wise importance map is regressed out with a conv/relu/bn/sigmoid sandwich and is used to weighted average the feature vector spatially, resulting in a single feature vector of 128 features. This layer is followed by a dense layer, a softmax classification layer and standard binary cross entropy loss. Our dataset contains much more malignant cases than benign. To counter class imbalance, class weight of 10:1 (benign to malignant) weight is added to the loss function. We use a standard Adam optimizer with learning rate 1e−3. Two drop out layers with keep rate 0.8/drop rate 0.2 are added to the network to counter overfitting. We use batch size 10 and run the training for 2000 steps. Empirically this is enough for the network to reach convergence.

Fig. 3.
figure 3

Receiver Operating Characteristic Curve for fine tuned network.

Fig. 4.
figure 4

Examples of mis-classified labels. (a), (b) are true malignant nodules labeled by the network as benign. (c), (d) are true benign nodules labeled by the network as malignant.

3 Results

When validated on a held out test set, our classifier achieved and AUC of 0.70 and a classification accuracy of 71%. The Receiver Operating Characteristic Curve can be seen in Fig. 3. Examples of incorrectly classified nodules can be seen in Fig. 4

4 Conclusion

Machine learning based image analysis methods have the potential to significantly enhance radiology workflows, reduce the occurrence of missed diagnoses and false positives, and improve survival rates for lung cancer patients. However, creation of larger, more comprehensive medical image datasets is required before clinically acceptable models can be trained. In this proof of concept study, we demonstrate that a network trained on a publicly available dataset can be fine-tuned, even with a small number of subjects, to a more specific classification task. Although the performance of our model does not reach state-of-the-art in terms of classification accuracy or AUC, we believe that the inclusion of ground truth labels based on pathology is novel and an important step towards clinical adoption of lung cancer CAD software. We are currently extracting additional imaging and pathology data from our larger dataset, and a more complete analysis on the full 796 patients is planned in the near future. In this study, we hope that the larger sample size will allow us to further fine-tune our existing network and allow for meaningful gains in accuracy and AUC. Ultimately, single institution datasets will not lead to optimal classifier performance. Given the importance of diagnosing lung cancer at an early stage and the government’s new screening guidelines, we strongly advise the medical community to begin construction of a comprehensive open dataset consisting of pathology and imaging data. Furthermore, radiologists don’t makes clinical diagnoses using solely on imaging data. They often correlate their imaging finding with additional clinical data such as the patient’s smoking history, demographics, and co-morbid conditions. We hope to expand our dataset to include additional clinical features from the patient’s electronic medical record, with a goal of creating a workflow-integrated CAD solution for lung cancer screening.