Comparative evaluation of support vector machines for computer aided diagnosis of lung cancer in CT based on a multi-dimensional data set

https://doi.org/10.1016/j.cmpb.2013.04.016Get rights and content

Abstract

Lung cancer is one of the most common forms of cancer resulting in over a million deaths per year worldwide. In this paper, the usage of support vector machine (SVM) classification for lung cancer is investigated, presenting a systematic quantitative evaluation against Boosting, Decision trees, k-nearest neighbor, LASSO regressions, neural networks and random forests. A large database of 5984 regions of interest (ROIs) and 488 input features (including textural features, patient characteristics, and morphological features) were used to train the classifiers and evaluate for their performance. The evaluation for classifiers’ performance was based on a tenfold cross validation framework, receiver operating characteristic curve (ROC), and Matthews correlation coefficient. Area under curve (AUC) of SVM, Boosting, Decision trees, k-nearest neighbor, LASSO, neural networks, random forests were 0.94, 0.86, 0.73, 0.72, 0.91, 0.92, and 0.85, respectively. It was proved that SVM classification offered significantly increased classification performance compared to the reference methods. This scheme may be used as an auxiliary tool to differentiate between benign and malignant SPNs of CT images in future

Introduction

As the leading cause of cancer-related mortalities, lung cancer is responsible for approximately 1.38 million deaths annually worldwide. Despite recent advances in medicine and technology, the prognosis for lung cancer remains poor, with the 5-year survival rate approaching only 10% in most countries [1]. Yet, if the cancer can be detected and diagnosed in its early stages, the 10-year survival rate could be greatly promoted [2]. However, it is difficult to diagnose lung cancer efficiently. More than 80% of patients are diagnosed with locally advanced or metastatic disease.

Currently, the diagnosis of lung cancer primarily relies on digital computed tomography (CT). In CT images, lung cancer usually appears as solitary pulmonary nodules (SPNs). By definition, the solitary pulmonary nodule (SPN) is a single, spherical, well-circumscribed, radiographic opacity that measures  3 cm in diameter and is surrounded completely by aerated lung. There is no associated atelectasis, hilar enlargement, or pleural effusion. However, the solitary pulmonary nodules (SPNs) of lung cancer share similarities with several benign diseases, such as tuberculosis, inflammatory pseudotumor, hamartoma, and aspergillosis [3]. A meta-analysis [4] found that it has a pooled sensitivity of 0.57 (95% confidence interval, 0.49–0.66) and a pooled specificity, 0.82 (95% confidence interval, 0.77–0.86) for lung cancer using CT.

To improve the accuracy and efficiency of CT scans in the diagnosis of lung cancer, a number of research groups are focusing on developing computer-aided diagnoses (CADs) as auxiliary tools, including image segmentation and textural analysis. Murphy et al. [5] extracted textural features, and used k-nearest-neighbor classification to detect pulmonary nodule in chest CT. Wang et al. [6] used the gray level co-occurrence matrix and the multi-level model to predict pulmonary nodules. Lee et al. [7] used a two-step approach for feature selection (216 textural features) and classifier (linear discriminant analysis) for evaluation of pulmonary nodule. Sousa et al. [8] explored six stages to extract texture for automatic detection of lung nodules in CT images using support vector machines. Kim et al. [9] used 11 shape features and 13 textural features, with support vector machine and Bayesian classifiers, to improve performance of differentiating obstructive lung diseases, based on high-resolution computerized tomography (HRCT) images. Tan et al. [10] proposed to use textural features and a decision tree (C4.5)-adaboost classifier for classifying normal and tuberculosis in lung computed tomography. Some research groups [11], [12] used the support vector machine classifier and other algorithms for classification bases on textural features, and found support vector machines outperformed other classifiers.

In previous studies, there is not a standard way to describe lung nodules (how to extract texture, and other features), the same to prediction models. However, how to describe lung nodules and establish prediction models has a direct effect on the performance of computer-aided diagnoses. In this study, support vector machines, a type of established methodologies used widely in various fields, and other six prediction models were established using a multi-dimensional set of textural features extracted by a Curvelet transformation from regions of interest (ROIs) in CT images, patient demographic characteristics, and morphological features. The aim of this study was to determine which prediction model was more suitable for CT texture analysis and that if this scheme could differentiate between benign and malignant lung cancer of CT images.

Section snippets

Image collection

This study was performed with ethics approval (Ethics Committee of Xuanwu Hospital, Capital Medical University, Approval Document No. [2011] 01). This study is cross-sectional and the CT images were collected in 4 hospitals, in 2009–2011. The decision on patient inclusion and exclusion was based on the results of final diagnoses. Malignant cases were confirmed by either surgical removal or biopsy of the lesion, and benign cases by either pathological analysis or a 2 year follow-up. The

Results

The descriptions of demographic parameters between benign and malignant cases are shown in Table 2. The descriptions of parameters used in the study are provided in Table 3.

These 476 textural features, together with 3 demographic parameters of the patient and 9 morphological measurements of ROIs were used as input data to establish prediction models. Accuracy based on cross-evaluation and Mcc of SVM, Boosting, Decision trees, k-nearest neighbor, LASSO regressions, neural networks and random

Discussions

Prediction models have been used effective in other research fields, such as gene classification and comparison results have been reported in many fields. However, there is no agreement on which model has a better performance in the analysis of SPNs in CT images, and different research paper chose different classifiers [5], [6], [7], [8], [29], [30]. In the present study, we have observed that SVMs outperform other classifiers, suggesting that SVMs may be a more appropriate prediction model in

Author contributions

XG conceptualized and designed the study. Data analysis was done by TS and XG. They are also the guarantors of integrity of the entire study. Clinical study was done by PL. The manuscript was prepared by all authors.

Acknowledgements

Supported by the Natural Science Fund of China (Serial Number: 81172772); the Natural Science Fund of Beijing (Serial Number: 4112015, 7131002); and National S&T Major Project (Serial Number: 2012ZX10005009-003)

References (30)

Cited by (83)

View all citing articles on Scopus
View full text