Keywords

1 Introduction

Chronic obstructive pulmonary disease (COPD), the third leading cause of death in the U.S., is a heterogeneous group of lung disorders, including a range of patterns of emphysema, chronic bronchitis, and non-emphysematous obstruction due to small-airway disease [7]. CT plays an important role in assessment of COPD. While quantitative image analysis techniques such as lung densitometry are able to detect and quantify emphysema [13], visual assessment has remained necessary for complete characterization of emphysema patterns [7].

To facilitate data comparison in research and clinical trials, and to improve diagnostic accuracy, the Fleischner Society developed a structured system for visual classification of phenotypic abnormalities on CT in subjects with COPD [6]. The system provides a six point ordinal scale (absent, trace, mild, moderate, confluent, and advanced destructive) for grading the severity of centrilobular emphysema, the prototypical form of emphysema seen in cigarette smokers. This approach was applied in the COPDGene study where baseline CT scans were retrospectively visually scored by trained analysts. Visual scores for presence and severity of centrilobular emphysema were found to be significantly associated with genetic loci previously associated with COPD [5] and with increased mortality risk [7]. Visual assessment required substantial effort, however, with multiple analysts working nearly four years to score approximately ten thousand scans.

The purpose of this study was to assess the feasibility of a deep learning model for automatic emphysema classification on CT. We hypothesized that the Fleischner system, a structured scale developed by domain experts, would provide an effective supervisory signal for training. To develop a model capable of predicting subject-level scores from volumetric CT, which are difficult to process at full resolution due to memory constraints of graphics processing units (GPUs), we combined a convolutional neural network (CNN) architecture with a long short-term memory (LSTM) layer. LSTMs are a type of recurrent neural network capable of learning dependencies in sequence data. They have been used with some success to classify data such as frame sequences from video clips [1]. This enabled efficient prediction using features extracted from axial image sequences sampled over a lung volume. We tested the approach by comparing predictions with visual scores and clinical parameters including pulmonary function tests and Global initiative for Obstructive Lung Disease (GOLD) stage.

2 Methods

2.1 Study Population

COPDGene is a prospective and multicenter investigation focused on the genetic epidemiology of COPD (ClinicalTrials. gov: NCT00608764). All subjects underwent volumetric inspiratory and expiratory CT using a standardized protocol [9]. Inspiratory CT on 8213 subjects were included in this work. The cohort was partitioned so that scans and visual scores on 2500 subjects were used for training and data on the remaining 5713 subjects were reserved for testing. Distribution of emphysema scores in training data was n = 774, 435, 473, 431, 275, 112 for Fleischner scores 0, 1, 2, 3, 4, and 5, respectively. Additional variables including GOLD stage, Forced Expiratory Volume in the first second percent predicted (FEV1\({\%}\)pred), FEV1/Forced Vital Capacity ratio (FEV1/FVC), St. George Respiratory Questionnaire (SGRQ) and smoking history (pack-years) were retrieved from the study database for the test partition. The SGRQ is a respiratory health-related quality of life questionnaire where higher scores correspond to greater impairment [15]. Quantitative CT emphysema score calculated as the percentage of lung voxels with intensity less than −950 Hounsfield Units (LAA-950) was also used for comparison in test subjects.

2.2 Visual Scoring

Each CT study had been retrospectively visually scored by two trained analysts using the scale 0 = absent, 1 = trace, 2 = mild, 3 = moderate, 4 = confluent, and 5 = advanced destructive [7]. Analyst agreement was assessed periodically throughout the study as good to excellent (weighted \(\kappa \) range 0.71–0.80). Analyst discordances larger than one point in score were adjudicated by a thoracic radiologist. Mean analyst score rounded down to the nearest integer was recorded as the final value.

2.3 Classification Algorithm

In an initial process, segmentation was performed on all CTs using an automatic lung segmentation function included in an open source library [11, 12]. Using Python and PyTorch [8] we implemented a combined CNN-LSTM network architecture [1] designed to predict visual centrilobular emphysema score from a sequence of 25 axial images. Axial images were sampled at evenly-spaced intervals over the height of each lung segmentation volume, excluding the upper and lowermost 5 mm. Figure 1 represents the architecture of the CNN-LSTM model. Briefly, the components of the network include four blocks consisting of two dimensional (2D) convolutions, rectified linear unit (RELU) activation and max pooling. The four 2D convolutional layers have 32 6\(\,\times \,\)6, 96 3\(\,\times \,\)3, 256 3\(\,\times \,\)3 and 384 3\(\,\times \,\)3 filters, respectively. The first two max pooling layers have stride 3 and the second two max pooling layers have stride 4. The input images are fed through the CNN portion separately to extract features, which are concatenated into a sequence and passed to the LSTM layer followed by a dense layer before output. The loss function is negative log likelihood.

The model was trained using CT scans and visual centrilobular emphysema scores, expressed as integers 0–5, on 2500 subjects. Some data augmentation was used, including in-plane image translations and offsets in sampling of axial images.

Fig. 1.
figure 1

The network combines CNN and LSTM architectures. 25 axial images sampled over the lung volume are separately processed by the CNN to extract features. These are concatenated into a sequence and passed to the LSTM layer, which learns representations of sequences that are useful for classification.

2.4 Statistical Analysis

Weighted \(\kappa \) statistics between analysts visual scores and automatic scores were computed. Descriptive statistics between CNN-LSTM emphysema scores and demographic and clinical parameters were computed. One-way analysis of variance (ANOVA) was used to test for significant differences in FEV1\({\%}\)pred., FEV1/FVC, SGRQ, LAA-950 and smoking history values stratified by emphysema severity scores. Chi-square tests of independence were used to compare centrilobular emphysema scores with GOLD stage. Statistical calculations were performed using R version 3.4.4 (2018-03-15). A p-value of <0.05 was considered statistically significant.

3 Results

Agreement between visual emphysema scores and those generated automatically by the CNN-LSTM was moderate (weighted \(\kappa \) = 0.60, p < 0.0001). Table 1 shows a confusion matrix comparing visual and automatic emphysema scores.

Table 2 shows comparison between visual emphysema score and clinical variables. ANOVA showed that more severe visual emphysema scores were significantly associated with diminished FEV1\(\%\)pred and FEV1/FVC as well as increased SGRQ, LAA-950 and smoking pack-years (p < 0.0001). Chi-square test of independence examining the relationship between visual score and GOLD stage was significant, \(\chi ^2\) (df = 25, n = 5713) = 2716.4, p\(\,<\,\)0.0001. Similarly, Table 3 shows comparison between CNN-LSTM emphysema scores and clinical variables. ANOVA shows that more severe emphysema grades predicted by the CNN-LSTM were associated with more severe clinical measures. Chi-square test of independence examining the relationship between CNN-LTSM score and GOLD stage was also significant, \(\chi ^2\) (df = 25, n = 5713) = 3203.4, p < 0.0001.

Table 1. Confusion matrix comparing visual emphysema scores with those predicted by the CNN-LSTM. Fleischner centrilobular emphysema scores are: 0 = absent, 1 = trace, 2 = mild centrilobular, 3 = moderate centrilobular, 4 = confluent, 5 = advanced destructive.
Table 2. Clinical variables according to visual emphysema score. FEV1\(\%\)pred, FEV1/FVC, SGRQ, LAA-950 and pack-years are means (s.d.). GOLD data are number of subjects. Preserved Ratio Impaired Spirometry (PRISm) is defined as FEV1/FVC ratio \(\ge \) 0.7.
Table 3. Clinical variables according to CNN-LSTM emphysema score. FEV1\(\%\)pred, FEV1/FVC, SGRQ, LAA-950 and pack-years are means (s.d.). GOLD data are number of subjects.

4 Discussion

The Fleischner Society visual scoring system is a clinically significant index of COPD severity that is associated with mortality risk. We have developed a CNN-LSTM model that is capable of automatic classification of centrilobular emphysema pattern on CT according to the Fleischner scale. In a group of 5713 test subjects, automatically scored emphysema patterns showed moderate agreement with visual scores and significant associations with GOLD stage, pulmonary function, SGRQ, LAA-950 and smoking pack-years. While agreement between visual and automatic emphysema scores in this test set is somewhat modest, similar statistical associations are seen when comparing visual and CNN-LSTM emphysema scores with other clinical parameters, suggesting that the CNN-LSTM learns to classify clinically significant features.

Deep learning has become the dominant approach in medical image analysis [16] for applications ranging from segmentation [10] and detection [14] to diagnosis [2]. One challenge when designing CNN architectures is to manage the total number of model parameters considering memory constraints of available GPUs. This is generally the case in applications using volumetric chest CT. Other researchers have shown that limiting model input to a small number (1–4) of relevant “canonical” slices, selected in pre-processing using anatomy detection methods, can be effective [3, 4]. In the present study we instead used an LSTM architecture, which made it possible to limit the number of convolutional weights while still using 25 full-resolution axial images as input.

This study has some limitations. COPDGene uses a well-defined CT protocol and study images are very consistent. Performing both training and testing on images from this cohort may produce optimistic results. We also did not perform systematic comparisons of different CNN architectures. Future work will incorporate mortality and longitudinal assessment for further clinical validation.

5 Conclusions

A combined CNN-LSTM architecture provides an efficient model for subject-level prediction and can be trained to perform automatic classification of centrilobular emphysema on CT.