1 Introduction

Thanks to the development of dedicated image acquisition methods, MRI has become a new modality allowing for spatial analysis of time-dependent physiological changes of the lung, e.g., ventilation [4]. The ventilation maps are computed from a time-resolved image series and the detection of the corresponding ventilation frequencies in the image data, after the images are spatially aligned [3]. An example is shown in Fig. 1. The automatic labeling of defected areas in these maps is a complex task for naïve methods such as thresholding because the value range depends on the relative signal intensity of the MR images. More sophisticated methods, e.g., supervised segmentation methods require accurate labeled ground truth data, which is difficult to obtain, even for clinical experts.

Fig. 1.
figure 1

(a) MRI acquisition of a patient with cystic fibrosis, (b) corresponding ventilation maps, and (c) assumed impaired areas.

Compared to supervised methods, unsupervised methods do not require labeled training data. There are several unsupervised learning methods for medical image segmentation. An unsupervised lesion detection using an adversarial auto-encoderder was presented in [5]. The method do not use any labels during training but require a training set of image data with only healthy subjects. A weakly supervised method for the segmentation of brain tumors was shown in [1] by using only one binary label for each image (healthy, non-healthy). The methods presented in [7, 8] use class level labels of the image contend as a weak label in order to create a semantic object segmentation.

In this paper, we present a first proof-of-concept for the segmentation of the defected areas in the ventilation maps using a weakly supervised learning strategy based on the Lung Clearance Index (LCI). The LCI is a single value which expresses the healthiness of the whole lung and corresponds to the global ventilation inhomogeneities [9]. Unlike binary or class labels, the LCI is an unbounded value where a value of 6–7 indicates a healthy lung. A higher LCI is correlated to more ventilation inhomogeneities. We designed a network that is able to perform a pixel-wise segmentation using only the LCI value as a weak label during training. Furthermore, we use self-supervised regularization to prevent the network from learning a non-meaningful segmentation. The performance of our method is evaluated by a rating of the segmentation results by 5 human experts. Initial results show that over \(60\%\) of the segmentation results are scored with very good or perfect.

2 Method

In this section, we describe our network design and training procedure in order to segment the defected areas in the lung using only the LCI value as a continuous label. For this, we use the relation

$$\begin{aligned} \frac{\text {defected lung volume}}{\text {total lung volume}} \propto LCI, \end{aligned}$$

which has been proven to be valid [6].

2.1 Lung Clearance Index

The Lung Clearance Index (LCI) is a global measure for the ventilation inhomogeneities of the lung, as a result of respiratory diseases like cystic fibrosis or chronic obstructive pulmonary disease [9]. It is defined in lung volume turnovers (TO), i.e., the number of breathing cycles that are necessary in order to clear the lung from a previews inhaled tracer gas. A common method for the determination of the LCI is the Nitrogen Multiple Breath Washout (N\({}_{2}\)-MBW). Here the patient starts breathing \(100\,\%\) oxygen, and the remaining nitrogen concentration of the previously breathed ambient air is measured. If the concentration of the N\({}_{2}\) is below \(2.5\,\%\) of the starting concentration the test stops. The LCI is then defined by

$$\begin{aligned} LCI&= \frac{V_{\text {CE}}}{FRC}&FRC= \frac{\text {expired volume N}_2}{c^{\text {N}_2}_{\text {start}} - c^{\text {N}_2}_{\text {end}}}, \end{aligned}$$
(1)

where \(V_{\text {CE}}\) is the expired volume during the complete measurement. The LCI for a healthy lung is 6–7 TO. For a more detailed description of the LCI, we refer the reader to [9]. Here, we normalize the LCI between 0 and 1 by assuming a minimum LCI value of 5 and a maximum LCI value of 20.

2.2 Ventilation Maps

For the calculation of the ventilation maps a 2D dynamic time series with the ultra-fast steady-state free precession pulse sequence (uf-bSSFP) [4] is acquired during free breathing. The respiratory motion is compensated with a pairwise non-rigid image registration [10] and automated lung segmentation is performed [2]. In the final step, the ventilation maps are calculated using a matrix pencil decomposition as described in [3]. To cover the full chest volume, 8–12 ventilation maps with a thickness of 12 mm at different coronal positions are computed. All maps are stacked together to obtain the final image volume with \(N\,\times \,256\,\times \,256\), where N is the number of acquired slices. This image volume is used as input for the network. We need to process all slices at ones, because the LCI value is related to the complete lung and not to a single slice.

2.3 Network Model

An overview of the complete network structure is shown in Fig. 2. Our network model consists of two major parts. The first part is the global autoencoder (GAE) where the input is the above-described image volume. As the ventilation maps have a slice thickness of 12 mm, neighboring slices can show substantially different content. To overcome this issue, all operations are performed on a single slice only and not across slices. All convolutional layers use a kernel size of \(1\times 3\times 3\). The conv-down block reduces the spatial dimension by using a strided convolution with a stride of \(1\times 2\times 2\). The conv-up block increases the spatial resolution by a factor of 2 using bilinear upsampling followed by a convolutional layer. Compared to the classical autoencoder approach the number of feature channels is not reduced in the decoder of the GAE. The input of the second part of the network the local autoencoder (LAE), are overlapping patches of the input volume. For each coronal slice M 2D patches of the input volume with a size of \(16\times 16\) and a stride of 1 are generated. We consider only patches were the region of the patches overlaps with the lung mask. The size of the latent variable of the LAE is set to 64. In contrast to the GAE, the number of feature channels is reduced in the decoder part of the LAE. For the final pixel-wise defect map, all patches of the input volume are encoded using the encoder of the LAE, reshaped and concatenated with the output of GAE. The embeddings of the patches are placed at the center position of the patch in the image volume. In this way, we get a feature map with the spatial dimensions of the input volume and 128 feature channels. The final network output, the pixel-wise defect segmentation \(s_{\text {d}}\) is obtained after two convolutional layer and has the final size of \(1\times N\times 256\times 256\).

Fig. 2.
figure 2

Model for the weakly supervised segmentation of defected lung areas.

2.4 Network Training

In order to train the parameter of our network with the LCI as weak label, we define the LCI loss as

$$\begin{aligned} \mathcal {L}_{\text {LCI}}(s_{\text {d}}, s_{\text {l}}, LCI) = \left( \frac{\sum _{x \in \mathcal {X}}^{|\mathcal {X}|} s_{\text {d}}(x)}{\sum _{x \in \mathcal {X}}^{|\mathcal {X}|} s_{\text {l}}(x)}- LCI\right) ^2, \end{aligned}$$
(2)

where \({\mathcal {X} \subset \mathbb {R}^3}\) is the domain of the lung mask \(s_{\text {l}}\), with \({s_{\text {l}}(x) = 1}\). The output of our network, i.e. the segmentation of the defect areas of the input ventilation map is defined as \({s_{\text {d}}:\mathcal {X} \rightarrow [0, 1]}\). However, the LCI loss is based on the ratio of the defected lung volume to the whole lung volume and thus not a pixel-wise loss. This can lead to a segmentation result that minimize the LCI loss but do not correspond to the defected areas. We use self-supervised regularization during training of our network to prevent the network from overfitting on the LCI loss and learning a non-meaningful segmentation. As shown in Fig. 2, our network contains different paths, some of which are only used for the self-supervised regularization during training (dotted) and some are used during training and inference (dense). We use the decoder part of the LAE only during training for the reconstruction of the image patches. With this, we enforce the network to learn an embedding for the classification of the image patch which is strongly related to the image content. The image reconstruction loss for the LAE is defined as

$$\begin{aligned} \mathcal {L}_{\text {LAE}}(I_{\text {patch}}, \tilde{I}_{\text {patch}}) = \frac{1}{|\mathcal {X}|} \sum _{x \in \mathcal {X}}^{|\mathcal {X}|} \frac{1}{|\mathcal {P}_x|} \sum _{p \in \mathcal {P}_x}^{|\mathcal {P}_x|} (I_{\text {patch}}(x, p) - \tilde{I}_{\text {patch}}(x, p))^2, \end{aligned}$$
(3)

where \({\mathcal {P}_x \subset \mathbb {R}^2}\) is the image patch domain at location x of the input image. For the feature regularization of the GAE the same approach is used, but here the output of the GAE is concatenated with the embedding of the LAE before the image is reconstructed. The global reconstruction loss is defined as

$$\begin{aligned} \mathcal {L}_{\text {GAE}}(I, \tilde{I}) = \frac{1}{|\mathcal {X}|} \sum _{x \in \mathcal {X}}^{|\mathcal {X}|} (I(x) - \tilde{I}(x))^2. \end{aligned}$$
(4)

The final loss for the training of the network is then given as

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{\text {LCI}} + \mathcal {L}_{\text {LAE}} + \mathcal {L}_{\text {GAE}}. \end{aligned}$$
(5)

3 Experiments and Results

For our experiments, we use a data set of 35 subjects with 2 examinations per subject on average. We divided our data set in a training set with 28 subjects and in a test set with 7 subjects. In each examination 8–12 dynamic 2D image series were acquired. All examinations were performed on a 1.5 T whole-body MR-scanner (MAGNETOM Aera, Siemens Healthineers, Erlangen, Germany) using a 12-channel thorax and a 24-channel spine receiver coil array. Each subject was scanned during free breathing with the uf-bSSFP sequence [4]. The main pulse sequence parameters were as follows: field-of-view \(375\times 375 - 425 \times 425\) mm\({}^2\), slice thickness 12 mm, \(\mathrm{TE/TR/TA} = 0.67\,\mathrm{ms}/1.46\,\mathrm{ms}/119\,\mathrm{ms}\), flip angle \(60^\circ \), bandwidth 2056 Hz/pixel, matrix \(128\times 128\) (interpolated to \(256\times 256\)), 160 coronal images, 3.33 images/s, parallel imaging GRAPPA factor of 2. Ventilation maps and the lung segmentation were then computed as described in Sect. 2.2.

Due to the problem of relative values of the ventilation maps caused by the relative values generated by the MR, we normalized each slice independently. We performed a z-normalization followed by clipping of the values at \([-4, 4]\) and a transformation to [0, 1]. In the final step, a histogram stretching was performed using the 2 and 98 percentile of the intensities. The normalization was only performed on intensity values inside the lung.

Fig. 3.
figure 3

Result of the human expert rating of the presented defect segmentation.

For the evaluation, we applied our method to the 7 subjects of the test set. Because the derivation of pixel-wise manually labeled ground truth remains difficult, we evaluated our method by using a rating of the final defect segmentation by 5 human experts. The rating scheme is defined as the percentage of correct defect segmentation for a given ventilation map: bad: 0%–20%, partially OK: 21%–40%, good: 41%–60%, very good: 61%–80%, perfect: 81%–100%. For the evaluation the lung ventilation map and the morphological image are available. Each rater scores all 2D segmentation results of the test set, which contains 132 images in total. Rater 1 and 2 are a senior resident radiologist with 20 respectively 10 years of experience. Rater 3 has over 5 years of research experience in the field of MRI lung imaging. Rater 4 is a chest fellow (radiologist) with 4 years of experience in chest imaging. Rater 5 is a physician with 2 years of experience of thorax imaging. The results of the human expert evaluation presented in Fig. 3, show that over \(60\%\) of the segmentation results are rated with very good or perfect. Selected segmentation results of different subjects with different LCIs are shown in Fig. 4. The results of our method demonstrate, that we are able to segment most of the impaired areas with ventilation defects by using only the LCI for training. However, in some areas an incorrect segmentation is observed in close approximation to blood vessels as shown in Fig. 4. This might be reasoned by the evidence that the non-ventilated blood vessels are showing the similar characteristics as impaired areas in the ventilation maps.

Fig. 4.
figure 4

Selected defect segmentation results (red) for different subjects with different LCI values. Top row: morphological image with defect segmentation outline. Middle row: ventilation maps with defect segmentation outline. Bottom row: ventilation maps with defect segmentation overlay. Green shows the outline of the given lung segmentation [2]. Incorrect defect segementations are highlighted with white circles. (Color figure online)

4 Conclusion

We presented a first proof-of-concept of our weakly supervised learning strategy for the segmentation of defected areas in MRI ventilation maps. We designed a network that is able to generate a pixel-wise segmentation of lung defects in MRI ventilation maps using the LCI value as a weak label. Our network model consists of two major parts for the global and local feature extraction. Both features are then combined to estimate the final segmentation. Furthermore, we use self-supervised regularization to prevent the network from learning a non-meaningful segmentation. We evaluated the performance of our method with a rating of the segmentation result by 5 human experts. The results show that over \(60\%\) of all segmentations are scored with very good or perfect. However, we observed that our method has a tendency for over-segmentation especially in regions were vessels are located. This over-segmentation could be removed by providing a vessel segmentation as a post-processing step or by adding this to the training procedure, but we will leave this for future work.