Keywords

1 Introduction

Around 50,000 deaths attributed to pneumonia are reported every year in the US alone [1]. Chest X-rays are currently the most frequently adopted imaging modality for detecting pneumonia [21]. The large increase in imaging studies, scarcity of radiologists and associated expense and intra-/inter-rater variability, has resulted in an acceleration in the development and adoption of automated image-based disease classification methods. In the last decade, deep learning methods have resulted in great successes in classification of natural and medical images. Locating discriminatory regions in images, along with the predicted class, renders deep models’ decisions more interpretable and trustworthy. Localization of disease-indicator regions (i.e., radiomic biosignatures) is particularly important for medical applications since it reveals whether the machine diagnosis was based on the presence or absence of disease and not biased towards some unique yet unintuitive and unrelated pattern that happens to be exhibited among the training examples.

1.1 Related Work

The past few years have witnessed numerous advances in deep learning methods for localizing objects and detecting discriminatory regions in images.

Multiple Instance Learning and Region-Based Methods. Bency et al. [4] and Teh et al. [15] applied region-proposal and beam search based methods to localize objects from natural (i.e., non-medical) images. Training such hybrid localization-classification models requires large amounts of bounding-box level image annotations, which can suffer from rater-variability and can be prohibitively expensive or time consuming. Several existing methods [5, 7, 13, 14, 17] formulate the weakly-supervised localization as a multiple instance learning (MIL) problem. However, like for region-proposal based methods, it is difficult to find an optimal window size.

Attention/Activation Based Methods. Similarly to previous works [8, 18, 22], Wei et al. [20] proposed an activation map based framework to produce tight bounding boxes around objects. However, in the context of object localization, there might be erroneously detected regions (false positives) or regions/activations which spread over unrealistically wide ranges. This is the case because saliency maps [11] are usually noisy.

Several works have attempted to smooth or regularize the saliency or attention maps [12] and prevent important features from being neglected due to saturation [3, 10]. To produce class-discriminative ‘explanation maps’, the gradient-weighted class activation mapping method GradCAM [9] was used to capture the importance of a particular features of a target class. However, GradCAM and similar approaches are applied after a model is trained, i.e., there is no explicit spatial enforcement during training and GradCAM requires class labels during inference. To remove dependency on class labels during test, Fan et al. [6], trained a masking mechanism simultaneously with a classification network to localize objects. However, their masking mechanism is based on super-pixels which might miss fine details. Zolna et al. [23] reformulated the same problem as a min-max game. It is not clear to us how to weigh the regularization term in their proposed loss function and we have concerns about scalability of the method, as it needs to preserve many copies of the model with different parameters to produce different masks using each set of parameters. In this paper, instead of keeping different parameters of a model, we propose to perform variational online mask sampling from a normal distribution using a single model.

1.2 Contribution

The focus of the current study is to develop a method to localize radiologic presentations of pneumonia in unseen chest X-ray images while training on data with only image-level labels. The key idea of the proposed approach is to learn a low-dimensional latent parametric probability distribution (regularized by Kullback-Leibler divergence from a standard normal distribution) that encodes the input data and be not only discriminative but also spatially-selective to disregard irrelevant background from input images. To this end, we propose InfoMask, a variational model with a learnt attention mechanism and a sparsity-promoting masking operation.

In this paper we make the following contributions: (a) we propose to produce online variational masks during training without the need for class labels during inference; (b) we propose a weakly supervised localization method without requiring any choice of window/bag size (which is necessary in competing multiple instance learning formulations); (c) we introduce a masking mechanism applied to the latent variational representation to filter non-discriminatory information; and (d) we propose minimizing mutual information between the input and latent variational attention maps and increasing the mutual information between the masked latent representation and class labels.

2 Method

Given a training set of input images \(\mathbf {x}\) and corresponding image-level labels y, our goal is to learn the parameters \(\varvec{ \theta }\) of a class-predictive model \(\hat{y}=g(\mathbf {x};\varvec{ \theta })\) that not only has high classification accuracy but also localizes the discriminative regions with minimal inclusion of irrelevant pixels. The localization is represented via a binary mask \(\varvec{M}\) and \(\varvec{ \theta }\) is learnt by maximizing \(p(y|\varvec{x},\varvec{M};\varvec{ \theta })\).

To this end, InfoMask learns to encode a bottleneck random variable \(\varvec{Z}\) that (i) captures minimal information about the input random variable \(\varvec{X}\), hence minimizes the encoding of irrelevant information in the input, and (ii) holds maximal information about the distribution of the target label variable Y. Consequently, inspired by Alemi et al. [2] and the information bottleneck [16], we aim at maximizing

$$\begin{aligned} J( \varvec{ \theta } ) = I ( \varvec{Z} , Y ; \varvec{ \theta } ) - \alpha I ( \varvec{Z},\mathbf {X} ; \varvec{ \theta } ) \end{aligned}$$
(1)

where I(AB) is the mutual information between random variables A and B, and \(\alpha \) is a scalar weight.

We model \(\varvec{z}\sim \mathcal {N}(\varvec{\mu _z},\varvec{\sigma _z})\) and learn to generate its mean and variance using convolutional layers, i.e., \(\varvec{\mu }_z=f_{e}^{\varvec{\mu }}(\varvec{x})\) and \(\varvec{\sigma }_z=f_{e}^{\varvec{\sigma }}(\varvec{x})\), and rewrite \(I(\varvec{Z},Y;\varvec{\theta })\) (and similarly \(I(\varvec{Z},\varvec{X};\varvec{\theta })\)), for each element of \(\varvec{Z}\), as:

$$\begin{aligned} I(\varvec{Z},Y;\varvec{\theta },\varvec{\mu _z},\varvec{\sigma _z}) = \int p(\varvec{z},y;\varvec{\theta },\varvec{\mu _z},\varvec{\sigma _z}) \log \frac{p(\varvec{z},y;\varvec{\theta },\varvec{\mu _z},\varvec{\sigma _z})}{p(\varvec{z};\varvec{\theta },\varvec{\mu _z},\varvec{\sigma _z}) p(y;\varvec{\theta })} dx dy \end{aligned}$$
(2)

To sample \(\varvec{Z}\), we apply the reparameterization trick and write \(\varvec{z} = f(\varvec{x},\epsilon )=\varvec{\mu _z} + \varvec{\sigma _z} \varvec{\epsilon } = \varvec{a}_{e}^{\varvec{\mu }}(\varvec{x}) + \varvec{a}_{e}^{\varvec{\sigma }}(\varvec{x}) \epsilon \), where \(\varvec{a}_{e}\) is a deterministic function which outputs both \(\varvec{\mu }\) and \(\varvec{\sigma }\) and \(\varvec{\epsilon } \sim \mathcal {N}(0,1)\). We regularize the distribution by penalizing the Kullback-Leibler (KL) divergence from a standard normal distribution. The final loss function which we aim to minimize is given by

$$\begin{aligned} L = \frac{ 1 }{ N } \sum _{ n = 1 } ^ { N } \mathbb {E }_{ \varvec{\epsilon } \sim p ( \epsilon ) } \left[ - \log q \left( y_{ n } | \varvec{a} \left( \varvec{x}_{ n } , \epsilon \right) \right) \right] + \alpha \mathrm { KL } [ p ( \varvec{Z} | \varvec{x}_{ n } ) , r ( \varvec{Z} ) ] \end{aligned}$$
(3)

where N is the number of training examples, q(.) is the variational approximation function, and \(r(\varvec{Z})\) is variational approximation. In our case, \(\varvec{Z}\) is not computed directly from the input but rather by sampling an attention map \(\varvec{A}\) from which \(\varvec{\mu }_z\) and \(\varvec{\sigma }_z\) are derived. To explicitly enforce the model to generate more focused attention maps, we apply the following masking function \(\varvec{M}\) with threshold \(\tau \) that localizes the discriminative areas of \(\varvec{Z}\).

$$\begin{aligned} \varvec{M} = R \left( \tilde{\varvec{z}} - \tau \right) \,\,\, \text {where} \quad \tilde{\varvec{z}} = { (1 + \exp \left( - \varvec{z} \right) )^{-1}} \end{aligned}$$
(4)

where R is a ReLU function with upper bound of 1, i.e., \(R(v)=\max (0,\min (v,1))\). The block diagram of the proposed method is shown in Fig. 1.

Fig. 1.
figure 1

Architecture and components of the proposed model. The input \(\varvec{X}\) is encoded via \(p(\varvec{z}|\varvec{X};\varvec{\theta })\). \(\varvec{A}\) refers to an attention map computed from the last layer of the encoder using \(1\times 1\) convolution and ReLU. Note that \(\varvec{A}\) is upsampled to the size of \(\varvec{X}\). \(\varvec{M}\) is the masked latent matrix. \(\varvec{X}\), \(\varvec{A}\), \(\varvec{\mu }\), \(\varvec{\sigma }\), \(\varvec{Z}\), and \(\varvec{\epsilon }\) are of size \(W \times H\).

3 Data

For evaluation, we used the NIH ChestX-ray8 Dataset [19], which comprises of 112,120 X-ray images from 30,805 unique patients with corresponding disease labels. We used 20547, 2568, and 2569 with pneumonia images as train, validation, and test sets, respectively. For training and to evaluate the test classification accuracy, we only used image-level labels. To evaluate the localization performance on test images, we used ground truth bounding boxes manually placed around the diseased areas.

4 Experiments and Results

We adopt a simple architecture as a baseline, i.e., an encoder (\( p ( \varvec{z} | \varvec{X} ; \varvec{\theta } )\)) of the form [conv(64, \(3\, \times \, 3\), relu), conv(64, \(3\, \times \,3\), relu), maxpooling(\(2\, \times \,2\)), conv(128, \(3\, \times \,3\), relu), conv(128, \(3 \times 3\), relu), maxpooling(\(2 \times 2\)), conv(256, \(3 \times 3\), relu), conv(16, \(3 \times 3\), relu)] and a classification block (\(p ( Y | \varvec{z} ; \varvec{\theta } )\)) of [conv(128, \(3 \times 3\), relu), maxpooling(\(2 \times 2\)), conv(64, \(3 \times 3\), relu), conv(64, \(3 \times 3\), relu) maxpooling(\(2 \times 2\)), global average pooling, softmax].

Table 1. Disease localization performance evaluation of the proposed InfoMask vs. competing methods. IoP, FPR, and FNR represent the localization performance while Acc. and AUC show the classification performance.
Fig. 2.
figure 2

Kernel density estimation of different measures for disease localization

Fig. 3.
figure 3

Examples of pneumonia localization for various methods. GT bounding boxes shown in yellow. (Color figure online)

We then compare our proposed InfoMask to four competing disease localization methods: (i) GradCAM, gradient-weighted class activation mapping + baseline, i.e., during inference, we replace \(\varvec{M}\) in Fig. 1 with GradCAM; (ii) FeatureMask, masking the latent representation without KL divergence regularization + baseline; (iii) RegL1, L1 regularization over the generated masks instead of KL regularization; (iv) CheXCAM, GradCAM applied to the last layer of CheXNet [8]. Even though each patient could have multiple disease classes at the same time, we focus only on pneumonia disease detection (vs. normal) to analyze whether our method is able to only localize target regions in a complex environment where other diseases might also be present. Note that the results (Table 1 and Figs. 23, and 4) reported next are based on the thresholded masks using the best threshold value, i.e., optimized, for each method, to minimize localization error over the validation set. To select the best epoch based on a validation set we first select N checkpoints which produce highest classification accuracy and then select the epoch with the highest localization score among them. As the detected thresholded masks could potentially have largely diverse patterns (e.g., from sparse disjoint localizations scattered over the whole image to large connected components, to anything in between), computing a single representative bounding box, as is provided by ground truth bounding box annotations, is not straightforward. Therefore, we replace the intersection over union (IoU) quality metric, commonly used for evaluating bounding box predictions, with a proposed intersection over predicted area (IoP), which reflects what percentage of the predicted area is inside the ground truth bounding box. As a small predicted areas inside the box can lead to a high score, we also compute false positive and negative rates, FPR and FNR respectively to measure over- and under-predicted areas.As reported in Table 1, the proposed InfoMask outperforms the competing methods by a large margin on IoP (at least 10% better), and obtains the second best FPR (only 1.5% higher than the lowest FPR). Examining the FPR values, it can be inferred that GradCAM tends to highlight larger areas of the input outside of the ground truth bounding boxes. Form the FNR column, we note that RegL1 generates smaller areas inside the boxes. Although the focus of the current study is not to improve classification accuracy, our proposed method achieves only slightly smaller classification accuracy (\({<}2\%\)) but with only 10% (7,000,000 vs. 700,000) of the parameters of CheXNet. The kernel density estimation plots in Fig. 2 support the quantitative results for the test images. Note how InfoMask obtains higher densities at larger IoP values (note: green curve in (a) for IoP \(\in [0.5,1]\)), smaller FPR density (green peak in (b) for FPR \(\in [0,0.1]\)) and in second best (behind CheXCAM) for FNR values. For a better interpretation of Table 1, we visualized a few samples of the attention maps and masked ones in Fig. 3 along with the ground truth (GT) bounding boxes in yellow. In Fig. 4, we visualized a few mean and variance samples computed for test images. As shown, there is less variance in the areas where the model is confident about absence of disease signs. As visualized InfoMask was able to localize pneumonia from images with different intensity distributions without using any bounding-box level annotation. As can be seen, FeatureMask and RegL1 produce scattered attention maps that cover only small portions of the GT bounding boxes. Among all, the proposed InfoMask generates contiguous attention areas with most agreement with ground truth boxes.

5 Conclusion

We proposed InfoMask, a method to localize disease-discriminatory regions trained with only image-level labels. Owing to the regularized variational latent representation with an attention mechanism, InfoMask generates contiguous and focused localization masks with higher agreement with ground truth annotations than competing methods (e.g., widely used GradCAM) without resorting to any bounding-box level annotations. A direction for future work aims at improving both classification and localization objectives by using stronger classification backbone models.

Fig. 4.
figure 4

A few localization samples of InfoMask with mean (\(\mu \)) and variance (\(\sigma \)) maps.