Abstract
The scarcity of richly annotated medical images is limiting supervised deep learning based solutions to medical image analysis tasks, such as localizing discriminatory radiomic disease signatures. Therefore, it is desirable to leverage unsupervised and weakly supervised models. Most recent weakly supervised localization methods apply attention maps or region proposals in a multiple instance learning formulation. While attention maps can be noisy, leading to erroneously highlighted regions, it is not simple to decide on an optimal window/bag size for multiple instance learning approaches. In this paper, we propose a learned spatial masking mechanism to filter out irrelevant background signals from attention maps. The proposed method minimizes mutual information between a masked variational representation and the input while maximizing the information between the masked representation and class labels. This results in more accurate localization of discriminatory regions. We tested the proposed model on the ChestX-ray8 dataset to localize pneumonia from chest X-ray images without using any pixel-level or bounding-box annotations.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Around 50,000 deaths attributed to pneumonia are reported every year in the US alone [1]. Chest X-rays are currently the most frequently adopted imaging modality for detecting pneumonia [21]. The large increase in imaging studies, scarcity of radiologists and associated expense and intra-/inter-rater variability, has resulted in an acceleration in the development and adoption of automated image-based disease classification methods. In the last decade, deep learning methods have resulted in great successes in classification of natural and medical images. Locating discriminatory regions in images, along with the predicted class, renders deep models’ decisions more interpretable and trustworthy. Localization of disease-indicator regions (i.e., radiomic biosignatures) is particularly important for medical applications since it reveals whether the machine diagnosis was based on the presence or absence of disease and not biased towards some unique yet unintuitive and unrelated pattern that happens to be exhibited among the training examples.
1.1 Related Work
The past few years have witnessed numerous advances in deep learning methods for localizing objects and detecting discriminatory regions in images.
Multiple Instance Learning and Region-Based Methods. Bency et al. [4] and Teh et al. [15] applied region-proposal and beam search based methods to localize objects from natural (i.e., non-medical) images. Training such hybrid localization-classification models requires large amounts of bounding-box level image annotations, which can suffer from rater-variability and can be prohibitively expensive or time consuming. Several existing methods [5, 7, 13, 14, 17] formulate the weakly-supervised localization as a multiple instance learning (MIL) problem. However, like for region-proposal based methods, it is difficult to find an optimal window size.
Attention/Activation Based Methods. Similarly to previous works [8, 18, 22], Wei et al. [20] proposed an activation map based framework to produce tight bounding boxes around objects. However, in the context of object localization, there might be erroneously detected regions (false positives) or regions/activations which spread over unrealistically wide ranges. This is the case because saliency maps [11] are usually noisy.
Several works have attempted to smooth or regularize the saliency or attention maps [12] and prevent important features from being neglected due to saturation [3, 10]. To produce class-discriminative ‘explanation maps’, the gradient-weighted class activation mapping method GradCAM [9] was used to capture the importance of a particular features of a target class. However, GradCAM and similar approaches are applied after a model is trained, i.e., there is no explicit spatial enforcement during training and GradCAM requires class labels during inference. To remove dependency on class labels during test, Fan et al. [6], trained a masking mechanism simultaneously with a classification network to localize objects. However, their masking mechanism is based on super-pixels which might miss fine details. Zolna et al. [23] reformulated the same problem as a min-max game. It is not clear to us how to weigh the regularization term in their proposed loss function and we have concerns about scalability of the method, as it needs to preserve many copies of the model with different parameters to produce different masks using each set of parameters. In this paper, instead of keeping different parameters of a model, we propose to perform variational online mask sampling from a normal distribution using a single model.
1.2 Contribution
The focus of the current study is to develop a method to localize radiologic presentations of pneumonia in unseen chest X-ray images while training on data with only image-level labels. The key idea of the proposed approach is to learn a low-dimensional latent parametric probability distribution (regularized by Kullback-Leibler divergence from a standard normal distribution) that encodes the input data and be not only discriminative but also spatially-selective to disregard irrelevant background from input images. To this end, we propose InfoMask, a variational model with a learnt attention mechanism and a sparsity-promoting masking operation.
In this paper we make the following contributions: (a) we propose to produce online variational masks during training without the need for class labels during inference; (b) we propose a weakly supervised localization method without requiring any choice of window/bag size (which is necessary in competing multiple instance learning formulations); (c) we introduce a masking mechanism applied to the latent variational representation to filter non-discriminatory information; and (d) we propose minimizing mutual information between the input and latent variational attention maps and increasing the mutual information between the masked latent representation and class labels.
2 Method
Given a training set of input images \(\mathbf {x}\) and corresponding image-level labels y, our goal is to learn the parameters \(\varvec{ \theta }\) of a class-predictive model \(\hat{y}=g(\mathbf {x};\varvec{ \theta })\) that not only has high classification accuracy but also localizes the discriminative regions with minimal inclusion of irrelevant pixels. The localization is represented via a binary mask \(\varvec{M}\) and \(\varvec{ \theta }\) is learnt by maximizing \(p(y|\varvec{x},\varvec{M};\varvec{ \theta })\).
To this end, InfoMask learns to encode a bottleneck random variable \(\varvec{Z}\) that (i) captures minimal information about the input random variable \(\varvec{X}\), hence minimizes the encoding of irrelevant information in the input, and (ii) holds maximal information about the distribution of the target label variable Y. Consequently, inspired by Alemi et al. [2] and the information bottleneck [16], we aim at maximizing
where I(A, B) is the mutual information between random variables A and B, and \(\alpha \) is a scalar weight.
We model \(\varvec{z}\sim \mathcal {N}(\varvec{\mu _z},\varvec{\sigma _z})\) and learn to generate its mean and variance using convolutional layers, i.e., \(\varvec{\mu }_z=f_{e}^{\varvec{\mu }}(\varvec{x})\) and \(\varvec{\sigma }_z=f_{e}^{\varvec{\sigma }}(\varvec{x})\), and rewrite \(I(\varvec{Z},Y;\varvec{\theta })\) (and similarly \(I(\varvec{Z},\varvec{X};\varvec{\theta })\)), for each element of \(\varvec{Z}\), as:
To sample \(\varvec{Z}\), we apply the reparameterization trick and write \(\varvec{z} = f(\varvec{x},\epsilon )=\varvec{\mu _z} + \varvec{\sigma _z} \varvec{\epsilon } = \varvec{a}_{e}^{\varvec{\mu }}(\varvec{x}) + \varvec{a}_{e}^{\varvec{\sigma }}(\varvec{x}) \epsilon \), where \(\varvec{a}_{e}\) is a deterministic function which outputs both \(\varvec{\mu }\) and \(\varvec{\sigma }\) and \(\varvec{\epsilon } \sim \mathcal {N}(0,1)\). We regularize the distribution by penalizing the Kullback-Leibler (KL) divergence from a standard normal distribution. The final loss function which we aim to minimize is given by
where N is the number of training examples, q(.) is the variational approximation function, and \(r(\varvec{Z})\) is variational approximation. In our case, \(\varvec{Z}\) is not computed directly from the input but rather by sampling an attention map \(\varvec{A}\) from which \(\varvec{\mu }_z\) and \(\varvec{\sigma }_z\) are derived. To explicitly enforce the model to generate more focused attention maps, we apply the following masking function \(\varvec{M}\) with threshold \(\tau \) that localizes the discriminative areas of \(\varvec{Z}\).
where R is a ReLU function with upper bound of 1, i.e., \(R(v)=\max (0,\min (v,1))\). The block diagram of the proposed method is shown in Fig. 1.
Architecture and components of the proposed model. The input \(\varvec{X}\) is encoded via \(p(\varvec{z}|\varvec{X};\varvec{\theta })\). \(\varvec{A}\) refers to an attention map computed from the last layer of the encoder using \(1\times 1\) convolution and ReLU. Note that \(\varvec{A}\) is upsampled to the size of \(\varvec{X}\). \(\varvec{M}\) is the masked latent matrix. \(\varvec{X}\), \(\varvec{A}\), \(\varvec{\mu }\), \(\varvec{\sigma }\), \(\varvec{Z}\), and \(\varvec{\epsilon }\) are of size \(W \times H\).
3 Data
For evaluation, we used the NIH ChestX-ray8 Dataset [19], which comprises of 112,120 X-ray images from 30,805 unique patients with corresponding disease labels. We used 20547, 2568, and 2569 with pneumonia images as train, validation, and test sets, respectively. For training and to evaluate the test classification accuracy, we only used image-level labels. To evaluate the localization performance on test images, we used ground truth bounding boxes manually placed around the diseased areas.
4 Experiments and Results
We adopt a simple architecture as a baseline, i.e., an encoder (\( p ( \varvec{z} | \varvec{X} ; \varvec{\theta } )\)) of the form [conv(64, \(3\, \times \, 3\), relu), conv(64, \(3\, \times \,3\), relu), maxpooling(\(2\, \times \,2\)), conv(128, \(3\, \times \,3\), relu), conv(128, \(3 \times 3\), relu), maxpooling(\(2 \times 2\)), conv(256, \(3 \times 3\), relu), conv(16, \(3 \times 3\), relu)] and a classification block (\(p ( Y | \varvec{z} ; \varvec{\theta } )\)) of [conv(128, \(3 \times 3\), relu), maxpooling(\(2 \times 2\)), conv(64, \(3 \times 3\), relu), conv(64, \(3 \times 3\), relu) maxpooling(\(2 \times 2\)), global average pooling, softmax].
We then compare our proposed InfoMask to four competing disease localization methods: (i) GradCAM, gradient-weighted class activation mapping + baseline, i.e., during inference, we replace \(\varvec{M}\) in Fig. 1 with GradCAM; (ii) FeatureMask, masking the latent representation without KL divergence regularization + baseline; (iii) RegL1, L1 regularization over the generated masks instead of KL regularization; (iv) CheXCAM, GradCAM applied to the last layer of CheXNet [8]. Even though each patient could have multiple disease classes at the same time, we focus only on pneumonia disease detection (vs. normal) to analyze whether our method is able to only localize target regions in a complex environment where other diseases might also be present. Note that the results (Table 1 and Figs. 2, 3, and 4) reported next are based on the thresholded masks using the best threshold value, i.e., optimized, for each method, to minimize localization error over the validation set. To select the best epoch based on a validation set we first select N checkpoints which produce highest classification accuracy and then select the epoch with the highest localization score among them. As the detected thresholded masks could potentially have largely diverse patterns (e.g., from sparse disjoint localizations scattered over the whole image to large connected components, to anything in between), computing a single representative bounding box, as is provided by ground truth bounding box annotations, is not straightforward. Therefore, we replace the intersection over union (IoU) quality metric, commonly used for evaluating bounding box predictions, with a proposed intersection over predicted area (IoP), which reflects what percentage of the predicted area is inside the ground truth bounding box. As a small predicted areas inside the box can lead to a high score, we also compute false positive and negative rates, FPR and FNR respectively to measure over- and under-predicted areas.As reported in Table 1, the proposed InfoMask outperforms the competing methods by a large margin on IoP (at least 10% better), and obtains the second best FPR (only 1.5% higher than the lowest FPR). Examining the FPR values, it can be inferred that GradCAM tends to highlight larger areas of the input outside of the ground truth bounding boxes. Form the FNR column, we note that RegL1 generates smaller areas inside the boxes. Although the focus of the current study is not to improve classification accuracy, our proposed method achieves only slightly smaller classification accuracy (\({<}2\%\)) but with only 10% (7,000,000 vs. 700,000) of the parameters of CheXNet. The kernel density estimation plots in Fig. 2 support the quantitative results for the test images. Note how InfoMask obtains higher densities at larger IoP values (note: green curve in (a) for IoP \(\in [0.5,1]\)), smaller FPR density (green peak in (b) for FPR \(\in [0,0.1]\)) and in second best (behind CheXCAM) for FNR values. For a better interpretation of Table 1, we visualized a few samples of the attention maps and masked ones in Fig. 3 along with the ground truth (GT) bounding boxes in yellow. In Fig. 4, we visualized a few mean and variance samples computed for test images. As shown, there is less variance in the areas where the model is confident about absence of disease signs. As visualized InfoMask was able to localize pneumonia from images with different intensity distributions without using any bounding-box level annotation. As can be seen, FeatureMask and RegL1 produce scattered attention maps that cover only small portions of the GT bounding boxes. Among all, the proposed InfoMask generates contiguous attention areas with most agreement with ground truth boxes.
5 Conclusion
We proposed InfoMask, a method to localize disease-discriminatory regions trained with only image-level labels. Owing to the regularized variational latent representation with an attention mechanism, InfoMask generates contiguous and focused localization masks with higher agreement with ground truth annotations than competing methods (e.g., widely used GradCAM) without resorting to any bounding-box level annotations. A direction for future work aims at improving both classification and localization objectives by using stronger classification backbone models.
References
Centers for disease control and prevention. https://www.cdc.gov/pneumonia/prevention.html. Accessed 25 Mar 2019
Alemi, A.A., Fischer, I., Dillon, J.V., Murphy, K.: Deep variational information bottleneck. arXiv preprint arXiv:1612.00410 (2016)
Ancona, M., Ceolini, E., Öztireli, C., Gross, M.: Towards better understanding of gradient-based attribution methods for deep neural networks. In: ICLR 2018 (2018)
Bency, A.J., Kwon, H., Lee, H., Karthikeyan, S., Manjunath, B.S.: Weakly supervised localization using deep feature maps. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 714–731. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_43
Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2846–2854 (2016)
Fan, L.: Adversarial localization network. In: NIPS 2017 Workshop on Learning with Limited Labeled Data (2017)
Kumar, M.P., Packer, B., Koller, D.: Self-paced learning for latent variable models. In: Advances in Neural Information Processing Systems, pp. 1189–1197 (2010)
Rajpurkar, P., et al.: CheXNet: radiologist-level pneumonia detection on chest X-rays with deep learning. arXiv preprint arXiv:1711.05225 (2017)
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: why did you say that? Visual explanations from deep networks via gradient-based localization. CoRR abs/1610.02391 (2016)
Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 3145–3153. JMLR. org (2017)
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)
Smilkov, D., Thorat, N., Kim, B., Viégas, F., Wattenberg, M.: SmoothGrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825 (2017)
Song, H.O., Girshick, R., Jegelka, S., Mairal, J., Harchaoui, Z., Darrell, T.: On learning to localize objects with minimal supervision. arXiv preprint arXiv:1403.1024 (2014)
Song, H.O., Lee, Y.J., Jegelka, S., Darrell, T.: Weakly-supervised discovery of visual pattern configurations. In: Advances in Neural Information Processing Systems, pp. 1637–1645 (2014)
Teh, E.W., Rochan, M., Wang, Y.: Attention networks for weakly supervised object localization. In: BMVC, pp. 1–11 (2016)
Tishby, N., Pereira, F.C., Bialek, W.: The information bottleneck method. In: The 37th Annual Allerton Conference on Communications, Control, and Computing, pp. 368–377 (1999)
Wang, C., Ren, W., Huang, K., Tan, T.: Weakly supervised object localization with latent category learning. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 431–445. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_28
Wang, X., et al.: Weakly supervised learning for whole slide lung cancer image classification. Med. Imaging Deep Learn. (2018)
Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: ChestX-ray8: hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE Conference on Vision and Pattern Recognition, pp. 2097–2106 (2017)
Wei, Y., et al.: TS2C: tight box mining with surrounding segmentation context for weakly supervised object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 434–450 (2018)
WHO: Standardization of interpretation of chest radiographs for the diagnosis of pneumonia in children
Yan, C., Yao, J., Li, R., Xu, Z., Huang, J.: Weakly supervised deep learning for thoracic disease classification and localization on chest X-rays. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 103–110. ACM (2018)
Zolna, K., Geras, K.J., Cho, K.: Classifier-agnostic saliency map extraction. arXiv preprint arXiv:1805.08249 (2018)
Acknowledgement
We thank Joseph Paul Cohen for his insightful discussions and comments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Taghanaki, S.A. et al. (2019). InfoMask: Masked Variational Latent Representation to Localize Chest Disease. In: Shen, D., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. MICCAI 2019. Lecture Notes in Computer Science(), vol 11769. Springer, Cham. https://doi.org/10.1007/978-3-030-32226-7_82
Download citation
DOI: https://doi.org/10.1007/978-3-030-32226-7_82
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32225-0
Online ISBN: 978-3-030-32226-7
eBook Packages: Computer ScienceComputer Science (R0)