InfoMask: Masked Variational Latent Representation to Localize Chest Disease

Taghanaki, Saeid Asgari; Havaei, Mohammad; Berthier, Tess; Dutil, Francis; Di Jorio, Lisa; Hamarneh, Ghassan; Bengio, Yoshua

doi:10.1007/978-3-030-32226-7_82

Saeid Asgari Taghanaki^16,17,18,
Mohammad Havaei¹⁷,
Tess Berthier¹⁷,
Francis Dutil¹⁷,
Lisa Di Jorio¹⁷,
Ghassan Hamarneh¹⁸ &
…
Yoshua Bengio¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11769))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

10k Accesses
20 Citations

Abstract

The scarcity of richly annotated medical images is limiting supervised deep learning based solutions to medical image analysis tasks, such as localizing discriminatory radiomic disease signatures. Therefore, it is desirable to leverage unsupervised and weakly supervised models. Most recent weakly supervised localization methods apply attention maps or region proposals in a multiple instance learning formulation. While attention maps can be noisy, leading to erroneously highlighted regions, it is not simple to decide on an optimal window/bag size for multiple instance learning approaches. In this paper, we propose a learned spatial masking mechanism to filter out irrelevant background signals from attention maps. The proposed method minimizes mutual information between a masked variational representation and the input while maximizing the information between the masked representation and class labels. This results in more accurate localization of discriminatory regions. We tested the proposed model on the ChestX-ray8 dataset to localize pneumonia from chest X-ray images without using any pixel-level or bounding-box annotations.

You have full access to this open access chapter, Download conference paper PDF

Iterative Attention Mining for Weakly Supervised Thoracic Disease Pattern Localization in Chest X-Rays

OXnet: Deep Omni-Supervised Thoracic Disease Detection from Chest X-Rays

Self-guided Multiple Instance Learning for Weakly Supervised Disease Classification and Localization in Chest Radiographs

Keywords

1 Introduction

Around 50,000 deaths attributed to pneumonia are reported every year in the US alone [1]. Chest X-rays are currently the most frequently adopted imaging modality for detecting pneumonia [21]. The large increase in imaging studies, scarcity of radiologists and associated expense and intra-/inter-rater variability, has resulted in an acceleration in the development and adoption of automated image-based disease classification methods. In the last decade, deep learning methods have resulted in great successes in classification of natural and medical images. Locating discriminatory regions in images, along with the predicted class, renders deep models’ decisions more interpretable and trustworthy. Localization of disease-indicator regions (i.e., radiomic biosignatures) is particularly important for medical applications since it reveals whether the machine diagnosis was based on the presence or absence of disease and not biased towards some unique yet unintuitive and unrelated pattern that happens to be exhibited among the training examples.

1.1 Related Work

The past few years have witnessed numerous advances in deep learning methods for localizing objects and detecting discriminatory regions in images.

Multiple Instance Learning and Region-Based Methods. Bency et al. [4] and Teh et al. [15] applied region-proposal and beam search based methods to localize objects from natural (i.e., non-medical) images. Training such hybrid localization-classification models requires large amounts of bounding-box level image annotations, which can suffer from rater-variability and can be prohibitively expensive or time consuming. Several existing methods [5, 7, 13, 14, 17] formulate the weakly-supervised localization as a multiple instance learning (MIL) problem. However, like for region-proposal based methods, it is difficult to find an optimal window size.

Attention/Activation Based Methods. Similarly to previous works [8, 18, 22], Wei et al. [20] proposed an activation map based framework to produce tight bounding boxes around objects. However, in the context of object localization, there might be erroneously detected regions (false positives) or regions/activations which spread over unrealistically wide ranges. This is the case because saliency maps [11] are usually noisy.

Several works have attempted to smooth or regularize the saliency or attention maps [12] and prevent important features from being neglected due to saturation [3, 10]. To produce class-discriminative ‘explanation maps’, the gradient-weighted class activation mapping method GradCAM [9] was used to capture the importance of a particular features of a target class. However, GradCAM and similar approaches are applied after a model is trained, i.e., there is no explicit spatial enforcement during training and GradCAM requires class labels during inference. To remove dependency on class labels during test, Fan et al. [6], trained a masking mechanism simultaneously with a classification network to localize objects. However, their masking mechanism is based on super-pixels which might miss fine details. Zolna et al. [23] reformulated the same problem as a min-max game. It is not clear to us how to weigh the regularization term in their proposed loss function and we have concerns about scalability of the method, as it needs to preserve many copies of the model with different parameters to produce different masks using each set of parameters. In this paper, instead of keeping different parameters of a model, we propose to perform variational online mask sampling from a normal distribution using a single model.

1.2 Contribution

The focus of the current study is to develop a method to localize radiologic presentations of pneumonia in unseen chest X-ray images while training on data with only image-level labels. The key idea of the proposed approach is to learn a low-dimensional latent parametric probability distribution (regularized by Kullback-Leibler divergence from a standard normal distribution) that encodes the input data and be not only discriminative but also spatially-selective to disregard irrelevant background from input images. To this end, we propose InfoMask, a variational model with a learnt attention mechanism and a sparsity-promoting masking operation.

In this paper we make the following contributions: (a) we propose to produce online variational masks during training without the need for class labels during inference; (b) we propose a weakly supervised localization method without requiring any choice of window/bag size (which is necessary in competing multiple instance learning formulations); (c) we introduce a masking mechanism applied to the latent variational representation to filter non-discriminatory information; and (d) we propose minimizing mutual information between the input and latent variational attention maps and increasing the mutual information between the masked latent representation and class labels.

2 Method

Given a training set of input images $\mathbf {x}$ and corresponding image-level labels y, our goal is to learn the parameters $\varvec{ \theta }$ of a class-predictive model $\hat{y}=g(\mathbf {x};\varvec{ \theta })$ that not only has high classification accuracy but also localizes the discriminative regions with minimal inclusion of irrelevant pixels. The localization is represented via a binary mask $\varvec{M}$ and $\varvec{ \theta }$ is learnt by maximizing $p(y|\varvec{x},\varvec{M};\varvec{ \theta })$.

To this end, InfoMask learns to encode a bottleneck random variable $\varvec{Z}$ that (i) captures minimal information about the input random variable $\varvec{X}$, hence minimizes the encoding of irrelevant information in the input, and (ii) holds maximal information about the distribution of the target label variable Y. Consequently, inspired by Alemi et al. [2] and the information bottleneck [16], we aim at maximizing

$$\begin{aligned} J( \varvec{ \theta } ) = I ( \varvec{Z} , Y ; \varvec{ \theta } ) - \alpha I ( \varvec{Z},\mathbf {X} ; \varvec{ \theta } ) \end{aligned}$$

(1)

where I(A, B) is the mutual information between random variables A and B, and $\alpha $ is a scalar weight.

We model $\varvec{z}\sim \mathcal {N}(\varvec{\mu _z},\varvec{\sigma _z})$ and learn to generate its mean and variance using convolutional layers, i.e., $\varvec{\mu }_z=f_{e}^{\varvec{\mu }}(\varvec{x})$ and $\varvec{\sigma }_z=f_{e}^{\varvec{\sigma }}(\varvec{x})$, and rewrite $I(\varvec{Z},Y;\varvec{\theta })$ (and similarly $I(\varvec{Z},\varvec{X};\varvec{\theta })$), for each element of $\varvec{Z}$, as:

$$\begin{aligned} I(\varvec{Z},Y;\varvec{\theta },\varvec{\mu _z},\varvec{\sigma _z}) = \int p(\varvec{z},y;\varvec{\theta },\varvec{\mu _z},\varvec{\sigma _z}) \log \frac{p(\varvec{z},y;\varvec{\theta },\varvec{\mu _z},\varvec{\sigma _z})}{p(\varvec{z};\varvec{\theta },\varvec{\mu _z},\varvec{\sigma _z}) p(y;\varvec{\theta })} dx dy \end{aligned}$$

(2)

To sample $\varvec{Z}$, we apply the reparameterization trick and write $\varvec{z} = f(\varvec{x},\epsilon )=\varvec{\mu _z} + \varvec{\sigma _z} \varvec{\epsilon } = \varvec{a}_{e}^{\varvec{\mu }}(\varvec{x}) + \varvec{a}_{e}^{\varvec{\sigma }}(\varvec{x}) \epsilon $, where $\varvec{a}_{e}$ is a deterministic function which outputs both $\varvec{\mu }$ and $\varvec{\sigma }$ and $\varvec{\epsilon } \sim \mathcal {N}(0,1)$. We regularize the distribution by penalizing the Kullback-Leibler (KL) divergence from a standard normal distribution. The final loss function which we aim to minimize is given by

$$\begin{aligned} L = \frac{ 1 }{ N } \sum _{ n = 1 } ^ { N } \mathbb {E }_{ \varvec{\epsilon } \sim p ( \epsilon ) } \left[ - \log q \left( y_{ n } | \varvec{a} \left( \varvec{x}_{ n } , \epsilon \right) \right) \right] + \alpha \mathrm { KL } [ p ( \varvec{Z} | \varvec{x}_{ n } ) , r ( \varvec{Z} ) ] \end{aligned}$$

(3)

where N is the number of training examples, q(.) is the variational approximation function, and $r(\varvec{Z})$ is variational approximation. In our case, $\varvec{Z}$ is not computed directly from the input but rather by sampling an attention map $\varvec{A}$ from which $\varvec{\mu }_z$ and $\varvec{\sigma }_z$ are derived. To explicitly enforce the model to generate more focused attention maps, we apply the following masking function $\varvec{M}$ with threshold $\tau $ that localizes the discriminative areas of $\varvec{Z}$.

$$\begin{aligned} \varvec{M} = R \left( \tilde{\varvec{z}} - \tau \right) \,\,\, \text {where} \quad \tilde{\varvec{z}} = { (1 + \exp \left( - \varvec{z} \right) )^{-1}} \end{aligned}$$

(4)

where R is a ReLU function with upper bound of 1, i.e., $R(v)=\max (0,\min (v,1))$. The block diagram of the proposed method is shown in Fig. 1.

3 Data

For evaluation, we used the NIH ChestX-ray8 Dataset [19], which comprises of 112,120 X-ray images from 30,805 unique patients with corresponding disease labels. We used 20547, 2568, and 2569 with pneumonia images as train, validation, and test sets, respectively. For training and to evaluate the test classification accuracy, we only used image-level labels. To evaluate the localization performance on test images, we used ground truth bounding boxes manually placed around the diseased areas.

4 Experiments and Results

We adopt a simple architecture as a baseline, i.e., an encoder ($ p ( \varvec{z} | \varvec{X} ; \varvec{\theta } )$) of the form [conv(64, $3\, \times \, 3$, relu), conv(64, $3\, \times \,3$, relu), maxpooling($2\, \times \,2$), conv(128, $3\, \times \,3$, relu), conv(128, $3 \times 3$, relu), maxpooling($2 \times 2$), conv(256, $3 \times 3$, relu), conv(16, $3 \times 3$, relu)] and a classification block ($p ( Y | \varvec{z} ; \varvec{\theta } )$) of [conv(128, $3 \times 3$, relu), maxpooling($2 \times 2$), conv(64, $3 \times 3$, relu), conv(64, $3 \times 3$, relu) maxpooling($2 \times 2$), global average pooling, softmax].

Table 1. Disease localization performance evaluation of the proposed InfoMask vs. competing methods. IoP, FPR, and FNR represent the localization performance while Acc. and AUC show the classification performance.

Full size table

We then compare our proposed InfoMask to four competing disease localization methods: (i) GradCAM, gradient-weighted class activation mapping + baseline, i.e., during inference, we replace $\varvec{M}$ in Fig. 1 with GradCAM; (ii) FeatureMask, masking the latent representation without KL divergence regularization + baseline; (iii) RegL1, L1 regularization over the generated masks instead of KL regularization; (iv) CheXCAM, GradCAM applied to the last layer of CheXNet [8]. Even though each patient could have multiple disease classes at the same time, we focus only on pneumonia disease detection (vs. normal) to analyze whether our method is able to only localize target regions in a complex environment where other diseases might also be present. Note that the results (Table 1 and Figs. 2, 3, and 4) reported next are based on the thresholded masks using the best threshold value, i.e., optimized, for each method, to minimize localization error over the validation set. To select the best epoch based on a validation set we first select N checkpoints which produce highest classification accuracy and then select the epoch with the highest localization score among them. As the detected thresholded masks could potentially have largely diverse patterns (e.g., from sparse disjoint localizations scattered over the whole image to large connected components, to anything in between), computing a single representative bounding box, as is provided by ground truth bounding box annotations, is not straightforward. Therefore, we replace the intersection over union (IoU) quality metric, commonly used for evaluating bounding box predictions, with a proposed intersection over predicted area (IoP), which reflects what percentage of the predicted area is inside the ground truth bounding box. As a small predicted areas inside the box can lead to a high score, we also compute false positive and negative rates, FPR and FNR respectively to measure over- and under-predicted areas.As reported in Table 1, the proposed InfoMask outperforms the competing methods by a large margin on IoP (at least 10% better), and obtains the second best FPR (only 1.5% higher than the lowest FPR). Examining the FPR values, it can be inferred that GradCAM tends to highlight larger areas of the input outside of the ground truth bounding boxes. Form the FNR column, we note that RegL1 generates smaller areas inside the boxes. Although the focus of the current study is not to improve classification accuracy, our proposed method achieves only slightly smaller classification accuracy (${<}2\%$) but with only 10% (7,000,000 vs. 700,000) of the parameters of CheXNet. The kernel density estimation plots in Fig. 2 support the quantitative results for the test images. Note how InfoMask obtains higher densities at larger IoP values (note: green curve in (a) for IoP $\in [0.5,1]$), smaller FPR density (green peak in (b) for FPR $\in [0,0.1]$) and in second best (behind CheXCAM) for FNR values. For a better interpretation of Table 1, we visualized a few samples of the attention maps and masked ones in Fig. 3 along with the ground truth (GT) bounding boxes in yellow. In Fig. 4, we visualized a few mean and variance samples computed for test images. As shown, there is less variance in the areas where the model is confident about absence of disease signs. As visualized InfoMask was able to localize pneumonia from images with different intensity distributions without using any bounding-box level annotation. As can be seen, FeatureMask and RegL1 produce scattered attention maps that cover only small portions of the GT bounding boxes. Among all, the proposed InfoMask generates contiguous attention areas with most agreement with ground truth boxes.

5 Conclusion

We proposed InfoMask, a method to localize disease-discriminatory regions trained with only image-level labels. Owing to the regularized variational latent representation with an attention mechanism, InfoMask generates contiguous and focused localization masks with higher agreement with ground truth annotations than competing methods (e.g., widely used GradCAM) without resorting to any bounding-box level annotations. A direction for future work aims at improving both classification and localization objectives by using stronger classification backbone models.

References

Centers for disease control and prevention. https://www.cdc.gov/pneumonia/prevention.html. Accessed 25 Mar 2019
Alemi, A.A., Fischer, I., Dillon, J.V., Murphy, K.: Deep variational information bottleneck. arXiv preprint arXiv:1612.00410 (2016)
Ancona, M., Ceolini, E., Öztireli, C., Gross, M.: Towards better understanding of gradient-based attribution methods for deep neural networks. In: ICLR 2018 (2018)
Google Scholar
Bency, A.J., Kwon, H., Lee, H., Karthikeyan, S., Manjunath, B.S.: Weakly supervised localization using deep feature maps. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 714–731. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_43
Chapter Google Scholar
Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2846–2854 (2016)
Google Scholar
Fan, L.: Adversarial localization network. In: NIPS 2017 Workshop on Learning with Limited Labeled Data (2017)
Google Scholar
Kumar, M.P., Packer, B., Koller, D.: Self-paced learning for latent variable models. In: Advances in Neural Information Processing Systems, pp. 1189–1197 (2010)
Google Scholar
Rajpurkar, P., et al.: CheXNet: radiologist-level pneumonia detection on chest X-rays with deep learning. arXiv preprint arXiv:1711.05225 (2017)
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: why did you say that? Visual explanations from deep networks via gradient-based localization. CoRR abs/1610.02391 (2016)
Google Scholar
Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 3145–3153. JMLR. org (2017)
Google Scholar
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)
Smilkov, D., Thorat, N., Kim, B., Viégas, F., Wattenberg, M.: SmoothGrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825 (2017)
Song, H.O., Girshick, R., Jegelka, S., Mairal, J., Harchaoui, Z., Darrell, T.: On learning to localize objects with minimal supervision. arXiv preprint arXiv:1403.1024 (2014)
Song, H.O., Lee, Y.J., Jegelka, S., Darrell, T.: Weakly-supervised discovery of visual pattern configurations. In: Advances in Neural Information Processing Systems, pp. 1637–1645 (2014)
Google Scholar
Teh, E.W., Rochan, M., Wang, Y.: Attention networks for weakly supervised object localization. In: BMVC, pp. 1–11 (2016)
Google Scholar
Tishby, N., Pereira, F.C., Bialek, W.: The information bottleneck method. In: The 37th Annual Allerton Conference on Communications, Control, and Computing, pp. 368–377 (1999)
Google Scholar
Wang, C., Ren, W., Huang, K., Tan, T.: Weakly supervised object localization with latent category learning. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 431–445. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_28
Chapter Google Scholar
Wang, X., et al.: Weakly supervised learning for whole slide lung cancer image classification. Med. Imaging Deep Learn. (2018)
Google Scholar
Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: ChestX-ray8: hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE Conference on Vision and Pattern Recognition, pp. 2097–2106 (2017)
Google Scholar
Wei, Y., et al.: TS2C: tight box mining with surrounding segmentation context for weakly supervised object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 434–450 (2018)
Google Scholar
WHO: Standardization of interpretation of chest radiographs for the diagnosis of pneumonia in children
Google Scholar
Yan, C., Yao, J., Li, R., Xu, Z., Huang, J.: Weakly supervised deep learning for thoracic disease classification and localization on chest X-rays. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 103–110. ACM (2018)
Google Scholar
Zolna, K., Geras, K.J., Cho, K.: Classifier-agnostic saliency map extraction. arXiv preprint arXiv:1805.08249 (2018)

Download references

Acknowledgement

We thank Joseph Paul Cohen for his insightful discussions and comments.

Author information

Authors and Affiliations

MILA, Université de Montréal, Montreal, Canada
Saeid Asgari Taghanaki & Yoshua Bengio
Imagia Inc., Montreal, Canada
Saeid Asgari Taghanaki, Mohammad Havaei, Tess Berthier, Francis Dutil & Lisa Di Jorio
School of Computing Science, Simon Fraser University, Burnaby, Canada
Saeid Asgari Taghanaki & Ghassan Hamarneh

Authors

Saeid Asgari Taghanaki
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Havaei
View author publications
You can also search for this author in PubMed Google Scholar
Tess Berthier
View author publications
You can also search for this author in PubMed Google Scholar
Francis Dutil
View author publications
You can also search for this author in PubMed Google Scholar
Lisa Di Jorio
View author publications
You can also search for this author in PubMed Google Scholar
Ghassan Hamarneh
View author publications
You can also search for this author in PubMed Google Scholar
Yoshua Bengio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Saeid Asgari Taghanaki .

Editor information

Editors and Affiliations

University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Dinggang Shen
University of Georgia, Athens, GA, USA
Tianming Liu
Western University, London, ON, Canada
Terry M. Peters
Yale University, New Haven, CT, USA
Lawrence H. Staib
University of Strasbourg, Illkirch, France
Caroline Essert
United Imaging Intelligence, Shanghai, China
Sean Zhou
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Pew-Thian Yap
Western University, London, ON, Canada
Ali Khan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Taghanaki, S.A. et al. (2019). InfoMask: Masked Variational Latent Representation to Localize Chest Disease. In: Shen, D., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. MICCAI 2019. Lecture Notes in Computer Science(), vol 11769. Springer, Cham. https://doi.org/10.1007/978-3-030-32226-7_82

Download citation

DOI: https://doi.org/10.1007/978-3-030-32226-7_82
Published: 10 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32225-0
Online ISBN: 978-3-030-32226-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)