Abstract
This work tackles the problem of generating a medical report for multi-image panels. We apply our solution to the Renal Direct Immunofluorescence (RDIF) assay which requires a pathologist to generate a report based on observations across eight different whole slide images (WSI) in concert with existing clinical features. To this end, we propose a novel attention-based multi-modal generative recurrent neural network (RNN) architecture capable of dynamically sampling image data concurrently across the RDIF panel. The proposed methodology incorporates text from the clinical notes of the requesting physician to regulate the output of the network to align with the overall clinical context. In addition, we found the importance of regularizing attention weights for the word generation processes. This is because the system can ignore the attention mechanism by assigning equal weights for all members. Thus, we propose two regularizations to encourage efficient use of the attention mechanism. Experiments on our novel collection of RDIF WSIs provided by Sullivan Nicolaides Pathology demonstrate that our framework offers significant improvements over existing methods.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
1 Introduction
Automatic image captioning [17] is an important topic in the medical research area as it frees pathologists from manual medical image interpretation and reduces the cost significantly [5]. Typically this involves conditioning a recurrent neural network (RNN) on image features encoded by a convolutional neural network (CNN). This method has shown great promise in non-specific image captioning tasks but has not generalized well to the complex domain of medical images [18, 19].
A common solution is to employ a pathologist to annotate training data [19]. However, even with access to annotated image features and medical reports, the overall clinical context is still important in medical image interpretation. This is because certain pathologies may be morphologically indistinguishable. One such case is the differential diagnoses of immunotactoid glomerulonephritis and diabetic nephropathy. In the Renal Direct Immunofluorescence (RDIF) assay, both conditions can present with linear accentuation of the glomerualar basement membrane for IgG; the significance of this pattern cannot be determined without clinical confirmation/exclusion of diabetes mellitus [1]. For this reason, image captioning models conditioned solely on image data may not be the ideal solution for tasks in the medical domain.
The second major challenge of image captioning tasks in the medical domain is correlating information from multiple images. For example, a pathologist must interpret a set of 8 different renal biopsy sections of the same patient to report the RDIF assay. Several methods have been proposed to enable captioning of multiple images by assuming images in the set exhibit temporal dependence [3] or contain multiple views/instances of the same object [18]. These assumptions are unsuitable for the multi-object temporally independent RDIF set. We refer to this as the ordered set to sequence problem: it must be an ordered set to preserve the identity of the antibody in the RDIF panel.
To address the problems outlined above, we describe a novel framework to overcome the clinical context bias and generate a RDIF medical report. The contributions of this paper are listed as follows:
-
1.
To our knowledge, we are the first work to solve the ordered set to sequence problem by proposing a novel attention based architecture which provides concurrent access to all images at each step and models the clinical notes as priors to regulate clinical contexts.
-
2.
We also introduce two novel regulators, Salient Alpha (SAL) and Time Distributed Variance (TDVAR), to discourage uniform attention weights.
-
3.
Finally, we will release a novel RDIF dataset with quantitative baseline results provided using metrics from BLEU [13], ROUGE [11] and METEOR [10].
2 Renal Direct Immunoflourescence Dataset
The novel RDIF dataset used in this paper was assembled from routine clinical samples in collaboration with Sullivan Nicolaides Pathology; a subsidiary of Sonic Healthcare Limited. To prepare the RDIF slides, eight separate sections of renal biopsy tissue are treated with fluorescein isothiocyanate (FITC) conjugated antibodies to one of either IgG, IgA, IgM, Kappa, Lambda, C1q, C3 or Fibrinogen antibodies. The dataset comprises of 144 patient samples split into 99 training, 15 validation and 30 test sets. Each sample contains the eight WSI’s, the clinical notes of the requesting physician, and the medical report. This dataset can be accessed at https://github.com/cradleai/rdif.
3 Proposed Method - CORAL8
3.1 Architecture
As illustrated in Fig. 1, the main aim of CORAL8 is to generate a medical report, \(\mathcal {R}\) from an ordered set of images \(\mathcal {I}\) and clinical notes \(\mathcal {Q}\). To this end, we train a sentence generator \(\phi _{s}\) which receives a report context vector and local image features vector as input and generates sequence of words. There are several desirable properties that we enforce in the sentence generator:
-
1.
It must generate coherent sentences;
-
2.
The generated sentences must be in concert with the clinical contexts described in the clinical notes;
-
3.
The attention mechanism must produce diverse representations of local image features for the report generation. The mechanism must also ensure that local features from each image in the set are equally represented in the generative sequence.
To generate coherent sentences, we train a neural network called the prior encoder \(\phi _{p}\) which extracts context features \(F_m\); where m is the index of the sentence. The context features are a joint representation of; (1) previous sentence features; and (2) previous context features. To encourage agreement between the generated report and the clinical context, we feed the prior encoder with; (1) clinical notes features and (2) global image set features. This can be interpreted as forming a general impression of the image features with respect to the clinical context. The context features will then be fed into the sentence generator. Finally, to ensure that the model attends to each image in the set, we use a regulated attention mechanism to generate a dynamic local image features conditioning vector \(L_{t}\) used to generate the next word in the sentence. We describe these components in detail below.
Image Encoder encodes an ordered set of images \(\mathcal {I} = \{ I_{1}, .., I_{N} \} \) into the set of local image features \(\mathcal {A}\) and the global image features \(F_{init}\). More specifically, we first extract the \(14\times 14\times 512\) features U from the 4th max-pool layer of the pre-trained VGG-16 network [16]. Then the extracted features U are concatenated and flattened to compute the local image representations, \(\mathcal {A} = \{ A_{1},\ldots ,A_{a} \}, A_{i} \in \mathbb {R}^{N \times d}\) where \(a = 196\) and \(d = 512\) are the dimensions of the flattened image and feature channels respectively. Meanwhile, in order to produce a global representation for \(\mathcal {I}\), we apply an FC layer to U to extract \(F_{init}\) with 1xH fixed dimensions. Then, \(\mathcal {A}\) and \(F_{init}\) are fed into the following attention and context encoder networks.
Prior Encoder extracts the context vector \(F_{m}\) to represent features from the clinical notes and previously generated sentence \(R_{m-1}\). More specifically, for each sentence we first use a word embedding to produce fixed vector representations \(\mathcal {S} = \{s_{1},\ldots ,s_{C}\},s_{t} \in \mathbb {R}^{V \times E}\) and \(\mathcal {Q}=\{q_{1},\ldots ,q_{C}\},q_{i} \in \mathbb {R}^{V \times E}\) for the set of words in the previous sentence and clinical notes respectively; where, V is the size of the vocabulary, C is the number of words in each sentence, \(E=512\) is the word embedding space and \(q_{i}\) and \(s_{t}\) are both 1-of-V encoded words. \(R_{m-1}\) is then fed into a bidirectional LSTM [4] and followed by an FC layer to encode a fixed 1xH representation \(J_{m}\). We then concatenate \(J_{m}\) and \(F_{m-1}\) and apply an FC layer to encode \(F_{m}\) with 1xH fixed dimensions. At \(m = 0\), the output of the image encoder \(F_{init}\) and the embed clinical notes \(\mathcal {Q}\) are used in place of \(F_{m-1}\) and \(R_{m-1}\) respectively.
Sentence Generator is an RNN that generates a sentence \(\hat{R}_{m}\) word by word conditioned on the outputs of image encoder \(\mathcal {A}\) and prior encoder \(F_{m}\). After a sentence is generated, the prior encoder receives it as input to generate \(F_{m+1}\) which is then fed back to the sentence generator to generate the next sentence: this is repeated until the entire medical report is generated. More specifically, the attention network first computes a probability distribution over \(\mathcal {A}\) using the deterministic soft attention methods described in [17] to compute \(\kappa _{ti}\). \(\kappa _{ti}\) is then used to compute the weighted inter-image features vector \(\mathcal {Z}_{t} = \phi (\{A_{i}\},\{\kappa _{i}\}) = \sum ^a_{i} \kappa _{i}A_{i}\). The network then computes a second probability distribution \(\alpha _{ti}\) in the same way over \(\mathcal {Z}_{t} = \{z_{1},\ldots ,z_{N}\},z_{i} \in \mathbb {R}^{d}\). The second soft attention weighted inter-image feature vector \(L_{t}= \phi (\alpha (\{z_{i}\},\{\alpha _{i}\}) = \sum ^N_{i} \alpha _{i}z_{i}\) serves as the 1xd local features conditioning vector used to generate \(s_{t+1}\). \(L_{t}\) and \(h_{t}\) are then passed through to a visual sentinel. The visual sentinel multiplies \(L_{t}\) and \(F_{m}\) by gating scalars \(\beta _{L}\) and \(\beta _{F}\). Both gating scalars; are computed as follows
where \(w_{x}\) and \(b_{x}\) are hyper-parameters to be learned by the network. This allows the network to judge the importance of \(L_{t}\) and \(F_{m}\) features when generating \(s_{t+1}\). \(\beta _{L}L_{t}\), \(\beta _{F}F_{m}\) and \(s_{t}\) are concatenated and fed into a LSTM. A deep multilayer perceptron output layer (MLP) [14] then computes \(\mathcal {X}_{prob} = \{ p_{i}, .., p_{V} \}\); the probability distribution over the vocabulary of V words using the 1xH hidden state \(h_{t}\) output of the LSTM. Specifically, MLP takes \(s_{t}\), \(L_{t}\), \(h_{t}\) and \(F_{m}\) as input and computes the probability distribution for \(s_{t}\) as:
Where \(W_{v} \in \mathbb {R}^{V \times E}\), \(W_{h} \in \mathbb {R}^{H \times E}\), \(W_{L} \in \mathbb {R}^{E \times d}\), \(W_{f} \in \mathbb {R}^{E \times H}\) are all parametrized by the neural network. We then apply an argmax function to \(\mathcal {X}_{prob}\) to generate the next word in the sentence. This process is repeated for all words in the sentence.
3.2 Attention Regularization
When weights of the attention mechanism are not regularized, there is a possibility that the network will assign each data point equal attention weights. In this scenario, the attention mechanism does not offer any advantage over average pooling of image features. To this end, we apply a set of regularizations on the attention weights to enforce selectivity and attend only to image features that contribute to the model’s predictions at a given time step.
Xu’s et al. We first apply the regularization proposed by Xu et al. [17] to ensure that all the attention weights sum to one for both the temporal direction and spatial direction of the alpha attention matrix. More specifically, Xu et al. encourage a doubly stochastic property on the attention matrix which contains attention weights for visual features localisation at each time step. The loss function for Xu et al. is defined as:
Salient Alpha (SAL). The aim for SAL regularization is to increase the distance between the maximum weight and the average weight to force the network to be highly selective when attending to image regions. We define SAL as follows.
Where \(max_{i}\) is the maximum value and \(mean_{i}\) is the mean along the column axis.
Time Distributed Variance (TDVAR). The TDVAR aims to increase the variance of the attention weights. This will enforce the network to assign different attention weights for each generated word and enforces high variability in the attended features when generating the text sequence. We define TDVAR as follows:
Where \(std_{t}\) is the standard deviation and \(mean_{t}\) is the mean along the row axis of \(\alpha _{ti}\). We then combine these three regularization terms to produce \(C_{alpha}\) as follows:
where \(\lambda _{1}\), \(\lambda _{2}\) and \(\lambda _{3}\) are hyperparameters to scale the representation of each term. \(\delta \) is used to avoid zero division and exploding gradients in the initial training steps.
3.3 Training Protocol
We use random initialisation for neural network weights and zero initialisation for biases. The embedding matrix \(\mathbb {R}^{V \times E}\) is randomly initialised with values between \(-1\) and \(+1\). During training, the input to the network is \(\mathcal {D}_{t} = \{\mathcal {I}, \mathcal {R}, \mathcal {Q}\}^{D_{t}}_i=0\). The cost function used to train the network is as follows;
We update the gradients of the network using truncated backpropagation through time (TBTT) with \(\tau = 2\) m [12] and ADAM optimisation [8] i.e. The error is computed over the generated sentence, and the prior encoded previous sentence of lengths m. We implement the norm clipping strategy of [15] to stabilize the network and prevent exploding gradients. During inference, the inputs to the network are \(\mathcal {D}_{in} = \{\mathcal {I}, \mathcal {Q}\}^{D_{in}}_{i=0}\). NEWLINE tokens serve as the initial word inputs to the sentence generator; which generates the sentence word by word. Sentences are then generated one by one until the entire medical report sequence is complete.
4 Experiments and Results
We prepare the dataset by resizing all images to \(224\times 224\times 3\) pixels in order to make use of a VGG-16 network pre-trained on ImageNet [2]. All words from the medical reports and clinical notes are tokenized and we replace any word that occurs less than two times in the dataset with a special UNK token. This creates the vocabulary of 596 words. NEWLINE and EOS tokens are added to every sentence to indicate the start and end of the sentence respectively. Each sentence is padded to a fixed length of 40 word with NULL tokens. Finally, each medical report is padded to a fixed length of seven sentences.
We trained our model for 30 epochs using a learning rate of 0.001, \(\lambda _{1}=1\), \(\lambda _{2}=0.5\), \(\lambda _{3}=0.5\) and \(\delta = 0.001\). Then, the performance was evaluated using BLEU [13], ROUGE [11], and METEOR [10]. We use an implementation of [18] to serve as the baseline for our experiment; we refer to this method as Recurrent Attention. As the authors of [18] did not publish the source code for their model, we include validation experiments for our implementation in the supplementary materials. The alpha regularization method proposed in [17] applied to our CORAL8 model is also compared as a baseline. We conduct ablation studies to evaluate the contributions of each proposed component to the overall performance of our model. To determine the significance of clinical note features, we train a model that omits the initial prior encoder step and uses \(F_{init}\) to represent the context vector for the first sentence i.e. \(F_{0} = F_{init}\). Examples of generated reports with and without clinical notes can be found in Table 2. To asses the effects of each alpha regularization term; we train a model that omits it from the cost function. Visualizations of the effects these omissions have on the attention mechanism are provided in the supplementary materials. We also include a vanilla implementation where the sentence generator consists of LSTM conditioned only on \(S_{t}\) and \(F_{init}\). The quantitative results for all models are provided in Table 1.
5 Discussion and Future Direction
Table 1 shows that removing any component of the proposed model decreases performance across all quantitative metrics. This suggests that SAL, TDVAR and clinical notes all contribute to the final model performance. It is important to note that these metrics only measure the alignment of machine generated and ground truth texts, they do not provide sufficient means to validate the utility of these models in a clinical setting. However, as this is a pilot study in the application of deep learning for automatic reporting of the RDIF assay, these metrics are useful in establishing a quantitative baseline for measuring the relative performance in terms of generating narrative texts.
Insights into how clinical data improves accuracy can be inferred from Table 2. The first example refers to a case of IgA nephropathy; oxford classification S1 T1. S1 indicates that some glomeruli are segmentally sclerosed; this is a feature of IgA nephropathy and focal segmental glomerular sclerosis (FSGS). The model without clinical notes concluded the image was FSGS, but the presence of suspected IgA nephropathy in the clinical notes resulted in the proposed model predicting IgA nephropathy. In the second example, the proposed model accurately predicts pauci immune glomeruli nephritis. This condition is often referred to as Anti-neutrophil cytoplasmic antibody (ANCA) associated vasculitis, due to its strong association with ANCA antibodies [6]. The presence of ANCA positive ?ANCA vasculitis in the clinical notes suggests that the proposed model is capturing the associations between the clinical context and the pathologist impressions of the RDIF assay. Example 3 illustrates an incorrect impression generated by the CORAL8 model. Despite being incorrect, the example demonstrates how conditions with similar clinical features are modelled by the network. The underlying pathophysiology for arterionephrosclerosis and the predicted condition (thrombotic microangiopathy) can be due to the observed hypertension [7, 9]. This indicates that clinical notes may help stratify candidate medical conditions into groups with shared clinical features.
These results indicate that, although the proposed model architecture is foremost in a relative sense, generating narrative style medical reports continues to be a challenging obstacle to overcome. By releasing the dataset to the community, we hope to encourage further research into developing models to generate narrative medical reports for these multi-image medical panels. We advocate the inclusion of the additional clinical data in such models in order to accommodate the ethos of stratified medicine in the modern clinical landscape. When we achieve non-relative proficiency in generating narrative medical texts, in future works we will explore additional quantitative methods of validating the clinical utility of the machine generated reports for this task.
References
Alsaad, K., Herzenberg, A.: Distinguishing diabetic nephropathy from other causes of glomerulosclerosis: an update. J. Clin. Pathol. 60(1), 18–26 (2007). https://doi.org/10.1136/jcp.2005.035592
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255. IEEE (2009). https://doi.org/10.1109/cvprw.2009.5206848
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp. 2625–2634 (2015). https://doi.org/10.1109/cvpr.2015.7298878
Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005). https://doi.org/10.1016/j.neunet.2005.06.042
Ho, J., et al.: Can digital pathology result in cost savings? A financial projection for digital pathology implementation at a large integrated health care organization. J. Pathol. Inform. 5(1), 33 (2014). https://doi.org/10.4103/2153-3539.139714
Kallenberg, C.G., Heeringa, P., Stegeman, C.A.: Mechanisms of disease: pathogenesis and treatment of ANCA-associated vasculitides. Nat. Rev. Rheumatol. 2(12), 661 (2006). https://doi.org/10.1038/ncprheum0355
Khanal, N., Dahal, S., Upadhyay, S., Bhatt, V.R., Bierman, P.J.: Differentiating malignant hypertension-induced thrombotic microangiopathy from thrombotic thrombocytopenic purpura. Ther. Adv. Hematol. 6(3), 97–102 (2015). https://doi.org/10.1177/2040620715571076
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kopp, J.B.: Rethinking hypertensive kidney disease. Curr. Opin. Nephrol. Hypertens. 22(3), 266–272 (2013). https://doi.org/10.1097/mnh.0b013e3283600f8c
Lavie, A., Agarwal, A.: Meteor. In: StatMT. Association for Computational Linguistics (2007). https://doi.org/10.3115/1626355.1626389
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)
Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Eleventh Annual Conference of the International Speech Communication Association (2010)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU. In: ACL. Association for Computational Linguistics (2001). https://doi.org/10.3115/1073083.1073135
Pascanu, R., Gulcehre, C., Cho, K., Bengio, Y.: How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026 (2013)
Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: ICML, pp. 1310–1318 (2013)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)
Xue, Y., et al.: Multimodal recurrent model with attention for automated radiology report generation. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11070, pp. 457–466. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00928-1_52
Zhang, Z., Xie, Y., Xing, F., McGough, M., Yang, L.: MDNet: a semantically and visually interpretable medical image diagnosis network. In: CVPR, pp. 6428–6436 (2017). https://doi.org/10.1109/cvpr.2017.378
Acknowledgements
This research was funded by the Australian Government through the Australian Research Council and Sullivan Nicolaides Pathology under Linkage Project LP160101797.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Maksoud, S., Wiliem, A., Zhao, K., Zhang, T., Wu, L., Lovell, B. (2019). CORAL8: Concurrent Object Regression for Area Localization in Medical Image Panels. In: Shen, D., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. MICCAI 2019. Lecture Notes in Computer Science(), vol 11764. Springer, Cham. https://doi.org/10.1007/978-3-030-32239-7_48
Download citation
DOI: https://doi.org/10.1007/978-3-030-32239-7_48
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32238-0
Online ISBN: 978-3-030-32239-7
eBook Packages: Computer ScienceComputer Science (R0)