1 Introduction

In breast examinations, such as mammography, detected actionable tumors are further examined through invasive histology. Objective interpretation of these modalities is fraught with high inter-observer variability and limited reproducibility [1]. In this context, a reference based assessment, such as presenting prior cases with similar disease manifestations (termed Content Based Image Retrieval (CBIR)) could be used to circumvent discrepancies in cancer grading. With growing sizes of clinical databases, such a CBIR system ought to be both scalable and accurate. Towards this, hashing approaches for CBIR are being actively investigated for representing images as compact binary codes that can be used for fast and accurate retrieval [2,3,4].

Fig. 1.
figure 1

Overview of DMIH for end-to-end generation of bag-level hash codes. Breast anatomy image is attributed to Cancer Research UK/Wikimedia Commons.

Malignant carcinomas are often co-located with benign manifestations and suspect normal tissues [5]. In such cases, describing the whole image with a single label is inadequate for objective machine learning and alternatively requires expert annotations delineating the exact location of the region of interest. This argument extends to screening modalities like mammograms, where multiple anatomical views are acquired. In such scenarios, the status of the tumor is best represented to a CBIR system by constituting a bag of all associated images, thus veritably becoming multiple instance (MI) in nature. With this as our premise we present, for the first time, a novel deep learning based MI hashing method, termed as Deep Multiple Instance Hashing (DMIH).

Seminal works on shallow learning-based hashing include Iterative Quantization (ITQ) [6], Kernel Sensitive Hashing (KSH) [2] etc. that propose a two-stage framework involving extraction of hand-crafted features followed by binarization. Yang et al. extend these methods to MI learning scenarios with two variants: Instance Level MI Hashing (IMIH) and Bag Level MI Hashing (BMIH) [7]. However, these approaches are not end-to-end and are susceptible to semantic gap between features and associated concepts. Alternatively, deep hashing methods such as simultaneous feature learning and hashing (SFLH) [8], deep hashing networks (DHN) [9] and deep residual hashing (DRH) [3] propose the learning of representations and hash codes in an end-to-end fashion, in effect bridging this semantic gap. It must be noted that all the above deep hashing works targeted single instance (SI) hashing scenarios and an extension to MI hashing was not investigated.

Earlier works on MI deep learning in computer vision include work by Wu et al. [10], where the concept of an MI pooling (MIPool) layer is introduced to aggregate representations for multi-label classification. Yan et al. leveraged MI deep learning for efficient body part recognition [11]. Unlike MI classification that potentially substitutes the decision of the clinician, retrieval aims at presenting them with richer contextual information to facilitate decision-making.

DMIH effectively bridges the two concepts for CBIR systems by combining the representation learning strength of deep MI learning with the potential for scalability arising from hashing. Within CBIR for breast cancer, notable prior art includes work on mammogram image retrieval by Jiang et al. [12] and large-scale histology retrieval by Zhang et al. [4]. Both these works pose CBIR as an SI retrieval problem. Contrasting with [4, 12], within DMIH we create a bag of images to represent a particular pathological case and generate a bag-level hash code, as shown in Fig. 1. Our contributions in this paper include: (1) introduction of a robust supervised retrieval loss for learning in presence of weak labels and potential outliers; (2) training with an auxiliary SI arm with gradual loss trade-off for improved trainability; and (3) incorporation of the MIPool layer to aggregate representations across variable number of instances within a bag, generating bag-level discriminative hash codes.

2 Methodology

Lets consider database \(\mathcal {B} = \{B_1, \dots , B_{N_{B}}\}\) with \(N_{B}\) bags. Each bag, \(B_{i}\), with varying number (\(n_{i}\)) of instances (\(I_{i}\)) is denoted as \(B_i = \{I_1,\dots ,I_{n_{i}}\}\). We aim at learning \(\mathcal {H}\) that maps each bag to a K-d Hamming space \(\mathcal {H}:\mathcal {B} \rightarrow \{-1,1\}^K\), such that bags with similar instances and labels are mapped to similar codes. For supervised learning of \(\mathcal {H}\), we define a bag-level pairwise similarity matrix \(\mathcal {S}^{\text {MI}}=\{s_{ij}\}^{N_{B}}_{ij=1}\), such that \(s_{ij}=1\) if the bags are similar and zero otherwise. In applications, such as this one, where retrieval ground truth is unavailable we can use classification labels as a surrogate for generating \(\mathcal {S}^{\text {MI}}\).

Fig. 2.
figure 2

DMIH architecture.

Architecture: As shown in Fig. 2, the proposed DMIH framework consists of a deep CNN terminating in a fully connected layer (FCL). Its outputs \(\left\{ z_{ij} \right\} _{j=1}^{n_{i}}\) are fed into the MIPool layer to generate the aggregated representation \(\hat{z}_{i}\) that is pooled (\(\text {max}_{\forall j}\left\{ z_{ij} \right\} _{j=1}^{n_{i}}\), mean \((\cdot )\), etc.) across instances within the bag. \(\hat{z}_{i}\) is an embedding in the space of the bags and is the input of a fully connected MI hashing layer. The output of this layer is squashed to \([-1, 1]\) by passing it through a tanh\(\{\cdot \}\) function to generate \(h_i^{\text {MI}}\), which is quantized to produce bag-level hash codes as \(\mathbf {b}_{i}^{\text {MI}}=\ \text {sgn}\ (\mathbf {h}_{i}^{\text {MI}})\). The deep CNN mentioned earlier could be a pretrained network, such as VGGF [13], GoogleNet [14], ResNet50 (R50) [15] or an application specific network.

During training of DMIH, we introduce an auxiliary SI hashing (aux-SI) arm, as shown in Fig. 2. It taps off at the FCL layer and feeds directly into a fully connected SI hashing layer with tanh \(\{\cdot \}\) activation to generate instance level non-quantized hash codes, denoted as \(\{h_{ij}^{\text {SI}}\}_{j=1}^{n_{i}}\). While training DMIH using backpropagation, the MIPool layer significantly sparsifies the gradients (analogous to using very high dropout while training CNNs), thus limiting the trainability of the preceding layers. The SI hashing arm helps to mitigate this by producing auxiliary instance level gradients.

Model Learning and Robust Optimization: To learn similarity preserving hash codes, we propose a robust version of supervised retrieval loss based on neighborhood component analysis (NCA) employed by [16]. The motivation to introduce robustness within the loss function is two-fold: (1) robustness induces immunity to potentially noisy labels due to high inter-observer variability and limited reproducibility for the applications at hand [1]; (2) it can effectively counter ambiguous label assignment while training with the aux-SI hashing arm. Given \(\mathcal {S}^{\text {MI}}\), the robust supervised retrieval loss \(J_{\mathcal {S}}^{\text {MI}}\) is defined as: \({J_{\mathcal {S}}^{\text {MI}} = 1 - \frac{1}{N_{B}^{2}} \sum _{i,j = 1}^{N_{B}} s_{ij}p_{ij}}\) where \(p_{ij}\) is the probability that two bags (indexed as i and j) are neighbors. Given hash codes \(\mathbf {h_i} = \left\{ h_{i}^{k} \right\} _{k=1}^{K}\) and \(\mathbf {h_j}\), we define a bit-wise residual operation \(r_{ij}\) as \(r_{ij}^k = (h_i^k - h_j^k)\). We estimate \(p_{ij}\) as:

$$\begin{aligned} p_{ij} = \frac{e^{-\mathcal {L}_{\text {Huber}}(\mathbf {h_i},\mathbf {h_j})}}{\sum _{i\ne l}^{N_{B}}e^{-\mathcal {L}_{\text {Huber}}(\mathbf {h_i},\mathbf {h_l})}},\text { where } \mathcal {L}_{\text {Huber}}(\mathbf {h_i},\mathbf {h_j}) = \sum _{\forall k}\rho _k(r_{ij}^{k}). \end{aligned}$$
(1)

\(\mathcal {L}_{\text {Huber}}(\mathbf {h_i},\mathbf {h_j})\) is the Huber norm between hash codes for bags i and j, while the robustness operation \(\rho _k\) is defined as:

$$\begin{aligned} \rho _k(r_{ij}^k) = {\left\{ \begin{array}{ll} \frac{1}{2}(r_{ij}^k)^2, &{}\text {if}\, |r_{ij}^k| \leqslant c_k \\ c_k |r_{ij}^k| - \frac{1}{2} c_k^2, &{}\text {if}\, |r_{ij}^k| > c_k \end{array}\right. } \end{aligned}$$
(2)

In Eq. (2), the tuning factor \(c_k\) is estimated inherently from the data and is set to \(c_k=1.345 \times \sigma _k\). The factor of 1.345 is chosen to provide approximately 95% asymptotic efficiency and \(\sigma _k\) is a robust measure of bit-wise variance of \(r_{ij}^k\). Specifically, \(\sigma _k\) is estimated as 1.485 times the median absolute deviation of \(r_{ij}^k\) as empirically suggested in [17]. This robust formulation provides immunity to outliers during training by clipping their gradients. For training with the aux-SI hashing arm, we employ a similar robust retrieval loss \(J_{\mathcal {S}}^{\text {SI}}\) defined over single instances with bag-labels assigned to member instances.

To minimize loss of retrieval quality due to quantization, we use a differentiable quantization loss \(J_{Q} = \sum _{i=1}^M(\text {log}\ \text {cosh}(|\mathbf {h_i}| - \ \mathbf {1}))\) proposed in [9]. This loss also counters the effect of using continuous relaxation in definition of \(p_{ij}\) over using Hamming distance. As a standard practice in deep learning, we also add an additional weight decay regularization term \(R_{W}\), which is the Frobenius norm of the weights and biases, to regularize the cost function and avoid over-fitting.

The following composite loss is used to train DMIH:

$$\begin{aligned} J = \lambda _{\text {MI}}^{t} J_{\mathcal {S}}^{\text {MI}}+ \lambda _{\text {SI}}^{t} J_{\mathcal {S}}^{\text {SI}}+ \lambda _q J_{Q} + \lambda _w R_{W} \end{aligned}$$
(3)
Fig. 3.
figure 3

Weight trade.

where \( \lambda _{\text {MI}}^{t}\), \( \lambda _{\text {SI}}^{t}\), \(\lambda _q\) and \(\lambda _w\) are hyper-parameters that control the contribution of each of the loss terms. Specifically, \( \lambda _{\text {MI}}^{t}\) and \( \lambda _{\text {SI}}^{t}\) control the trade-off between the MI and SI hashing losses. The SI arm plays a significant role only in the early stages of training and can be traded off eventually to avoid sub-optimal MI hashing. For this we introduce a weight trade-off formulation that gradually down-regulates \(\lambda _{\text {SI}}^{t}\), while simultaneously up-regulating \( \lambda _{\text {MI}}^{t}\). Here, we use and \(\lambda _{\text {MI}}^{t} = 1 - \lambda _{\text {SI}}^{t}\), where t is the current epoch and \(t_{\text {max}}\) is the maximum number of epochs (see Fig. 3). We train DMIH with mini-batch stochastic gradient descent (SGD) with momentum. Due to potential outliers that can occur at the beginning of training, we scale \(c_{k}\) up by a factor of 7 for \(t = 1\) to allow a stable state to be reached.

3 Experiments

Databases: Clinical applicability of DMIH has been validated on two large scale datasets, namely, Digital Database for Screening Mammography (DDSM) [12, 18] and a retrospectively acquired histology dataset from the Indiana University Health Pathology Lab (IUPHL) [4, 19]. The DDSM dataset comprises of 11,617 expert selected regions of interest (ROI) curated from 1861 patients. Multiple ROIs associated with a single breast from two anatomical views constitute a bag (size: 1–12; median: 2), which has been annotated as normal, benign or malignant by expert radiologists. A bag labeled malignant could potentially contain multiple suspect normal and benign masses, which have not been individually identified. The IUPHL dataset is a collection of 653 ROIs from histology slides from 40 patients (20 with precancerous ductal hyperplasia (UDH) and rest with ductal carcinoma in situ (DCIS)) with ROI level annotations done by expert histopathologists. Due to high variability in sizes of these ROIs (upto 9 K \(\times \) 8 K pixels), we extract multiple patches and populate a ROI-level bag (size: 1–15; median: 8). From both the datasets, we use patient-level non-overlapping splits to constitute the training (80%) and testing (20%) sets.

Model Settings and Validations: To validate proposed contributions, namely robustness within NCA loss and trade-off from the aux-SI arm, we perform ablative testing with combinations of their baseline variants by fine-tuning multiple network architectures. Additionally, we compare DMIH against four state-of-the art methods: ITQ [6], KSH [2], SFLH [8] and DHN [9]. For a fair comparison, we use R50 for both SFLH and DHN, since as discussed later it performs the best. Since SFLH and DHN were originally proposed for SI hashing, we introduce additional MI variants by hashing through the MIPool layer. For ITQ and KSH, we further create two comparative settings: (1) Using IMIH [7] that learns instance-level hash codes followed by bag-level distance computation and (2) Utilizing BMIH [7] using bag-level kernalized representations followed by binarization.

For IMIH and SI variants of SFLH, DHN and DMIH, given two bags \(B_p\) and \(B_q\) with SI hash codes, say \(\mathcal {H}(B_q) = \{h_{q1},\dots ,h_{qM}\}\) and \(\mathcal {H}(B_p) = \{h_{p1},\dots ,h_{pN}\}\), the bag-level distance is computed as:

$$\begin{aligned} d(B_p,B_q) = \frac{1}{M}\sum _{i=1}^{M}(\min _{\forall j }\ \text {Hamming}(h_{pi},h_{qj})). \end{aligned}$$
(4)

All images were resized to \(224 \times 224\) and training data were augmented with random rigid transformations to create equally balanced classes. \( \lambda _{\text {MI}}^{t}\) and \( \lambda _{\text {SI}}^{t}\) were set assuming \(t_{\text {max}}\) as 150 epoch; \(\lambda _q\) and \(\lambda _w\) were set at 0.05 and 0.001 respectively. The momentum term within SGD was set to 0.9 and batch size to 128 for DDSM and 32 for IUPHL. For efficient learning, we use an exponentially decaying learning rate initialized at 0.01. The DMIH framework was implemented in MatConvNet [20]. We use standard retrieval quality metrics: nearest neighbor classification accuracy (nnCA) and precision-recall (PR) curves to perform the aforementioned comparisons. The results (nnCA) from ablative testing and comparative methods are tabulated in Tables 1 and 2 respectively. Within Table 2, methods were evaluated at two different code sizes (16 bits and 32 bits). We also present the PR curves of select bag-level methods (32 bits) in Fig. 5.

Fig. 4.
figure 4

Retrieval results for DMIH at code size 16 bits.

Fig. 5.
figure 5

PR curves for DDSM and IUPHL datasets at code size of 32.

Table 1. Performance of ablative testing at code size of 16 bits. We report the nearest neighbor classification accuracy (nnCA) estimated over unseen test data. Letters A-E are introduced for easier comparisons, discussed in Sect. 4.

4 Results and Discussion

Effect of aux-SI Loss: To justify using the aux-SI loss, we introduce a variant of DMIH without it (E in Table 1), which leads to a significant decline of 3% to 14% in contrast to DMIH. This could be potentially attributed to the prevention of the gradient sparsification caused by the MIPool layer. From Table 1, we observe a 3%–10% increase in performance, comparing cases with gradual decaying trade-off (B) against baseline setting (\(\lambda ^{t}_{\text {MI}} = \lambda ^{t}_{\text {SI}} = 0.5\), A, C).

Effect of Robustness: For robust-NCA, we compared against the original NCA formulation proposed in [16] (A, B, D in Table 1). Robustness helps handle potentially noisy MI labels, inconsistencies within a bag and the ambiguity in assigning SI labels. Comparing the effect of robustness for baselines sans the SI hashing arm (D vs. E) we observe marginally positive improvement across the architectures and datasets, with a substantial 7% in ResNet50 for DDSM. Robustness contributes more with the addition of the aux-SI hash arm (proposed vs. E) with improved performance in the range of 4%–5% across all settings. This observation further validates our prior argument.

Effect of Quantization: To assess the effect of quantization, we define two baselines: (1) setting \(\lambda _{q} = 0\) and (2) using non-quantized hash codes for retrieval (DMIH - NB). The latter potentially acts as an upper bound for performance evaluation. From Table 1, we observe a consistent increase in performance by margins of 3%–5% if DMIH is learnt with an explicit quantization loss to limit the associated error. It must also be noted that comparing with DMIH - NB, there is only a marginal fall in performance (2%–4%), which is desired.

As a whole, the two-pronged proposed approach, including robustness and trade-off, along with quantization loss delivers the highest performance, proving that DMIH is able to learn effectively, despite the ambiguity induced by the SI hashing arm. Figure 4 demonstrates the retrieval performance of DMIH on the target databases. For IUPHL, the retrieved images are semantically similar to the query as consistent anatomical signatures are evident in the retrieved neighbors. For DDSM, in the cancer and normal cases the retrieved neighbors are consistent, however it is hard to distinguish between benign and malignant. The retrieval time for a single query for DMIH was observed at 31.62 ms (for IUPHL) and 17.48 ms (for DDSM) showing potential for fast and scalable search.

Table 2. Results of comparison with state-of-the art hashing methods.

Comparative Methods

In the contrastive experiments against ITQ and KSH, hand-crafted GIST [21] features underperformed significantly, while the improvement with the R50 features ranged from 5%–30%. However, DMIH still performed 10%–25% better.

Comparing the SI with the MI variations of DHN, SFLH and DMIH, it is observed that the performance improved in the range of 3%–11%, suggesting that end-to-end learning of MI hash codes is preferred over two-stage hashing i.e. hashing at SI level and comparing at bag level with Eq. (4). However, DMIH fares comparably better than both the SI and MI versions of SFLH and DHN, owing to the robustness of the proposed retrieval loss function. As also seen from the associated PR curves in Fig. 5, the performance gap between shallow and deep hashing methods remains significant despite using R50 features. Comparative results strongly support our premise that end-to-end learning of MI hash codes is preferred over conventional two-stage approaches.

5 Conclusion

In this paper, for the first time, we propose an end-to-end deep robust hashing framework, termed DMIH, for retrieval under a multiple instance setting. We incorporate the MIPool layer to aggregate representations across instances to generate a bag-level discriminative hash code. We introduce the notion of robustness into our supervised retrieval loss and improve the trainability of DMIH by utilizing an aux-SI hashing arm regulated by a trade-off. Extensive validations and ablative testing on two public breast cancer datasets demonstrate the superiority of DMIH and its potential for future extension to other MI applications.