1 Introduction

Deep learning models [1, 10] have achieved many successes in biomedical image segmentation. To obtain good segmentation performance, a decent amount of (pixel-wise) annotated images is often required to train such models. Due to high costs of pixel-wise annotation and large image sizes in applications (e.g., 3D image stacks with hundreds of slices, 2D whole-tissue images with hundreds of millions of pixels), it is common that annotation for only a small subset of all image data is available. Thus, when training a deep learning model using annotated images, one may also have a considerable number of unannotated images at hand. Such unannotated images are often from the original data distribution (containing useful information) and are free to use. Hence, a natural question is: How could we utilize unannotated images to benefit and improve segmentation?

Some recent attempts [5, 7] were made to utilize weakly annotated images in natural scene image segmentation. Bounding box (to bound an object of interest) and image level label (to show what objects appear in the images) are two common weak annotation methods for their settings. However, in biomedical image segmentation, there can be numerously more object instances (e.g., cells) than in natural scene images, and drawing bounding box still requires a great deal of effort. Also, there can be much fewer object classes in biomedical images than in natural scene images, and image level labels may be less useful in biomedical settings since almost all the images may contain all the object classes for segmentation (e.g., cells, glands). Thus, it is important to exploit unannotated images as well as annotated images for effective biomedical image segmentation.

Using unannotated data together with annotated data to train a learning model is not new. In [14], it combined an auxiliary unsupervised learning task to help the supervised training of a neural network; the intermediate layers are shared among both the supervised and unsupervised learning tasks. Consequently, its network can be trained for better generality. Using this approach, different choices for unsupervised learning tasks were proposed (e.g., reconstructing the input of the model through an encoding and decoding stage [8], a classification task for transforming the input to specially designed class labels [3]). As pointed out in [9], a key drawback of this approach is that, since unsupervised and supervised learning tasks have different goals, the unsupervised learning part may not always be helpful to the supervised learning part via the shared model parameters. To alleviate this problem, Ladder networks (with skip connections) were used to reduce the burden put on the encoding layers by the unsupervised learning part [9]. Despite this, the inherent problem of having different goals for supervised and unsupervised learning tasks was still not well resolved.

It would be ideal to use both annotated and unannotated data to serve the same goal (e.g., using both for training a segmentation network, as in our problem). A major difficulty is, since no ground truth is given for unannotated data, back-propagation errors after the forward pass cannot be directly computed for unannotated data. Our key idea is to train a deep neural network to compute approximate errors for unannotated data, using adversarial training [4, 11].

In this paper, we propose a new adversarial training approach, i.e., a deep adversarial network (DAN) model, for producing consistently good segmentation for both annotated and unannotated images. Our DAN model consists of two networks: (1) a segmentation network (SN) to conduct segmentation; (2) an evaluation network (EN) to assess the quality of SN’s segmentation. During training, EN is encouraged to distinguish between segmentation results of unannotated and annotated samples by giving them different scores, while SN is encouraged to produce segmentation results of unannotated images such that EN cannot distinguish these from the annotated ones. Through an iterative adversarial training process, because EN is constantly “criticizing” the segmentation of unannotated images using its learned feature mappings (describing what good segmentation looks like), SN can be trained to produce more and more accurate segmentation for unannotated and unseen samples. Our method is inspired by [4, 6]. Different from [6], our adversarial networks are designed to utilize unannotated images.

Experiments using the 2015 MICCAI Gland Challenge dataset [13] and a 3D fungus segmentation dataset show that our DAN model is effective in utilizing unannotated image data to obtain segmentation of considerably better quality.

2 Method

This section describes our adversarial training model utilizing unannotated data, and discusses a key issue: How to construct the input for the evaluation network.

2.1 Adversarial Networks Using Unannotated Data

There are two networks in our DAN model: a segmentation network SN and an evaluation network EN. SN takes an input image I and produces segmentation probability maps for I. EN takes the segmentation probability maps and the corresponding input image I, and determines a score indicating the quality of the segmentation: 1 (for good quality) or 0 (for not good quality).

During the model training, EN is encouraged to give high scores (1) for segmentation of annotated images and low scores (0) for segmentation of unannotated images. SN is trained using annotated images and is also encouraged to produce segmentation results of unannotated images such that EN might give them high scores. Below we describe the details of our adversarial training model.

Given M annotated training images \(X_m\), their corresponding segmentation ground truth \(Y_m\), and N unannotated images \(U_n\), we define the loss function as

$$\begin{aligned} {{\ell (\theta _S{,} \theta _E)}{=} {\sum _{m=1}^{M}} {\ell _{mce}}(S(X_m){,}Y_m){-}\lambda [{\sum _{m=1}^{M}} {\ell _{bce}}(E(S(X_m){,}X_m),1){+}{\sum _{n=1}^{N}} {\ell _{bce}}(E(S(U_n){,}U_n),0)]} \end{aligned}$$

where \(\theta _S\) and \(\theta _E\) are the parameters of the two networks SN and EN respectively, \(\ell _{mce}\) is the multi-class cross-entropy loss, and \(\ell _{bce}\) is the binary-class cross-entropy loss. The first term in the loss function is for the supervised training of SN using annotated images, and the second term forms the adversarial training part. The training process minimizes part of the loss with respect to the parameters \(\theta _S\) of SN, while maximizing the loss with respect to the parameters \(\theta _E\) of EN. More specifically, training EN aims to minimize

$$\begin{aligned} {\lambda [\sum _{m=1}^{M} \ell _{bce}(E(S(X_m),X_m),1)+\sum _{n=1}^{N} \ell _{bce}(E(S(U_n),U_n),0)]} \end{aligned}$$

with respect to the parameters \(\theta _E\) of EN, and training SN aims to minimize

$$\begin{aligned} { \sum _{m=1}^{M} \ell _{mce}(S(X_m),Y_m)-\lambda (\sum _{n=1}^{N} \ell _{bce}(E(S(U_n),U_n),0))} \end{aligned}$$

with respect to the parameters \(\theta _S\) of SN. As in [4], when updating SN, we replace the term \(-\lambda (\sum _{n=1}^{N}\ell _{bce}(E(S(U_n),U_n),0))\) by \(\lambda (\sum _{n=1}^{N} \ell _{bce}(E(S(U_n),U_n),1))\). A standard stochastic gradient descent method can be applied to optimize this loss function. Since the adversarial training part may be less useful prior to the stage when SN can produce reasonably good segmentation for the annotated training images, we set \(\lambda = 0.1\) initially, and set \(\lambda =1\) after 30000 iterations. The value of lambda should be small (\(<1\)) before SN can produce decent segmentation results. Too large lambda (e.g. \(\lambda = 10\)) may cause the training to fail to train a reasonable SN. Figure 1 shows our training process. Figure 2 gives more details of the SN and EN architectures. Our SN largely follows the architecture of DCAN [2], but with no split up-sampling (deconvolution) paths used. Our EN follows the main architecture of the classic VGG16 network [12].

Fig. 1.
figure 1

Illustrating the processes in one iteration of our model training (with the mini-batch size = 4). First, SN is trained using annotated images and their corresponding ground truth images; then, EN is trained to give different scores to segmentations of annotated images and unannotated images; finally, SN is trained to improve the segmentation quality of unannotated images (based on EN’s learned feature mappings).

2.2 Constructing the Input of the Evaluation Network

The input information provided to EN is crucial to the whole adversarial training system. A simple form for the input of EN could be just the segmentation probability maps, which allow EN to examine useful morphological properties of the segmented biomedical objects and help assess segmentation quality.

Fig. 2.
figure 2

Architectural details of our segmentation network (SN) and evaluation network (EN).

A more effective way to construct input for EN is to combine segmentation probability maps and the corresponding input image. This allows EN to explore the correlations between the segmentation and input image for evaluating the segmentation quality. However, giving the input image to EN could potentially be problematic since EN might come up with a way to give an evaluation score only based on the appearance of the input image without examining the segmentation probability maps. This would make the whole adversarial training useless with respect to improving the segmentation performance for unannotated images. Below we discuss two main methods for combining the segmentation maps and the input image to construct the input for EN.

Concatenation. Two possible ways to concatenate the segmentation probability map and input image are: directly concatenate them, or transform them to two feature maps and concatenate the feature maps. With either method, since EN has separate model parameters for handling information from the segmentation maps and from the input image, it is possible that only information from the raw image input is utilized for EN’s decision making.

Element-wise multiplication. A good aspect of element-wise multiplication is that it can “force” the segmentation probability maps and input image to mix at the very initial stage. Thus, all the model parameters are jointly trained using information from both the segmentation and input image. This ensures that the segmentation probability maps are used in EN’s decision making and in the entire adversarial training process. However, since element-wise multiplication essentially performs a pixel-wise gate operation (using the input image) on the segmentation probability maps in both the forward pass from SN to EN and backward propagation from EN to SN, it could happen that lower intensity structures (e.g., cell nuclei, gland borders in H&E stained images) may have very little influence on both the decision making process of EN and the parameter updates of SN. In order to reduce this bias, we use both the input image and its inverted image for mixing with the segmentation probability maps. Suppose I is one channel from the raw image input and P is one probability map produced by SN. We mix them by \(I\cdot P\) and \((1-I)\cdot P\) (two maps obtained). We mix every possible pair of I and P and concatenate all the obtained maps to form the input of EN.

3 Experiments and Results

To evaluate the effectiveness of our DAN model on utilizing unannotated images for segmentation, we test and compare DAN and several related models using two data sets: the 2015 MICCAI Gland Challenge dataset [13] for gland segmentation in H&E stained tissue images (e.g., see the top row of Fig. 3), and an in-house 3D electron microscopy (EM) image dataset for fungus segmentation.

Gland segmentation. This dataset [13] has 85 training images (37 benign (BN), 48 malignant (MT)), 60 testing images (33 BN, 27 MT) in part A, and 20 testing images (4 BN, 16 MT) in part B. As our unannotated training data, we acquired 100 additional H&E stained intestinal images from an in-house dataset (e.g., see the bottom row of Fig. 3).

Fig. 3.
figure 3

Top row: Image samples and their corresponding instance-level segmentation in the Gland Challenge dataset. Bottom row: Our unannotated training image samples.

Table 1 shows the gland segmentation results of our DAN model and several closely related models. For fair comparison, an adversarial training model (SSAN [6]), a semi-supervised learning model (Ladder networks [9]), and our DAN model all use the same segmentation network as SN (the base model). CUMedVision [2] and multichannel models [15, 16] were very recently designed especially for gland segmentation. CUMedVision [2] won the 2015 MICCAI Gland Segmentation Challenge, and the multichannel model in [15] is the best-known model with a sophisticated network structure. As we show, based on a relatively simple segmentation network (SN) and effective use of unannotated images via adversarial training, DAN (using SN) can improve the segmentation performance and give better overall segmentation results than the state-of-the-art methods. Figure 4 gives visual segmentation results of difficult cases in malignant tissues.

Fig. 4.
figure 4

Instance-level segmentation results on some malignant cases.

Table 1. Summary of the gland segmentation results. SSAN [6] is a latest adversarial network for semantic segmentation, Ladder networks [9] are a state-of-the-art model for semi-supervised learning, CUMedVision [2] won the 2015 MICCAI Gland Segmentation Challenge, and Multichannel2 [15] is the current best published model for gland segmentation on the MICCAI dataset.
Table 2. Results for pixel-level fungus segmentation in EM images.

Fungus segmentation. We also test our DAN model using four 3D EM image stacks (\(\sim \) size \(1658\times 1705 \times 100\) each) for fungus segmentation. The 3D EM images are captured from body tissues of ants. In biomedical applications, one may often have only a limited amount of annotated 2D slices for model training for 3D segmentation problems. To model such scenarios, we use only one slice from each stack to form the annotated images; 10 extra slices are utilized from each stack to form the unannotated images; 20 slices in each stack are marked with ground truth for testing the segmentation performance of different models.

Table 2 shows the pixel-level fungus segmentation results of our model and three closely related models. Our model produces considerably better results.

4 Conclusions

In this paper, we proposed a deep adversarial network that can effectively utilize unannotated image data for training biomedical image segmentation neural networks with better generalization and robustness.