Elsevier

Medical Image Analysis

Volume 59, January 2020, 101587
Medical Image Analysis

‘Squeeze & excite’ guided few-shot segmentation of volumetric images

https://doi.org/10.1016/j.media.2019.101587Get rights and content

Highlights

  • We present the first few-shot segmentation framework for volumetric medical scans.

  • We introduce strong interactions at multiple locations between the conditioner and segmenter arms, instead of only one interaction at the final layer.

  • ‘Channel squeeze & spatial excitation’ modules for effectuating the interaction.

  • Stable training of few-shot segmenter from scratch without requiring a pre-trained model.

  • A volumetric segmentation strategy that optimally pairs the slices of query and support volumes

Abstract

Deep neural networks enable highly accurate image segmentation, but require large amounts of manually annotated data for supervised training. Few-shot learning aims to address this shortcoming by learning a new class from a few annotated support examples. We introduce, a novel few-shot framework, for the segmentation of volumetric medical images with only a few annotated slices. Compared to other related works in computer vision, the major challenges are the absence of pre-trained networks and the volumetric nature of medical scans. We address these challenges by proposing a new architecture for few-shot segmentation that incorporates ‘squeeze & excite’ blocks. Our two-armed architecture consists of a conditioner arm, which processes the annotated support input and generates a task-specific representation. This representation is passed on to the segmenter arm that uses this information to segment the new query image. To facilitate efficient interaction between the conditioner and the segmenter arm, we propose to use ‘channel squeeze & spatial excitation’ blocks – a light-weight computational module – that enables heavy interaction between both the arms with negligible increase in model complexity. This contribution allows us to perform image segmentation without relying on a pre-trained model, which generally is unavailable for medical scans. Furthermore, we propose an efficient strategy for volumetric segmentation by optimally pairing a few slices of the support volume to all the slices of the query volume. We perform experiments for organ segmentation on whole-body contrast-enhanced CT scans from the Visceral Dataset. Our proposed model outperforms multiple baselines and existing approaches with respect to the segmentation accuracy by a significant margin. The source code is available at https://github.com/abhi4ssj/few-shot-segmentation.

Introduction

Fully convolutional neural networks (F-CNNs) have achieved state-of-the-art performance in semantic image segmentation for both natural (Jégou, Drozdzal, Vazquez, Romero, Bengio, 2017, Zhao, Shi, Qi, Wang, Jia, 2017, Long, Shelhamer, Darrell, 2015, Noh, Hong, Han, 2015) and medical images (Ronneberger, Fischer, Brox, 2015, Milletari, Navab, Ahmadi, 2016). Despite their tremendous success in image segmentation, they are of limited use when only a few labeled images are available. F-CNNs are in general highly complex models with millions of trainable weight parameters that require thousands of densely annotated images for training to be effective. A better strategy could be to adapt an already trained F-CNN model to segment a new semantic class from a few labeled images. This strategy often works well in computer vision applications where a pre-trained model is used to provide a good initialization and is subsequently fine-tuned with the new data to tailor it to the new semantic class. However, fine-tuning an existing pre-trained network without risking over-fitting still requires a fair amount of annotated images (at least in the order of hundreds). When dealing in an extremely low data regime, where only a single or a few annotated images of the new class are available, such fine-tuning based transfer learning often fails and may cause overfitting (Shaban, Bansal, Liu, Essa, Boots, Rakelly, Shelhamer, Darrell, Efros, Levine).

Few-shot learning is a machine learning technique that aims to address situations where an existing model needs to generalize to an unknown semantic class with a few examples at a rapid pace (Fei-Fei, Fergus, Perona, 2006, Miller, Matsakis, Viola, 2000, Fei-Fei, 2006). The basic concept of few-shot learning is motivated by the learning process of humans, where learning new semantics is done rapidly with very few observations, leveraging strong prior knowledge acquired from past experience. While few-shot learning for image classification and object detection is a well studied topic, few-shot learning for semantic image segmentation with neural networks has only recently been proposed (Shaban, Bansal, Liu, Essa, Boots, Rakelly, Shelhamer, Darrell, Efros, Levine). It is an immensely challenging task to make dense pixel-level high-dimensional predictions in such an extremely low data regime. But at the same time, few-shot learning could have a big impact on medical image analysis because it addresses learning from scarcely annotated data, which is the norm due to the dependence on medical experts for carrying out manual labeling. In this article, we propose a few-shot segmentation framework designed exclusively for segmenting volumetric medical scans. A key to achieve this goal is to integrate the recently proposed ‘squeeze & excite’ blocks within the design of our novel few-shot architecture (Roy et al., 2018b).

Few-shot learning algorithms try to generalize a model to a new, previously unseen class with only a few labeled examples by utilizing the previously acquired knowledge from differently labeled training data. Fig. 1 illustrates the overall setup, where we want to segment the liver in a new scan given the annotation of liver in only a single slice. A few-shot segmentation network architecture (Shaban, Bansal, Liu, Essa, Boots, Rakelly, Shelhamer, Darrell, Efros, Levine) commonly consists of three parts: (i) a conditioner arm, (ii) a set of interaction blocks, and (iii) a segmentation arm. During inference, the model is provided with a support set (Is, Ls(α)), consisting of an image Is with the new semantic class (or organ) α outlined as a binary mask indicated as Ls(α). In addition, a query image Iq is provided, where the new semantic class is to be segmented. The conditioner takes in the support set and performs a forward pass. This generates multiple feature maps of the support set in all the intermediate layers of the conditioner arm. This set of feature maps is referred to as task representation as they encode the information required to segment the new semantic class. The task representation is taken up by the interaction blocks, whose role is to pass the relevant information to the segmentation arm. The segmentation arm takes the query image as input, leverages the task information as provided by the interaction blocks and generates a segmentation mask Mq for the query input Iq. Thus, interaction blocks pass the information from the conditioner to the segmenter and form the backbone for few-shot semantic image segmentation. Existing approaches use weak interactions with a single connection either at the bottleneck or the last layer of the network (Shaban, Bansal, Liu, Essa, Boots, Rakelly, Shelhamer, Darrell, Efros, Levine).

Existing work in computer vision on few-shot segmentation processes 2D RGB images and uses a pre-trained model for both segmenter and conditioner arm to aid training (Shaban, Bansal, Liu, Essa, Boots, Rakelly, Shelhamer, Darrell, Efros, Levine). Pre-trained models provide a strong prior knowledge with more powerful features from the start of training. Hence, weak interaction between conditioner and segmenter is sufficient to train the model effectively. The direct extension to medical images is challenging due to the lack of pre-trained models. Instead, both the conditioner and the segmenter need to be trained from scratch. However, training the network in the absence of pre-trained models with weak interaction is prone to instability and mode collapse.

Instead of weak interaction, we propose a strong interaction at multiple locations between both the arms. The strong interaction facilitates effective gradient flow across the 2 arms, which eases the training of both the arms without the need for any pre-trained model. For effectuating the interaction, we propose our recently introduced ‘channel squeeze & spatial excitation’ (sSE) module (Roy, Navab, Wachinger, Roy, Navab, Wachinger, 2018b). In our previous works, we used the sSE blocks for adaptive self re-calibration of feature maps to aid segmentation in a single segmentation network. Here, we use the sSE blocks to communicate between the 2 arms of the few-shot segmentation network. The block takes as input the learned conditioner feature map and performs ‘channel squeeze’ to learn a spatial map. This is used to perform ‘spatial excitation’ on the segmenter feature map. We use sSE blocks between all the encoder, bottleneck and decoder blocks. SE blocks are well suited for effectuating the interaction between arms, as they are light-weight and therefore only marginally increase the model complexity. Despite its light-weight nature, they can have a strong impact on the segmenter’s features via re-calibration.

Existing work on few-shot segmentation focused on 2D images, while we are dealing with volumetric medical scans. Manually annotating organs on all slices in 3D images is time consuming. Following the idea of few-shot learning, the annotation should rather happen on a few sparsely selected slices. To this end, we propose a volumetric segmentation strategy by properly pairing a few annotated slices of the support volume with all the slices of the query volume, maintaining inter-slice consistency of the segmentation.

In this work, we propose:

  • 1.

    A novel few-shot segmentation framework for volumetric medical scans.

  • 2.

    Strong interactions at multiple locations between the conditioner and segmenter arms, instead of only one interaction at the final layer.

  • 3.

    ‘Squeeze & excitation’ modules for effectuating the interaction.

  • 4.

    Stable training from scratch without requiring a pre-trained model.

  • 5.

    A volumetric segmentation strategy that optimally pairs the slices of query and support volumes.

We discuss related work in Section 2, present our few-shot segmentation algorithm in Section 3, the experimental setup in Section 4 and experimental results and discussion in Section 5. We conclude with a summary of our contributions in Section 6.

Section snippets

Few-shot learning

Methods for few-shot learning can be broadly divided into three groups. The first group of methods adapts a base classifier to the new class (Bart, Ullman, 2005, Fei-Fei, Fergus, Perona, 2006, Hariharan, Girshick, 2017). These approaches are often prone to overfitting as they attempt to fit a complex model on a few new samples. Methods in the second group aim to predict classifiers close to the base classifier to prevent overfitting. The basic idea is to use a two-branch network, where the

Method

In this section, we first introduce the problem setup, then detail the architecture of our network and the training strategy, and finally describe the evaluation strategy for segmenting volumetric scans.

Dataset description

We choose the challenging task of organ segmentation from contrast-enhanced CT (ceCT) scans, for evaluating our few-shot volumetric segmentation framework. We use the Visceral dataset (Jimenez-del Toro et al., 2016), which consists of two parts (i) silver corpus (with 65 scans) and (ii) gold corpus (20 scans). All the scans were resampled to a voxel resolution of 2 mm3.

Problem formulation

As there is no existing benchmark for few-shot image segmentation on volumetric medical images, we formulate our own

‘Squeeze & excitation’ based interaction

In this section, we investigate the optimal positions of the SE blocks for facilitating interaction and compare the performance of cSE and sSE blocks. Here, we set the number of convolution kernels of the conditioner arm to 16 and the segmenter arm to 64. We use k=12 support slices from the support volume. Since the aim of this experiment is to evaluate the position and the type of SE blocks, we keep the above parameters fixed, but evaluate them later. With four different possibilities of

Conclusion

In this article, we introduced a few-shot segmentation framework for volumetric medical scans. The main challenges were the absence of pre-trained models to start from, and the volumetric nature of the scans. We proposed to use ‘channel squeeze and spatial excitation’ blocks for aiding proper training of our framework from scratch. In addition, we proposed a volumetric segmentation strategy for segmenting a query volume scan with a support volume scan by strategic by pairing 2D slices

Declaration of Competing Interest

The authors declare that they do not have any financial or nonfinancial conflict of interests.

Acknowledgement

We thank SAP SE and the Bavarian State Ministry of Education, Science and the Arts in the framework of the Centre Digitisation. Bavaria (ZD.B) for funding and the NVIDIA corporation for GPU donation.

References (26)

  • E. Bart et al.

    Cross-generalization: learning novel classes from a single example by feature replacement

    Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on

    (2005)
  • L. Bertinetto et al.

    Learning feed-forward one-shot learners

    Advances in Neural Information Processing Systems

    (2016)
  • S. Caelles et al.

    One-shot video object segmentation

    CVPR

    (2017)
  • N. Dong et al.

    Few-shot semantic segmentation with prototype learning

    BMVC

    (2018)
  • L. Fei-Fei

    Knowledge transfer in learning to recognize visual objects classes

    Proceedings of the International Conference on Development and Learning (ICDL)

    (2006)
  • L. Fei-Fei et al.

    One-shot learning of object categories

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2006)
  • B. Hariharan et al.

    Low-shot visual recognition by shrinking and hallucinating features

    Proc. of IEEE Int. Conf. on Computer Vision (ICCV), Venice, Italy

    (2017)
  • K. He et al.

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    Proceedings of the IEEE International Conference on Computer Vision

    (2015)
  • J. Hu et al.

    Squeeze-and-excitation networks

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • S. Jégou et al.

    The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation

    Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on

    (2017)
  • O. Jimenez-del Toro et al.

    Cloud-based evaluation of anatomical structure segmentation and landmark detection algorithms: visceral anatomy benchmarks

    IEEE Trans. Med. Imag.

    (2016)
  • G. Koch et al.

    Siamese neural networks for one-shot image recognition

    ICML Deep Learning Workshop

    (2015)
  • J. Long et al.

    Fully convolutional networks for semantic segmentation

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • Cited by (143)

    View all citing articles on Scopus
    1

    A. Guha Roy, S. Siddiqui and S. Pölsterl has contributed equally to this work.

    View full text