1 Introduction and Previous Work

The usefulness of multi-scale feature representations has been acknowledged by the computer vision community for decades, c.f. Burt and Adelson [2] or Koenderink and van Doorn [13] for instance, and both traditional and modern approaches for image segmentation, registration or stereo rely on the integration of features from multiple scales. In recent years, convolutional neural networks (CNNs) have advanced the state of art in image segmentation tremendously as this technique allows to learn rich feature representations over multiple scales. However, the integration of these representations for obtaining full-resolution segmentations is not straightforward and it is an active field of research.

Besides applying CNNs in a patch-based fashion as proposed by [6], which is still common practice for very large data such as whole slide images in digital pathology, FCNNs making use of whole-image information, originally suggested by Long et al. [16], have turned out to be powerful tools. Based on this work, Ronneberger et al. [19] extended the idea of feature forwarding and proposed a symmetrical architecture. In this work the expanding or decoding path takes advantage of fine-grained features from the compressing path that are forwarded via skip connections. Feature forwarding has also turned out to be a very successful concept for 3D volumetric segmentation as demonstrated by Milletari et al. [17] and Çiçek et al. [5]. Further achievements to these architectures have then been accomplished by improved up-sampling, e.g. [1], improved training strategies, e.g. [10], integration of random fields and àtrous convolutions [4], and particularly the application of residual learning [8], e.g. [3, 14]. For a more complete review of related works, we refer the interested reader to the recent review of Litjens et al. [15]. In summary, it can be said that most state-of-the-art segmentation architectures use skip connections for feature forwarding and multi-scale context integration. However, most current approach resort to simple feature fusion schemes, based on concatenation or summation. An exception is represented by the gated feedback refinement network by Islam et al. [11] which comprise gate units to control the information flow and filter out ambiguity.

In this work, we present an alternative approach to multi-scale feature integration based on Long-Short-Term-Memory-units (LSTMs) initially proposed by Hochreiter and Schmidhuber [9], wich we term Coarse-to-Fine Context Memory (CFCM). The rationale behind this approach is that LSTMs implement a memory mechanism in which information can be maintained through different steps and only be updated with new information when necessary. We employ this idea to manage features extracted at different resolutions from the compressing path of the network. To demonstrate the potential of this approach, we compare our method to established architectures on two different datasets.

2 Method

Our segmentation approach is based on a fully convolutional architecture consisting of an encoding and a decoding part, c.f. Fig. 1. While encoding is based on a standard ResNet architecture, decoding is implemented using convolutional LSTMs. The core idea of this approach is to use a memory mechanism, implemented via convolutional LSTMs, for fusing features extracted from different layers of the encoder. Thereby, the convolutional LSTMs take the role of a coarse-to-fine focusing mechanism which first perceives the global context of the input data, as the deepest activations are fed to the inputs of the LSTM, and later processes fine-grained details. This happens when shallower, high-resolution features are considered. Code available on http://github.com/faustomilletari/CFCM-2D.

2.1 Encoder

Recent works [5, 17, 19] have proven that forwarding features extracted by the layers of the encoding path to the corresponding layers of the decoding path greatly improves performance: At training time, convergence can be achieved within a smaller number of epochs, and at testing time the segmentation performance is better. To this end, feature fusion strategies based on concatenation and summation have been employed by various authors [5, 14, 16, 19], but alternatives have been rarely investigated, which constitutes one of the motivations for this work. Our aim is to model the hierarchical nature of the features we extract from the encoding path explicitly in order to build a principled and more effective way of fusing them.

Fig. 1.
figure 1

Graphical Representation of the ResNet+Skip connection architecture and the proposed Coarse to Fine Context Memory (CFCM) based on LSTM. The number of layers in each block of the ResNet varies according to the architecture (ResNet-18, -34, -50, -101). The number of skip connection follows accordingly.

As shown in Fig. 1, we employ a ResNet architecture and we derive features at each residual block. These features are interpreted as a coarse-to-fine scale sequence, starting from the bottom of the ResNet up to its top. The deepest features are characterized by low resolution but high receptive field. As shown by Zeiler et al. [21] as well as other recent works, these features are taking into account global image information and high-level, complex patterns. Due to their coarse resolution, however, they do not yield information about fine-grained details. The uppermost features, on the other hand, refer to much more low-level, and fine-grained details, which is due to their high resolution and their limited receptive field.

2.2 Decoder

Our decoder treats each block of the ResNet encoder as a single time-step. As shown in Fig. 1 we forward the outputs of these blocks to our decoder, where the features are processed through LSTM cells. To this end, we employ convolutional LSTMs [20], which have the capability of selectively updating their internal states at each step depending on the result of a convolution. As shown in Fig. 2 each time step makes use of three feature sets: inputs, hidden and cell state. Inputs are concatenated with the hidden state. A convolution is performed and its result is used to (1) pass a part of the information stored in the cell state through the forget gate; (2) compute new activations which contribute to the cell state after being (3) decimated; (4) compute a new hidden state.

The initial hidden and cell states of the first LSTM are set to zero. The states of all other LSTM blocks in the decoder are initialized to be the up-sampled versions of the hidden and cell activations of the cell below. Intuitively, this mechanism can be understood as a coarse-to-fine context integration mechanism. The whole context of the picture is perceived first and, gradually, fine-grained details are added as the feature receptive fields decrease. Compared to other strategies, depicted also in Fig. 2, this architecture allows global context to be kept in memory while details are gradually being added. This aims at imitating the mechanism allowing humans to focus on details of an object while keeping in mind its global appearance. The hidden state of the last LSTM cell of the decoder constitutes the output of the memory mechanism and the last two convolutional layers produce the final segmentation.

Fig. 2.
figure 2

Schematic Comparison of convolutional LSTMs (A) and feature fusion by summation (B) and concatenation (C).

3 Experimental Evaluation

To demonstrate the advantages and general applicability of the proposed technique, we evaluate the segmentation performance on two different datasets. First, we compare CFCM against to two baselines, U-Net and ResNet+Skip, on the Montgomery Country X-ray Dataset. In the second experiment, we focus on the general applicability and test the performance regarding segmentation of surgical instruments in endoscopic surgery sequences. To show the superior performance of the method, we compare to state-of-the-art networks that are specialized on instrument tracking.

Implementation Details. Our networks are initialized with the same set of parameters, trained with batch-size 16 with a learning rate of 0.00001 optimizing for the dice coefficient [17]. When learning to segment the Montgomery county X-Ray we train for 150 epochs. When dealing with EndoVis we train for 30 epochs. The images are scaled to an input size of \(256\times 256\) pixels. Our ResNet+Skip connection is depicted in Fig. 1. Its decoder is a mirrored version of the encoder with skip connections at each block. Our method CFCM is implemented in tensorflow and will be made publicly available upon paper publication.

Montgomery County X-ray Set. This dataset comprises 138 annotated posterior-anterior chest x-rays and has been acquired from the tuberculosis control program of the Department of Health and Human Services of Montgomery County, MD, USA [12]. The set contains 80 normal cases and 58 abnormal cases with manifestations of tuberculosis including effusions and miliary patterns. For testing, we perform a three fold cross evaluation for binary lung segmentation and report the mean scores in Table 1. As shown in Fig. 3, U-Net and ResNet tend to misclassify the air-filled upper trachea or fractions of the shoulder as part of the lung, while the proposed CFCM is successful in capturing the global shape and fine outlines of the anatomy. Especially the leakage to the region of the shoulder is reduced. The improved performance is also reflected in consistent better quantitative results (see Table 1). It can be observed that the performance of CFCM improves with depth while the ResNet with simple skip connections starts to overfit due to high number of parameters.

Fig. 3.
figure 3

Qualitative results on the Montgomery county X-Ray dataset.

Table 1. Results for Montgomery county X-Ray set. Abbreviations: DICE = Dice coefficient, MAD = Mean Absolute Distance, RMS = Root-Mean-Square distance, HD = Hausdorff Distance
Fig. 4.
figure 4

Qualitative results on the EndoVis dataset.

EndoVis 2015Footnote 1. The dataset covers in total 6 ex-vivo endoscopic surgery sequences of image resolution \(720 \times 576\) pixels. The training data contains four 45s sequences. The remaining 15s of the same sequence together with two new 60s videos form the testing dataset. There are three semantic classes (manipulator, shaft and background). As specified in the guidelines, we preformed a cross-validation by leaving one surgery out of the training data. The segmentation result was compared to generic methods as well as algorithms that were explicitly published for this task [7, 14, 18]. García-Peraza-Herrera et al. [7] proposed a Fully convolutional network for segmentation in minimally invasive surgery. To achieve real-time performance, they applied the network only on every couple of frames and propagated the information with optical flow (FCN+OF). DLR [18] represents a deep residual network with dilated convolutions. Laina and Rieke et al. [14] suggested a unified deep learning approach for simultaneous segmentation and 2D pose estimation using Fully Convolutional Residual Network with skip connections. As depicted in Table 2, we outperform state of the art for both binary segmentation as well as multi-class segmentation. The major advantage of the proposed method over alternative approaches can be seen in the robustness to specular noise and the precision for the grasper (Table 2, Fig. 4). While the other methods have problems with the most flexible part of the instrument, CFCM can still recover the fine segmentation by the deep feature integration with LSTMs.

Table 2. Results for EndoVis. Abbreviations: B.Acc = Balanced Accuracy, Rec = Recall, Spec = Specificity, DICE = Dice coefficient

4 Conclusion

We presented a novel approach for CNN-based image segmentation that achieves multi-scale feature integration via LSTMs which we term Coarse-to-Fine Context Memory (CFCM). This approach has been evaluated on two challenging segmentation databases of chest radiographs and video data showing surgical instruments during an intervention in endoscopy. The experiments demonstrate that the proposed method achieves superior performance and can outperform generic as well as application-specific networks. Future research might include the extension of this concept to 3D and the exploration of different memory mechanisms.