CFCM: Segmentation via Coarse to Fine Context Memory

Milletari, Fausto; Rieke, Nicola; Baust, Maximilian; Esposito, Marco; Navab, Nassir

doi:10.1007/978-3-030-00937-3_76

Fausto Milletari¹⁸,
Nicola Rieke^18,19,
Maximilian Baust¹⁹,
Marco Esposito¹⁹ &
…
Nassir Navab¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11073))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

10k Accesses

Abstract

Recent neural-network-based architectures for image segmentation make extensive usage of feature forwarding mechanisms to integrate information from multiple scales. Although yielding good results, even deeper architectures and alternative methods for feature fusion at different resolutions have been scarcely investigated for medical applications. In this work we propose to implement segmentation via an encoder-decoder architecture which differs from any other previously published method since (i) it employs a very deep architecture based on residual learning and (ii) combines features via a convolutional Long Short Term Memory (LSTM), instead of concatenation or summation. The intuition is that the memory mechanism implemented by LSTMs can better integrate features from different scales through a coarse-to-fine strategy; hence the name Coarse-to-Fine Context Memory (CFCM). We demonstrate the remarkable advantages of this approach on two datasets: the Montgomery county lung segmentation dataset, and the EndoVis 2015 challenge dataset for surgical instrument segmentation.

Maximilian Baust is now working for Konica Minolta Laboratory Europe.

You have full access to this open access chapter, Download conference paper PDF

Efficient Global-Local Memory for Real-Time Instrument Segmentation of Robotic Surgical Video

Ctnet: rethinking convolutional neural networks and vision transformer for medical image segmentation

Article 23 December 2023

Improved Long-Short Term Memory U-Net for Image Segmentation

1 Introduction and Previous Work

The usefulness of multi-scale feature representations has been acknowledged by the computer vision community for decades, c.f. Burt and Adelson [2] or Koenderink and van Doorn [13] for instance, and both traditional and modern approaches for image segmentation, registration or stereo rely on the integration of features from multiple scales. In recent years, convolutional neural networks (CNNs) have advanced the state of art in image segmentation tremendously as this technique allows to learn rich feature representations over multiple scales. However, the integration of these representations for obtaining full-resolution segmentations is not straightforward and it is an active field of research.

Besides applying CNNs in a patch-based fashion as proposed by [6], which is still common practice for very large data such as whole slide images in digital pathology, FCNNs making use of whole-image information, originally suggested by Long et al. [16], have turned out to be powerful tools. Based on this work, Ronneberger et al. [19] extended the idea of feature forwarding and proposed a symmetrical architecture. In this work the expanding or decoding path takes advantage of fine-grained features from the compressing path that are forwarded via skip connections. Feature forwarding has also turned out to be a very successful concept for 3D volumetric segmentation as demonstrated by Milletari et al. [17] and Çiçek et al. [5]. Further achievements to these architectures have then been accomplished by improved up-sampling, e.g. [1], improved training strategies, e.g. [10], integration of random fields and àtrous convolutions [4], and particularly the application of residual learning [8], e.g. [3, 14]. For a more complete review of related works, we refer the interested reader to the recent review of Litjens et al. [15]. In summary, it can be said that most state-of-the-art segmentation architectures use skip connections for feature forwarding and multi-scale context integration. However, most current approach resort to simple feature fusion schemes, based on concatenation or summation. An exception is represented by the gated feedback refinement network by Islam et al. [11] which comprise gate units to control the information flow and filter out ambiguity.

In this work, we present an alternative approach to multi-scale feature integration based on Long-Short-Term-Memory-units (LSTMs) initially proposed by Hochreiter and Schmidhuber [9], wich we term Coarse-to-Fine Context Memory (CFCM). The rationale behind this approach is that LSTMs implement a memory mechanism in which information can be maintained through different steps and only be updated with new information when necessary. We employ this idea to manage features extracted at different resolutions from the compressing path of the network. To demonstrate the potential of this approach, we compare our method to established architectures on two different datasets.

2 Method

Our segmentation approach is based on a fully convolutional architecture consisting of an encoding and a decoding part, c.f. Fig. 1. While encoding is based on a standard ResNet architecture, decoding is implemented using convolutional LSTMs. The core idea of this approach is to use a memory mechanism, implemented via convolutional LSTMs, for fusing features extracted from different layers of the encoder. Thereby, the convolutional LSTMs take the role of a coarse-to-fine focusing mechanism which first perceives the global context of the input data, as the deepest activations are fed to the inputs of the LSTM, and later processes fine-grained details. This happens when shallower, high-resolution features are considered. Code available on http://github.com/faustomilletari/CFCM-2D.

2.1 Encoder

Recent works [5, 17, 19] have proven that forwarding features extracted by the layers of the encoding path to the corresponding layers of the decoding path greatly improves performance: At training time, convergence can be achieved within a smaller number of epochs, and at testing time the segmentation performance is better. To this end, feature fusion strategies based on concatenation and summation have been employed by various authors [5, 14, 16, 19], but alternatives have been rarely investigated, which constitutes one of the motivations for this work. Our aim is to model the hierarchical nature of the features we extract from the encoding path explicitly in order to build a principled and more effective way of fusing them.

As shown in Fig. 1, we employ a ResNet architecture and we derive features at each residual block. These features are interpreted as a coarse-to-fine scale sequence, starting from the bottom of the ResNet up to its top. The deepest features are characterized by low resolution but high receptive field. As shown by Zeiler et al. [21] as well as other recent works, these features are taking into account global image information and high-level, complex patterns. Due to their coarse resolution, however, they do not yield information about fine-grained details. The uppermost features, on the other hand, refer to much more low-level, and fine-grained details, which is due to their high resolution and their limited receptive field.

2.2 Decoder

Our decoder treats each block of the ResNet encoder as a single time-step. As shown in Fig. 1 we forward the outputs of these blocks to our decoder, where the features are processed through LSTM cells. To this end, we employ convolutional LSTMs [20], which have the capability of selectively updating their internal states at each step depending on the result of a convolution. As shown in Fig. 2 each time step makes use of three feature sets: inputs, hidden and cell state. Inputs are concatenated with the hidden state. A convolution is performed and its result is used to (1) pass a part of the information stored in the cell state through the forget gate; (2) compute new activations which contribute to the cell state after being (3) decimated; (4) compute a new hidden state.

The initial hidden and cell states of the first LSTM are set to zero. The states of all other LSTM blocks in the decoder are initialized to be the up-sampled versions of the hidden and cell activations of the cell below. Intuitively, this mechanism can be understood as a coarse-to-fine context integration mechanism. The whole context of the picture is perceived first and, gradually, fine-grained details are added as the feature receptive fields decrease. Compared to other strategies, depicted also in Fig. 2, this architecture allows global context to be kept in memory while details are gradually being added. This aims at imitating the mechanism allowing humans to focus on details of an object while keeping in mind its global appearance. The hidden state of the last LSTM cell of the decoder constitutes the output of the memory mechanism and the last two convolutional layers produce the final segmentation.

3 Experimental Evaluation

To demonstrate the advantages and general applicability of the proposed technique, we evaluate the segmentation performance on two different datasets. First, we compare CFCM against to two baselines, U-Net and ResNet+Skip, on the Montgomery Country X-ray Dataset. In the second experiment, we focus on the general applicability and test the performance regarding segmentation of surgical instruments in endoscopic surgery sequences. To show the superior performance of the method, we compare to state-of-the-art networks that are specialized on instrument tracking.

Implementation Details. Our networks are initialized with the same set of parameters, trained with batch-size 16 with a learning rate of 0.00001 optimizing for the dice coefficient [17]. When learning to segment the Montgomery county X-Ray we train for 150 epochs. When dealing with EndoVis we train for 30 epochs. The images are scaled to an input size of \(256\times 256\) pixels. Our ResNet+Skip connection is depicted in Fig. 1. Its decoder is a mirrored version of the encoder with skip connections at each block. Our method CFCM is implemented in tensorflow and will be made publicly available upon paper publication.

Montgomery County X-ray Set. This dataset comprises 138 annotated posterior-anterior chest x-rays and has been acquired from the tuberculosis control program of the Department of Health and Human Services of Montgomery County, MD, USA [12]. The set contains 80 normal cases and 58 abnormal cases with manifestations of tuberculosis including effusions and miliary patterns. For testing, we perform a three fold cross evaluation for binary lung segmentation and report the mean scores in Table 1. As shown in Fig. 3, U-Net and ResNet tend to misclassify the air-filled upper trachea or fractions of the shoulder as part of the lung, while the proposed CFCM is successful in capturing the global shape and fine outlines of the anatomy. Especially the leakage to the region of the shoulder is reduced. The improved performance is also reflected in consistent better quantitative results (see Table 1). It can be observed that the performance of CFCM improves with depth while the ResNet with simple skip connections starts to overfit due to high number of parameters.

Table 1. Results for Montgomery county X-Ray set. Abbreviations: DICE = Dice coefficient, MAD = Mean Absolute Distance, RMS = Root-Mean-Square distance, HD = Hausdorff Distance

Full size table

EndoVis 2015^{Footnote 1}. The dataset covers in total 6 ex-vivo endoscopic surgery sequences of image resolution \(720 \times 576\) pixels. The training data contains four 45s sequences. The remaining 15s of the same sequence together with two new 60s videos form the testing dataset. There are three semantic classes (manipulator, shaft and background). As specified in the guidelines, we preformed a cross-validation by leaving one surgery out of the training data. The segmentation result was compared to generic methods as well as algorithms that were explicitly published for this task [7, 14, 18]. García-Peraza-Herrera et al. [7] proposed a Fully convolutional network for segmentation in minimally invasive surgery. To achieve real-time performance, they applied the network only on every couple of frames and propagated the information with optical flow (FCN+OF). DLR [18] represents a deep residual network with dilated convolutions. Laina and Rieke et al. [14] suggested a unified deep learning approach for simultaneous segmentation and 2D pose estimation using Fully Convolutional Residual Network with skip connections. As depicted in Table 2, we outperform state of the art for both binary segmentation as well as multi-class segmentation. The major advantage of the proposed method over alternative approaches can be seen in the robustness to specular noise and the precision for the grasper (Table 2, Fig. 4). While the other methods have problems with the most flexible part of the instrument, CFCM can still recover the fine segmentation by the deep feature integration with LSTMs.

Table 2. Results for EndoVis. Abbreviations: B.Acc = Balanced Accuracy, Rec = Recall, Spec = Specificity, DICE = Dice coefficient

Full size table

4 Conclusion

We presented a novel approach for CNN-based image segmentation that achieves multi-scale feature integration via LSTMs which we term Coarse-to-Fine Context Memory (CFCM). This approach has been evaluated on two challenging segmentation databases of chest radiographs and video data showing surgical instruments during an intervention in endoscopy. The experiments demonstrate that the proposed method achieves superior performance and can outperform generic as well as application-specific networks. Future research might include the extension of this concept to 3D and the exploration of different memory mechanisms.

Notes

1.
MICCAI 2015 Endoscopic Vision Challenge Instrument Segmentation and Tracking Sub-challenge http://endovissub-instrument.grand-challenge.org.

References

Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017)
Article Google Scholar
Burt, P.J., Adelson, E.H.: The laplacian pyramid as a compact image code. In: Readings in Computer Vision, pp. 671–679. Elsevier (1987)
Google Scholar
Chen, H., Dou, Q., Yu, L., Qin, J., Heng, P.A.: VoxresNet: deep voxelwise residual networks for brain segmentation from 3D MR images. NeuroImage (2017)
Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv preprint arXiv:1606.00915 (2016)
Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-net: learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46723-8_49
Chapter Google Scholar
Ciresan, D., Giusti, A., Gambardella, L.M., Schmidhuber, J.: Deep neural networks segment neuronal membranes in electron microscopy images. In: Advances in Neural Information Processing Systems, pp. 2843–2851 (2012)
Google Scholar
García-Peraza-Herrera, L.C., et al.: Real-time segmentation of non-rigid surgical tools based on deep learning and tracking. In: Peters, T., et al. (eds.) CARE 2016. LNCS, vol. 10170, pp. 84–95. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54057-3_8
Chapter Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hwang, S., Kim, H.-E.: Self-transfer learning for weakly supervised lesion localization. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 239–246. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46723-8_28
Chapter Google Scholar
Islam, M.A., Rochan, M., Bruce, N.D., Wang, Y.: Gated feedback refinement network for dense image labeling. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4877–4885. IEEE (2017)
Google Scholar
Jaeger, S., Candemir, S., Antani, S., Wáng, Y.X.J., Lu, P.X., Thoma, G.: Two public chest x-ray datasets for computer-aided screening of pulmonary diseases. Quant. Imaging Med. Surg. 4(6), 475 (2014)
Google Scholar
Koenderink, J.J., van Doorn, A.J.: Representation of local geometry in the visual system. Biol. Cybern. 55(6), 367–375 (1987)
Article MathSciNet Google Scholar
Laina, I., et al.: Concurrent segmentation and localization for tracking of surgical instruments. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10434, pp. 664–672. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66185-8_75
Chapter Google Scholar
Litjens, G., et al.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017)
Article Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Google Scholar
Milletari, F., Navab, N., Ahmadi, S.A.: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. IEEE (2016)
Google Scholar
Pakhomov, D., Premachandran, V., Allan, M., Azizian, M., Navab, N.: Deep residual learning for instrument segmentation in robotic surgery. arXiv preprint arXiv:1703.08580 (2017)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, pp. 802–810 (2015)
Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

NVIDIA, Santa Clara, USA
Fausto Milletari & Nicola Rieke
Technische Universität München, Munich, Germany
Nicola Rieke, Maximilian Baust, Marco Esposito & Nassir Navab

Authors

Fausto Milletari
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Rieke
View author publications
You can also search for this author in PubMed Google Scholar
Maximilian Baust
View author publications
You can also search for this author in PubMed Google Scholar
Marco Esposito
View author publications
You can also search for this author in PubMed Google Scholar
Nassir Navab
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fausto Milletari .

Editor information

Editors and Affiliations

University of Leeds, Leeds, UK
Alejandro F. Frangi
King’s College London, London, UK
Julia A. Schnabel
University of Pennsylvania, Philadelphia, PA, USA
Christos Davatzikos
Universidad de Valladolid, Valladolid, Spain
Carlos Alberola-López
Queen’s University, Kingston, ON, Canada
Gabor Fichtinger

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Milletari, F., Rieke, N., Baust, M., Esposito, M., Navab, N. (2018). CFCM: Segmentation via Coarse to Fine Context Memory. In: Frangi, A., Schnabel, J., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. MICCAI 2018. Lecture Notes in Computer Science(), vol 11073. Springer, Cham. https://doi.org/10.1007/978-3-030-00937-3_76

Download citation

DOI: https://doi.org/10.1007/978-3-030-00937-3_76
Published: 13 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00936-6
Online ISBN: 978-3-030-00937-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us