Elsevier

Neural Networks

Volume 124, April 2020, Pages 75-85
Neural Networks

2020 Special Issue
A 3D deep supervised densely network for small organs of human temporal bone segmentation in CT images

https://doi.org/10.1016/j.neunet.2020.01.005Get rights and content

Abstract

Computed Tomography (CT) has become an important way for examining the critical anatomical organs of the human temporal bone in the diagnosis and treatment of ear diseases. Segmentation of the critical anatomical organs is an important fundamental step for the computer assistant analysis of human temporal bone CT images. However, it is challenging to segment sophisticated and small organs. To deal with this issue, a novel 3D Deep Supervised Densely Network (3D-DSD Net) is proposed in this paper. The network adopts a dense connection design and a 3D multi-pooling feature fusion strategy in the encoding stage of the 3D-Unet, and a 3D deep supervised mechanism is employed in the decoding stage. The experimental results show that our method achieved competitive performance in the CT data segmentation task of the small organs in the temporal bone.

Introduction

Temporal bone Computed Tomography (CT) scan is an established standard for the examination of otopathy to detect anatomical abnormality in the human temporal bone (Jager et al., 2005). With the increasing of clinical diagnosis of ear diseases, temporal bone CT images are exploding rapidly, and doctors have to face massive CT images to process, which increase the workload of doctors. Therefore, it is important to automatically segment the critical anatomical organ of the temporal bone from the CT scans to reduce the workload of doctors and reduce misdiagnosis. Accurate segmentation of the critical anatomical organs of the temporal bone not only helps to understand their structural features but also plays an important role in determining clinical surgical procedures (Yamashita et al., 2018).

The anatomical structure of human temporal bone is illustrated in Fig. 1. There are more than 30 anatomical structures embedded in the temporal bone, in which most of critical anatomical organs investigated in this paper are labeled in Fig. 1, including the malleus, the incus, the external contour of cochlea (ECC), the internal cavity of cochlea (ICC), the vestibule, the superior semicircular canal (SSC), the posterior semicircular canal (PSC) and the lateral semicircular canal (LSC). The internal acoustic meatus (IAM) cannot be observed in this view.

Automatic segmentation of critical anatomical organs is a challenging task from human temporal bone CT images, because of the sophisticated in structure, small in volume, low resolution, and high diversity in the shape and texture feature. For example, in a 512 × 512 × 199 voxels human temporal bone CT sequence, the largest organ internal acoustic meatus includes about 1298 voxels, while the smallest organ malleus includes merely about 184 voxels. The target area in a single CT slice accounts for less than 1% of the overall image. Different critical anatomical organs are quite different in shape, size, and density etc., and the boundary between the target area and surrounding organs are not clear. These characteristics bring great challenges to the automatic segmentation of temporal bone CT images.

Various segmentation algorithms have been intensively investigated. The classic medical image segmentation methods can be divided into 3 generations (Masood, Sharif, Masood, Yasmin, & Raza, 2015). The first generation includes thresholding methods, region growing methods, and edge-based methods. The second generation includes classifiers, clustering, deformable models, and graph searching. The third-generation includes graph guided approaches, shape models and appearance models. Each method has its pros and cons. For example, the threshold segmentation algorithm requires selecting a suitable threshold, and the region growing algorithm requires selecting the appropriate seed point. Although much research work has been done in medical image segmentation there is still huge room available for more efficient and effective techniques.

In recent years, with the rise of artificial intelligence technology represented by deep learning, it has provided additional technical means to implement the accurate segmentation of medical images.

Recently, the most commonly used convolutional neural network in medical image segmentation is U-net proposed by Ronneberger, Fischer, and Brox (2015). According to the recent survey on deep semantic segmentation of natural and medical images (Taghanaki, Abhishek, Cohen, Cohen-Adad, & Hamarneh, 2019), U-net architecture improved the model’s accuracy and addressed the problem of gradient vanishing, which make it becomes one of the popular architectures in segmentation tasks of medical images. Different from the fully convolutional neural network (FCN) (Long, Shelhamer, & Darrell, 2015), SegNet (Badrinarayanan, Kendall, & Cipolla, 2017) and Deeplab (Chen, Zhu, Papandreou, Schroff, & Adam, 2018) network structure, U-net adopts a bilateral symmetric architecture that consist of an encoder path to capture context information and a symmetric decoder path to recover spatial position. The encoder and decoder are connected by skip connections instead of directly monitoring the high-level semantic features. Therefore, we select the U-net as the baseline of this paper.

However, the features extracted by the U-net network in the coding phase are not fully exploited and the details will be lost in the down-sampling. In addition, there is a semantic gap between low-level semantic features and high-level semantic features (Zhou, Siddiquee, Tajbakhsh, & Liang, 2018), and using skip connections to join features is not conducive enough to the small organ segmentation.

In view of the existing organs segmentation methods for 3D medical images, this paper proposes a 3D Deep Supervised Densely Network (3D-DSD Net), which uses the traditional advantages of fully convolutional networks to train an end-to-end auto encoder network to segment temporal bone CT volumetric data. Different from the existing methods, the proposed network receives 3D volumetric data, and improves the segmentation accuracy of small organs through densely-connected-block design, multi-pooling features fusion strategy and deep supervised hidden layer design. It achieved the optimal segmentation accuracy on the temporal bone CT dataset.

We make the following contributions:

  • (1)

    A novel 3D-DSD Net, that we designed to segment 3D medical images and applied to the segmentation task of critical anatomical organs of temporal bone CT for the first time. This architecture can automatically segment critical anatomical organs of the temporal bone from volume to volume, and do assist the clinical diagnosis of radiologists.

  • (2)

    We designed a 3D multi-pooling features fusion strategy that make full use of the multi-scale and multi-level features. Furthermore, a densely connected block and the combination of long and short skip connections are employed to enhance the boundary and other details via feature fusion of different layers and scales. Improved the segmentation accuracy of critical anatomical organs of temporal bone.

  • (3)

    A 3D deep supervised mechanism is introduced to construct the accompanying objective function for the hidden layer output of the 3D network, which guides the network training process and improves the robustness of the segmentation model.

  • (4)

    We supply extensive performance comparisons across varying configurations of 3D-DSD Net and between other leading medical image segmentation methods.

For the clinical field, the work in this paper contributes a fundamental step to computer-aided otopathy CT image processing. Specifically, the segmentation results can be used to establish an automatic otopathy CT image reading system or virtual reality surgical simulation of temporal bone surgery. Due to the super sophisticated anatomical structure of human temporal bone, it is difficult to find out all the critical organs of the temporal bone for junior otologists. The work in this paper will be helpful in the training of the otologists.

For the field of machine learning, we provide a solution to small 3D objects segmentation. The schemes proposed in this paper should further enlighten researchers to design other similar networks to deal with small objects segmentation tasks.

As far as we know, we have not found other high-performance temporal bone segmentation method similar to 3D-DSD Net exists, which we develop in the following sections. Section 2 briefly reviews related work, and Section 3 describes the proposed 3D-DSD Net. After experiments are presented in Section 4, we draw conclusions in Section 5.

Section snippets

Handcrafted-feature-based segmentation methods

A large number of segmentation approaches such as threshold segmentation, region growing and active contour models have been applied to the segmentation of medical images before the rise of deep-learning-based segmentation often uses a set or adaptive threshold to divide the image into one or more target regions and backgrounds. These methods usually depend on high contrasts between the targets and backgrounds (Zhang, Yan, Chui, & Ong, 2010). Region growing based segmentation methods are

The proposed method

A novel method that extends 3D U-net (Poudel et al., 2016) is proposed, redesigning feature extraction, feature fusion and the supervision mechanism to tackle the small volume of critical anatomical organs of the temporal bone. The overall network architecture is illustrated in Fig. 2. In the encoder, three densely connected blocks are aimed at extract low-level features. A 3D multi-pooling fusion strategy is designed to make full use of multi-scale and multi-level features to improve the

Experiments and discussions

In this section, the experiment results are provided to show the performance of the proposed method.

Conclusion

This paper proposes a 3D network suitable for small organs segmentation of the temporal bone. The network makes extensive use of the volumetric data information, improves the design of the feature extraction module in the encoding stage based on the traditional 3D-Unet, improves the feature utilization and retains the detailed information as much as possible. In the decoding stage, the single supervised is changed into the joint supervised training of the backbone network and the auxiliary

Acknowledgments

This work is supported by the Science and Technology Development Program of Beijing Education Committee, China (Grant Number KM201810005026) and National Natural Science Foundation of China (Grant Number 61527807).

References (42)

  • ÇiçekÖ. et al.

    3D U-net: learning dense volumetric segmentation from sparse annotation

  • DiceL.R.

    Measures of the amount of ecologic association between species

    Ecology

    (1945)
  • DrozdzalM. et al.

    The importance of skip connections in biomedical image segmentation

  • Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In...
  • GruberN. et al.

    A joint deep learning approach for automated liver and tumor segmentation

    (2019)
  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE...
  • Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In...
  • JagerL. et al.

    CT of the normal temporal bone: Comparison of multi–and single–detector row CT

    Radiology

    (2005)
  • KontschiederP. et al.

    Structured class-labels in random forests for semantic image labelling

  • LeeC.-Y. et al.

    Deeply-supervised nets

  • LiX. et al.

    H-DenseUNet: Hybrid densely connected UNet for liver and tumor segmentation from CT volumes

    IEEE Transactions on Medical Imaging

    (2018)
  • Cited by (0)

    View full text