An attention-guided network for surgical instrument segmentation from endoscopic images

https://doi.org/10.1016/j.compbiomed.2022.106216Get rights and content

Highlights

  • An attention-guided network is proposed for surgical instrument segmentation.

  • A residual path is proposed to realize effective feature representation.

  • A dual-attention block is proposed to highlight features of surgical instruments.

  • A non-local attention block is introduced to acquire global contexts.

Abstract

Accurate surgical instrument segmentation can provide the precise location and pose information to the surgeons, assisting the surgeon to accurately judge the follow-up operation during the robot-assisted surgery procedures. Due to strong context extraction ability, there have been significant advances in research of automatic surgical instrument segmentation, especially U-Net and its variant networks. However, there are still some problems to affect segmentation accuracy, like insufficient processing of local features, class imbalance issue, etc. To deal with these problems, with the typical encoder–decoder structure, an effective surgical instrument segmentation network is proposed for providing an end-to-end detection scheme. Specifically, aimed at the problem of insufficient processing of local features, the residual path is introduced for the full feature extraction to strengthen the backward propagation of low-level features. Further, to achieve feature enhancement of local feature maps, a non-local attention block is introduced to insert into the bottleneck layer to acquire global contexts. Besides, to highlight the pixel areas of the surgical instruments, a dual-attention module (DAM) is introduced to make full use of the high-level features extracted from decoder unit and the low-level features delivered by the encoder unit to acquire the attention features and suppress the irrelevant features. To prove the effectiveness and superiority of the proposed segmentation model, experiments are conducted on two public surgical instrument segmentation data sets, including Kvasir-instrument set and Endovis2017 set, which could acquire a 95.77% Dice score and 92.13% mIOU value on Kvasir-instrument set, and simultaneously reach 95.60% Dice score and 92.74% mIOU value on Endovis2017 set respectively. Experimental results show that the proposed segmentation model realizes a superior performance on surgical instruments in comparison to other advanced models, which could provide a good reference for further development of intelligent surgical robots. The source code is provided at https://github.com/lyangucas92/Surg_Net.

Introduction

Surgical robots have become increasingly popular in the medical field, and they can effectively reduce the trauma of surgical procedures on the human body, and are also more stable compared to manual surgery [1]. The precise segmentation and analysis of surgical images are the basis of the steady and efficient running of surgical robots, which provides the exact position and pose of surgical instruments for surgical robots and computer-aided systems [2]. However, the challenging conditions, such as different illumination, blood, and complex textures, make the precise segmentation of surgical instruments very difficult, and even human may have difficulty in accurately identifying the position and pose of surgical instruments in some extreme cases [3], [4]. Therefore, the precise surgical instrument segmentation lays a solid foundation for robot-assisted surgery.

As a technical problem in the area of image analysis, image segmentation has drawn the research enthusiasm of much researchers. At present, the researchers have designed lots of methods for surgical instrument segmentation. In general, they could be primarily split into two classes: traditional image processing methods [5], [6], [7] and learning-based methods [8], [9], [10]. Traditional image processing methods always need design various mathematical methods to divide the raw images into several disjoint regions with the help of specific remarkable features, such as gray, color, spatial texture and geometric shape, etc [11]. Combined with the region-based homogeneity enhancement, Chen et al. proposed a flexible interactive segmentation model on the basis of the eikonal partial differential equation structure [12]. It could well enhance the relational representation of homogeneous regions by constructing a local geodesic metric, which had a good capability of integrating edge features and shape differences. Khadidos et al. proposed a boundary segmentation method with the level set [13], which performed weighted calculation according to the importance of local edge features for contour segmentation, and used the level set method to minimize the objective energy function, thereby realizing accurate segmentation of object boundaries. The traditional methods stated above usually build specific models for different segmentation tasks to adapt to the scale change and shape difference of the changeable segmented targets, which cause poor robustness to complex samples and complicated backgrounds. Moreover, the effective establishment of the detection models require rich experience support, which will greatly affect the generality and flexibility of detection models on different segmentation tasks.

With the great infrastructure of high-performance computing equipment, the learning-based segmentation methods have got fast development which are different from traditional image processing methods. The learning-based segmentation methods do not need to build specific models for specific targets, and can automatically learn and detect effective image features to realize end-to-end image segmentation which are generally implemented by convolutional neural networks (CNNs). And the segmentation process has becoming more and more simple under the influence of the common application of deep learning [14]. To improve the inference efficiency, Long et al. presented a fully convolutional neural network (FCN) to realize a novel image segmentation method [15]. The high-level features were extracted by series-wound convolutional layers and continuous down-sampling operations, and the segmentation masks were recovered from the extracted high-level features through continuous convolution and upsampling operations. However, continuous pooling operations could cause the information loss on details or micro objects. To acquire more global contextual information, Badrinarayanan et al. proposed a SegNet model to realize automatic image segmentation. SegNet no longer adopted the deconvolution for feature upsampling, but directly used pooling indices to realize non-linear upsampling, which greatly reduced the parameter numbers [16]. However, when unpooling was performed on lower-resolution features, the relationship between neighboring pixels was also ignored. For the purpose of solving the problem of insufficient attention to global contextual information, Ronneberger et al. put forward an U-Net network, which increased the information transmission path between mirror layers to get the high-level features delivered by the decoder unit and the low-level features delivered from the encoder unit [17]. It combined the global and local information as much as possible to offer powerful support to segment the target as more as accurately. In short, the U-Net segmentation network realized more sufficient processing of both high-level and low-level features, while it focused on the recovery of detail features while segmenting approximate location of different targets. Here, to ensure the detection precision of surgical instrument segmentation, a novel segmentation network is built for instrument segmentation based on the U-Net network.

On account of the significant breakthrough of U-Net network, much researchers pay more attention to U-Net segmentation network and introduce various mechanisms to pursue higher detection accuracy and efficiency on image segmentation. For shallow network layer in U-Net, when the low-level features delivered from encoder unit are concatenated with feature maps from decoder unit through skip connections, the features of both differ greatly and are prone to form a semantic gap. In terms of this issue, Zhou et al. further proposed an U-Net++ network, which contained an encoder unit and a decoder unit that were connected by nested dense convolutional blocks to optimize the semantic gap issue between feature maps of different level before feature fusion [18]. To obtain more useful features and ignore unnecessary features, Oktay et al. introduced an attention gated block and proposed an attention U-Net [19]. The attention U-Net optimized the simple skip connections with attention gated block, using feature maps with higher semantic features to guide the lower semantic features for region of interest selection, which provided an effective means for class imbalance issue. Taking full advantage of the extracted local features can effectively improve the segmentation accuracy. To verify this idea, Huang et al. proposed a densely connected convolutional network (DCCN), which made intermediate features more abundant and achieved an obvious improvement of segmentation accuracy by reusing the features in each network layer [20]. In addition, for the segmented targets with large size differences, the segmentation model needs to have multiple receptive fields to accurately segment the multi-scale targets simultaneously. Zhang et al. introduced a multi-scale feature fusion block on the basis of dense connectivity to realize accurate segmentation for targets with great scale difference simultaneously [21]. Combined with different blocks, the segmentation accuracy of all the above networks has got improved to some extent, but there are still some deficiencies, such as deficient disposing of local feature maps, class imbalance issue, etc.

Inspired by previous research on surgical instrument segmentation, using the encoder–decoder structure, an effective surgical instrument segmentation network is proposed to furnish an effective and accurate pixel-level detection method of surgical instruments. Combined with quantitative and qualitative analysis, the proposed segmentation model could achieve a promising performance on surgical instruments. The main contributions are enumerated as below.

(1) An attention-guided network is constructed for effective surgical instrument segmentation.

(2) To make effective processing of local features, residual paths are introduced to capture more contexts, which are also in favor of the fusion for feature maps. Meanwhile, a non-local block is introduced to apply in the bottleneck layer to capture the pixel-to-pixel dependencies and preserve more feature information.

(3) As for the class imbalance issue, to acquire sufficient attention features and reduce the useless information, a DAM block is introduced in the front-end of the decoder unit to extract features in both channel and spatial dimensions which could highlight the significant pixel areas of endoscopic images.

The remaining part of this paper are as follows: Section 2 illustrates the proposed segmentation method, incorporating the overall network framework, the proposed residual path and DAM block. Section 3 presents the details about experiment data sets, parameter setting and evaluation indicators. Section 4 shows the elaborate experimental results and analysis. The conclusions are stated in Section 5.

Section snippets

Proposed segmentation network

Illuminated by U-Net, to realize precise segmentation of endoscopic images, a novel segmentation network is built for automatic surgical instrument segmentation with the introduced DAM block, residual path and non-local module. In this section and the following subsections, the details of the proposed surgical instrument segmentation model and each network block are described in detail.

Experiment dataset and settings

To verify the availability and superiority of the proposed surgical instrument segmentation network, much experiments about the segmentation task are carried on the public gastrointestinal tract dataset (Kvasir-instrument set) and cataract dataset (Endovis2017 set) about surgical instrument segmentation. The details about the experimental data set and model training are given in this section.

Experimental results and analysis

In this section, the special experimental results and analysis of the proposed segmentation network at the public Kvasir-instrument set and Endovis2017 set are described including quantitative and qualitative perspectives.

Firstly, some advanced segmentation models which are widely used to medical image segmentation, are set as contradistinctive study to elaborate the effect of proposed segmentation network. Posteriorly, the ablation experiment is also introduced to show the effectiveness of

Conclusion

In this work, taking with the encoder–decoder architecture, an effective surgical instrument segmentation network is proposed for providing an automatic detection scheme. To take full advantage of local context features, the residual paths are introduced to optimize the simple process mode to relieve the semantic gap issue. Meanwhile, a non-local block is employed to capture global contexts for the propose of feature enhancement. Further, faced with the segmentation task against to class

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

The authors look forward to reviewers reviews and are happy to revise to perfection according to the precious opinion.

References (38)

  • MinaeeS. et al.

    Image segmentation using deep learning: A survey

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2021)
  • FuL.-Y. et al.

    Transformer based U-shaped medical image segmentation network: a survey

    J. Comput. Appl.

    (2022)
  • SevakJ.S. et al.

    Survey on semantic image segmentation techniques

  • ChenD. et al.

    Geodesic paths for image segmentation with implicit region-based homogeneity enhancement

    IEEE Trans. Image Process.

    (2021)
  • KhadidosA. et al.

    Weighted level set evolution based on local edge features for medical image segmentation

    IEEE Trans. Image Process.

    (2017)
  • LongJ. et al.

    Fully convolutional networks for semantic segmentation

  • BadrinarayananV. et al.

    Segnet: A deep convolutional encoder-decoder architecture for image segmentation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • RonnebergerO. et al.

    U-net: Convolutional networks for biomedical image segmentation

  • ZhouZ. et al.

    Unet++: A nested u-net architecture for medical image segmentation

  • Cited by (12)

    • CGBA-Net: context-guided bidirectional attention network for surgical instrument segmentation

      2023, International Journal of Computer Assisted Radiology and Surgery
    View all citing articles on Scopus

    This work was supported by the National Key Research & Development Project of China (2020YFB1313701), the National Natural Science Foundation of China (No. 62003309) and Outstanding Foreign Scientist Support Project in Henan Province of China (No. GZS2019008)

    View full text