An attention-guided network for surgical instrument segmentation from endoscopic images

doi:10.1016/j.compbiomed.2022.106216

Computers in Biology and Medicine

Volume 151, Part A, December 2022, 106216

https://doi.org/10.1016/j.compbiomed.2022.106216 Get rights and content

Highlights

•
An attention-guided network is proposed for surgical instrument segmentation.
•
A residual path is proposed to realize effective feature representation.
•
A dual-attention block is proposed to highlight features of surgical instruments.
•
A non-local attention block is introduced to acquire global contexts.

Abstract

Accurate surgical instrument segmentation can provide the precise location and pose information to the surgeons, assisting the surgeon to accurately judge the follow-up operation during the robot-assisted surgery procedures. Due to strong context extraction ability, there have been significant advances in research of automatic surgical instrument segmentation, especially U-Net and its variant networks. However, there are still some problems to affect segmentation accuracy, like insufficient processing of local features, class imbalance issue, etc. To deal with these problems, with the typical encoder–decoder structure, an effective surgical instrument segmentation network is proposed for providing an end-to-end detection scheme. Specifically, aimed at the problem of insufficient processing of local features, the residual path is introduced for the full feature extraction to strengthen the backward propagation of low-level features. Further, to achieve feature enhancement of local feature maps, a non-local attention block is introduced to insert into the bottleneck layer to acquire global contexts. Besides, to highlight the pixel areas of the surgical instruments, a dual-attention module (DAM) is introduced to make full use of the high-level features extracted from decoder unit and the low-level features delivered by the encoder unit to acquire the attention features and suppress the irrelevant features. To prove the effectiveness and superiority of the proposed segmentation model, experiments are conducted on two public surgical instrument segmentation data sets, including Kvasir-instrument set and Endovis2017 set, which could acquire a 95.77% Dice score and 92.13% mIOU value on Kvasir-instrument set, and simultaneously reach 95.60% Dice score and 92.74% mIOU value on Endovis2017 set respectively. Experimental results show that the proposed segmentation model realizes a superior performance on surgical instruments in comparison to other advanced models, which could provide a good reference for further development of intelligent surgical robots. The source code is provided at https://github.com/lyangucas92/Surg_Net.

Introduction

Surgical robots have become increasingly popular in the medical field, and they can effectively reduce the trauma of surgical procedures on the human body, and are also more stable compared to manual surgery [1]. The precise segmentation and analysis of surgical images are the basis of the steady and efficient running of surgical robots, which provides the exact position and pose of surgical instruments for surgical robots and computer-aided systems [2]. However, the challenging conditions, such as different illumination, blood, and complex textures, make the precise segmentation of surgical instruments very difficult, and even human may have difficulty in accurately identifying the position and pose of surgical instruments in some extreme cases [3], [4]. Therefore, the precise surgical instrument segmentation lays a solid foundation for robot-assisted surgery.

As a technical problem in the area of image analysis, image segmentation has drawn the research enthusiasm of much researchers. At present, the researchers have designed lots of methods for surgical instrument segmentation. In general, they could be primarily split into two classes: traditional image processing methods [5], [6], [7] and learning-based methods [8], [9], [10]. Traditional image processing methods always need design various mathematical methods to divide the raw images into several disjoint regions with the help of specific remarkable features, such as gray, color, spatial texture and geometric shape, etc [11]. Combined with the region-based homogeneity enhancement, Chen et al. proposed a flexible interactive segmentation model on the basis of the eikonal partial differential equation structure [12]. It could well enhance the relational representation of homogeneous regions by constructing a local geodesic metric, which had a good capability of integrating edge features and shape differences. Khadidos et al. proposed a boundary segmentation method with the level set [13], which performed weighted calculation according to the importance of local edge features for contour segmentation, and used the level set method to minimize the objective energy function, thereby realizing accurate segmentation of object boundaries. The traditional methods stated above usually build specific models for different segmentation tasks to adapt to the scale change and shape difference of the changeable segmented targets, which cause poor robustness to complex samples and complicated backgrounds. Moreover, the effective establishment of the detection models require rich experience support, which will greatly affect the generality and flexibility of detection models on different segmentation tasks.

With the great infrastructure of high-performance computing equipment, the learning-based segmentation methods have got fast development which are different from traditional image processing methods. The learning-based segmentation methods do not need to build specific models for specific targets, and can automatically learn and detect effective image features to realize end-to-end image segmentation which are generally implemented by convolutional neural networks (CNNs). And the segmentation process has becoming more and more simple under the influence of the common application of deep learning [14]. To improve the inference efficiency, Long et al. presented a fully convolutional neural network (FCN) to realize a novel image segmentation method [15]. The high-level features were extracted by series-wound convolutional layers and continuous down-sampling operations, and the segmentation masks were recovered from the extracted high-level features through continuous convolution and upsampling operations. However, continuous pooling operations could cause the information loss on details or micro objects. To acquire more global contextual information, Badrinarayanan et al. proposed a SegNet model to realize automatic image segmentation. SegNet no longer adopted the deconvolution for feature upsampling, but directly used pooling indices to realize non-linear upsampling, which greatly reduced the parameter numbers [16]. However, when unpooling was performed on lower-resolution features, the relationship between neighboring pixels was also ignored. For the purpose of solving the problem of insufficient attention to global contextual information, Ronneberger et al. put forward an U-Net network, which increased the information transmission path between mirror layers to get the high-level features delivered by the decoder unit and the low-level features delivered from the encoder unit [17]. It combined the global and local information as much as possible to offer powerful support to segment the target as more as accurately. In short, the U-Net segmentation network realized more sufficient processing of both high-level and low-level features, while it focused on the recovery of detail features while segmenting approximate location of different targets. Here, to ensure the detection precision of surgical instrument segmentation, a novel segmentation network is built for instrument segmentation based on the U-Net network.

On account of the significant breakthrough of U-Net network, much researchers pay more attention to U-Net segmentation network and introduce various mechanisms to pursue higher detection accuracy and efficiency on image segmentation. For shallow network layer in U-Net, when the low-level features delivered from encoder unit are concatenated with feature maps from decoder unit through skip connections, the features of both differ greatly and are prone to form a semantic gap. In terms of this issue, Zhou et al. further proposed an U-Net $+ +$ network, which contained an encoder unit and a decoder unit that were connected by nested dense convolutional blocks to optimize the semantic gap issue between feature maps of different level before feature fusion [18]. To obtain more useful features and ignore unnecessary features, Oktay et al. introduced an attention gated block and proposed an attention U-Net [19]. The attention U-Net optimized the simple skip connections with attention gated block, using feature maps with higher semantic features to guide the lower semantic features for region of interest selection, which provided an effective means for class imbalance issue. Taking full advantage of the extracted local features can effectively improve the segmentation accuracy. To verify this idea, Huang et al. proposed a densely connected convolutional network (DCCN), which made intermediate features more abundant and achieved an obvious improvement of segmentation accuracy by reusing the features in each network layer [20]. In addition, for the segmented targets with large size differences, the segmentation model needs to have multiple receptive fields to accurately segment the multi-scale targets simultaneously. Zhang et al. introduced a multi-scale feature fusion block on the basis of dense connectivity to realize accurate segmentation for targets with great scale difference simultaneously [21]. Combined with different blocks, the segmentation accuracy of all the above networks has got improved to some extent, but there are still some deficiencies, such as deficient disposing of local feature maps, class imbalance issue, etc.

Inspired by previous research on surgical instrument segmentation, using the encoder–decoder structure, an effective surgical instrument segmentation network is proposed to furnish an effective and accurate pixel-level detection method of surgical instruments. Combined with quantitative and qualitative analysis, the proposed segmentation model could achieve a promising performance on surgical instruments. The main contributions are enumerated as below.

(1) An attention-guided network is constructed for effective surgical instrument segmentation.

(2) To make effective processing of local features, residual paths are introduced to capture more contexts, which are also in favor of the fusion for feature maps. Meanwhile, a non-local block is introduced to apply in the bottleneck layer to capture the pixel-to-pixel dependencies and preserve more feature information.

(3) As for the class imbalance issue, to acquire sufficient attention features and reduce the useless information, a DAM block is introduced in the front-end of the decoder unit to extract features in both channel and spatial dimensions which could highlight the significant pixel areas of endoscopic images.

The remaining part of this paper are as follows: Section 2 illustrates the proposed segmentation method, incorporating the overall network framework, the proposed residual path and DAM block. Section 3 presents the details about experiment data sets, parameter setting and evaluation indicators. Section 4 shows the elaborate experimental results and analysis. The conclusions are stated in Section 5.

Section snippets

Proposed segmentation network

Illuminated by U-Net, to realize precise segmentation of endoscopic images, a novel segmentation network is built for automatic surgical instrument segmentation with the introduced DAM block, residual path and non-local module. In this section and the following subsections, the details of the proposed surgical instrument segmentation model and each network block are described in detail.

Experiment dataset and settings

To verify the availability and superiority of the proposed surgical instrument segmentation network, much experiments about the segmentation task are carried on the public gastrointestinal tract dataset (Kvasir-instrument set) and cataract dataset (Endovis2017 set) about surgical instrument segmentation. The details about the experimental data set and model training are given in this section.

Experimental results and analysis

In this section, the special experimental results and analysis of the proposed segmentation network at the public Kvasir-instrument set and Endovis2017 set are described including quantitative and qualitative perspectives.

Firstly, some advanced segmentation models which are widely used to medical image segmentation, are set as contradistinctive study to elaborate the effect of proposed segmentation network. Posteriorly, the ablation experiment is also introduced to show the effectiveness of

Conclusion

In this work, taking with the encoder–decoder architecture, an effective surgical instrument segmentation network is proposed for providing an automatic detection scheme. To take full advantage of local context features, the residual paths are introduced to optimize the simple process mode to relieve the semantic gap issue. Meanwhile, a non-local block is employed to capture global contexts for the propose of feature enhancement. Further, faced with the segmentation task against to class

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

The authors look forward to reviewers reviews and are happy to revise to perfection according to the precious opinion.

References (38)

FuK.-S. et al.
A survey on image segmentation
Pattern Recognit.
(1981)
ZaitounN.M. et al.
Survey on image segmentation techniques
Procedia Comput. Sci.
(2015)
NiyasS. et al.
Medical image segmentation with 3D convolutional neural networks: A survey
Neurocomputing
(2022)
GuJ. et al.
Recent advances in convolutional neural networks
Pattern Recognit.
(2018)
NiZ. et al.
SurgiNet: Pyramid attention aggregation and class-wise self-distillation for surgical instrument segmentation
Med. Image Anal.
(2022)
SunY. et al.
Lightweight deep neural network for real-time instrument semantic segmentation in robot assisted minimally invasive surgery
IEEE Robot. Autom. Lett.
(2021)
LinS. et al.
Multi-frame feature aggregation for real-time instrument segmentation in endoscopic video
IEEE Robot. Autom. Lett.
(2021)
IslamM. et al.
Real-time instrument segmentation in robotic surgery using auxiliary supervised deep adversarial learning
IEEE Robot. Autom. Lett.
(2019)
Garcia-Peraza-HerreraL.C. et al.
Toolnet: holistically-nested real-time segmentation of robotic surgical tools
HaralickR.M.
Image segmentation survey
Fundam. Comput. Vis.
(1983)

MinaeeS. et al.

Image segmentation using deep learning: A survey

IEEE Trans. Pattern Anal. Mach. Intell.

(2021)

FuL.-Y. et al.

Transformer based U-shaped medical image segmentation network: a survey

J. Comput. Appl.

(2022)

SevakJ.S. et al.

Survey on semantic image segmentation techniques

ChenD. et al.

Geodesic paths for image segmentation with implicit region-based homogeneity enhancement

IEEE Trans. Image Process.

(2021)

KhadidosA. et al.

Weighted level set evolution based on local edge features for medical image segmentation

IEEE Trans. Image Process.

(2017)

LongJ. et al.

Fully convolutional networks for semantic segmentation

BadrinarayananV. et al.

Segnet: A deep convolutional encoder-decoder architecture for image segmentation

IEEE Trans. Pattern Anal. Mach. Intell.

(2017)

RonnebergerO. et al.

U-net: Convolutional networks for biomedical image segmentation

ZhouZ. et al.

Unet++: A nested u-net architecture for medical image segmentation

Cited by (12)

Methods and datasets for segmentation of minimally invasive surgical instruments in endoscopic images and videos: A review of the state of the art
2024, Computers in Biology and Medicine
In the field of computer- and robot-assisted minimally invasive surgery, enormous progress has been made in recent years based on the recognition of surgical instruments in endoscopic images and videos. In particular, the determination of the position and type of instruments is of great interest. Current work involves both spatial and temporal information, with the idea that predicting the movement of surgical tools over time may improve the quality of the final segmentations. The provision of publicly available datasets has recently encouraged the development of new methods, mainly based on deep learning. In this review, we identify and characterize datasets used for method development and evaluation and quantify their frequency of use in the literature. We further present an overview of the current state of research regarding the segmentation and tracking of minimally invasive surgical instruments in endoscopic images and videos. The paper focuses on methods that work purely visually, without markers of any kind attached to the instruments, considering both single-frame semantic and instance segmentation approaches, as well as those that incorporate temporal information. The publications analyzed were identified through the platforms Google Scholar, Web of Science, and PubMed. The search terms used were “instrument segmentation”, “instrument tracking”, “surgical tool segmentation”, and “surgical tool tracking”, resulting in a total of 741 articles published between 01/2015 and 07/2023, of which 123 were included using systematic selection criteria. A discussion of the reviewed literature is provided, highlighting existing shortcomings and emphasizing the available potential for future developments.
CFFR-Net: A channel-wise features fusion and recalibration network for surgical instruments segmentation
2023, Engineering Applications of Artificial Intelligence
Surgical instrument segmentation plays a crucial role in robot-assisted surgery by furnishing essential information about instrument location and orientation. This information not only enhances surgical planning but also augments the precision and safety of procedures. Despite promising strides in recent research on surgical instrument segmentation, accuracy still faces obstacles due to local feature processing limitations, surgical environment complexity, and instrument morphological variability. To address these challenges, we introduced the channel-wise features fusion and recalibration network (CFFR-Net). This network utilizes a dual-stream mechanism, combining a context-guided block and dense block for feature extraction. The context-guided block captures a variety of contextual information by using different dilation rates. Additionally, CFFR-Net employs a fusion mechanism that harmonizes context-guided and dense streams. This integration, along with the inclusion of Squeeze-and-Excitation attention, enhances both the precision and robustness of semantic instrument segmentation.
We performed experiments using two publicly available datasets for surgical instrument segmentation: the Kvasir-instrument and Endovis2017 datasets. The results of these experiments were highly encouraging, as our proposed model exhibited remarkable performance on both datasets compared to the state-of-the-art methods. On the Kvasir-instrument set, our model achieved a Dice score of 95.84% and mean intersection over union (mIOU) value of 92.40%. Similarly, on the Endovis2017 set, it obtained a Dice score of 95.47% and mIOU value of 93.02%.
Advancing precision in endoscopic surgery: a novel approach with attention mechanisms and adaptive threshold for surgical instrument segmentation
2024, Research Square
Branch Aggregation Attention Network for Robotic Surgical Instrument Segmentation
2023, IEEE Transactions on Medical Imaging
CGBA-Net: context-guided bidirectional attention network for surgical instrument segmentation
2023, International Journal of Computer Assisted Radiology and Surgery
RSSNet: A Fine-tuned Deep Learning Network for Robotic Surgical-tool Segmentation
2023, TechRxiv

View all citing articles on Scopus

^☆: This work was supported by the National Key Research & Development Project of China (2020YFB1313701), the National Natural Science Foundation of China (No. 62003309) and Outstanding Foreign Scientist Support Project in Henan Province of China (No. GZS2019008)

View full text

An attention-guided network for surgical instrument segmentation from endoscopic images☆

Highlights

Abstract

Introduction

Section snippets

Proposed segmentation network

Experiment dataset and settings

Experimental results and analysis

Conclusion

Declaration of Competing Interest

Acknowledgment

Pattern Recognit.

Procedia Comput. Sci.

Neurocomputing

Pattern Recognit.

Med. Image Anal.

Lightweight deep neural network for real-time instrument semantic segmentation in robot assisted minimally invasive surgery

IEEE Robot. Autom. Lett.

Multi-frame feature aggregation for real-time instrument segmentation in endoscopic video

IEEE Robot. Autom. Lett.

Real-time instrument segmentation in robotic surgery using auxiliary supervised deep adversarial learning

IEEE Robot. Autom. Lett.

Toolnet: holistically-nested real-time segmentation of robotic surgical tools

Image segmentation survey

Fundam. Comput. Vis.

Image segmentation using deep learning: A survey

IEEE Trans. Pattern Anal. Mach. Intell.

Transformer based U-shaped medical image segmentation network: a survey

J. Comput. Appl.

Survey on semantic image segmentation techniques

Geodesic paths for image segmentation with implicit region-based homogeneity enhancement

IEEE Trans. Image Process.

Weighted level set evolution based on local edge features for medical image segmentation

IEEE Trans. Image Process.

Fully convolutional networks for semantic segmentation

Segnet: A deep convolutional encoder-decoder architecture for image segmentation

IEEE Trans. Pattern Anal. Mach. Intell.

U-net: Convolutional networks for biomedical image segmentation

Unet++: A nested u-net architecture for medical image segmentation