TANet: Triple Attention Network for medical image segmentation

https://doi.org/10.1016/j.bspc.2023.104608Get rights and content

Highlights

  • Help clinicians make the diagnosis by automatically marking the diseased tissues.

  • Low-level features contribute limitedly but incur substantial computational cost.

  • Correlation between channels and spaces can be used for medical image segmentation.

  • Attention mechanism allows the networks to pay attention to the areas of Interest.

  • Selecting appropriate scale features can solve the scale adaptability problem.

Abstract

In recent years, deep learning-based methods have achieved remarkable progress in medical image processing, like polyp segmentation in colonoscopy images and skin lesion segmentation in dermoscopy images. However, the current state-of-the-art medical segmentation methods still suffer from the problem of low accuracy in segmenting the small-scale and variable-scale objects. To solve this problem, we propose Triple Attention Network (TANet). In TANet, a novel Triple Attention Module (TAM) is presented. TAM has two sub-modules: Multi-scale Feature Selection Module (MFSM) and Contextual Feature Extraction Module (CFEM). MFSM is used to extract more adaptable multi-scale features for capturing variable-scale objects, while CFEM is for capturing small-scale objects by extracting contextual features. TAM aims to combine MFSM and CFEM to finally enhance the segmentation performance of the medical images with the small-scale and variable-scale lesions. Extensive experiments are conducted on five polyp datasets and one skin lesion dataset. Results show that the proposed models outperform the previous state-of-the-art models on most evaluation metrics and improve the Dice score by up to 7.1%. All results consistently confirm the effectiveness of the proposed TANet and show that the TANet achieves state-of-the-art performance on the above datasets.

Introduction

With the advent of digital medical imaging equipment, the application of image processing technology in medical image analysis has received extensive attention. Medical image segmentation is an active and important field in medical image analysis. It helps clinicians make the diagnosis by automatically locating and marking the diseased tissues. Therefore, automatic medical image segmentation is significant for facilitating quantitative pathological evaluation, treatment planning, and monitoring disease progression [1]. However, due to various factors, such as background artifacts, noise, varied shape and size of lesions, and blurred boundaries, accurate segmentation has been a challenging task. In recent years, Convolutional Neural Networks (CNNs) have made great progress on computer vision tasks, such as medical image classification [2], [3], [4], object detection [5], [6], [7] and image retrieval [8]. Unsurprisingly, CNNs have also obtained massive achievements on semantic segmentation tasks. Ciresan et al. [9] propose a sliding window-based pipeline using CNN for semantic segmentation. Long et al. [10] propose a fully convolutional network (FCN), which removes the fully connected layers and only uses the convolutional layers for segmentation tasks. Based on FCN, SegNet [11] has been designed, which employs a symmetrical encoder–decoder architecture for segmentation tasks. The encoder extracts spatial features, then the decoder restores the low-resolution feature maps to the original resolution and predicts the segmentation masks. Naturally, CNNs have been introduced to conduct medical image segmentation. U-Net [12] is one of the most popular CNNs-based methods for medical image segmentation. Similar to SegNet, U-Net also includes an encoding path and a decoding path. The encoding path gradually reduces feature map resolution and learns the sophisticated features of the input image. The decoding path restores the low-resolution feature maps into the original size of an input image by an upsampling approach. However, it is well known that downsampling approaches lead to the loss of some meaningful information and degrade the segmentation performance [13]. To overcome this problem, U-Net introduces skip connections, which concatenate features from both encoder and decoder to obtain more meaningful features. U-Net beated FCN and obtained the state-of-the-art (SOTA) performance on medical images at the time. After that, many U-Net variants have been proposed, including U-Net++ [14], R2UNet [15], Attention-Unet [16], ConvLSTMU-Net [17], and etc.

However, for the problem that the lesions in medical images vary significantly in size and shape, the existing encoder–decoder-based architectures such as U-Net cannot produce accurate segmentation masks. We call it the scale adaptation problem. In this case, the existing encoder–decoder-based architectures do not provide sufficient multi-scale features for generating accurate segmentation. To tackle this problem, a common way is to design new skip connections to explore multi-scale features, such as MDU-Net [18], H-DenseUNet [19] and U-Net++ [14]. In addition, multiscale-based methods have been developed to deal with the scale adaptation problem. The atrous spatial pyramid pooling module (ASPP) [20] and the pyramid pooling module (PPM) [21] are widely used to extract multi-scale features. For example, PoolNet [22] extracts and processes the feature maps from the deepest layer via multiple parallel poolings operations with different-size pooling kernels. Ce-Net [23] adopts multiple dilated convolution branches with different dilated rates to obtain rich multi-scale context features of images. Although the network with new skip connections and multi-scale based methods can alleviate the scale adaptation problem to a certain extent, these methods cannot automatically select the most adaptable scale features from the extracted multi-scale features, which is necessary for obtaining accurate segmentation on medical images [24].

Additionally, for the medical images containing multiple lesions of different sizes, the context of the large-scale lesions may harm the segmentation of the small-scale lesions, resulting in the so-called small-scale lesion problem. To solve this problem, a type of methods [25], [26] attempts to capture more small-scale lesion features by fusing the multi-scale context features. Another type of methods [27], [28] addresses the small-scale lesion problem by extracting global contextual features. These methods utilize the enlarged kernel size or introduce an effective encoding layer on top of the network to capture global contextual features. However, all these methods only conduct simple explorations of global contextual features, and they do not explore the relationship between these contextual features so that features from dominated salient objects (e.g., large-scale lesions) still affect the segmentation of inconspicuous objects (e.g., small-scale lesions).

To solve the scale adaptation problem and the small-scale lesion problem in medical image segmentation, and improve the segmentation performance as much as possible, we propose a novel network architecture named Triple Attention Network (TANet) in this paper. TANet simultaneously fuses scale attention, position attention and channel attention, and use Res2Net [29] as the encoder backbone to extract features. Moreover, a novel module is proposed in TANet, called Triple Attention Module (TAM). TAM concentrates on addressing the above-mentioned two problems. The proposed TAM consists of Multi-scale Feature Selection Module (MFSM) and Contextual Features Extraction Module (CFEM). It combines MFSM and CFEM to sufficiently and efficiently extract discriminative features. To solve the scale adaptation problem, TAM utilizes MFSM to extract more multi-scale features based on the features of high-level layers, and dynamically select the most adaptable scale features from the new multi-scale features. Simultaneously, TAM highlights the feature representations of small-scale lesions to avoid the influence of large-scale lesions by using CFEM. CFEM exploits the correlation in channel and spatial dimensions between the similar features of lesions. It introduces the self-attention mechanism to establish spatial inter-pixel and inter-channel correlation. In this manner, CFEM can selectively aggregate the similar features of the lesions at any scales, and improve the feature representation of lesions at any scale. Therefore, CFEM can not only obtain global contextual features, but also capture the relationship between these features.

The main contributions of this paper are summarized as follows:

(1) A novel network architecture, Triple Attention Network (TANet), is presented for medical image segmentation tasks. TANet outperforms the existing SOTA methods on five polyp datasets and one skin lesion dataset.

(2) A new network module, Triple Attention Module (TAM), is proposed for addressing the scale adaptation problem and small-scale lesion problem. TAM can capture more discriminative features by combining the Multi-scale Feature Selection Module and Contextual Features Extraction Module.

(3) Extensive experiments are conducted to demonstrate the effectiveness of our proposed method. All experimental results consistently show that our method is superior to the existing medical image segmentation methods.

Section snippets

Related works

Traditional medical image segmentation methods are mainly based on hand-crafted features [30], [31], [32], [33], [34], [35]. These methods not only require a large amount of labor input, but also lead to misjudgment or over-segment. In recent years, many methods based on CNNs have been proposed and made brilliant achievements for medical image analysis [10], [11], [12], [36], [37]. Among these CNNs-based methods, encoder–decoder or U-shape-based networks are prevalent for medical image

Overview of network architecture

In this paper, we propose a new network, called Triple Attention Network (TANet), for medical image segmentation tasks. The overall architecture of TANet is presented in Fig. 1. TANet mainly includes three parts: (1) Feature Encoder; (2) Triple Attention Module (TAM), and (3) Feature Aggregation Module (FAM). Compared with traditional U-shape architecture (e.g., U-Net), the encoder and decoder in our model are not entirely symmetrical. According to the observations from [13], [49], the

Experiments

Experiments are conducted on two types of medical images, namely polyp and skin lesion images. The proposed TANet and the existing SOTA methods are compared on the polyp segmentation task and skin lesion segmentation task. The details of the experiments are presented in the following subsections.

Conclusions

In this paper, we have proposed a novel deep network called Triple Attention Network (TANet) for medical image segmentation. The Triple Attention Module (TAM) presented in TANet can capture more discriminative features by combining the proposed Multi-scale Feature Selection Module (MFSM) and Contextual Feature Extraction Module (CFEM). In TAM, MFSM can extract more multi-scale features and select adaptable scale features from all features. With these adaptable scale features, CFEM can extract

CRediT authorship contribution statement

Xin Wei: Conceptualization, Methodology, Software, Writing – review & editing. Fanghua Ye: Data curation, Writing – original draft. Huan Wan: Investigation. Jianfeng Xu: Software, Validation. Weidong Min: Writing – review & editing.

Declaration of Competing Interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Xin Wei reports financial support was provided by National Natural Science Foundation of China. Weidong Min reports financial support was provided by National Natural Science Foundation of China. Weidong Min reports financial support was provided by Jiangxi key Laboratory of Smart City. Xin Wei reports a relationship with Beijing Jiaotong University that

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant No. 62106093, 62076117, 62106090), Jiangxi Key Laboratory of Smart City (Grant No. 20192BCD40002), the Urgent Need for Overseas Talent project (Grant No. 20223BCJ25040, 20223BCJ25026) and Jiangxi Training Program for Academic and the Technical Leaders in Major Disciplines - Leading Talents Project (Grant No. 20225BCJ22016).

References (72)

  • SahaM. et al.

    Her2Net: A deep framework for semantic segmentation and classification of cell membranes and nuclei in breast cancer evaluation

    IEEE Trans. Image Process.

    (2018)
  • NardelliP. et al.

    Pulmonary artery–vein classification in CT images using deep learning

    IEEE Trans. Med. Imaging

    (2018)
  • PoudelS. et al.

    Colorectal disease classification using efficiently scaled dilation in convolutional neural network

    IEEE Access

    (2020)
  • ShinH.-C. et al.

    Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning

    IEEE Trans. Med. Imaging

    (2016)
  • ZhangJ. et al.

    Detecting anatomical landmarks from limited medical imaging data using two-stage task-oriented deep neural networks

    IEEE Trans. Image Process.

    (2017)
  • DingL. et al.

    A novel deep learning pipeline for retinal vessel detection in fluorescein angiography

    IEEE Trans. Image Process.

    (2020)
  • CiresanD. et al.

    Deep neural networks segment neuronal membranes in electron microscopy images

    Adv. Neural Inf. Process. Syst.

    (2012)
  • J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE...
  • BadrinarayananV. et al.

    Segnet: A deep convolutional encoder-decoder architecture for image segmentation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • RonnebergerO. et al.

    U-Net: Convolutional networks for biomedical image segmentation

  • ZhouZ. et al.

    Unet++: A nested U-Net architecture for medical image segmentation

  • AlomM.Z. et al.

    Recurrent residual U-Net for medical image segmentation

    J. Med. Imaging

    (2019)
  • OktayO. et al.

    Attention U-Net: Learning where to look for the pancreas

    (2018)
  • R. Azad, M. Asadi-Aghbolaghi, M. Fathy, S. Escalera, Bi-directional ConvLSTM U-Net with densley connected convolutions,...
  • ZhangJ. et al.

    Mdu-net: Multi-scale densely connected U-Net for biomedical image segmentation

    (2018)
  • LiX. et al.

    H-DenseUNet: Hybrid densely connected UNet for liver and tumor segmentation from CT volumes

    IEEE Trans. Med. Imaging

    (2018)
  • ChenL.-C. et al.

    Rethinking atrous convolution for semantic image segmentation

    (2017)
  • H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network, in: Proceedings of the IEEE Conference on...
  • J.-J. Liu, Q. Hou, M.-M. Cheng, J. Feng, J. Jiang, A simple pooling-based design for real-time salient object...
  • GuZ. et al.

    Ce-Net: Context encoder network for 2D medical image segmentation

    IEEE Trans. Med. Imaging

    (2019)
  • X. Li, W. Wang, X. Hu, J. Yang, Selective kernel networks, in: Proceedings of the IEEE/CVF Conference on Computer...
  • H. Ding, X. Jiang, B. Shuai, A.Q. Liu, G. Wang, Context contrasted feature and gated multi-scale aggregation for scene...
  • G. Lin, A. Milan, C. Shen, I. Reid, Refinenet: Multi-path refinement networks for high-resolution semantic...
  • C. Peng, X. Zhang, G. Yu, G. Luo, J. Sun, Large kernel matters–Improve semantic segmentation by global convolutional...
  • WangJ. et al.

    Global context encoding for salient objects detection

  • GaoS. et al.

    Res2net: A new multi-scale backbone architecture

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2019)
  • Cited by (3)

    1

    Equal contribution.

    View full text