research-article

Open access

Hierarchical Feature Integrated BoT-UNet with contextual feature enhancement for retinal vessel segmentation

Authors:

Ananya Bose,

Prerana Mukherjee,

Anasua SarkarAuthors Info & Claims

ICVGIP '24: Proceedings of the Fifteenth Indian Conference on Computer Vision Graphics and Image Processing

Article No.: 39, Pages 1 - 9

https://doi.org/10.1145/3702250.3702289

Published: 31 December 2024 Publication History

PDF eReader

Abstract

Retinal vessels provide crucial information about eye health and systemic diseases. Thin vessels are particularly important as they can indicate early signs of diseases. The challenges with thin vessels are that they have low contrast and can be easily confused with background noise. Variations in vessel width and intensity make uniform detection difficult. Presence of other retinal structures (e.g., optic disc, fovea) can further complicate segmentation. In the context of retinal vessel segmentation, attention can help the model focus more on areas likely to contain thin vessels. To ensure feature diversity, it’s crucial to incorporate both fine-grained details (like edges and textures) and broader, more complex elements (such as intricate objects or structural components). In our paper, we propose a Hierarchical Feature Integrated BoT-UNet with contextual feature enhancement. The proposed architecture preserves U-Net’s core design while innovating its components. Both encoder and decoder sections are substituted with the implementation of bottleneck transformer structures, a modification that substantially decreases computational demands compared to conventional Transformer models. Furthermore, the proposed BoT U-Net incorporates a fusion of convolutional and Transformer elements (hybrid), eliminating the need for pre-trained weights. The architecture employs Hierarchical Feature Integration Block (HFIB) and integrates Contextual Feature Enhancement Block (CFEB) between the encoding and decoding stages to enhance segmentation performance. Additionally, backward fusion mechanisms are employed to aggregate and reintegrate context, enhancing the feature refinement process. This bidirectional fusion ensures effective capture and utilization of both local and global contexts. The proposed network achieves competitive retinal segmentation results on three public benchmark datasets: DRIVE, CHASE DB1, and STARE.

1 Introduction

Retinal vessel segmentation is a crucial task in ophthalmology and medical image analysis, playing a vital role in the diagnosis and monitoring of various ocular and systemic diseases. This process involves the automated identification and delineation of blood vessels within retinal images, ranging from large arteries and veins to the finest capillaries [22]. Of particular importance and challenge are the accurate segmentation of thin vessels, which are often indicators of early-stage diseases or subtle changes in retinal health.

Thin vessels can be early indicators of diseases such as diabetic retinopathy, glaucoma, and hypertensive retinopathy. Changes in these fine structures often precede more noticeable symptoms. Thin vessels are inherently difficult to detect due to their low contrast, small size, and similarity to background noise in retinal images. This makes their accurate segmentation a critical and challenging aspect of automated retinal analysis. Recent developments in imaging technology and computational methods have made it increasingly feasible to detect and analyze these fine vascular structures, opening new avenues for early disease detection and treatment monitoring.

In contemporary medical practice, physical examination remains the primary method for assessing retinal fundus abnormalities. However, this approach is not only time-consuming and laborious but also demands substantial medical expertise from physicians. Consequently, there is a pressing need for automated and precise retinal vessel segmentation from fundus images to support doctors in their diagnostic processes. To address this challenge, researchers have explored various machine learning techniques. These include the application of Gabor wavelet filters for vascular segmentation [29] and high-pass filtering to enhance vessel visibility [28]. Additionally, some studies have employed probabilistic methods such as the EM (Expectation-Maximization) maximum likelihood estimation [9] to differentiate between background pixels and retinal vessels. These efforts aim to streamline the diagnostic process and improve the accuracy of retinal abnormality detection.

The advent of deep learning has transformed the landscape of computer vision and machine learning. As deep learning technologies have evolved, automatic feature learning methods like convolutional neural networks (CNNs) and fully convolutional networks (FCNs) have largely supplanted traditional manual feature extraction techniques. These methods have demonstrated remarkable efficacy in medical image segmentation tasks, including the delineation of blood vessels. In a pivotal work Ronneberger et al.’s introduced U-Net [26], characterized by its distinctive symmetric U-shaped structure, marked a significant leap forward in medical image segmentation. U-Net’s ability to effectively capture intricate details and boundary information in images set a new standard in the field. Building upon U-Net’s success, a diverse array of novel network architectures has emerged. These include UNet++, DU-Net, Cross-modality, and R2UNet [3, 10, 13, 37], amongst others.

While convolutional neural networks (CNNs) excel at feature extraction, they are constrained by the limited receptive field of their convolutional kernels. This limitation means that CNNs are adept at processing local information but struggle to capture global contextual information effectively. Moreover, the challenge in retinal vessel segmentation lies in achieving precise pixel-level classification rather than broader image-level categorization. To address these limitations, researchers have begun incorporating the Transformer framework into computer vision tasks. This approach aims to overcome the localized focus of CNNs and enhance the ability to perform accurate, fine-grained segmentation by leveraging the Transformer’s capacity to model long-range dependencies and global context. Vision Transformers (ViTs) [5] offer significant advantages over CNN-based networks, including a broader receptive field and learned feature aggregation. These characteristics enable ViTs to transmit contextual and semantic information more effectively throughout the network, resulting in enhanced localization capabilities. However, these benefits come with increased demands for training data and computational resources.

This article presents a novel approach – an innovative Hierarchical Feature Integrated BoT-UNet with contextual feature enhancement. The suggested architecture preserves U-Net’s core design while innovating its components. The novel contributions are three-fold:

•

Both encoder and decoder sections are substituted with the implementation of grouping and bottleneck structures, minimizing the computational time and cost associated with the Transformer models. Furthermore, the proposed BoT U-Net incorporates a fusion of convolutional and Transformer elements (hybrid), eliminating the need for pre-trained weights.

•

The HFI BoT-UNet architecture employs Hierarchical Feature Integration Block (HFIB) and integrates Contextual Feature Enhancement Block (CFEB) between the encoding and decoding stages to enhance segmentation performance.

•

Additionally, backward fusion mechanisms are employed to aggregate and reintegrate context, enhancing the feature refinement process. This bidirectional fusion ensures effective utilization of both local and global contexts.

2 Related Works

In recent years, CNN-based approaches have made significant strides in semantic and medical image segmentation, owing to their efficient feature extraction and robust representation capabilities. The Convolutional Block Attention Module (CBAM) [34] network stands out as a seminal work, integrating channel and spatial attention in a novel way. Building on this concept, the CA-Net [6] takes a more comprehensive approach by incorporating channel, spatial, and location attention into a single framework. This integrated attention convolutional neural network offers a more interpretable solution for medical image segmentation tasks. The VA-UFL [30] method combines the insights of visual attention mechanisms with multi-scale contextual information to selectively highlight the most pertinent parts of a structure within a given local patch. In CS2-Net [20] two self-attention mechanisms are utilized within the channel and spatial domains to produce attention-aware expressive features. These mechanisms improve the network’s ability to capture long-range dependencies and efficiently utilize the multi-channel space for feature representation and normalization, thereby enhancing the network’s ability to distinguish curvilinear structures from the background.

The Group Transformer Network (GT-Net) uses bottleneck structures to reduce transformer complexity, facilitating CNN-transformer integration [14]. IterNet [12], recognized for utilizing iterative UNet applications to improve segmentation outcomes, has emerged as one of the most effective methods. Nonetheless, incorporating arbitrary angle rotation as a data augmentation strategy presents difficulties, resulting in ambiguous ground-truth labeling near vessel boundaries. M2—Unet [11] employs MobileNet-v2 and U-Net to achieve segmentation result. [25] utilizes a U-Net architecture with an attention mechanism to primarily concentrate on relevant areas of the input image, complemented by the unfolded deep kernel estimation (UDKE) method to improve the performance of semantic segmentation models. FANet enhances segmentation by integrating the previous epoch’s predicted mask with current feature maps, creating an attention mechanism during training. This process is followed by a rectification step, resulting in refined final predictions [32]. This technique was specifically developed to tackle common problems in vessel segmentation, such as the under-segmentation of faint vessels and edge pixels. By allowing the network weights to adapt dynamically during training and inference, this method aims to improve the accuracy and robustness of the segmentation process, particularly for challenging areas of the image.

[31] proposes noise suppression through the implementation of BM3D filtering combined with a multi-scale line detection technique, aimed at improving specificity. Additionally, directional triple-stick filtering is employed to enhance the detection of small vessels. Accurately segmenting retinal vessels is crucial for diagnosing eye diseases but challenging due to scale variations, complex vessel anatomy, and low contrast with the background. To address these challenges, [35] proposed a novel scale and context sensitive network (SCS-Net) for retinal vessel segmentation, incorporating a scale-aware feature aggregation (SFA) module for dynamic multi-scale feature extraction, an adaptive feature fusion (AFF) module for efficient hierarchical fusion, and a multi-level semantic supervision (MSS) module for refining vessel maps. [15] introduces an edge detection-based dual encoder to preserve the edge of vessels and efficiently capture channel features. Convolutional neural networks are commonly used for image segmentation, but challenges like ambiguous tumor edges, variable lesions, and weak vessel boundaries hinder their accuracy. To address these issues, [17] introduced a dual-tree complex wavelet scattering transform module and propose a novel learning scattering wavelet network, along with an improved active contour loss function for handling complex segmentation tasks. Segmentation methods struggle with accurate fine vessel detection due to information loss from pooling and insufficient local context processing in skip connections. To address this, [1] proposed, a novel retinal vessel segmentation network that combines a residual depth-wise over-parameterized convolution (ResDO-conv) for enhanced feature extraction, a pooling fusion block (PFB) to mitigate information loss, and an attention fusion block (AFB) to improve multi-scale feature expression. Although U-Net and its variants perform well in image segmentation, they often suffer from feature loss in the encoder and mismatched contextual information due to skip connections. To address these issues, [33] proposed an improved retinal vessel segmentation method by incorporating ResNest for enhanced feature extraction and introducing a depthwise FCA Block (DFB) to handle local context mismatches. [8] propose a U-Net variant for improved retinal vessel segmentation. A minimal U-Net (Mi-UNet) with only 0.07M parameters is introduced compared to 31.03M in traditional U-Net. Building on Mi-UNet, [8] present Salient U-Net (S-UNet), a bridge-style architecture with a saliency mechanism, using a cascading technique to enhance input images and address data imbalance.

In TP-UNet [18] HaarNet framework combines elements from UNet and SegNet architectures to create a novel three-pathway UNet structure. This innovative design, dubbed TP-UNet, is further enhanced by integrating auto-encoding (AE) capabilities and deep supervised learning (DSL) strategies. The resulting model, referred to as TP-UNet+AE+DSL, leverages these additional features to boost its overall effectiveness in image segmentation tasks.

Despite the use of many different algorithms, accurate partitioning of small low-contrast structures in medical images remains an important problem. One approach is the use of engineered feature extraction from medical images to detect thin and irregular appearances based on their unique shape and structural features. In another work, Xiao et al [36] proposed a modified stick filter that employed three parallel sticks at equal distances for enhancing the detection of slender formations. But in this method, only magnitudes were used to disregard directional cues. To overcome this issue, Peng et al [24] developed a method which combined both magnitude and orientation information to differentiate thin, irregular objects from other surrounding structures more effectively. However, their methodology failed in dealing with distorted or broken shapes. Another advancement by Liu et al. [16] involved the application of geodesic models in handling these issues in shortcut joining during curvilinear structure delineation where different path types are allowed for each pixel. Nonetheless, this technique can only be applied to planar cases since it is unable to go beyond 2-D dimensional space. Modern approaches consist of various deep learning architectures for segmentation of curvilinear objects that significantly surpass prior methods utilized in medical imaging domain. A similar illustration could be seen from Ma et al. [19] who came up with an anatomically grounded dual-branch cascaded neural network to segment out thin, irregular objects in medical images. On the same footing, Roy et al. [27] developed a deep learning architecture for multi-view bright, thin curved structure detection.

Despite its effectiveness, Vision Transformer (ViT) has a significant limitation: it typically requires pre-training on massive image datasets to perform well, which can lead to suboptimal results when applied to smaller, domain-specific datasets. To address this issue, Aravind et al. [16] introduced BoTNet, an innovative instance segmentation backbone that combines elements of both Transformers and convolutional networks. BoTNet’s design takes into account the high computational demands of Transformer architectures. Instead of fully replacing convolutional layers, it strategically integrates Transformer blocks into the final layers of a ResNet architecture. This hybrid approach aims to harness the strengths of both paradigms: the efficient local processing of CNNs and the long-range dependency modeling of Transformers. By doing so, BoTNet strikes a balance between computational efficiency and the ability to capture global context, potentially offering improved performance on smaller datasets without the need for extensive pre-training.

3 Proposed Methodology

The architecture of Hierarchical Feature Integrated BoT-UNet with contextual feature enhancement uses bottleneck transformer in both encoder and decoder as shown in Fig. 1. The transformer is composed of 3 x 3 convolutions, Multi-head self-attention (MHSA) module, a bottleneck module. In order to incorporate multiscale features, a Hierarchical Feature Integration Block (HFIB) extracts the characteristic features at multiple kernel levels and integrates them. The channel mapping is enhanced by Contextual Feature Enhancement Block (CFEB). The network is a sophisticated model for image segmentation that employs a combination of encoding and decoding pathways with backward feature fusion to enhance feature extraction. The encoding path consists of multiple convolutional layers and MHSA interspersed with max-pooling operations, which progressively reduce the spatial dimensions while capturing more abstract features. To enrich these features, Hierarchical Feature Integration Blocks (HFIB) are integrated at each level, providing comprehensive contextual information. Following the encoding phase, a Contextual Feature Enhancement Block (CFEB) further refines the features from the deepest layer, boosting their representational capacity. The decoding path then up-samples these features through up-convolution layers, progressively reconstructing the spatial dimensions. At each decoding stage, the model concatenates the up-sampled features with corresponding encoding layer outputs via skip connections, preserving high-resolution details. Additionally, HFIB and backward fusion mechanisms are employed to aggregate and reintegrate context, enhancing the feature refinement process. This bidirectional fusion ensures effective utilization of both local and global contexts. The final 1x1 convolutional layer generates the output segmentation map, providing the model’s prediction. This architecture allows to maintain high spatial resolution and leverage rich contextual information, making it highly effective for fine-grained image segmentation tasks.

Figure 1:

Figure 2:

3.1 Bottleneck transformer block (BoT Block)

Despite the success of deep neural networks in supervised learning for computer vision and NLP, their practical implementation is hindered by the need for extensive annotated data. This challenge is particularly pronounced with Transformer architectures, especially when applied to image segmentation tasks. Unlike text, which uses a finite set of words, images comprise countless pixels, significantly increasing computational complexity. To address this, researchers have developed the BoT-UNet architecture, featuring a novel BoTBlock component. This module combines convolutional operations with multi-headed self-attention in a bottleneck structure, enhancing feature extraction for long-range relationships. The BoTBlock first reduces input channels via convolution, processes the compressed representation through multi-head self-attention, and then expands it back using additional convolutions. It also employs residual connections to preserve original input information. By integrating multi-head self-attention with convolutions, the BoTBlock efficiently handles both local and global information, making it highly effective for tasks requiring detailed spatial understanding, such as medical image segmentation. This approach streamlines computation while maintaining the powerful capabilities of Transformer models, offering a more resource-efficient solution to the challenges posed by complex image data. The BoTBlock can be represented by,

\begin{equation} {BoTBlock(x)=ReLU[Conv3(AvgPool[ReLU(Conv2(MHSA(ReLU(Conv1(x)))))])]} \end{equation}

(1)

Where, Conv1, Conv2 and Conv3 are convolutional layers with different dimensions and strides. MHSA denotes multi-head self-attention.

3.2 Multi-Head Self Attention (MHSA)

The Transformer architecture extensively uses the MHSA (Multi-Headed Self-Attention) module to effectively capture long-range dependencies. Multi-head attention (Fig. 2) is a key component within attention mechanisms that applies the attention process multiple times in parallel. The outputs from these individual attention heads are then concatenated and linearly transformed to achieve the required dimensionality. This approach allows the model to focus on different parts of the sequence simultaneously, helping to capture various aspects such as long-term dependencies. For input \(I\epsilon \mathbb {R}^{(HXB)XC}\) where spatial dimensions are represented by HXB and C indicated the channel number. For an input and output of same dimension MHSA process can be mathematically expressed as,

\begin{equation} \mathbf {Att(Q,K,V)}=softmax((\mathbf {QK}{^T})/\sqrt (d))\mathbf {V} \end{equation}

(2)

\begin{equation} \mathbf {Q}=\mathbf {XW}^q,\mathbf {K}=\mathbf {XW}^k,\mathbf {V}=\mathbf {XW}^v \end{equation}

(3)

where, \(\mathbf {W}^q\epsilon \mathbb {R}^{(C X C)},\mathbf {W}^k\epsilon \mathbb {R}^{(C X C)},\mathbf {W}^v\epsilon \mathbb {R}^{(C X C)}\). The model incorporates learnable parameter matrices, each with dimensions CXC, representing linear transformations. A normalized scaling factor, denoted as \(\sqrt {d}\), is applied. The softmax operation is then performed row-wise on the resulting matrix. Furthermore, the concept of multiple attention heads can be introduced to expand the model’s representational capacity. The h^th head attention matrix Att_h can be calculated by,

\begin{equation} \mathbf {Att_h}= softmax((\mathbf {Q}{_h} \mathbf {K}{_h^T})/\sqrt {d}) \mathbf {V}{_h} \end{equation}

(4)

Att_h, Q_h, and K_h are the attention matrix, query, and key of the h^th head, respectively. We also split the value V into H heads. The attention matrix is concatenated as,

\begin{equation} \mathbf {Att}=Concat(\mathbf {A}_1,\ldots,\mathbf {A}_h,\ldots,\mathbf {A}_H) \end{equation}

(5)

\begin{equation} \mathbf {V}=Concat(\mathbf {V}_1,\ldots,\mathbf {V}_h,\ldots,\mathbf {V}_H) \end{equation}

(6)

Figure 3:

3.3 Hierarchical Feature Integration Block

The Hierarchical Feature Integration Block introduces a multi-branch parallel processing architecture that analyzes information through various concurrent pathways [2]. This approach fosters feature diversity, allowing each channel to extract distinct attributes from the input data. Consequently, the network can recognize a broader range of patterns and subtleties, yielding a more nuanced and comprehensive representation of the input. At its core, this module constructs an information hierarchy via multiple paths of varying complexity. This structure is particularly beneficial for vascular segmentation tasks, enabling the capture of both fine-grained elements like edges and textures, as well as larger-scale features such as complex structures or objects. The design also functions as a regularization mechanism, promoting pathway diversity to mitigate overfitting. This enhances the model’s resilience to training set bias, thereby improving its robustness and generalization capabilities. The module’s architecture, as depicted in the accompanying Fig. 3, employs atrous convolutions at different scales (1×1, 3×3, and 5×5) applied parallel to the input data. Non-linearity is introduced through ReLU activation, followed by batch normalization to ensure training stability and acceleration. The outputs from these convolution blocks are then aggregated to produce intermediate results. In this work, we adjusted the network’s architecture by altering the number of layers to better align with our experimental requirements. The Hierarchical Feature Integration Block can be represented by,

\begin{equation} out1 = in + BN(\sigma (\mathrm{f}_{h1}^{1X1}(in))))+BN(\sigma (\mathrm{f}_{h1}^{3X3}(in))))+BN(\sigma (\mathrm{f}_{h1}^{5X5}(in)))) \end{equation}

(7)

\[\begin{eqnarray} out2 = out1 + BN(\sigma (\mathrm{f}_{h2}^{1X1}(out1))))+BN(\sigma (\mathrm{f}_{h2}^{3X3}(out1))))+\nonumber \\ BN(\sigma (\mathrm{f}_{h2}^{5X5}(out1)))) \end{eqnarray}\]

(8)

Where, in represents the input feature map. \(\mathrm{f}_{h1}^{1X1}, \mathrm{f}_{h1}^{3X3}, \mathrm{f}_{h1}^{5X5}\) are convolution operations with kernel sizes 1x1, 3x3, and 5x5 in the first stage. out1 is the output of the first stage. \(\mathrm{f}_{h2}^{1X1}, \mathrm{f}_{h2}^{3X3}, \mathrm{f}_{h2}^{5X5}\) are convolution operations with kernel sizes 1x1, 3x3, and 5x5 in the second stage. out2 is the final output feature map. This process iterates through successive stages, with each step’s output serving as the input for the subsequent one. The final output undergoes dimensionality reduction via a bottleneck layer, a design choice that enhances parameter efficiency, enables increased network depth, and optimizes memory usage.

3.4 Contextual Feature Enhancement Block

The Contextual Feature Enhancement Block (CFEB) (as in Fig. 4) is strategically integrated between the encoder and decoder to further refine feature representations by leveraging contextual information. It comprises of three core components: Context Encoding, which utilizes a series of depth-wise convolutions to capture a range of visual contexts from local to global; Context Aggregation, where these encoded features are selectively combined based on learnable attention weights to prioritize significant context information for each query token; and Context Fusion, which merges the aggregated context features with the query token through an element-wise affine transformation facilitated by a learnable weight matrix. This approach enhances the model’s ability to integrate and utilize contextual information, improving the overall quality of the feature representations. Query(X) is the query feature generated from the input X. Conv2d(.,W) applies an element-wise affine transformation using a learnable weight matrix W. \(Conv_{(depth-wise)}^i\) represents the context features from the i^th depth-wise convolution layer. Gate_i are point-wise convolutional layers that generate gating values. CFEB is represented as,

\[\begin{eqnarray} O&=&Query(X)+Conv2d(Modulator(\sum _{(i=1)}^N(Conv_{depth-wise}^i (X)* \nonumber \\ &&\sigma (Gate_i (Conv_{depth-wise}^i (X)))),W) \end{eqnarray}\]

(9)

Figure 4:

4 Experimental Results And Discussions

To assess our proposed architecture, we have employed seven evaluation metrics to provide a thorough assessment of the segmentation performance improvements. The selected metrics include: 1. Area under the Receiver Operating Characteristic (ROC) curve

2. Area under the Precision-Recall curve

3. F1 score

4. Accuracy

5. Sensitivity

6. Specificity

7. Precision

This multi-faceted approach allows for a detailed and nuanced evaluation of the algorithm’s performance across various aspects of segmentation quality.

4.1 Datasets and Experimental settings

Our model’s performance was assessed using three benchmark retinal fundus image datasets: DRIVE [21], CHASE_DB1 [23], and STARE [7]. The DRIVE dataset, originating from a Dutch Diabetic Retinopathy screening initiative, contains 40 color fundus images (565×584 pixels). It’s pre-divided into equal training and testing sets of 20 images each. CHASE_DB1, stemming from the British Children’s Hearing and Health Research Project, includes 28 fundus images (999×960 pixels). We allocated the first 21 images for training and the remaining 7 for testing. The STARE (Structural Analysis of the Retina) dataset consists of 20 fundus images (700×605 pixels). As it lacks a predefined split, we use first 5 for evaluation and rest for training. For all datasets, expert-annotated manual segmentations serve as ground truth. Random patches are extracted for training, and overlapping patches are extracted (64 X 64) for testing. Overlapping patches are recombined into full images, and predictions are validated and filtered based on FOV.

Figure 5:

Figure 6:

Figure 7:

4.2 Evaluation Metrics

To assess the quality of our segmentation results, we conducted a quantitative comparison against the corresponding ground-truth images. This evaluation process involved classifying pixels into four categories: 1. Correctly identified vessel pixels (true positives, TP) 2. Correctly identified background pixels (true negatives, TN) 3. Background pixels mistakenly labeled as vessels (false positives, FP) 4. Vessel pixels erroneously classified as background (false negatives, FN) Using these pixel classifications, we computed several performance metrics: 1. Accuracy (Acc): Measure of overall correct pixel classification 2. Sensitivity (Sen): Ability to correctly identify vessel pixels 3. Specificity (Spe): Ability to correctly identify background pixels 4. F1 score: Harmonic mean of precision and recall 5. Area Under the Receiver Operating Characteristic Curve (AUC-ROC)

These metrics provide a comprehensive view of the segmentation performance, allowing us to assess different aspects of the algorithm’s effectiveness.

4.3 Performance comparison with other methods

We conducted a comprehensive evaluation of the Hierarchical Feature Integrated BoT-UNet with contextual feature enhancement architecture, comparing it to leading approaches in retinal vessel segmentation. To gauge performance, we examined various metrics presented in Table 1, which offers a detailed comparative analysis across the datasets. This comparison underscores the improved precision in retinal vessel segmentation achieved by each component of our proposed model.

Table 1:

	DRIVE					CHASE					STARE
Method	AUC-ROC	Acc	Sen	Spe	F1-score	AUC-ROC	Acc	Sen	Spe	F1-score	AUC-ROC	Acc	Sen	Spe	F1-score
U-Net 2015 [26]	97.55	95.31	75.37	98.2	81.42	97.72	95.78	82.88	97.01	77.83	98.3	96.9	82.7	98.42	83.73
Cross modality 2016 [13]	97.38	98.27	75.69	98.16	-	97.16	95.81	75.07	97.93	-	98.79	96.28	77.26	98.44	-
R2 U-Net 2018[3]	97.84	95.56	77.99	98.13	81.71	99.14	96.34	77.56	98.62	79.28	99.14	97.12	82.98	98.62	84.75
Visual att 2018 [30]	97.01	95.89	86.44	96.67	76.07	95.91	94.74	82.97	96.63	71.89	96.7	95.02	83.25	97.46	76.98
DU-Net 2018 [10]	98.02	95.66	79.63	-	82.37	98.04	96.10	81.55	-	78.83	98.32	96.41	75.95	-	81.43
M2U-Net 2019 [11]	97.14	96.3	-	-	-	96.66	97.03	-	-	-	-	-	-	-	-
IterNet 2020 [12]	98.16	95.73	77.35	98.38	82.05	98.51	96.55	79.7	98.23	80.73	98.81	97.01	77.15	98.86	81.46
Dense Unet 2020 [4]	97.16	95.11	78.86	97.36	-	-	-	-	-	-	96.82	94.75	78.96	97.34	-
CS2-Net 2020 [20]	97.63	96.22	82.59	98.5	-	96.28	95.22	78.41	98.31	-	97.27	96.51	85.16	97.48	-
GT-Unet 2021 [14]	97.94	96.3	82.92	98.17	84.63	97.9	96.3	76.49	98.8	82	96.94	96	72.99	98.66	79
SCS-Net 2021 [35]	98.37	96.97	82.89	98.38	-	98.67	97.62	83.65	98.39	-	98.44	97.36	82.07	98.39	-
WWVB 2022 [31]	-	96.1	81.25	97.63	-	-	95.78	80.12	97.3	-	-	95.86	80.78	97.21	-
DE-DCGCN-EE 2022 [15]	98.66	97.05	83.59	98.26	82.88	98.98	97.62	84.05	98.56	82.61	98.99	97.05	83.59	98.26	83.63
LSW-Net 2022 [17]	-	98.65	78.76	98.37	-	-	-	-	-	-	-	-	-	-	-
ResDO-Unet 2023 [1]	-	95.61	79.85	97.91	-	-	96.72	80.2	97.94	-	-	95.67	79.63	97.92	81.72
U-Net improved 2023 [33]	-	94.03	73.8	97.03	-	-	95.04	74.13	97.2	-	-	-	-	-	-
UDKE 2023 [25]	-	96.12	76.57	-	82.95	-	96.29	83.02	-	74.93	-	96.71	75.91	-	72.28
TP-Unet 2023 [18]	-	95.71	81.84	97.73	82.91	-	96.64	82.42	98.05	81.62	-	-	-	-	-
S-UNet 2023 [8]	98.21	95.67	83.12	97.51	83.03	98.67	96.58	80.44	98.41	82.42	-	-	-	-	-
HFI BoT-Unet 2024	98.05	96.2	88.7	96.88	84.88	97.88	96.4	76.08	98.82	81.82	96.87	95.76	68.24	98.91	76.8

Table 1: Comparative performance on retinal vessel segmentation on Drive, Chase and Stare Datasets.

While UNet-based segmentation has shown promising results, it’s important to note its limitations. These include high computational demands due to numerous factors, difficulties in explicitly modeling long-range dependencies, and the potential loss of fine-grained spatial information during the decoding and downsampling processes. RU-Net and R2U-Net [3] frameworks are structured to maintain an equivalent parameter count to their U-Net [26] and ResU-Net counterparts. Despite this parity in network complexity, the RU-Net and R2U-Net models demonstrate superior results in segmentation tasks. It’s noteworthy that while the incorporation of recurrent and residual functionalities doesn’t expand the parameter set, these additions substantially influence both the training process and the model’s performance during testing phases. The DU-Net architecture [4] introduces additional computational demands due to its inclusion of a convolution offset layer. This feature, while potentially beneficial for performance, comes at the expense of increased resource utilization during processing. Our proposed approach exhibited competitive performance across multiple datasets. When applied to the DRIVE dataset, it achieved top scores in sensitivity (88.7), F1-score (84.88), and very high AUC-ROC (98.05). On the CHASE_DB1 dataset, our method reached the highest specificity (98.82), while also maintaining high accuracy and F1-score. For the STARE dataset, our proposed model demonstrated superior specificity (98.91) and very competitive AUC-ROC and accuracy scores. The outstanding sensitivity results suggest our method is able to correctly identify positive cases compared to other techniques. Concurrently, the high specificity indicates a low rate of false positives. These combined outcomes highlight our approach’s proficiency in accurately distinguishing both vessel and non-vessel pixels in retinal images. Fig. 6 presents sample outputs from our proposed approach as applied to the DRIVE, CHASE_DB1, and STARE datasets. Fig. 7 provides a qualitative analysis between our approach and other compared methods for the DRIVE, CHASE_DB, and STARE datasets, respectively. Different network architectures have varying strengths and weaknesses in retinal vessel segmentation. U-Net variants tend to overestimate vessel areas due to their difficulty in precisely defining vessel edges. R2-UNet improves upon U-Net’s segmentation accuracy but at the cost of slower processing. Bot-UNet, which serves as the foundation for our model, provides more precise segmentation results.

To provide a thorough evaluation of our model’s capabilities, we generated Receiver Operating Characteristic (ROC) curves for each dataset using our network model. These curves, depicted in Fig. 5, demonstrate both high True Positive Rates and low False Positive Rates.

4.4 Ablation study

To validate the effectiveness of the proposed network in vessel segmentation, ablation experiments were conducted using the DRIVE datasets. The assessment of network prediction results is based on five performance metrics: Accuracy (ACC), Sensitivity (SE), Specificity (SP), Area Under the Curve (AUC-ROC), and F1-score. Table 2 presents a detailed representation of the segmentation performance of different methods on different datasets, highlighting the enhanced accuracy in retinal vessel segmentation achieved by each module proposed in the model.

Table 2:

	DRIVE
Method	AUC-ROC	Acc	Sen	Spe	F1-score
BoT-Unet	97.94	96.3	82.92	98.17	84.63
BoT-Unet +CFEB	97.9	96.31	78.37	98.81	83.86
BoT-Unet +HFIB	97.89	96.21	81.83	98.22	84.13
BoT-Unet +CFEB +HFIB	97.99	95.87	86.77	97.53	84.06
BoT-Unet +CFEB +HFIB +Bi-skip	98.05	96.2	88.7	96.88	84.88

Table 2: Ablation studies on Drive dataset

This study involves the systematic addition or removal of various components to assess their impact on performance metrics. Here’s an overview of the model configurations: BoT-Unet: The baseline model without any additional modifications. BoT-Unet + CFEB: Combines the baseline architecture with the Contextual Feature Enhancement Block (CFEB). BoT-Unet + HFIB: Combines the baseline architecture with the Hierarchical Feature Integration Block (HFIB) BoT-Unet + CFEB + HFIB: Incorporates both Contextual Feature Enhancement Block and Hierarchical Feature Integration Block into the baseline. BoT-Unet + CFEB + HFIB + Bidirectional fusion: In addition to Contextual Feature Enhancement Block and Hierarchical Feature Integration Block, bidirectional fusion path is introduced.

5 Conclusion

In this work, we presented a U-shaped network featuring a convolutional branching mechanism inspired by the transformer architecture. Backward fusion captures both global and local context features, with HFIB enhancing computational efficiency by using the optimal number of filters. CFEB gathered varying levels of context information for each image part, uses a gate to select the most relevant information. Backward fusion allowed the local features obtained through CNN to merge seamlessly with the global information from the transformer, improving the extraction of fine-grained vessel features. Our method showcased outstanding segmentation performance on the DRIVE, CHASE_DB1, and STARE datasets, excelling in identifying thin blood vessels, which is crucial for clinical diagnosis.

References

[1]

2023. ResDO-UNet: A deep residual network for accurate retinal vessel segmentation from fundus images. Biomedical Signal Processing and Control 79 (2023), 104087.

Abstract

1 Introduction

2 Related Works

3 Proposed Methodology

3.1 Bottleneck transformer block (BoT Block)

3.2 Multi-Head Self Attention (MHSA)

3.3 Hierarchical Feature Integration Block

3.4 Contextual Feature Enhancement Block

4 Experimental Results And Discussions

4.1 Datasets and Experimental settings

4.2 Evaluation Metrics

4.3 Performance comparison with other methods

4.4 Ablation study

5 Conclusion

References

Index Terms

Recommendations

Retinal vessel segmentation employing ANN technique by Gabor and moment invariants-based features

Retinal vessel segmentation based on self-distillation and implicit neural representation

Retinal Vessel Segmentation of Non-Proliferative Diabetic Retinopathy

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations