Abstract
Convolutional neural networks (CNNs) have found extensive use in medical image segmentation tasks. However, they encounter limitations in capturing long-range semantic interactions. Conversely, Transformers excel at handling long-range dependencies but struggle to preserve local semantic details. To address this challenge, we propose STA-Former, a hybrid CNN-Transformer model for medical image segmentation. Our approach is founded on three fundamental principles: (1) We propose the Shrinkage Triplet Attention (STA) module to enhance feature fusion within the decoder. It focuses on spatial and channel interactions in the feature map, computes thresholds across dimensions, and suppresses irrelevant information through soft-thresholding. (2) We present a redesigned hierarchical hybrid CNN-Transformer encoder that connects CNN and Transformer blocks at multiple scales, enabling the capture of both long-range and short-range dependencies across various scales of feature maps. (3) Unlike traditional decoders that apply the attention mechanism exclusively to low-level features, our approach utilizes a multiscale attention hierarchical decoder, leveraging feature map correlations at different scales for effective feature fusion. Our method exhibits superior performance compared to the state-of-the-art methods on three datasets: Synapse multiorgan CT, ACDC cardiac MRI scans, and breast ultrasound image.






Similar content being viewed by others
References
Mkindu, H., Wu, L., Zhao, Y.: 3d multi-scale vision transformer for lung nodule detection in chest CT images. Signal Image Video Process. 17, 2473–2480 (2023)
Pandit, B.K., Banerjee, A.: 3d edgesegnet: a deep neural network framework for simultaneous edge detection and segmentation of medical images. Signal Image Video Process. 17, 2981–2989 (2023)
Upreti, M., Pandey, C., Bist, A.S., Rawat, B., Hardini, M.: Convolutional neural networks in medical image understanding. Aptisi Trans. Technopreneurship (ATT) 3(2), 120–126 (2021)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241, Springer (2015)
Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.: Transunet: transformers make strong encoders for medical image segmentation. arXiv:2102.04306 (2021)
Azad, R., Fayjie, A.R., Kauffmann, C., Ben Ayed, I., Pedersoli, M., Dolz, J.: On the texture bias for few-shot CNN segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2674–2683 (2021)
Valanarasu, J.M.J., Oza, P., Hacihaliloglu, I., Patel, V.M.: Medical transformer: gated axial-attention for medical image segmentation. In: Medical Image Computing and Computer Assisted Intervention—MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pp. 36–46, Springer (2021)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pp. 213–229, Springer (2020)
Azad, R., Heidari, M., Shariatnia, M., Aghdam, E.K., Karimijafarbigloo, S., Adeli, E., Merhof, D.: Transdeeplab: convolution-free transformer-based deeplab v3+ for medical image segmentation. In: Predictive Intelligence in Medicine: 5th International Workshop, PRIME 2022, Held in Conjunction with MICCAI 2022, Singapore, September 22, 2022, Proceedings, pp. 91–102, Springer (2022)
Peng, Z., Huang, W., Gu, S., Xie, L., Wang, Y., Jiao, J., Ye, Q.: Conformer: local features coupling global representations for visual recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 367–376 (2021)
Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M.: ‘Swin-unet: unet-like pure transformer for medical image segmentation. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 205–218, Springer (2023)
Wang, B., Wang, F., Dong, P., Li, C.: Multiscale Transunet++: dense hybrid u-net with transformer for medical image segmentation. Signal Image Video Process. 16(6), 1607–1614 (2022)
Zhang, Y., Qian, K., Zhu, Z., Yu, H., Zhang, B.: Dba-unet: a double u-shaped boundary attention network for maxillary sinus anatomical structure segmentation in cbct images. Signal Image Video Process. 17(5), 2251–2257 (2023)
Liang, B., Tang, C., Zhang, W., Xu, M., Wu, T.: N-net: an Unet architecture with dual encoder for medical image segmentation. Signal Image Video Process. 17, 3073–3081 (2023)
Ruan, J., Xie, M., Xiang, S., Liu, T., Fu, Y.: Mew-unet: multi-axis representation learning in frequency domain for medical image segmentation. arXiv:2210.14007 (2022)
Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: Unet++: a nested u-net architecture for medical image segmentation. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, pp. 3–11, Springer (2018)
Chen, H., Han, Y., Xu, P., Li, Y., Li, K., Yin, J.: Ms-unet-v2: adaptive denoising method and training strategy for medical image segmentation with small training data. arXiv:2309.03686 (2023)
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: Cvt: introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22–31 (2021)
Xu, G., Wu, X., Zhang, X., He, X.: Levit-unet: make faster encoders with transformer for medical image segmentation. arXiv:2107.08623 (2021)
Misra, D., Nalamada, T., Arasanipalai, A.U., Hou, Q.: Rotate to attend: Convolutional triplet attention module. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3139–3148, (2021)
Lin, Y., Zhang, D., Fang, X., Chen, Y., Cheng, K.-T., Chen, H.: Rethinking boundary detection in deep learning models for medical image segmentation. In: International Conference on Information Processing in Medical Imaging, pp. 730–742, Springer (2023)
Wang, H., Xie, S., Lin, L., Iwamoto, Y., Han, X.-H., Chen, Y.-W., Tong, R.: Mixed transformer u-net for medical image segmentation. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2390–2394. IEEE (2022)
Guo, M.-H., Liu, Z.-N., Mu, T.-J., Hu, S.-M.: Beyond self-attention: external attention using two linear layers for visual tasks. IEEE Trans. Pattern Anal. Mach. Intell. 45(5), 5436–5447 (2022)
Liu, X., Hu, Y., Chen, J.: Hybrid CNN-transformer model for medical image segmentation with pyramid convolution and multi-layer perceptron. Biomed. Signal Process. Control 86, 105331 (2023)
Yu, Z., Lee, F., Chen, Q.: Hct-net: hybrid CNN-transformer model based on a neural architecture search network for medical image segmentation. Appl. Intell. 53, 19990–20006 (2023)
Wang, T., Lan, J., Han, Z., Hu, Z., Huang, Y., Deng, Y., Zhang, H., Wang, J., Chen, M., Jiang, H., et al.: O-net: a novel framework with deep fusion of CNN and transformer for simultaneous segmentation and classification. Front. Neurosci. 16, 876065 (2022)
Chen, Y., Wang, T., Tang, H., Zhao, L., Zhang, X., Tan, T., Gao, Q., Du, M., Tong, T.: Cotrfuse: a novel framework by fusing CNN and transformer for medical image segmentation. Phys. Med. Biol. 68(17), 175027 (2023)
He, Q., Yang, Q., Xie, M.: Hctnet: A hybrid CNN-transformer network for breast ultrasound image segmentation. Comput. Biol. Med. 155, 106629 (2023)
Heidari, M., Kazerouni, A., Soltany, M., Azad, R., Aghdam, E.K., Cohen-Adad, J., Merhof, D.: Hiformer: hierarchical multi-scale representations using transformers for medical image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6202–6212 (2023)
Zhao, M., Zhong, S., Fu, X., Tang, B., Pecht, M.: Deep residual shrinkage networks for fault diagnosis. IEEE Trans. Ind. Inform. 16(7), 4681–4690 (2019)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp. 3–19 (2018)
Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B., et al.: Attention u-net: learning where to look for the pancreas. arXiv:1804.03999 (2018)
Fu, S., Lu, Y., Wang, Y., Zhou, Y., Shen, W., Fishman, E., Yuille, A.: Domain adaptive relational reasoning for 3d multi-organ segmentation. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part I 23, pp. 656–666, Springer (2020)
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818 (2018)
Huang, X., Deng, Z., Li, D., Yuan, X.: Missformer: an effective medical image segmentation transformer. arXiv:2109.07162 (2021)
Naderi, M., Givkashi, M., Piri, F., Karimi, N., Samavi, N.: Focal-unet: Unet-like focal modulation for medical image segmentation. arXiv:2212.09263 (2022)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Al-Dhabyani, W., Gomaa, M., Khaled, H., Fahmy, A.: Dataset of breast ultrasound images. Data Brief 28, 104863 (2020)
Valanarasu, J.M.J., Patel, V.M.: Unext: Mlp-based rapid medical image segmentation network. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 23–33, Springer (2022)
Zhang, Z., Liu, Q., Wang, Y.: Road extraction by deep residual u-net. IEEE Geosci. Remote Sens. Lett. 15(5), 749–753 (2018)
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
Yuzhao Liu wrote the main manuscript text. Liming Han, Bin Yao, and Qing Li provide important suggestions for the manuscript. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
Not applicable.
Ethical Approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, Y., Han, L., Yao, B. et al. STA-Former: enhancing medical image segmentation with Shrinkage Triplet Attention in a hybrid CNN-Transformer model. SIViP 18, 1901–1910 (2024). https://doi.org/10.1007/s11760-023-02893-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-023-02893-5