Abstract
Medical image segmentation plays a crucial role in medical artificial intelligence. Recent advancements in computer vision have introduced multiscale ViT (Vision Transformer), revealing its robustness and superior feature extraction capabilities. However, the independent processing of data patches by ViT often leads to insufficient attention to fine details. In medical image segmentation tasks like organ and tumor segmentation, precise boundary delineation is of utmost importance. To address this challenge, this study proposes two novel CNN-Transformer feature fusion modules: SFM (Shallow Fusion Module) and DFM (Deep Fusion Module). These modules effectively integrate high-level and low-level semantic information from the feature pyramid while maintaining network efficiency. To expedite network convergence, the Deep Supervise method is introduced during the training phase. Additionally, extensive ablation experiments and comparative studies are conducted on well-known public datasets, namely Synapse and ACDC, to evaluate the effectiveness of the proposed approach. The experimental results not only demonstrate the efficacy of the proposed modules and training method but also showcase the superiority of our architecture compared to previous methods. The code and trained models will be available soon.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
This work is supported by the Natural Science Foundation of China (No. 62072388), the industry guidance project foundation of science technology bureau of Fujian province in 2020 (No. 2020H0047), and Fujian Sunshine Charity Foundation.
References
Liu, Q., Kaul, C., Anagnostopoulos, C., Murray-Smith, R., Deligianni, F.: Optimizing vision transformers for medical image segmentation and few-shot domain adaptation. arXiv preprint arXiv:2210.08066 (2022)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Soucy, N., Sekeh, S.Y.: CEU-Net: ensemble semantic segmentation of hyperspectral images using clustering. arXiv preprint arXiv:2203.04873 (2022)
Diakogiannis, F.I., Waldner, F., Caccetta, P., Wu, C.: ResUNet-a: a deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogrammetry Remote Sens. 162, 94–114 (2020)
Huang, H., Tong, R., Hu, H., Zhang, Q.: UNet 3+: a full-scale connected UNet for medical image segmentation. In: International Conference on Acoustics, Speech and Signal Processing (2020)
Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: UNet++: a nested U-Net architecture for medical image segmentation. In: Stoyanov, D., et al. (eds.) DLMIA/ML-CDS -2018. LNCS, vol. 11045, pp. 3–11. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00889-5_1
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: ICLR 2021 (2021)
Chen, J., et al.: TransUNet: transformers make strong encoders for medical image segmentation. arXiv Computer Vision and Pattern Recognition (2021)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: International Conference on Computer Vision (2021)
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: International Conference on Computer Vision (2021)
Cao, H., et al.: Swin-Unet: Unet-like pure transformer for medical image segmentation. arXiv Image and Video Processing (2021)
Dong, B., Wang, W., Fan, D.-P., Li, J., Fu, H., Shao, L.: Polyp-PVT: polyp segmentation with pyramid vision transformers. arXiv Computer Vision and Pattern Recognition (2021)
Li, W., Yang, H.: Collaborative transformer-CNN learning for semi-supervised medical image segmentation. In: IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022, Las Vegas, NV, USA, 6–8 December 2022, pp. 1058–1065. IEEE (2022)
Verma, A., Qassim, H., Feinzimer, D.: Residual squeeze CNDS deep learning CNN model for very large scale places image recognition. In: 8th IEEE Annual Ubiquitous Computing, Electronics and Mobile Communication Conference, UEMCON, New York City, NY, USA, 19–21 October 2017, pp. 463–469. IEEE (2017)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1
Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 42(8), 2011–2023 (2020). https://doi.org/10.1109/TPAMI.2019.2913372
Landman, B., Xu, Z., Igelsias, J., Styner, M., Langerak, T., Klein, A.: MICCAI multi-atlas labeling beyond the cranial vault-workshop and challenge. In: Proceedings of the MICCAI Multi-Atlas Labeling Beyond Cranial Vault-Workshop Challenge, vol. 5, p. 12 (2015)
Bernard, O., Lalande, A., et al.: Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE Trans. Med. Imaging 37(11), 2514–2525 (2018)
Fu, S., et al.: Domain adaptive relational reasoning for 3D multi-organ segmentation. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12261, pp. 656–666. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59710-8_64
Wang, H., et al.: Mixed transformer U-Net for medical image segmentation. arXiv preprint arXiv:2111.04734 (2022)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhou, R., Yao, J., Hong, Q., Li, X., Cao, X. (2024). Cross Attention Multi Scale CNN-Transformer Hybrid Encoder Is General Medical Image Learner. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14437. Springer, Singapore. https://doi.org/10.1007/978-981-99-8558-6_8
Download citation
DOI: https://doi.org/10.1007/978-981-99-8558-6_8
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8557-9
Online ISBN: 978-981-99-8558-6
eBook Packages: Computer ScienceComputer Science (R0)