Cross Attention Multi Scale CNN-Transformer Hybrid Encoder Is General Medical Image Learner

Zhou, Rongzhou; Yao, Junfeng; Hong, Qingqi; Li, Xingxin; Cao, Xianpeng

doi:10.1007/978-981-99-8558-6_8

Rongzhou Zhou¹⁵,
Junfeng Yao^15,16,17,
Qingqi Hong¹⁸,
Xingxin Li¹⁵ &
…
Xianpeng Cao¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14437))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

358 Accesses

Abstract

Medical image segmentation plays a crucial role in medical artificial intelligence. Recent advancements in computer vision have introduced multiscale ViT (Vision Transformer), revealing its robustness and superior feature extraction capabilities. However, the independent processing of data patches by ViT often leads to insufficient attention to fine details. In medical image segmentation tasks like organ and tumor segmentation, precise boundary delineation is of utmost importance. To address this challenge, this study proposes two novel CNN-Transformer feature fusion modules: SFM (Shallow Fusion Module) and DFM (Deep Fusion Module). These modules effectively integrate high-level and low-level semantic information from the feature pyramid while maintaining network efficiency. To expedite network convergence, the Deep Supervise method is introduced during the training phase. Additionally, extensive ablation experiments and comparative studies are conducted on well-known public datasets, namely Synapse and ACDC, to evaluate the effectiveness of the proposed approach. The experimental results not only demonstrate the efficacy of the proposed modules and training method but also showcase the superiority of our architecture compared to previous methods. The code and trained models will be available soon.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
This work is supported by the Natural Science Foundation of China (No. 62072388), the industry guidance project foundation of science technology bureau of Fujian province in 2020 (No. 2020H0047), and Fujian Sunshine Charity Foundation.

References

Liu, Q., Kaul, C., Anagnostopoulos, C., Murray-Smith, R., Deligianni, F.: Optimizing vision transformers for medical image segmentation and few-shot domain adaptation. arXiv preprint arXiv:2210.08066 (2022)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Soucy, N., Sekeh, S.Y.: CEU-Net: ensemble semantic segmentation of hyperspectral images using clustering. arXiv preprint arXiv:2203.04873 (2022)
Diakogiannis, F.I., Waldner, F., Caccetta, P., Wu, C.: ResUNet-a: a deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogrammetry Remote Sens. 162, 94–114 (2020)
Article Google Scholar
Huang, H., Tong, R., Hu, H., Zhang, Q.: UNet 3+: a full-scale connected UNet for medical image segmentation. In: International Conference on Acoustics, Speech and Signal Processing (2020)
Google Scholar
Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: UNet++: a nested U-Net architecture for medical image segmentation. In: Stoyanov, D., et al. (eds.) DLMIA/ML-CDS -2018. LNCS, vol. 11045, pp. 3–11. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00889-5_1
Chapter Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: ICLR 2021 (2021)
Google Scholar
Chen, J., et al.: TransUNet: transformers make strong encoders for medical image segmentation. arXiv Computer Vision and Pattern Recognition (2021)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: International Conference on Computer Vision (2021)
Google Scholar
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: International Conference on Computer Vision (2021)
Google Scholar
Cao, H., et al.: Swin-Unet: Unet-like pure transformer for medical image segmentation. arXiv Image and Video Processing (2021)
Google Scholar
Dong, B., Wang, W., Fan, D.-P., Li, J., Fu, H., Shao, L.: Polyp-PVT: polyp segmentation with pyramid vision transformers. arXiv Computer Vision and Pattern Recognition (2021)
Google Scholar
Li, W., Yang, H.: Collaborative transformer-CNN learning for semi-supervised medical image segmentation. In: IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022, Las Vegas, NV, USA, 6–8 December 2022, pp. 1058–1065. IEEE (2022)
Google Scholar
Verma, A., Qassim, H., Feinzimer, D.: Residual squeeze CNDS deep learning CNN model for very large scale places image recognition. In: 8th IEEE Annual Ubiquitous Computing, Electronics and Mobile Communication Conference, UEMCON, New York City, NY, USA, 19–21 October 2017, pp. 463–469. IEEE (2017)
Google Scholar
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1
Chapter Google Scholar
Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 42(8), 2011–2023 (2020). https://doi.org/10.1109/TPAMI.2019.2913372
Article Google Scholar
Landman, B., Xu, Z., Igelsias, J., Styner, M., Langerak, T., Klein, A.: MICCAI multi-atlas labeling beyond the cranial vault-workshop and challenge. In: Proceedings of the MICCAI Multi-Atlas Labeling Beyond Cranial Vault-Workshop Challenge, vol. 5, p. 12 (2015)
Google Scholar
Bernard, O., Lalande, A., et al.: Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE Trans. Med. Imaging 37(11), 2514–2525 (2018)
Article Google Scholar
Fu, S., et al.: Domain adaptive relational reasoning for 3D multi-organ segmentation. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12261, pp. 656–666. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59710-8_64
Chapter Google Scholar
Wang, H., et al.: Mixed transformer U-Net for medical image segmentation. arXiv preprint arXiv:2111.04734 (2022)

Download references

Author information

Authors and Affiliations

Center for Digital Media Computing, School of Film, School of Informatics, Xiamen University, Xiamen, China
Rongzhou Zhou, Junfeng Yao, Xingxin Li & Xianpeng Cao
Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan, Ministry of Culture and Tourism, Xiamen, China
Junfeng Yao
Institute of Artificial Intelligence, Xiamen University, Xiamen, 361005, China
Junfeng Yao
Xiamen University, Xiamen, 361005, China
Qingqi Hong

Authors

Rongzhou Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Junfeng Yao
View author publications
You can also search for this author in PubMed Google Scholar
Qingqi Hong
View author publications
You can also search for this author in PubMed Google Scholar
Xingxin Li
View author publications
You can also search for this author in PubMed Google Scholar
Xianpeng Cao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Junfeng Yao or Qingqi Hong .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, R., Yao, J., Hong, Q., Li, X., Cao, X. (2024). Cross Attention Multi Scale CNN-Transformer Hybrid Encoder Is General Medical Image Learner. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14437. Springer, Singapore. https://doi.org/10.1007/978-981-99-8558-6_8

Download citation

DOI: https://doi.org/10.1007/978-981-99-8558-6_8
Published: 26 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8557-9
Online ISBN: 978-981-99-8558-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics