FlashViT: A Flash Vision Transformer with Large-Scale Token Merging for Congenital Heart Disease Detection

Jiang, Lei; Cheng, Junlong; Chen, Jilong; Gu, Mingyang; Zhu, Min; Han, Peilun; Li, Kang; Yang, Zhigang

doi:10.1007/978-981-99-8558-6_12

Lei Jiang¹⁵,
Junlong Cheng¹⁵,
Jilong Chen¹⁵,
Mingyang Gu¹⁵,
Min Zhu¹⁵,
Peilun Han¹⁶,
Kang Li¹⁶ &
…
Zhigang Yang¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14437))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

799 Accesses

Abstract

Congenital heart disease (CHD) is the most common congenital malformation and imaging examination is an important means to diagnose it. Currently, deep learning-based methods have achieved remarkable results in various types of imaging examinations. However, the issues of large parameter size and low throughput limit their clinical applications. In this paper, we design an efficient, light-weight hybrid model named FlashViT, to assist cardiovascular radiologists in early screening and diagnosis of CHD. Specifically, we propose the Large-scale Token Merging Module (LTM) for more aggressive similar token merging without sacrificing accuracy, which alleviate the problem of high computational complexity and resource consumption of self-attention mechanism. In addition, we propose an unsupervised homogenous pre-training strategy to tackle the issue of insufficient medical image data and poor generalization ability. Compared with conventional pre-training strategy that use ImageNet1K, our strategy only utilizes less than 1$\%$ of the class-agnostic medical images from ImageNet1K, resulting in faster convergence speed and advanced performance of the model. We conduct extensive validation on the collected CHD dataset and the results indicate that our proposed FlashViT-S achieves accuracy of 92.2$\%$ and throughput of 3753 fps with about 3.8 million parameters. We hope that this work can provide some assistance in designing laboratory models for future application in clinical practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arnaout, R., Curran, L., Zhao, Y., Levine, J.C., Chinn, E., Moon-Grady, A.J.: Expert-level prenatal detection of complex congenital heart disease from screening ultrasound using deep learning. medRxiv, pp. 2020–06 (2020)
Google Scholar
Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: your ViT but faster. arXiv preprint arXiv:2210.09461 (2022)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Cheng, J., et al.: ResGANet: residual group attention network for medical image classification and segmentation. Med. Image Anal. 76, 102313 (2022)
Article Google Scholar
Cheng, J., Tian, S., Yu, L., Lu, H., Lv, X.: Fully convolutional attention network for biomedical image segmentation. Artif. Intell. Med. 107, 101899 (2020)
Article Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Desai, G., Elsayed, N., Elsayed, Z., Ozer, M.: A transfer learning based approach for classification of COVID-19 and pneumonia in CT scan imaging. arXiv preprint arXiv:2210.09403 (2022)
Dong, X., et al.: CSWin transformer: a general vision transformer backbone with cross-shaped windows. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12124–12134 (2022)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Huang, X., Deng, Z., Li, D., Yuan, X.: MISSFormer: an effective medical image segmentation transformer. arXiv preprint arXiv:2109.07162 (2021)
Huynh, B.Q., Li, H., Giger, M.L.: Digital mammographic tumor classification using transfer learning from deep convolutional neural networks. J. Med. Imaging 3(3), 034501–034501 (2016)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., Xie, P.: Not all patches are what you need: expediting vision transformers via token reorganizations. arXiv preprint arXiv:2202.07800 (2022)
Liu, Y., et al.: Global prevalence of congenital heart disease in school-age children: a meta-analysis and systematic review. BMC Cardiovasc. Disord. 20, 1–10 (2020)
Article Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Maaz, M., et al.: EdgeNeXt: efficiently amalgamated CNN-transformer architecture for mobile vision applications. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) Computer Vision, ECCV 2022 Workshops, ECCV 2022, Part VII. LNCS, vol. 13807, pp. 3–20. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-25082-8_1
Mehta, S., Rastegari, M.: MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178 (2021)
Minaee, S., Kafieh, R., Sonka, M., Yazdani, S., Soufi, G.J.: Deep-COVID: predicting COVID-19 from chest X-ray images using deep transfer learning. Med. Image Anal. 65, 101794 (2020)
Article Google Scholar
Perera, S., Adhikari, S., Yilmaz, A.: POCFormer: a lightweight transformer architecture for detection of COVID-19 using point of care ultrasound. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 195–199. IEEE (2021)
Google Scholar
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: DynamicViT: efficient vision transformers with dynamic token sparsification. Adv. Neural. Inf. Process. Syst. 34, 13937–13949 (2021)
Google Scholar
Rashid, U., Qureshi, A.U., Hyder, S.N., Sadiq, M.: Pattern of congenital heart disease in a developing country tertiary care center: factors associated with delayed diagnosis. Ann. Pediatr. Cardiol. 9(3), 210 (2016)
Article Google Scholar
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)
Google Scholar
Tu, Z., et al.: MaxViT: multi-axis vision transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022, Part XXIV. LNCS, vol. 13684, pp. 459–479. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20053-3_27
Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: ChestX-ray8: hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2097–2106 (2017)
Google Scholar
Xu, X., et al.: ImageCHD: a 3D computed tomography image dataset for classification of congenital heart disease. In: Martel, A.L., et al. (eds.) MICCAI 2020, Part IV 23. LNCS, vol. 12264, pp. 77–87. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59719-1_8
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science, Sichuan University, Chengdu, China
Lei Jiang, Junlong Cheng, Jilong Chen, Mingyang Gu & Min Zhu
Department of Radiology and West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
Peilun Han, Kang Li & Zhigang Yang

Authors

Lei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Junlong Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Jilong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Mingyang Gu
View author publications
You can also search for this author in PubMed Google Scholar
Min Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Peilun Han
View author publications
You can also search for this author in PubMed Google Scholar
Kang Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhigang Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Min Zhu .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, L. et al. (2024). FlashViT: A Flash Vision Transformer with Large-Scale Token Merging for Congenital Heart Disease Detection. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14437. Springer, Singapore. https://doi.org/10.1007/978-981-99-8558-6_12

Download citation

DOI: https://doi.org/10.1007/978-981-99-8558-6_12
Published: 26 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8557-9
Online ISBN: 978-981-99-8558-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

FlashViT: A Flash Vision Transformer with Large-Scale Token Merging for Congenital Heart Disease Detection