A multimodal transformer to fuse images and metadata for skin disease classification

Cai, Gan; Zhu, Yu; Wu, Yue; Jiang, Xiaoben; Ye, Jiongyao; Yang, Dawei

doi:10.1007/s00371-022-02492-4

A multimodal transformer to fuse images and metadata for skin disease classification

Original article
Published: 05 May 2022

Volume 39, pages 2781–2793, (2023)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Gan Cai¹,
Yu Zhu ORCID: orcid.org/0000-0003-1535-6520¹,
Yue Wu¹,
Xiaoben Jiang¹,
Jiongyao Ye¹ &
…
Dawei Yang^2,3

10k Accesses
71 Citations
4 Altmetric
Explore all metrics

Abstract

Skin disease cases are rising in prevalence, and the diagnosis of skin diseases is always a challenging task in the clinic. Utilizing deep learning to diagnose skin diseases could help to meet these challenges. In this study, a novel neural network is proposed for the classification of skin diseases. Since the datasets for the research consist of skin disease images and clinical metadata, we propose a novel multimodal Transformer, which consists of two encoders for both images and metadata and one decoder to fuse the multimodal information. In the proposed network, a suitable Vision Transformer (ViT) model is utilized as the backbone to extract image deep features. As for metadata, they are regarded as labels and a new Soft Label Encoder (SLE) is designed to embed them. Furthermore, in the decoder part, a novel Mutual Attention (MA) block is proposed to better fuse image features and metadata features. To evaluate the model’s effectiveness, extensive experiments have been conducted on the private skin disease dataset and the benchmark dataset ISIC 2018. Compared with state-of-the-art methods, the proposed model shows better performance and represents an advancement in skin disease diagnosis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Msfusenet: a multi-stage information fusion network for multi-modal skin lesion diagnosis

Article 01 May 2025

Visually Aware Metadata-Guided Supervision for Improved Skin Lesion Classification Using Deep Learning

Multi-modal bilinear fusion with hybrid attention mechanism for multi-label skin lesion classification

Article 15 January 2024

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Xiao, H., Ran, Z., Mabu, S., Li, Y., Li, L.: SAUNet++: an automatic segmentation model of COVID-19 lesion from CT slices. Vis. Comput. pp. 1–14 (2022)
Mohamed, E.H., El-Behaidy, W.H.: Enhanced skin lesions classification using deep convolutional networks. In: 2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), IEEE, pp. 180–188 (2019)
Zhang, Y., Wang, C.: SIIM-ISIC melanoma classification with DenseNet. In: 2021 IEEE 2nd international conference on big data, artificial intelligence and internet of things engineering (ICBAIE), IEEE, pp. 14–17 (2021)
Karthik, K., Kamath, S.S.: A deep neural network model for content-based medical image retrieval with multi-view classification. Vis. Comput. 37(7), 1837–1850 (2021)
Article Google Scholar
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122 (2021)
Yang, J., et al.: Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641 (2021)
Liu, Z., et al.: Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)
Zhang, Z., Zhang, H., Zhao, L., Chen, T., Pfister, T.: Aggregating nested transformers. arXiv preprint arXiv:2105.12723 (2021)
Chen, C-F., Fan, Q., Panda, R.: Crossvit: cross-attention multi-scale vision transformer for image classification. arXiv preprint arXiv:2103.14899 (2021)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, PMLR, pp. 6105–6114 (2019)
Vaswani, A., et al.: Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008 (2017)
Phung, S.L., Bouzerdoum, A., Chai, D.: Skin segmentation using color pixel classification: analysis and comparison. IEEE Trans. Pattern Anal. Mach. Intell. 27(1), 148–154 (2005)
Article Google Scholar
Zhang, J., Xie, Y., Wu, Q., Xia, Y.: Medical image classification using synergic deep learning. Med. Image Anal. 54, 10–19 (2019)
Article Google Scholar
Gao, X., Zhang, Y., Wang, H., Sun, Y., Zhao, F., Zhang, X.: A modified fuzzy clustering algorithm based on dynamic relatedness model for image segmentation. Vis. Comput. pp. 1–14 (2022)
Serte, S., Demirel, H.: Gabor wavelet-based deep learning for skin lesion classification. Comput. Biol. Med. 113, 103423 (2019)
Article Google Scholar
Javed, R., Saba, T., Shafry, M., Rahim, M.: An intelligent saliency segmentation technique and classification of low contrast skin lesion dermoscopic images based on histogram decision. In: 2019 12th International Conference on Developments in eSystems Engineering (DeSE), IEEE, pp. 164–169 (2019)
Salah, K.B., Othmani, M., Kherallah, M.: A novel approach for human skin detection using convolutional neural network. Vis. Comput. 38, 1–11 (2021)
Google Scholar
Hao, Y., et al.: An end-to-end model for question answering over knowledge base with cross-attention combining global knowledge. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 221–231 (2017)
Gonzalez-Diaz, I.: Dermaknet: Incorporating the knowledge of dermatologists to convolutional neural networks for skin lesion diagnosis. IEEE J. Biomed. Health Inform. 23(2), 547–559 (2018)
Article Google Scholar
Song, L., Lin, J., Wang, Z.J., Wang, H.: An end-to-end multi-task deep learning framework for skin lesion analysis. IEEE J. Biomed. Health Inform. 24(10), 2912–2921 (2020)
Article Google Scholar
Tang, P., Liang, Q., Yan, X., Xiang, S., Zhang, D.: Gp-cnn-dtel: Global-part cnn model with data-transformed ensemble learning for skin lesion classification. IEEE J. Biomed. Health Inform. 24(10), 2870–2882 (2020)
Article Google Scholar
Kawahara, J., Daneshvar, S., Argenziano, G., Hamarneh, G.: Seven-point checklist and skin lesion classification using multitask multimodal neural nets. IEEE J. Biomed. Health Inform. 23(2), 538–546 (2018)
Article Google Scholar
Pacheco, A.G.C., Krohling, R.: An attention-based mechanism to combine images and metadata in deep learning models applied to skin cancer classification. IEEE J. Biomed. Health Inform. (2021)
Zhou, L., Luo, Y.: Deep features fusion with mutual attention transformer for skin lesion diagnosis. In: Presented at the 2021 IEEE International Conference on Image Processing (ICIP) (2021)
Gessert, N., Nielsen, M., Shaikh, M., Werner, R., Schlaefer, A.: Skin lesion classification using ensembles of multi-resolution EfficientNets with meta data. MethodsX 7, 100864 (2020)
Article Google Scholar
Höhn, J., et al.: Integrating patient data into skin cancer classification using convolutional neural networks: systematic review. J. Med. Internet Res. 23(7), e20708 (2021)
Article Google Scholar
Ningrum, D.N.A., et al.: Deep learning classifier with patient’s metadata of dermoscopic images in malignant melanoma detection. J. Multidiscip. Healthc. 14, 877 (2021)
Article Google Scholar
Pacheco, A.G., Krohling, R.A.: An attention-based mechanism to combine images and metadata in deep learning models applied to skin cancer classification. IEEE J. Biomed. Health Inform. 25(9), 3554–3563 (2021)
Article Google Scholar
Kim, J.-H., On, K.-W., Lim, W., Kim, J., Ha, J.-W., Zhang, B.-T.: Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016)
Kim, J.-H., Jun, J., Zhang, B.-T.: Bilinear attention networks. arXiv preprint arXiv:1805.07932 (2018)
Xiong, C., Merity, S., Socher, R.: Dynamic memory networks for visual and textual question answering. In: International Conference on Machine Learning, PMLR, pp. 2397–2406 (2016)
Bose, R., Pande, S., Banerjee, B.: Two headed dragons: multimodal fusion and cross modal transactions. In: 2021 IEEE International Conference on Image Processing (ICIP), IEEE, pp. 2893–2897 (2021)
Tschandl, P., Rosendahl, C., Kittler, H.: The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 5(1), 1–9 (2018)
Article Google Scholar
Codella, N., et al.: Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368 (2019)
Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp. 1821–1830 (2017)
Khan, M.A., Javed, M.Y., Sharif, M., Saba, T., Rehman, A.: Multi-model deep neural network based features extraction and optimal selection approach for skin lesion classification. In: 2019 international conference on computer and information sciences (ICCIS), IEEE, pp. 1–7 (2019)
Huang, H.W., Hsu, B.W.Y., Lee, C.H., Tseng, V.S.: Development of a light-weight deep learning model for cloud applications and remote diagnosis of skin cancers. J. Dermatol. 48(3), 310–316 (2021)
Article Google Scholar
Liu, Q., Yu, L., Luo, L., Dou, Q., Heng, P.A.: Semi-supervised medical image classification with relation-driven self-ensembling model. IEEE Trans. Med. Imaging 39(11), 3429–3440 (2020)
Article Google Scholar
Gu, Y., Ge, Z., Bonnington, C.P., Zhou, J.: Progressive transfer learning and adversarial domain adaptation for cross-domain skin disease classification. IEEE J. Biomed. Health Inform. 24(5), 1379–1393 (2019)
Article Google Scholar

Download references

Acknowledgements

This research is supported in part by Science and Technology Commission of Shanghai Municipality (20DZ2254400, 21DZ2200600), National Scientific Foundation of China (82170110), Zhongshan Hospital Clinical Research Foundation(2019ZSGG15), and Shanghai Pujiang Program (20PJ1402400).

Author information

Authors and Affiliations

School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China
Gan Cai, Yu Zhu, Yue Wu, Xiaoben Jiang & Jiongyao Ye
Department of Pulmonary and Critical Care Medicine, Zhongshan Hospital, Fudan University, Shanghai, 200032, China
Dawei Yang
Shanghai Engineering Research Center of Internet of Things for Respiratory Medicine, Shanghai, 200032, China
Dawei Yang

Authors

Gan Cai
View author publications
You can also search for this author inPubMed Google Scholar
Yu Zhu
View author publications
You can also search for this author inPubMed Google Scholar
Yue Wu
View author publications
You can also search for this author inPubMed Google Scholar
Xiaoben Jiang
View author publications
You can also search for this author inPubMed Google Scholar
Jiongyao Ye
View author publications
You can also search for this author inPubMed Google Scholar
Dawei Yang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Yu Zhu or Dawei Yang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cai, G., Zhu, Y., Wu, Y. et al. A multimodal transformer to fuse images and metadata for skin disease classification. Vis Comput 39, 2781–2793 (2023). https://doi.org/10.1007/s00371-022-02492-4

Download citation

Accepted: 04 April 2022
Published: 05 May 2022
Issue Date: July 2023
DOI: https://doi.org/10.1007/s00371-022-02492-4

Keywords

Profiles

Yu Zhu View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A multimodal transformer to fuse images and metadata for skin disease classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Msfusenet: a multi-stage information fusion network for multi-modal skin lesion diagnosis

Visually Aware Metadata-Guided Supervision for Improved Skin Lesion Classification Using Deep Learning

Multi-modal bilinear fusion with hybrid attention mechanism for multi-label skin lesion classification

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now