Skip to main content

Advertisement

Log in

A multimodal transformer to fuse images and metadata for skin disease classification

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Skin disease cases are rising in prevalence, and the diagnosis of skin diseases is always a challenging task in the clinic. Utilizing deep learning to diagnose skin diseases could help to meet these challenges. In this study, a novel neural network is proposed for the classification of skin diseases. Since the datasets for the research consist of skin disease images and clinical metadata, we propose a novel multimodal Transformer, which consists of two encoders for both images and metadata and one decoder to fuse the multimodal information. In the proposed network, a suitable Vision Transformer (ViT) model is utilized as the backbone to extract image deep features. As for metadata, they are regarded as labels and a new Soft Label Encoder (SLE) is designed to embed them. Furthermore, in the decoder part, a novel Mutual Attention (MA) block is proposed to better fuse image features and metadata features. To evaluate the model’s effectiveness, extensive experiments have been conducted on the private skin disease dataset and the benchmark dataset ISIC 2018. Compared with state-of-the-art methods, the proposed model shows better performance and represents an advancement in skin disease diagnosis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

  1. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  2. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)

  3. Xiao, H., Ran, Z., Mabu, S., Li, Y., Li, L.: SAUNet++: an automatic segmentation model of COVID-19 lesion from CT slices. Vis. Comput. pp. 1–14 (2022)

  4. Mohamed, E.H., El-Behaidy, W.H.: Enhanced skin lesions classification using deep convolutional networks. In: 2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), IEEE, pp. 180–188 (2019)

  5. Zhang, Y., Wang, C.: SIIM-ISIC melanoma classification with DenseNet. In: 2021 IEEE 2nd international conference on big data, artificial intelligence and internet of things engineering (ICBAIE), IEEE, pp. 14–17 (2021)

  6. Karthik, K., Kamath, S.S.: A deep neural network model for content-based medical image retrieval with multi-view classification. Vis. Comput. 37(7), 1837–1850 (2021)

    Article  Google Scholar 

  7. Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122 (2021)

  8. Yang, J., et al.: Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641 (2021)

  9. Liu, Z., et al.: Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)

  10. Zhang, Z., Zhang, H., Zhao, L., Chen, T., Pfister, T.: Aggregating nested transformers. arXiv preprint arXiv:2105.12723 (2021)

  11. Chen, C-F., Fan, Q., Panda, R.: Crossvit: cross-attention multi-scale vision transformer for image classification. arXiv preprint arXiv:2103.14899 (2021)

  12. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020)

  13. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  14. Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, PMLR, pp. 6105–6114 (2019)

  15. Vaswani, A., et al.: Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008 (2017)

  16. Phung, S.L., Bouzerdoum, A., Chai, D.: Skin segmentation using color pixel classification: analysis and comparison. IEEE Trans. Pattern Anal. Mach. Intell. 27(1), 148–154 (2005)

    Article  Google Scholar 

  17. Zhang, J., Xie, Y., Wu, Q., Xia, Y.: Medical image classification using synergic deep learning. Med. Image Anal. 54, 10–19 (2019)

    Article  Google Scholar 

  18. Gao, X., Zhang, Y., Wang, H., Sun, Y., Zhao, F., Zhang, X.: A modified fuzzy clustering algorithm based on dynamic relatedness model for image segmentation. Vis. Comput. pp. 1–14 (2022)

  19. Serte, S., Demirel, H.: Gabor wavelet-based deep learning for skin lesion classification. Comput. Biol. Med. 113, 103423 (2019)

    Article  Google Scholar 

  20. Javed, R., Saba, T., Shafry, M., Rahim, M.: An intelligent saliency segmentation technique and classification of low contrast skin lesion dermoscopic images based on histogram decision. In: 2019 12th International Conference on Developments in eSystems Engineering (DeSE), IEEE, pp. 164–169 (2019)

  21. Salah, K.B., Othmani, M., Kherallah, M.: A novel approach for human skin detection using convolutional neural network. Vis. Comput. 38, 1–11 (2021)

    Google Scholar 

  22. Hao, Y., et al.: An end-to-end model for question answering over knowledge base with cross-attention combining global knowledge. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 221–231 (2017)

  23. Gonzalez-Diaz, I.: Dermaknet: Incorporating the knowledge of dermatologists to convolutional neural networks for skin lesion diagnosis. IEEE J. Biomed. Health Inform. 23(2), 547–559 (2018)

    Article  Google Scholar 

  24. Song, L., Lin, J., Wang, Z.J., Wang, H.: An end-to-end multi-task deep learning framework for skin lesion analysis. IEEE J. Biomed. Health Inform. 24(10), 2912–2921 (2020)

    Article  Google Scholar 

  25. Tang, P., Liang, Q., Yan, X., Xiang, S., Zhang, D.: Gp-cnn-dtel: Global-part cnn model with data-transformed ensemble learning for skin lesion classification. IEEE J. Biomed. Health Inform. 24(10), 2870–2882 (2020)

    Article  Google Scholar 

  26. Kawahara, J., Daneshvar, S., Argenziano, G., Hamarneh, G.: Seven-point checklist and skin lesion classification using multitask multimodal neural nets. IEEE J. Biomed. Health Inform. 23(2), 538–546 (2018)

    Article  Google Scholar 

  27. Pacheco, A.G.C., Krohling, R.: An attention-based mechanism to combine images and metadata in deep learning models applied to skin cancer classification. IEEE J. Biomed. Health Inform. (2021)

  28. Zhou, L., Luo, Y.: Deep features fusion with mutual attention transformer for skin lesion diagnosis. In: Presented at the 2021 IEEE International Conference on Image Processing (ICIP) (2021)

  29. Gessert, N., Nielsen, M., Shaikh, M., Werner, R., Schlaefer, A.: Skin lesion classification using ensembles of multi-resolution EfficientNets with meta data. MethodsX 7, 100864 (2020)

    Article  Google Scholar 

  30. Höhn, J., et al.: Integrating patient data into skin cancer classification using convolutional neural networks: systematic review. J. Med. Internet Res. 23(7), e20708 (2021)

    Article  Google Scholar 

  31. Ningrum, D.N.A., et al.: Deep learning classifier with patient’s metadata of dermoscopic images in malignant melanoma detection. J. Multidiscip. Healthc. 14, 877 (2021)

    Article  Google Scholar 

  32. Pacheco, A.G., Krohling, R.A.: An attention-based mechanism to combine images and metadata in deep learning models applied to skin cancer classification. IEEE J. Biomed. Health Inform. 25(9), 3554–3563 (2021)

    Article  Google Scholar 

  33. Kim, J.-H., On, K.-W., Lim, W., Kim, J., Ha, J.-W., Zhang, B.-T.: Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016)

  34. Kim, J.-H., Jun, J., Zhang, B.-T.: Bilinear attention networks. arXiv preprint arXiv:1805.07932 (2018)

  35. Xiong, C., Merity, S., Socher, R.: Dynamic memory networks for visual and textual question answering. In: International Conference on Machine Learning, PMLR, pp. 2397–2406 (2016)

  36. Bose, R., Pande, S., Banerjee, B.: Two headed dragons: multimodal fusion and cross modal transactions. In: 2021 IEEE International Conference on Image Processing (ICIP), IEEE, pp. 2893–2897 (2021)

  37. Tschandl, P., Rosendahl, C., Kittler, H.: The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 5(1), 1–9 (2018)

    Article  Google Scholar 

  38. Codella, N., et al.: Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368 (2019)

  39. Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp. 1821–1830 (2017)

  40. Khan, M.A., Javed, M.Y., Sharif, M., Saba, T., Rehman, A.: Multi-model deep neural network based features extraction and optimal selection approach for skin lesion classification. In: 2019 international conference on computer and information sciences (ICCIS), IEEE, pp. 1–7 (2019)

  41. Huang, H.W., Hsu, B.W.Y., Lee, C.H., Tseng, V.S.: Development of a light-weight deep learning model for cloud applications and remote diagnosis of skin cancers. J. Dermatol. 48(3), 310–316 (2021)

    Article  Google Scholar 

  42. Liu, Q., Yu, L., Luo, L., Dou, Q., Heng, P.A.: Semi-supervised medical image classification with relation-driven self-ensembling model. IEEE Trans. Med. Imaging 39(11), 3429–3440 (2020)

    Article  Google Scholar 

  43. Gu, Y., Ge, Z., Bonnington, C.P., Zhou, J.: Progressive transfer learning and adversarial domain adaptation for cross-domain skin disease classification. IEEE J. Biomed. Health Inform. 24(5), 1379–1393 (2019)

    Article  Google Scholar 

Download references

Acknowledgements

This research is supported in part by Science and Technology Commission of Shanghai Municipality (20DZ2254400, 21DZ2200600), National Scientific Foundation of China (82170110), Zhongshan Hospital Clinical Research Foundation(2019ZSGG15), and Shanghai Pujiang Program (20PJ1402400).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yu Zhu or Dawei Yang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cai, G., Zhu, Y., Wu, Y. et al. A multimodal transformer to fuse images and metadata for skin disease classification. Vis Comput 39, 2781–2793 (2023). https://doi.org/10.1007/s00371-022-02492-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-022-02492-4

Keywords

Profiles

  1. Yu Zhu