Abstract
Metaphors are ubiquitous in natural language, and metaphor detection, as an important prerequisite for metaphor understanding, is widely used in natural language processing tasks such as sentiment analysis, sarcasm interpretation, and text comprehension. Current metaphor detection methods rely mainly on text and identify metaphorical language through language analysis. However, these methods usually focus too much on text content, ignore the importance of visual metaphors, and lack effective multimodal metaphor feature integration methods. This paper proposes a metaphor detection model with visual information enhancement based on multimodal split fusion. Specifically, we first use a multidimensional attention enhancement module to process image information. This module optimizes the recognition and processing of key features by sequentially integrating channel and spatial attention mechanisms, thereby improving the performance of the model in visual tasks. To achieve two-way interaction of multimodal metaphor features, we design a multimodal split-fusion module. This module enhances the model’s metaphor detection ability by dividing each modal data into feature blocks of equal size and aggregating and weighting these blocks. Extensive experimental results on the public multimodal metaphor dataset METMeme and the sarcasm dataset Sarcasm verify the effectiveness of our model.








Similar content being viewed by others
Data Availability Statement
No datasets were generated or analyzed during the current study.
References
Campbell G (1988) The philosophy of rhetoric. SIU Press
Ekaterina S (2015) Design and evaluation of metaphor processing systems. Comput Linguistics 41(4):579–623
Steen G et al (2010) A method for linguistic metaphor identification from MIP to MIPVU preface, Method for linguistic metaphor identification: from MIP To MIPVU 14 pp.IX–+
Dan A et al (2013) Why "dark thoughts” aren’t really dark: a novel algorithm for metaphor identification. In: IEEE Symposium on Computational Intelligence, Cognitive Algorithms, Mind, and Brain (CCMB). IEEE. 2013:60–65
Bizzoni Y, Chatzikyriakidis S, Ghanimifard M (2017) "deep” learning: detecting metaphoricity in adjective-noun pairs. In: Proceedings of the Workshop on Stylistic Variation. pp.43–52
Lakoff G, Johnson M (2008) Metaphors we live by. University of Chicago press
Wang S et al. (2023) Metaphor detection with effective context denoising. arXiv preprint arXiv:2302.05611
Turney P et al. (2011) Literal and Metaphorical Sense Identification Through Concrete and Abstract Context. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. pp.680–690
Li L, Sporleder C (2009) Classifier Combination for Contextual Idiom Detection Without Labelled Data. In: Proceedings of the 2009 conference on empirical methods in natural language processing. pp.315–323
Shutova E, Sun L, Korhonen A (2010) Metaphor Identification Using Verb and Noun Clustering. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010). pp.1002–1010
Ding Y et al. (2024) Clothes-Eraser: clothing-aware controllable disentanglement for clothes-changing person re-identification. In: Signal, Image and Video Processing. pp. 1–12
Ding Y, Wang A, Zhang L (2024) Multidimensional semantic disentanglement network for clothes-changing person re-identification. In: Proceedings of the 2024 International Conference on Multimedia Retrieval. pp. 1025–1033
Do Dinh E-L, Gurevych I (2016) Token-level metaphor detection using neural networks. Proceedings of the Fourth Workshop on Metaphor in NLP. pp.28–33
Mykowiecka A, Wawer A, Marciniak M (2018) Detecting figurative word occurrences using recurrent neural networks. In: Proceedings of the Workshop on Figurative Language Processing. pp.124–127
Song W et al (2021) Verb metaphor detection via contextual relation learning. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Vol 1: Long Papers). pp.4240–4251
Mao R et al (2022) MetaPro: a computational metaphor processing model for text pre-processing. Inform Fusion 86:30–43
Zhang S, Liu Y (2022) Metaphor detection via linguistics enhanced Siamese network. In: Proceedings of the 29th International Conference on Computational Linguistics. pp.4149–4159
Fu C et al (2020) Beyond literal visual modeling: Understanding image metaphor based on literal-implied concept mapping. In: MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part I 26. Springer. pp.111–123
Akula AR et al (2023) Metaclue: Towards comprehensive visual metaphors research. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp.23201–23211
He T et al (2024) Balanced active sampling for person re-identification. In: 2024 IEEE International Conference on Multimedia and Expo (ICME). IEEE. pp.1–6
Shutova E, Kiela D, Maillard J (2016) Black holes and white rabbits: metaphor identification with visual features. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies. pp.160–170
Kehat G, Pustejovsky J (2020) Improving neural metaphor detection with visual datasets. In: Proceedings of the Twelfth Language Resources and Evaluation Conference. pp.5928–5933
Chang S et al (2021) Multimodal metaphor detection based on distinguishing concreteness. Neurocomputing 429:166–173
Zhang D et al (2021) In Your Face: Sentiment Analysis of Metaphor with Facial Expressive Features. In: 2021 International Joint Conference on Neural Networks (IJCNN). IEEE. pp.1–8
Yongkang D et al (2025) Attention-enhanced multimodal feature fusion network for clothes-changing person re-identification. Complex Intell Syst 11:1–15
Li W et al (2024) rLLM: Relational table learning with LLMs. arXiv preprint arXiv:2407.20157
Wang Z et al (2024) From cluster assumption to graph convolution: Graph-based semi-supervised learning revisited. In: IEEE Transactions on Neural Networks and Learning Systems
Wang Z et al (2021) Zero-shot node classification with decomposed graph prototype network. In: Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. pp.1769–1779
Zheng W et al (2020) Network embedding with completely-imbalanced labels. IEEE Trans Knowledge Data Eng 33(11):3634–3647
Wang Z et al (2017) Multiple source detection without knowing the underlying propagation model. In: Proceedings of the AAAI Conference on Artificial Intelligence
Alnajjar K , Hämäläinen M, Zhang S (2022) Ring that bell: A corpus and method for multimodal metaphor detection in videos. arXiv preprint arXiv:2301.01134
Zhang D et al (2021) MultiMET: a multimodal dataset for metaphor understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp.3214–3225
Xu B et al (2022) Met-meme: a multimodal meme dataset rich in metaphors. In: Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval. pp.2887–2899
Kim J (2022) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Liming Z et al (2022) An infrared and visible image fusion algorithm based on ResNet-152. Multimed Tools Appl 81(7):9277–9287
Zhou B et al (2016) Learning deep features for discriminative localization. Proceedings of the IEEE conference on computer vision and pattern recognition. pp.2921–2929
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. Proceedings of the IEEE conference on computer vision and pattern recognition. pp.7132–7141
Zagoruyko S, Komodakis N (2016) Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928
Joze HRV et al (2020) MMTM: multimodal transfer module for CNN fusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp.13289–13299
Cai Y, Cai H, Wan X (2019) Multi-modal sarcasm detection in twitter with hierarchical fusion model. In: Proceedings of the 57th annual meeting of the association for computational linguistics. pp.2506–2515
Lewis M et al (2019) Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461
Qassim H, Verma A, Feinzimer D (2018) Compressed residual-VGG16 CNN model for big data places image recognition. In: IEEE 8th annual computing and communication workshop and conference (CCWC). IEEE. 2018:169–175
He K et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp.770–778
Pan H et al (2020) Modeling intra and inter-modality incongruity for multi-modal sarcasm detection. In: Findings of the Association for Computational Linguistics: EMNLP 2020. pp.1383–1392
Yang B et al (2022) Multimodal sentiment analysis with unidirectional modality translation. Neurocomputing 467:130–137
de Toledo GL, Marcacini RM (2022) Transfer learning with joint fine-tuning for multimodal sentiment analysis. arXiv preprint arXiv:2210.05790
Chen X et al (2022) Hybrid transformer with multi-level fusion for multimodal knowledge graph completion. In: Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval. pp.904–915
Licai S et al (2023) Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Trans Affect Comput 15(1):309–325
Xiaoyu H et al (2024) VIEMF: multimodal metaphor detection via visual information enhancement with multimodal fusion. Inform Process Manage 61(3):103652
Xu N, Zeng Z, Mao W (2020) Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association. In: Proceedings of the 58th annual meeting of the association for computational linguistics. pp.3777–3786
Lou C et al (2021) Affective dependency graph for sarcasm detection. In: Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. pp.1844–1849
Liang B et al (2022) Multi-modal sarcasm detection via cross-modal graph convolutional network. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics. pp.1767–1777
Tan Y et al (2023) KnowleNet: knowledge fusion network for multimodal sarcasm detection. Inform Fusion 100:101921
Clark K (2020) Electra: Pre-training text encoders as discriminators rather than generators, arXiv preprint arXiv:2003.10555
Cai H et al (2020) Feature-level fusion approaches based on multimodal EEG data for depression recognition. Inform Fusion 59:127–138
Ghani MKA et al (2020) Decision-level fusion scheme for nasopharyngeal carcinoma identification using machine learning techniques. Neural Comput Appl 32:625–638
Iqbal MA, Baibhav N, Sarbani R (2022) Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities. Knowledge-Based Syst 244:108580
Selvaraju RR et al (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE international conference on computer vision. pp.618–626
Acknowledgements
This research was supported by the Natural Science Foundation of Xinjiang Uygur Autonomous Region (2023D01C176) and the Xinjiang Uygur Autonomous Region Universities Fundamental Research Funds Scientific Research Project (XJEDU2 022P018). We sincerely thank these foundations for their support.
Author information
Authors and Affiliations
Contributions
M.H. was primarily responsible for conceptualization, methodology design, data visualization, drafting the original manuscript, and subsequent review and editing. Y.Q.M handled data curation, provided supervision, and contributed to review and editing. Y.Y.B. oversaw supervision and participated in review and editing. G.S.S. conducted investigation and validation. W.Q.X. provided supervision. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, Q., Meng, H., Yan, Y. et al. SFVE: visual information enhancement metaphor detection with multimodal splitting fusion. J Supercomput 81, 467 (2025). https://doi.org/10.1007/s11227-025-06958-9
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-025-06958-9