Abstract
With the rapid popularity of short video content, multimodal sentiment analysis (MSA) has attracted extensive attention. Most previous MSA studies have focused on manually transcribed benchmark datasets, which are both costly to generate and limited in availability. In real-world applications, MSA often relies on Automatic Speech Recognition (ASR) technology. However, due to the noise in the captions generated by ASR, traditional MSA models suffer a substantial decline in performance. To address this problem, this paper proposes a Multi-level Sentiment-aware Clustering Denoising Model (MSCDM), which effectively enhances the robustness by introducing sentiment distance constraints both intra- and inter-modality. Specifically, the model first compensates for the loss of sentiment semantic information in the text modality caused by ASR by leveraging samples with the same sentiment polarity to guide each other. Subsequently, the model refines cross-modal sentiment representations by dividing samples with multimodal information into positive and negative examples. We conduct extensive experiments on real-world datasets including MOSI-SpeechBrain, MOSI-IBM, and MOSI-iFlytek, and the results demonstrate the model’s effectiveness, outperforming the current state-of-the-art models on three datasets. The in-depth analysis confirms the performance of the multi-level clustering denoising strategy proposed for MSA.









Similar content being viewed by others
Data availibility statement
Here is link to the datasets used in this article: https://github.com/albertwy/SWRM.
References
Das, R., Singh, T.D.: Multimodal sentiment analysis: a survey of methods, trends, and challenges. ACM Comput Surv 55(13s):270:1–270:38 (2023)
Yu, W., Xu, H., Meng, F., et al.: CH-SIMS: a Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020. Association for Computational Linguistics, pp. 3718–3727 (2020)
Yang, J., Yu, Y., Niu, D., et al.: Confede: contrastive feature decomposition for multimodal sentiment analysis. In: Rogers, A., Boyd-Graber, J.L., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9–14, 2023. Association for Computational Linguistics, pp. 7617–7630 (2023)
Chen, Q., Huang, G., Wang, Y.: The weighted cross-modal attention mechanism with sentiment prediction auxiliary task for multimodal sentiment analysis. IEEE ACM Trans. Audio Speech Lang. Process. 30, 2689–2695 (2022)
Baltrusaitis, T., Ahuja, C., Morency, L.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2019)
Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 45(10), 12113–12132 (2023)
Dong, Z., Liu, H.: Recent advancements and challenges in multimodal sentiment analysis: a survey. In: International Conference on Machine Learning and Cybernetics, ICMLC 2023, Adelaide, Australia, July 9–11, 2023. IEEE, pp. 464–469 (2023)
Zeng, J., Liu, T., Zhou, J.: Tag-assisted multimodal sentiment analysis under uncertain missing modalities. In: SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11–15, 2022. ACM, pp. 1545–1554 (2022)
Huan, R., Zhong, G., Chen, P., et al.: Unimf: a unified multimodal framework for multimodal sentiment analysis in missing modalities and unaligned multimodal sequences. IEEE Trans. Multim. 26, 5753–5768 (2024)
Yuan, Z., Zhang, B., Xu, H., et al.: Meta noise adaption framework for multimodal sentiment analysis with feature noise. IEEE Trans. Multim. 26, 7265–7277 (2024)
Shang, C., Palmer, A., Sun, J., et al.: VIGAN: missing view imputation with generative adversarial networks. In: 2017 IEEE International Conference on Big Data (IEEE BigData 2017), Boston, MA, USA, December 11–14, 2017. IEEE Computer Society, pp. 766–775 (2017)
Zhao, J., Li, R., Jin, Q.: Missing modality imagination network for emotion recognition with uncertain missing modalities. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1–6, 2021. Association for Computational Linguistics, pp. 2608–2618 (2021)
Wu, Y., Zhao, Y., Yang, H., et al.: Sentiment word aware multimodal refinement for multimodal sentiment analysis with ASR errors. In: Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22–27, 2022. Association for Computational Linguistics, pp. 1397–1406 (2022)
Tsai, Y.H., Bai, S., Liang, P.P., et al.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28–August 2, 2019, Volume 1: Long Papers. Association for Computational Linguistics, pp. 6558–6569 (2019)
Wang, D., Guo, X., Tian, Y., et al.: TETFN: a text enhanced transformer fusion network for multimodal sentiment analysis. Pattern Recognit. 136, 109259 (2023)
Zhao, X., Chen, Y., Liu, S., et al.: Shared-private memory networks for multimodal sentiment analysis. IEEE Trans. Affect. Comput. 14(4), 2889–2900 (2023)
Hazarika, D., Zimmermann, R., Poria, S.: MISA: modality-invariant and -specific representations for multimodal sentiment analysis. In: MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event/Seattle, WA, USA, October 12–16, 2020. ACM, pp. 1122–1131 (2020)
Yu, W., Xu, H., Yuan, Z., et al.: Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2–9, 2021. AAAI Press, pp. 10790–10797 (2021)
Mai, S., Zeng, Y., Zheng, S., et al.: Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Trans. Affect. Comput. 14(3), 2276–2289 (2023)
Hu, G., Lin, T., Zhao, Y., et al.: Unimse: towards unified multimodal sentiment analysis and emotion recognition. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7–11, 2022. Association for Computational Linguistics, pp. 7837–7851 (2022)
Yuan, Z., Li, W., Xu, H., et al.: Transformer-based feature reconstruction network for robust multimodal sentiment analysis. In: MM ’21: ACM Multimedia Conference, Virtual Event, China, October 20–24, 2021. ACM, pp. 4400–4407 (2021)
Dumpala, S.H., Sheikh, I.A., Chakraborty, R., et al.: Sentiment classification on erroneous ASR transcripts: a multi view learning approach. In: 2018 IEEE Spoken Language Technology Workshop, SLT 2018, Athens, Greece, December 18–21, 2018. IEEE, pp. 807–814 (2018)
Lakomkin, E., Zamani, M., Weber, C., et al.: Incorporating end-to-end speech recognition models for sentiment analysis. In: International Conference on Robotics and Automation, ICRA 2019, Montreal, QC, Canada, May 20–24, 2019. IEEE, pp. 7976–7982 (2019)
Ekman, P., Rosenberg, E.L.: What the face reveals: basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS). Oxford University Press, Oxford (1997)
Degottex, G., Kane, J., Drugman, T., et al.: COVAREP—a collaborative voice analysis repository for speech technologies. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4–9, 2014. IEEE, pp. 960–964 (2014)
Devlin, J., Chang, M., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, pp. 4171–4186 (2019)
Yin, D., Meng, T., Chang, K.: Sentibert: a transferable transformer-based architecture for compositional sentiment semantics. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020. Association for Computational Linguistics, pp. 3695–3706 (2020)
Zadeh, A., Zellers, R., Pincus, E., et al.: MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. CoRR. arXiv:abs/1606.06259 (2016)
Sun, L., Lian, Z., Liu, B., et al.: Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Trans. Affect. Comput. 15(1), 309–325 (2024)
Wang, D., Liu, S., Wang, Q., et al.: Cross-modal enhancement network for multimodal sentiment analysis. IEEE Trans. Multim. 25, 4909–4921 (2023)
Huang, J., Zhou, J., Tang, Z., et al.: TMBL: transformer-based multimodal binding learning model for multimodal sentiment analysis. Knowl. Based Syst. 285, 111346 (2024)
Zadeh, A., Chen, M., Poria, S., et al.: Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9–11, 2017. Association for Computational Linguistics, pp. 1103–1114 (2017)
Liu, Z., Shen, Y., Lakshminarasimhan, V.B., et al.: Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15–20, 2018, Volume 1: Long Papers. Association for Computational Linguistics, pp. 2247–2256 (2018)
Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9(11) (2008)
Acknowledgements
This paper is supported by the National Natural Science Foundation of China (62366025, U21B2027, 62266028), Natural Science Foundation project of Yunnan Science and Technology Department (202301AT070444). Yunnan Key Research Projects (202203AA080004, 202303AP140008, 202302AD080003).
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Zixu Hu. The first draft of the manuscript was written by Zixu Hu and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Communicated by Bing-kun Bao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hu, Z., Yu, Z. & Guo, J. Multi-level sentiment-aware clustering for denoising in multimodal sentiment analysis with ASR errors. Multimedia Systems 31, 116 (2025). https://doi.org/10.1007/s00530-025-01697-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-025-01697-6