Skip to main content

Advertisement

Log in

Multi-level sentiment-aware clustering for denoising in multimodal sentiment analysis with ASR errors

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

With the rapid popularity of short video content, multimodal sentiment analysis (MSA) has attracted extensive attention. Most previous MSA studies have focused on manually transcribed benchmark datasets, which are both costly to generate and limited in availability. In real-world applications, MSA often relies on Automatic Speech Recognition (ASR) technology. However, due to the noise in the captions generated by ASR, traditional MSA models suffer a substantial decline in performance. To address this problem, this paper proposes a Multi-level Sentiment-aware Clustering Denoising Model (MSCDM), which effectively enhances the robustness by introducing sentiment distance constraints both intra- and inter-modality. Specifically, the model first compensates for the loss of sentiment semantic information in the text modality caused by ASR by leveraging samples with the same sentiment polarity to guide each other. Subsequently, the model refines cross-modal sentiment representations by dividing samples with multimodal information into positive and negative examples. We conduct extensive experiments on real-world datasets including MOSI-SpeechBrain, MOSI-IBM, and MOSI-iFlytek, and the results demonstrate the model’s effectiveness, outperforming the current state-of-the-art models on three datasets. The in-depth analysis confirms the performance of the multi-level clustering denoising strategy proposed for MSA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availibility statement

Here is link to the datasets used in this article: https://github.com/albertwy/SWRM.

References

  1. Das, R., Singh, T.D.: Multimodal sentiment analysis: a survey of methods, trends, and challenges. ACM Comput Surv 55(13s):270:1–270:38 (2023)

  2. Yu, W., Xu, H., Meng, F., et al.: CH-SIMS: a Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020. Association for Computational Linguistics, pp. 3718–3727 (2020)

  3. Yang, J., Yu, Y., Niu, D., et al.: Confede: contrastive feature decomposition for multimodal sentiment analysis. In: Rogers, A., Boyd-Graber, J.L., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9–14, 2023. Association for Computational Linguistics, pp. 7617–7630 (2023)

  4. Chen, Q., Huang, G., Wang, Y.: The weighted cross-modal attention mechanism with sentiment prediction auxiliary task for multimodal sentiment analysis. IEEE ACM Trans. Audio Speech Lang. Process. 30, 2689–2695 (2022)

    Article  Google Scholar 

  5. Baltrusaitis, T., Ahuja, C., Morency, L.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2019)

    Article  Google Scholar 

  6. Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 45(10), 12113–12132 (2023)

    Article  Google Scholar 

  7. Dong, Z., Liu, H.: Recent advancements and challenges in multimodal sentiment analysis: a survey. In: International Conference on Machine Learning and Cybernetics, ICMLC 2023, Adelaide, Australia, July 9–11, 2023. IEEE, pp. 464–469 (2023)

  8. Zeng, J., Liu, T., Zhou, J.: Tag-assisted multimodal sentiment analysis under uncertain missing modalities. In: SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11–15, 2022. ACM, pp. 1545–1554 (2022)

  9. Huan, R., Zhong, G., Chen, P., et al.: Unimf: a unified multimodal framework for multimodal sentiment analysis in missing modalities and unaligned multimodal sequences. IEEE Trans. Multim. 26, 5753–5768 (2024)

    Article  Google Scholar 

  10. Yuan, Z., Zhang, B., Xu, H., et al.: Meta noise adaption framework for multimodal sentiment analysis with feature noise. IEEE Trans. Multim. 26, 7265–7277 (2024)

    Article  Google Scholar 

  11. Shang, C., Palmer, A., Sun, J., et al.: VIGAN: missing view imputation with generative adversarial networks. In: 2017 IEEE International Conference on Big Data (IEEE BigData 2017), Boston, MA, USA, December 11–14, 2017. IEEE Computer Society, pp. 766–775 (2017)

  12. Zhao, J., Li, R., Jin, Q.: Missing modality imagination network for emotion recognition with uncertain missing modalities. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1–6, 2021. Association for Computational Linguistics, pp. 2608–2618 (2021)

  13. Wu, Y., Zhao, Y., Yang, H., et al.: Sentiment word aware multimodal refinement for multimodal sentiment analysis with ASR errors. In: Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22–27, 2022. Association for Computational Linguistics, pp. 1397–1406 (2022)

  14. Tsai, Y.H., Bai, S., Liang, P.P., et al.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28–August 2, 2019, Volume 1: Long Papers. Association for Computational Linguistics, pp. 6558–6569 (2019)

  15. Wang, D., Guo, X., Tian, Y., et al.: TETFN: a text enhanced transformer fusion network for multimodal sentiment analysis. Pattern Recognit. 136, 109259 (2023)

    Article  Google Scholar 

  16. Zhao, X., Chen, Y., Liu, S., et al.: Shared-private memory networks for multimodal sentiment analysis. IEEE Trans. Affect. Comput. 14(4), 2889–2900 (2023)

    Article  Google Scholar 

  17. Hazarika, D., Zimmermann, R., Poria, S.: MISA: modality-invariant and -specific representations for multimodal sentiment analysis. In: MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event/Seattle, WA, USA, October 12–16, 2020. ACM, pp. 1122–1131 (2020)

  18. Yu, W., Xu, H., Yuan, Z., et al.: Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2–9, 2021. AAAI Press, pp. 10790–10797 (2021)

  19. Mai, S., Zeng, Y., Zheng, S., et al.: Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Trans. Affect. Comput. 14(3), 2276–2289 (2023)

    Article  Google Scholar 

  20. Hu, G., Lin, T., Zhao, Y., et al.: Unimse: towards unified multimodal sentiment analysis and emotion recognition. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7–11, 2022. Association for Computational Linguistics, pp. 7837–7851 (2022)

  21. Yuan, Z., Li, W., Xu, H., et al.: Transformer-based feature reconstruction network for robust multimodal sentiment analysis. In: MM ’21: ACM Multimedia Conference, Virtual Event, China, October 20–24, 2021. ACM, pp. 4400–4407 (2021)

  22. Dumpala, S.H., Sheikh, I.A., Chakraborty, R., et al.: Sentiment classification on erroneous ASR transcripts: a multi view learning approach. In: 2018 IEEE Spoken Language Technology Workshop, SLT 2018, Athens, Greece, December 18–21, 2018. IEEE, pp. 807–814 (2018)

  23. Lakomkin, E., Zamani, M., Weber, C., et al.: Incorporating end-to-end speech recognition models for sentiment analysis. In: International Conference on Robotics and Automation, ICRA 2019, Montreal, QC, Canada, May 20–24, 2019. IEEE, pp. 7976–7982 (2019)

  24. Ekman, P., Rosenberg, E.L.: What the face reveals: basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS). Oxford University Press, Oxford (1997)

    Google Scholar 

  25. Degottex, G., Kane, J., Drugman, T., et al.: COVAREP—a collaborative voice analysis repository for speech technologies. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4–9, 2014. IEEE, pp. 960–964 (2014)

  26. Devlin, J., Chang, M., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, pp. 4171–4186 (2019)

  27. Yin, D., Meng, T., Chang, K.: Sentibert: a transferable transformer-based architecture for compositional sentiment semantics. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020. Association for Computational Linguistics, pp. 3695–3706 (2020)

  28. Zadeh, A., Zellers, R., Pincus, E., et al.: MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. CoRR. arXiv:abs/1606.06259 (2016)

  29. Sun, L., Lian, Z., Liu, B., et al.: Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Trans. Affect. Comput. 15(1), 309–325 (2024)

    Article  Google Scholar 

  30. Wang, D., Liu, S., Wang, Q., et al.: Cross-modal enhancement network for multimodal sentiment analysis. IEEE Trans. Multim. 25, 4909–4921 (2023)

    Article  Google Scholar 

  31. Huang, J., Zhou, J., Tang, Z., et al.: TMBL: transformer-based multimodal binding learning model for multimodal sentiment analysis. Knowl. Based Syst. 285, 111346 (2024)

    Article  Google Scholar 

  32. Zadeh, A., Chen, M., Poria, S., et al.: Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9–11, 2017. Association for Computational Linguistics, pp. 1103–1114 (2017)

  33. Liu, Z., Shen, Y., Lakshminarasimhan, V.B., et al.: Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15–20, 2018, Volume 1: Long Papers. Association for Computational Linguistics, pp. 2247–2256 (2018)

  34. Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9(11) (2008)

Download references

Acknowledgements

This paper is supported by the National Natural Science Foundation of China (62366025, U21B2027, 62266028), Natural Science Foundation project of Yunnan Science and Technology Department (202301AT070444). Yunnan Key Research Projects (202203AA080004, 202303AP140008, 202302AD080003).

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Zixu Hu. The first draft of the manuscript was written by Zixu Hu and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Zhengtao Yu.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Communicated by Bing-kun Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, Z., Yu, Z. & Guo, J. Multi-level sentiment-aware clustering for denoising in multimodal sentiment analysis with ASR errors. Multimedia Systems 31, 116 (2025). https://doi.org/10.1007/s00530-025-01697-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-025-01697-6

Keywords