Multi-level sentiment-aware clustering for denoising in multimodal sentiment analysis with ASR errors

Hu, Zixu; Yu, Zhengtao; Guo, Junjun

doi:10.1007/s00530-025-01697-6

Multi-level sentiment-aware clustering for denoising in multimodal sentiment analysis with ASR errors

Regular Paper
Published: 17 February 2025

Volume 31, article number 116, (2025)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Zixu Hu^1,2,
Zhengtao Yu^1,2 &
Junjun Guo^1,2

178 Accesses
Explore all metrics

Abstract

With the rapid popularity of short video content, multimodal sentiment analysis (MSA) has attracted extensive attention. Most previous MSA studies have focused on manually transcribed benchmark datasets, which are both costly to generate and limited in availability. In real-world applications, MSA often relies on Automatic Speech Recognition (ASR) technology. However, due to the noise in the captions generated by ASR, traditional MSA models suffer a substantial decline in performance. To address this problem, this paper proposes a Multi-level Sentiment-aware Clustering Denoising Model (MSCDM), which effectively enhances the robustness by introducing sentiment distance constraints both intra- and inter-modality. Specifically, the model first compensates for the loss of sentiment semantic information in the text modality caused by ASR by leveraging samples with the same sentiment polarity to guide each other. Subsequently, the model refines cross-modal sentiment representations by dividing samples with multimodal information into positive and negative examples. We conduct extensive experiments on real-world datasets including MOSI-SpeechBrain, MOSI-IBM, and MOSI-iFlytek, and the results demonstrate the model’s effectiveness, outperforming the current state-of-the-art models on three datasets. The in-depth analysis confirms the performance of the multi-level clustering denoising strategy proposed for MSA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal Approach for Code-Mixed Speech Sentiment Classification

A dissimilarity feature-driven decomposition network for multimodal sentiment analysis

Article 15 January 2025

ConD2: Contrastive Decomposition Distilling for Multimodal Sentiment Analysis

Data availibility statement

Here is link to the datasets used in this article: https://github.com/albertwy/SWRM.

References

Das, R., Singh, T.D.: Multimodal sentiment analysis: a survey of methods, trends, and challenges. ACM Comput Surv 55(13s):270:1–270:38 (2023)
Yu, W., Xu, H., Meng, F., et al.: CH-SIMS: a Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020. Association for Computational Linguistics, pp. 3718–3727 (2020)
Yang, J., Yu, Y., Niu, D., et al.: Confede: contrastive feature decomposition for multimodal sentiment analysis. In: Rogers, A., Boyd-Graber, J.L., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9–14, 2023. Association for Computational Linguistics, pp. 7617–7630 (2023)
Chen, Q., Huang, G., Wang, Y.: The weighted cross-modal attention mechanism with sentiment prediction auxiliary task for multimodal sentiment analysis. IEEE ACM Trans. Audio Speech Lang. Process. 30, 2689–2695 (2022)
Article Google Scholar
Baltrusaitis, T., Ahuja, C., Morency, L.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2019)
Article Google Scholar
Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 45(10), 12113–12132 (2023)
Article Google Scholar
Dong, Z., Liu, H.: Recent advancements and challenges in multimodal sentiment analysis: a survey. In: International Conference on Machine Learning and Cybernetics, ICMLC 2023, Adelaide, Australia, July 9–11, 2023. IEEE, pp. 464–469 (2023)
Zeng, J., Liu, T., Zhou, J.: Tag-assisted multimodal sentiment analysis under uncertain missing modalities. In: SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11–15, 2022. ACM, pp. 1545–1554 (2022)
Huan, R., Zhong, G., Chen, P., et al.: Unimf: a unified multimodal framework for multimodal sentiment analysis in missing modalities and unaligned multimodal sequences. IEEE Trans. Multim. 26, 5753–5768 (2024)
Article Google Scholar
Yuan, Z., Zhang, B., Xu, H., et al.: Meta noise adaption framework for multimodal sentiment analysis with feature noise. IEEE Trans. Multim. 26, 7265–7277 (2024)
Article Google Scholar
Shang, C., Palmer, A., Sun, J., et al.: VIGAN: missing view imputation with generative adversarial networks. In: 2017 IEEE International Conference on Big Data (IEEE BigData 2017), Boston, MA, USA, December 11–14, 2017. IEEE Computer Society, pp. 766–775 (2017)
Zhao, J., Li, R., Jin, Q.: Missing modality imagination network for emotion recognition with uncertain missing modalities. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1–6, 2021. Association for Computational Linguistics, pp. 2608–2618 (2021)
Wu, Y., Zhao, Y., Yang, H., et al.: Sentiment word aware multimodal refinement for multimodal sentiment analysis with ASR errors. In: Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22–27, 2022. Association for Computational Linguistics, pp. 1397–1406 (2022)
Tsai, Y.H., Bai, S., Liang, P.P., et al.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28–August 2, 2019, Volume 1: Long Papers. Association for Computational Linguistics, pp. 6558–6569 (2019)
Wang, D., Guo, X., Tian, Y., et al.: TETFN: a text enhanced transformer fusion network for multimodal sentiment analysis. Pattern Recognit. 136, 109259 (2023)
Article Google Scholar
Zhao, X., Chen, Y., Liu, S., et al.: Shared-private memory networks for multimodal sentiment analysis. IEEE Trans. Affect. Comput. 14(4), 2889–2900 (2023)
Article Google Scholar
Hazarika, D., Zimmermann, R., Poria, S.: MISA: modality-invariant and -specific representations for multimodal sentiment analysis. In: MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event/Seattle, WA, USA, October 12–16, 2020. ACM, pp. 1122–1131 (2020)
Yu, W., Xu, H., Yuan, Z., et al.: Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2–9, 2021. AAAI Press, pp. 10790–10797 (2021)
Mai, S., Zeng, Y., Zheng, S., et al.: Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Trans. Affect. Comput. 14(3), 2276–2289 (2023)
Article Google Scholar
Hu, G., Lin, T., Zhao, Y., et al.: Unimse: towards unified multimodal sentiment analysis and emotion recognition. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7–11, 2022. Association for Computational Linguistics, pp. 7837–7851 (2022)
Yuan, Z., Li, W., Xu, H., et al.: Transformer-based feature reconstruction network for robust multimodal sentiment analysis. In: MM ’21: ACM Multimedia Conference, Virtual Event, China, October 20–24, 2021. ACM, pp. 4400–4407 (2021)
Dumpala, S.H., Sheikh, I.A., Chakraborty, R., et al.: Sentiment classification on erroneous ASR transcripts: a multi view learning approach. In: 2018 IEEE Spoken Language Technology Workshop, SLT 2018, Athens, Greece, December 18–21, 2018. IEEE, pp. 807–814 (2018)
Lakomkin, E., Zamani, M., Weber, C., et al.: Incorporating end-to-end speech recognition models for sentiment analysis. In: International Conference on Robotics and Automation, ICRA 2019, Montreal, QC, Canada, May 20–24, 2019. IEEE, pp. 7976–7982 (2019)
Ekman, P., Rosenberg, E.L.: What the face reveals: basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS). Oxford University Press, Oxford (1997)
Google Scholar
Degottex, G., Kane, J., Drugman, T., et al.: COVAREP—a collaborative voice analysis repository for speech technologies. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4–9, 2014. IEEE, pp. 960–964 (2014)
Devlin, J., Chang, M., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, pp. 4171–4186 (2019)
Yin, D., Meng, T., Chang, K.: Sentibert: a transferable transformer-based architecture for compositional sentiment semantics. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020. Association for Computational Linguistics, pp. 3695–3706 (2020)
Zadeh, A., Zellers, R., Pincus, E., et al.: MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. CoRR. arXiv:abs/1606.06259 (2016)
Sun, L., Lian, Z., Liu, B., et al.: Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Trans. Affect. Comput. 15(1), 309–325 (2024)
Article Google Scholar
Wang, D., Liu, S., Wang, Q., et al.: Cross-modal enhancement network for multimodal sentiment analysis. IEEE Trans. Multim. 25, 4909–4921 (2023)
Article Google Scholar
Huang, J., Zhou, J., Tang, Z., et al.: TMBL: transformer-based multimodal binding learning model for multimodal sentiment analysis. Knowl. Based Syst. 285, 111346 (2024)
Article Google Scholar
Zadeh, A., Chen, M., Poria, S., et al.: Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9–11, 2017. Association for Computational Linguistics, pp. 1103–1114 (2017)
Liu, Z., Shen, Y., Lakshminarasimhan, V.B., et al.: Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15–20, 2018, Volume 1: Long Papers. Association for Computational Linguistics, pp. 2247–2256 (2018)
Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9(11) (2008)

Download references

Acknowledgements

This paper is supported by the National Natural Science Foundation of China (62366025, U21B2027, 62266028), Natural Science Foundation project of Yunnan Science and Technology Department (202301AT070444). Yunnan Key Research Projects (202203AA080004, 202303AP140008, 202302AD080003).

Author information

Authors and Affiliations

Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650500, China
Zixu Hu, Zhengtao Yu & Junjun Guo
Yunnan Key Laboratory of Artificial Intelligence, Kunming, China
Zixu Hu, Zhengtao Yu & Junjun Guo

Authors

Zixu Hu
View author publications
You can also search for this author inPubMed Google Scholar
Zhengtao Yu
View author publications
You can also search for this author inPubMed Google Scholar
Junjun Guo
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Zixu Hu. The first draft of the manuscript was written by Zixu Hu and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Zhengtao Yu.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Communicated by Bing-kun Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hu, Z., Yu, Z. & Guo, J. Multi-level sentiment-aware clustering for denoising in multimodal sentiment analysis with ASR errors. Multimedia Systems 31, 116 (2025). https://doi.org/10.1007/s00530-025-01697-6

Download citation

Received: 10 September 2024
Accepted: 25 January 2025
Published: 17 February 2025
DOI: https://doi.org/10.1007/s00530-025-01697-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-level sentiment-aware clustering for denoising in multimodal sentiment analysis with ASR errors

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multimodal Approach for Code-Mixed Speech Sentiment Classification

A dissimilarity feature-driven decomposition network for multimodal sentiment analysis

ConD2: Contrastive Decomposition Distilling for Multimodal Sentiment Analysis

Data availibility statement

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now