Hierarchical Supervised Contrastive Learning for Multimodal Sentiment Analysis

Chen, Kezhou; Wang, Shuo; Hao, Yanbin

doi:10.1007/978-3-031-53308-2_5

Kezhou Chen¹⁴,
Shuo Wang¹⁴ &
Yanbin Hao¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14555))

Included in the following conference series:

International Conference on Multimedia Modeling

1202 Accesses

Abstract

Multimodal sentiment analysis (MSA) is dedicated to deciphering human emotions in videos. It is a challenging task due to the semantic disparities among various modalities (e.g., linguistic, visual, and acoustic) present in video content. To bridge these gaps, we leverage contrastive learning and introduce a novel hierarchical multimodal approach termed Hierarchical Supervised Contrastive Learning (HSCL). Initially, we utilize an unimodal fusion combined with a supervised contrastive learning strategy to distill pertinent content from each modality. Subsequently, we combine the bimodal data in pairs and further align them through supervised contrastive learning. This paired data aids in understanding the intricate nuances of multidimensional human emotions. In our method, supervised contrastive learning is tailored to accentuate the significance of label information, facilitating the extraction of sentiment cues from diverse sources. Experimental studies on two benchmark datasets demonstrate the effectiveness of our method. Specifically, compared to the baseline on the CMU-MOSI benchmark, our method achieves 2.59% accuracy improvement for emotion recognition. Codes are released at https://github.com/Turdidae810/HSCL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Cross-modal contrastive learning for multimodal sentiment recognition

Article 25 March 2024

Transformer-based adaptive contrastive learning for multimodal sentiment analysis

Article 12 April 2024

CTHFNet: contrastive translation and hierarchical fusion network for text–video–audio sentiment analysis

Article 10 October 2024

References

Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Baltrušaitis, T., Robinson, P., Morency, L.P.: Openface: an open source facial behavior analysis toolkit. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–10. IEEE (2016)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Degottex, G., Kane, J., Drugman, T., Raitio, T., Scherer, S.: Covarep-a collaborative voice analysis repository for speech technologies. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 960–964. IEEE (2014)
Google Scholar
Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910 (2021)
Google Scholar
Han, W., Chen, H., Gelbukh, A., Zadeh, A., Morency, L.P., Poria, S.: Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In: Proceedings of the 2021 International Conference on Multimodal Interaction, pp. 6–15 (2021)
Google Scholar
Han, W., Chen, H., Poria, S.: Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 9180–9192 (2021)
Google Scholar
Hazarika, D., Zimmermann, R., Poria, S.: Misa: modality-invariant and-specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1122–1131 (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, vol. 1, p. 2 (2019)
Google Scholar
Khosla, P., et al.: Supervised contrastive learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 18661–18673 (2020)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. In: Advances in Neural Information Processing Systems, vol. 34, pp. 9694–9705 (2021)
Google Scholar
Li, Y., Wang, Y., Cui, Z.: Decoupled multimodal distilling for emotion recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6631–6640 (2023)
Google Scholar
Lin, Z., et al.: Modeling intra-and inter-modal relations: hierarchical graph contrastive learning for multimodal sentiment analysis. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 7124–7135 (2022)
Google Scholar
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pp. 79–86 (2002)
Google Scholar
Pérez-Rosas, V., Mihalcea, R., Morency, L.P.: Utterance-level multimodal sentiment analysis. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 973–982 (2013)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Rahman, W., et al.: Integrating multimodal information in large pretrained transformers. In: Proceedings of the Conference. Association for Computational Linguistics. Meeting, vol. 2020, p. 2359. NIH Public Access (2020)
Google Scholar
Sun, H., Wang, H., Liu, J., Chen, Y.W., Lin, L.: CubeMLP: an MLP-based model for multimodal sentiment analysis and depression estimation. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 3722–3729 (2022)
Google Scholar
Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 8992–8999 (2020)
Google Scholar
Tsai, Y.H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6558–6569 (2019)
Google Scholar
Tsai, Y.H.H., Liang, P.P., Zadeh, A., Morency, L.P., Salakhutdinov, R.: Learning factorized multimodal representations. In: International Conference on Learning Representations (2018)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Yu, W., Xu, H., Yuan, Z., Wu, J.: Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 10790–10797 (2021)
Google Scholar
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1103–1114 (2017)
Google Scholar
Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell. Syst. 31(6), 82–88 (2016)
Article Google Scholar
Zadeh, A.B., Liang, P.P., Poria, S., Cambria, E., Morency, L.P.: Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2236–2246 (2018)
Google Scholar
Zha, K., Cao, P., Yang, Y., Katabi, D.: Supervised contrastive regression. arXiv preprint arXiv:2210.01189 (2022)
Zhang, L., Wang, S., Liu, B.: Deep learning for sentiment analysis: a survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 8(4), e1253 (2018)
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grants No. 62202439).

Author information

Authors and Affiliations

MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, Hefei, China
Kezhou Chen, Shuo Wang & Yanbin Hao

Authors

Kezhou Chen
View author publications
You can also search for this author in PubMed Google Scholar
Shuo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yanbin Hao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuo Wang .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
Delft University of Technology, Delft, The Netherlands
Alan Hanjalic
Delft University of Technology, Delft, The Netherlands
Cynthia Liem
University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Reykjavik University, Reykjavik, Iceland
Björn Þór Jónsson
Microsoft Research Lab – Asia, Beijing, China
Bei Liu
The University of Tokyo, Tokyo, Japan
Yoko Yamakata

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, K., Wang, S., Hao, Y. (2024). Hierarchical Supervised Contrastive Learning for Multimodal Sentiment Analysis. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14555. Springer, Cham. https://doi.org/10.1007/978-3-031-53308-2_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-53308-2_5
Published: 28 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53307-5
Online ISBN: 978-3-031-53308-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Hierarchical Supervised Contrastive Learning for Multimodal Sentiment Analysis