research-article

MMSF: A Multimodal Sentiment-Fused Method to Recognize Video Speaking Style

Authors:

Tongwei RenAuthors Info & Claims

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

Pages 289 - 297

https://doi.org/10.1145/3591106.3592219

Published: 12 June 2023 Publication History

Abstract

As talking takes a large proportion of human lives, it is necessary to perform deeper understanding of human conversations. Speaking style recognition is aimed at recognizing the styles of conversations, which provides a fine-grained description about talking. Current works focus on adopting only visual clues to recognize speaking styles, which cannot accurately distinguish different speaking styles when they are visually similar. To recognize speaking styles more effectively, we propose a novel multimodal sentiment-fused method, MMSF, which extracts and integrates visual, audio and textual features of videos. In addition, as sentiment is one of the motivations of human behavior, we first introduce sentiment into our multimodal method with cross-attention mechanism, which enhance the video feature to recognize speaking styles. The proposed MMSF is evaluated on long-form video understanding benchmark, and the experiment results show that it is superior to the state-of-the-arts.

References

[1]

Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. 2021. Vatt: transformers for multimodal self-supervised learning from raw video, audio and text. In Advances in Neural Information Processing Systems. 24206–24221.

[2]

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. Vivit: a video vision transformer. In IEEE International Conference on Computer Vision. 6836–6846.

[3]

Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: facial behavior analysis toolkit. In IEEE International Conference on Automatic Face & Gesture Recognition. 59–66.

Digital Library

[4]

Fabrice Bellard. [n. d.]. FFmpeg. http://ffmpeg.org

[5]

Runqing Cao, Chunyang Ye, and Hui Zhou. 2020. Multimodel sentiment analysis with self-attention. In Future Technologies Conference. 16–26.

[6]

Shixing Chen, Xiang Hao, Xiaohan Nie, and Raffay Hamid. 2022. Movies2Scenes: learning scene representations using movie similarities. In arXiv preprint arXiv:2202.10650.

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: pre-training of deep bidirectional transformers for language understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics. 4171–4186.

[8]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In IEEE International Conference on Computer Vision. 6202–6211.

[9]

Edward Fish, Jon Weinbren, and Andrew Gilbert. 2022. Two-stream transformer architecture for long video understanding. In British Machine Vision Conference.

[10]

Wei Han, Hui Chen, Alexander Gelbukh, Amir Zadeh, Louis-philippe Morency, and Soujanya Poria. 2021. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In International Conference on Multimodal Interaction. 6–15.

Digital Library

[11]

Wei Han, Hui Chen, and Soujanya Poria. 2021. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In Conference on Empirical Methods in Natural Language Processing. 9180–9192.

[12]

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, 2014. Deep speech: scaling up end-to-end speech recognition. In arXiv preprint arXiv:1412.5567.

[13]

Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. Misa: modality-invariant and-specific representations for multimodal sentiment analysis. In ACM International Conference on Multimedia. 1122–1131.

Digital Library

[14]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. In Neural Computation. 1735–1780.

[15]

Md Mohaiminul Islam and Gedas Bertasius. 2022. Long movie clip classification with state-space video models. In European Conference on Computer Vision. 87–104.

Digital Library

[16]

jdepoix. [n. d.]. YouTube Transcript API. https://github.com/jdepoix/youtube-transcript-api.git

[17]

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In IEEE Conference on Computer Vision and Pattern Recognition. 3202–3211.

[18]

Brian McFee, Colin Raffel, Dawen Liang, Daniel P Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. Librosa: audio and music signal analysis in python. In Python in Science Conference. 18–25.

[19]

Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. 2011. Towards multimodal sentiment analysis: harvesting opinions from the web. In International Conference on Multimodal Interfaces. 169–176.

Digital Library

[20]

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. In International Conference on Machine Learning. 689–696.

[21]

Paul Rayson, Geoffrey N Leech, and Mary Hodges. 1997. Social differentiation in the use of English vocabulary: some analyses of the conversational component of the British National Corpus. In International Journal of Corpus Linguistics. 133–152.

[22]

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: a joint model for video and language representation learning. In IEEE International Conference on Computer Vision. 7464–7473.

[23]

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Conference. Association for Computational Linguistics. Meeting. 6558.

[24]

Chao-Yuan Wu and Philipp Krahenbuhl. 2021. Towards long-form video understanding. In IEEE Conference on Computer Vision and Pattern Recognition. 1884–1894.

[25]

Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. 2022. Memvit: memory-augmented multiscale vision transformer for efficient long-term video recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 13587–13597.

[26]

C Xu, S Cetintas, KC Lee, and LJ Li. 2014. Visual sentiment prediction with deep convolutional neural networks. In arXiv preprint arXiv:1411.5731.

[27]

Ziyi Yang, Yuwei Fang, Chenguang Zhu, Reid Pryzant, Dongdong Chen, Yu Shi, Yichong Xu, Yao Qian, Mei Gao, Yi-Ling Chen, 2022. I-code: an integrative and composable multimodal learning framework. In arXiv preprint arXiv:2205.01818.

[28]

Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. 2015. Robust image sentiment analysis using progressively trained and domain transferred deep networks. In AAAI Conference on Artificial Intelligence. 381–388.

[29]

YouTube. [n. d.]. MovieClips. https://www.movieclips.com/

[30]

Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. 2021. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In AAAI Conference on Artificial Intelligence. 10790–10797.

[31]

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. In arXiv preprint arXiv:1606.06259.

Cited By

Yu FZhang BFang YBei JRen TLi JRossetto LGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Reproducibility Companion Paper of "MMSF: A Multimodal Sentiment-Fused Method to Recognize Video Speaking Style"Proceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658373(1232-1235)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658373

Index Terms

MMSF: A Multimodal Sentiment-Fused Method to Recognize Video Speaking Style
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

A Deep Multi-level Attentive Network for Multimodal Sentiment Analysis
Multimodal sentiment analysis has attracted increasing attention with broad application prospects. Most of the existing methods have focused on a single modality, which fails to handle social media data due to its multiple modalities. Moreover, in ...
Reproducibility Companion Paper of "MMSF: A Multimodal Sentiment-Fused Method to Recognize Video Speaking Style"
ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

To support the replication of "MMSF: A Multimodal Sentiment-Fused Method to Recognize Video Speaking Style", which was presented at ICMR'23, this companion paper provides the details of the artifacts. Speaking style recognition is aimed at recognizing ...
Attention and Engagement Aware Multimodal Conversational Systems
ICMI '15: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

Despite their ability to complete certain tasks, dialog systems still suffer from poor adaptation to users' engagement and attention. We observe human behaviors in different conversational settings to understand human communication dynamics and then ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

June 2023

694 pages

ISBN:9798400701788

DOI:10.1145/3591106

Editors:
Ioannis (Yiannis) Kompatsiaris
Centre for Research and Technology Hellas, Greece
,
Jiebo Luo
University of Rochester,USA
,
Nicu Sebe
University of Trento, Italy
,
Angela Yao
National University of Singapore, Singapore
,
Vasileios Mezaris
Centre for Research and Technology Hellas, Greece
,
Symeon Papadopoulos
Centre for Research and Technology Hellas, Greece
,
Adrian Popescu
CEA LIST, France
,
Zi (Helen) Huang
University of Queensland, Australia

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation of China

Conference

ICMR '23

Sponsor:

SIGMM

ICMR '23: International Conference on Multimedia Retrieval

June 12 - 15, 2023

Thessaloniki, Greece

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
188
Total Downloads

Downloads (Last 12 months)52
Downloads (Last 6 weeks)8

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yu FZhang BFang YBei JRen TLi JRossetto LGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Reproducibility Companion Paper of "MMSF: A Multimodal Sentiment-Fused Method to Recognize Video Speaking Style"Proceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658373(1232-1235)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658373

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten