ABSTRACT
As talking takes a large proportion of human lives, it is necessary to perform deeper understanding of human conversations. Speaking style recognition is aimed at recognizing the styles of conversations, which provides a fine-grained description about talking. Current works focus on adopting only visual clues to recognize speaking styles, which cannot accurately distinguish different speaking styles when they are visually similar. To recognize speaking styles more effectively, we propose a novel multimodal sentiment-fused method, MMSF, which extracts and integrates visual, audio and textual features of videos. In addition, as sentiment is one of the motivations of human behavior, we first introduce sentiment into our multimodal method with cross-attention mechanism, which enhance the video feature to recognize speaking styles. The proposed MMSF is evaluated on long-form video understanding benchmark, and the experiment results show that it is superior to the state-of-the-arts.
- Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. 2021. Vatt: transformers for multimodal self-supervised learning from raw video, audio and text. In Advances in Neural Information Processing Systems. 24206–24221.Google Scholar
- Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. Vivit: a video vision transformer. In IEEE International Conference on Computer Vision. 6836–6846.Google ScholarCross Ref
- Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: facial behavior analysis toolkit. In IEEE International Conference on Automatic Face & Gesture Recognition. 59–66.Google ScholarDigital Library
- Fabrice Bellard. [n. d.]. FFmpeg. http://ffmpeg.orgGoogle Scholar
- Runqing Cao, Chunyang Ye, and Hui Zhou. 2020. Multimodel sentiment analysis with self-attention. In Future Technologies Conference. 16–26.Google Scholar
- Shixing Chen, Xiang Hao, Xiaohan Nie, and Raffay Hamid. 2022. Movies2Scenes: learning scene representations using movie similarities. In arXiv preprint arXiv:2202.10650.Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: pre-training of deep bidirectional transformers for language understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics. 4171–4186.Google Scholar
- Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In IEEE International Conference on Computer Vision. 6202–6211.Google ScholarCross Ref
- Edward Fish, Jon Weinbren, and Andrew Gilbert. 2022. Two-stream transformer architecture for long video understanding. In British Machine Vision Conference.Google Scholar
- Wei Han, Hui Chen, Alexander Gelbukh, Amir Zadeh, Louis-philippe Morency, and Soujanya Poria. 2021. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In International Conference on Multimodal Interaction. 6–15.Google ScholarDigital Library
- Wei Han, Hui Chen, and Soujanya Poria. 2021. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In Conference on Empirical Methods in Natural Language Processing. 9180–9192.Google ScholarCross Ref
- Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, 2014. Deep speech: scaling up end-to-end speech recognition. In arXiv preprint arXiv:1412.5567.Google Scholar
- Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. Misa: modality-invariant and-specific representations for multimodal sentiment analysis. In ACM International Conference on Multimedia. 1122–1131.Google ScholarDigital Library
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. In Neural Computation. 1735–1780.Google Scholar
- Md Mohaiminul Islam and Gedas Bertasius. 2022. Long movie clip classification with state-space video models. In European Conference on Computer Vision. 87–104.Google ScholarDigital Library
- jdepoix. [n. d.]. YouTube Transcript API. https://github.com/jdepoix/youtube-transcript-api.gitGoogle Scholar
- Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In IEEE Conference on Computer Vision and Pattern Recognition. 3202–3211.Google ScholarCross Ref
- Brian McFee, Colin Raffel, Dawen Liang, Daniel P Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. Librosa: audio and music signal analysis in python. In Python in Science Conference. 18–25.Google ScholarCross Ref
- Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. 2011. Towards multimodal sentiment analysis: harvesting opinions from the web. In International Conference on Multimodal Interfaces. 169–176.Google ScholarDigital Library
- Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. In International Conference on Machine Learning. 689–696.Google Scholar
- Paul Rayson, Geoffrey N Leech, and Mary Hodges. 1997. Social differentiation in the use of English vocabulary: some analyses of the conversational component of the British National Corpus. In International Journal of Corpus Linguistics. 133–152.Google Scholar
- Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: a joint model for video and language representation learning. In IEEE International Conference on Computer Vision. 7464–7473.Google ScholarCross Ref
- Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Conference. Association for Computational Linguistics. Meeting. 6558.Google ScholarCross Ref
- Chao-Yuan Wu and Philipp Krahenbuhl. 2021. Towards long-form video understanding. In IEEE Conference on Computer Vision and Pattern Recognition. 1884–1894.Google ScholarCross Ref
- Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. 2022. Memvit: memory-augmented multiscale vision transformer for efficient long-term video recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 13587–13597.Google ScholarCross Ref
- C Xu, S Cetintas, KC Lee, and LJ Li. 2014. Visual sentiment prediction with deep convolutional neural networks. In arXiv preprint arXiv:1411.5731.Google Scholar
- Ziyi Yang, Yuwei Fang, Chenguang Zhu, Reid Pryzant, Dongdong Chen, Yu Shi, Yichong Xu, Yao Qian, Mei Gao, Yi-Ling Chen, 2022. I-code: an integrative and composable multimodal learning framework. In arXiv preprint arXiv:2205.01818.Google Scholar
- Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. 2015. Robust image sentiment analysis using progressively trained and domain transferred deep networks. In AAAI Conference on Artificial Intelligence. 381–388.Google ScholarCross Ref
- YouTube. [n. d.]. MovieClips. https://www.movieclips.com/Google Scholar
- Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. 2021. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In AAAI Conference on Artificial Intelligence. 10790–10797.Google ScholarCross Ref
- Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. In arXiv preprint arXiv:1606.06259.Google Scholar
Index Terms
- MMSF: A Multimodal Sentiment-Fused Method to Recognize Video Speaking Style
Recommendations
A Deep Multi-level Attentive Network for Multimodal Sentiment Analysis
Multimodal sentiment analysis has attracted increasing attention with broad application prospects. Most of the existing methods have focused on a single modality, which fails to handle social media data due to its multiple modalities. Moreover, in ...
Attention and Engagement Aware Multimodal Conversational Systems
ICMI '15: Proceedings of the 2015 ACM on International Conference on Multimodal InteractionDespite their ability to complete certain tasks, dialog systems still suffer from poor adaptation to users' engagement and attention. We observe human behaviors in different conversational settings to understand human communication dynamics and then ...
TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis
WWW '20: Proceedings of The Web Conference 2020Multimodal sentiment analysis is an important research area that predicts speaker’s sentiment tendency through features extracted from textual, visual and acoustic modalities. The central challenge is the fusion method of the multimodal information. A ...
Comments