skip to main content
10.1145/3591106.3592219acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

MMSF: A Multimodal Sentiment-Fused Method to Recognize Video Speaking Style

Authors Info & Claims
Published:12 June 2023Publication History

ABSTRACT

As talking takes a large proportion of human lives, it is necessary to perform deeper understanding of human conversations. Speaking style recognition is aimed at recognizing the styles of conversations, which provides a fine-grained description about talking. Current works focus on adopting only visual clues to recognize speaking styles, which cannot accurately distinguish different speaking styles when they are visually similar. To recognize speaking styles more effectively, we propose a novel multimodal sentiment-fused method, MMSF, which extracts and integrates visual, audio and textual features of videos. In addition, as sentiment is one of the motivations of human behavior, we first introduce sentiment into our multimodal method with cross-attention mechanism, which enhance the video feature to recognize speaking styles. The proposed MMSF is evaluated on long-form video understanding benchmark, and the experiment results show that it is superior to the state-of-the-arts.

References

  1. Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. 2021. Vatt: transformers for multimodal self-supervised learning from raw video, audio and text. In Advances in Neural Information Processing Systems. 24206–24221.Google ScholarGoogle Scholar
  2. Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. Vivit: a video vision transformer. In IEEE International Conference on Computer Vision. 6836–6846.Google ScholarGoogle ScholarCross RefCross Ref
  3. Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: facial behavior analysis toolkit. In IEEE International Conference on Automatic Face & Gesture Recognition. 59–66.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Fabrice Bellard. [n. d.]. FFmpeg. http://ffmpeg.orgGoogle ScholarGoogle Scholar
  5. Runqing Cao, Chunyang Ye, and Hui Zhou. 2020. Multimodel sentiment analysis with self-attention. In Future Technologies Conference. 16–26.Google ScholarGoogle Scholar
  6. Shixing Chen, Xiang Hao, Xiaohan Nie, and Raffay Hamid. 2022. Movies2Scenes: learning scene representations using movie similarities. In arXiv preprint arXiv:2202.10650.Google ScholarGoogle Scholar
  7. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: pre-training of deep bidirectional transformers for language understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics. 4171–4186.Google ScholarGoogle Scholar
  8. Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In IEEE International Conference on Computer Vision. 6202–6211.Google ScholarGoogle ScholarCross RefCross Ref
  9. Edward Fish, Jon Weinbren, and Andrew Gilbert. 2022. Two-stream transformer architecture for long video understanding. In British Machine Vision Conference.Google ScholarGoogle Scholar
  10. Wei Han, Hui Chen, Alexander Gelbukh, Amir Zadeh, Louis-philippe Morency, and Soujanya Poria. 2021. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In International Conference on Multimodal Interaction. 6–15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Wei Han, Hui Chen, and Soujanya Poria. 2021. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In Conference on Empirical Methods in Natural Language Processing. 9180–9192.Google ScholarGoogle ScholarCross RefCross Ref
  12. Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, 2014. Deep speech: scaling up end-to-end speech recognition. In arXiv preprint arXiv:1412.5567.Google ScholarGoogle Scholar
  13. Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. Misa: modality-invariant and-specific representations for multimodal sentiment analysis. In ACM International Conference on Multimedia. 1122–1131.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. In Neural Computation. 1735–1780.Google ScholarGoogle Scholar
  15. Md Mohaiminul Islam and Gedas Bertasius. 2022. Long movie clip classification with state-space video models. In European Conference on Computer Vision. 87–104.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. jdepoix. [n. d.]. YouTube Transcript API. https://github.com/jdepoix/youtube-transcript-api.gitGoogle ScholarGoogle Scholar
  17. Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In IEEE Conference on Computer Vision and Pattern Recognition. 3202–3211.Google ScholarGoogle ScholarCross RefCross Ref
  18. Brian McFee, Colin Raffel, Dawen Liang, Daniel P Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. Librosa: audio and music signal analysis in python. In Python in Science Conference. 18–25.Google ScholarGoogle ScholarCross RefCross Ref
  19. Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. 2011. Towards multimodal sentiment analysis: harvesting opinions from the web. In International Conference on Multimodal Interfaces. 169–176.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. In International Conference on Machine Learning. 689–696.Google ScholarGoogle Scholar
  21. Paul Rayson, Geoffrey N Leech, and Mary Hodges. 1997. Social differentiation in the use of English vocabulary: some analyses of the conversational component of the British National Corpus. In International Journal of Corpus Linguistics. 133–152.Google ScholarGoogle Scholar
  22. Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: a joint model for video and language representation learning. In IEEE International Conference on Computer Vision. 7464–7473.Google ScholarGoogle ScholarCross RefCross Ref
  23. Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Conference. Association for Computational Linguistics. Meeting. 6558.Google ScholarGoogle ScholarCross RefCross Ref
  24. Chao-Yuan Wu and Philipp Krahenbuhl. 2021. Towards long-form video understanding. In IEEE Conference on Computer Vision and Pattern Recognition. 1884–1894.Google ScholarGoogle ScholarCross RefCross Ref
  25. Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. 2022. Memvit: memory-augmented multiscale vision transformer for efficient long-term video recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 13587–13597.Google ScholarGoogle ScholarCross RefCross Ref
  26. C Xu, S Cetintas, KC Lee, and LJ Li. 2014. Visual sentiment prediction with deep convolutional neural networks. In arXiv preprint arXiv:1411.5731.Google ScholarGoogle Scholar
  27. Ziyi Yang, Yuwei Fang, Chenguang Zhu, Reid Pryzant, Dongdong Chen, Yu Shi, Yichong Xu, Yao Qian, Mei Gao, Yi-Ling Chen, 2022. I-code: an integrative and composable multimodal learning framework. In arXiv preprint arXiv:2205.01818.Google ScholarGoogle Scholar
  28. Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. 2015. Robust image sentiment analysis using progressively trained and domain transferred deep networks. In AAAI Conference on Artificial Intelligence. 381–388.Google ScholarGoogle ScholarCross RefCross Ref
  29. YouTube. [n. d.]. MovieClips. https://www.movieclips.com/Google ScholarGoogle Scholar
  30. Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. 2021. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In AAAI Conference on Artificial Intelligence. 10790–10797.Google ScholarGoogle ScholarCross RefCross Ref
  31. Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. In arXiv preprint arXiv:1606.06259.Google ScholarGoogle Scholar

Index Terms

  1. MMSF: A Multimodal Sentiment-Fused Method to Recognize Video Speaking Style

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval
            June 2023
            694 pages
            ISBN:9798400701788
            DOI:10.1145/3591106

            Copyright © 2023 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 12 June 2023

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited

            Acceptance Rates

            Overall Acceptance Rate254of830submissions,31%

            Upcoming Conference

            ICMR '24
            International Conference on Multimedia Retrieval
            June 10 - 14, 2024
            Phuket , Thailand
          • Article Metrics

            • Downloads (Last 12 months)147
            • Downloads (Last 6 weeks)9

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format