skip to main content
10.1145/3503161.3548018acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Representation Learning through Multimodal Attention and Time-Sync Comments for Affective Video Content Analysis

Published: 10 October 2022 Publication History

Abstract

Although temporal patterns inherent in visual and audio signals are crucial for affective video content analysis, they have not been thoroughly explored yet. In this paper, we propose a novel Temporal-Aware Multimodal (TAM) method to fully capture the temporal information. Specifically, we design a cross-temporal multimodal fusion module that applies attention-based fusion to different modalities within and across video segments. As a result, it fully captures the temporal relations between different modalities. Furthermore, a single emotion label lacks supervision for learning representation of each segment, making temporal pattern mining difficult. We leverage time-synchronized comments (TSCs) as auxiliary supervision, since these comments are easily accessible and contain rich emotional cues. Two TSC-based self-supervised tasks are designed: the first aims to predict the emotional words in a TSC from video representation and TSC contextual semantics, and the second predicts the segment in which the TSC appears by calculating the correlation between video representation and TSC embedding. These self-supervised tasks are used to pre-train the cross-temporal multimodal fusion module on a large-scale video-TSC dataset, which is crawled from the web without labeling costs. These self-supervised pre-training tasks prompt the fusion module to perform representation learning on segments including TSC, thus capturing more temporal affective patterns. Experimental results on three benchmark datasets show that the proposed fusion module achieves state-of-the-art results in affective video content analysis. Ablation studies verify that after TSC-based pre-training, the fusion module learns more segments' affective patterns and achieves better performance.

Supplementary Material

MP4 File (MM22-fp1134.mp4)
Presentation video

References

[1]
Esra Acar, Frank Hopfgartner, and Sahin Albayrak. 2014. Understanding affective content of music videos through learned representations. In International conference on multimedia modeling. Springer, 303--314.
[2]
Yoann Baveye, Emmanuel Dellandrea, Christel Chamaret, and Liming Chen. 2015. LIRIS-ACCEDE: A video database for affective content analysis. IEEE Transactions on Affective Computing 6, 1 (2015), 43--55.
[3]
Gedas Bertasius, HengWang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding. arXiv preprint arXiv:2102.05095 2, 3 (2021), 4.
[4]
Chen Chen, Zuxuan Wu, and Yu-Gang Jiang. 2016. Emotion in context: Deep semantic feature fusion for video emotion recognition. In Proceedings of the 24th ACM international conference on Multimedia. 127--131.
[5]
Xu Chen, Yongfeng Zhang, Qingyao Ai, Hongteng Xu, Junchi Yan, and Zheng Qin. 2017. Personalized key frame recommendation. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. 315--324.
[6]
Emmanuel Dellandréa, Liming Chen, Yoann Baveye, Mats Viktor Sjöberg, and Christel Chamaret. 2016. The mediaeval 2016 emotional impact of movies task. In CEUR Workshop Proceedings.
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[8]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[9]
Quan Gan, Shangfei Wang, Longfei Hao, and Qiang Ji. 2017. A multimodal deep regression bayesian network for affective video content analyses. In Proceedings of the IEEE International Conference on Computer Vision. 5113--5122.
[10]
Xun Gao, Yin Zhao, Jie Zhang, and Longjun Cai. 2021. Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark. In Proceedings of the 29th ACM International Conference on Multimedia. 3380--3389.
[11]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546--6555.
[12]
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 131--135.
[13]
Yu-Gang Jiang, Baohan Xu, and Xiangyang Xue. 2014. Predicting emotions in user-generated videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 28.
[14]
Chenchen Li, Jialin Wang, Hongwei Wang, Miao Zhao, Wenjie Li, and Xiaotie Deng. 2019. Visual-texual emotion analysis with deep coupled video and danmu neural networks. IEEE Transactions on Multimedia 22, 6 (2019), 1634--1646.
[15]
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).
[16]
Tao Liang, Guosheng Lin, Lei Feng, Yan Zhang, and Fengmao Lv. 2021. Attention Is Not Enough: Mitigating the Distribution Discrepancy in Asynchronous Multimodal Sequence Fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8148--8156.
[17]
Guangyi Lv, Tong Xu, Enhong Chen, Qi Liu, and Yi Zheng. 2016. Reading the videos: Temporal labeling for crowdsourced time-sync videos based on semantic embedding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30.
[18]
Trisha Mittal, Puneet Mathur, Aniket Bera, and Dinesh Manocha. 2021. Affect2mm: Affective analysis of multimedia content using emotion causality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5661--5671.
[19]
Fan Qi, Xiaoshan Yang, and Changsheng Xu. 2021. Zero-shot Video Emotion Recognition via Multimodal Protagonist-aware Transformer Network. In Proceedings of the 29th ACM International Conference on Multimedia. 1074--1083.
[20]
Haonan Qiu, Liang He, and FengWang. 2020. Dual Focus Attention Network For Video Emotion Recognition. In 2020 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.
[21]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763.
[22]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[23]
Shangfei Wang, Longfei Hao, and Qiang Ji. 2019. Knowledge-augmented multimodal deep regression bayesian networks for emotion video tagging. IEEE Transactions on Multimedia 22, 4 (2019), 1084--1097.
[24]
Shangfei Wang and Qiang Ji. 2015. Video affective content analysis: a survey of state-of-the-art methods. IEEE Transactions on Affective Computing 6, 4 (2015), 410--430.
[25]
Jie Wei, Xinyu Yang, and Yizhuo Dong. 2021. User-generated video emotion recognition based on key frames. Multimedia Tools and Applications 80, 9 (2021), 14343--14361.
[26]
Bin Wu, Erheng Zhong, Ben Tan, Andrew Horner, and Qiang Yang. 2014. Crowdsourced time-sync video tagging using temporal and personalized topic modeling. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 721--730.
[27]
Baohan Xu, Yanwei Fu, Yu-Gang Jiang, Boyang Li, and Leonid Sigal. 2016. Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization. IEEE Transactions on Affective Computing 9, 2 (2016), 255--270.
[28]
Baohan Xu, Yingbin Zheng, Hao Ye, Caili Wu, Heng Wang, and Gufei Sun. 2019. Video emotion recognition with concept selection. In 2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 406--411.
[29]
Linhong Xu, Hongfei Lin, Yu Pan, Hui Ren, and Jianmei Chen. 2008. Constructing the affective lexicon ontology. Journal of the China society for scientific and technical information 27, 2 (2008), 180--185.
[30]
Linli Xu and Chao Zhang. 2017. Bridging video content and comments: Synchronized video description with temporal summarization of crowdsourced time-sync comments. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.
[31]
Liang Yang and Hongfei Lin. 2012. Construction and application of Chinese emotional corpus. In Workshop on Chinese Lexical Semantics. Springer, 122--133.
[32]
Wenmian Yang, Wenyuan Gao, Xiaojie Zhou, Weijia Jia, Shaohua Zhang, and Yutao Luo. 2019. Herding effect based attention for personalized time-sync video recommendation. In 2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 454--459.
[33]
Wenmain Yang, Kun Wang, Na Ruan, Wenyuan Gao, Weijia Jia, Wei Zhao, Nan Liu, and Yunyong Zhang. 2019. Time-sync Video Tag Extraction Using Semantic Association Graph. ACM Transactions on Knowledge Discovery from Data (TKDD) 13, 4 (2019), 1--24.
[34]
Yun Yi and HanliWang. 2019. Multi-modal learning for affective content analysis in movies. Multimedia Tools and Applications 78, 10 (2019), 13331--13350.
[35]
Yun Yi, Hanli Wang, and Qinyu Li. 2019. Affective video content analysis with adaptive fusion recurrent network. IEEE Transactions on Multimedia 22, 9 (2019), 2454--2466.
[36]
Haimin Zhang and Min Xu. 2018. Recognition of emotions in user-generated videos with kernelized features. IEEE Transactions on Multimedia 20, 10 (2018), 2824--2835.
[37]
Haimin Zhang and Min Xu. 2021. Recognition of Emotions in User-generated Videos with Transferred Emotion Intensity Learning. IEEE Transactions on Multimedia (2021).
[38]
Sicheng Zhao, Yunsheng Ma, Yang Gu, Jufeng Yang, Tengfei Xing, Pengfei Xu, Runbo Hu, Hua Chai, and Kurt Keutzer. 2020. An end-to-end visual-audio attention network for emotion recognition in user-generated videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 303--311.
[39]
Sicheng Zhao, Shangfei Wang, Mohammad Soleymani, Dhiraj Joshi, and Qiang Ji. 2019. Affective computing for large-scale heterogeneous multimedia data: A survey. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15, 3s (2019), 1--32.

Cited By

View all
  • (2025)Enhancing video rumor detection through multimodal deep feature fusion with time-sync commentsInformation Processing and Management: an International Journal10.1016/j.ipm.2024.10393562:1Online publication date: 1-Jan-2025
  • (2024)Temporal Enhancement for Video Affective Content AnalysisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681631(642-650)Online publication date: 28-Oct-2024
  • (2024)GRACE: GRadient-based Active Learning with Curriculum Enhancement for Multimodal Sentiment AnalysisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681617(5702-5711)Online publication date: 28-Oct-2024
  • Show More Cited By

Index Terms

  1. Representation Learning through Multimodal Attention and Time-Sync Comments for Affective Video Content Analysis

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '22: Proceedings of the 30th ACM International Conference on Multimedia
    October 2022
    7537 pages
    ISBN:9781450392037
    DOI:10.1145/3503161
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. affective computing
    2. multimodal fusion
    3. video content analysis
    4. vision and language

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)195
    • Downloads (Last 6 weeks)20
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Enhancing video rumor detection through multimodal deep feature fusion with time-sync commentsInformation Processing and Management: an International Journal10.1016/j.ipm.2024.10393562:1Online publication date: 1-Jan-2025
    • (2024)Temporal Enhancement for Video Affective Content AnalysisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681631(642-650)Online publication date: 28-Oct-2024
    • (2024)GRACE: GRadient-based Active Learning with Curriculum Enhancement for Multimodal Sentiment AnalysisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681617(5702-5711)Online publication date: 28-Oct-2024
    • (2024)Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video UnderstandingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680810(7316-7325)Online publication date: 28-Oct-2024
    • (2024)Heterogeneous Dual-Attentional Network for WiFi and Video-Fused Multi-Modal Crowd CountingIEEE Transactions on Mobile Computing10.1109/TMC.2024.344446923:12(14233-14247)Online publication date: Dec-2024
    • (2024)Hierarchical Multi-Modal Attention Network for Time-Sync Comment Video RecommendationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.330976834:4(2694-2705)Online publication date: Apr-2024
    • (2024)VAD: A Video Affective Dataset With DanmuIEEE Transactions on Affective Computing10.1109/TAFFC.2024.338250315:4(1889-1905)Online publication date: 28-Mar-2024
    • (2024)MART: Masked Affective RepresenTation Learning via Masked Temporal Distribution Distillation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01219(12830-12840)Online publication date: 16-Jun-2024
    • (2023)Sentiment Analysis on Online Videos by Time-Sync CommentsEntropy10.3390/e2507101625:7(1016)Online publication date: 2-Jul-2023
    • (2023)AffectFAL: Federated Active Affective Computing with Non-IID DataProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612442(871-882)Online publication date: 26-Oct-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media