research-article

Representation Learning through Multimodal Attention and Time-Sync Comments for Affective Video Content Analysis

Authors:

Jicai Pan,

Shangfei Wang,

Lin FangAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 42 - 50

https://doi.org/10.1145/3503161.3548018

Published: 10 October 2022 Publication History

Get Access

Abstract

Although temporal patterns inherent in visual and audio signals are crucial for affective video content analysis, they have not been thoroughly explored yet. In this paper, we propose a novel Temporal-Aware Multimodal (TAM) method to fully capture the temporal information. Specifically, we design a cross-temporal multimodal fusion module that applies attention-based fusion to different modalities within and across video segments. As a result, it fully captures the temporal relations between different modalities. Furthermore, a single emotion label lacks supervision for learning representation of each segment, making temporal pattern mining difficult. We leverage time-synchronized comments (TSCs) as auxiliary supervision, since these comments are easily accessible and contain rich emotional cues. Two TSC-based self-supervised tasks are designed: the first aims to predict the emotional words in a TSC from video representation and TSC contextual semantics, and the second predicts the segment in which the TSC appears by calculating the correlation between video representation and TSC embedding. These self-supervised tasks are used to pre-train the cross-temporal multimodal fusion module on a large-scale video-TSC dataset, which is crawled from the web without labeling costs. These self-supervised pre-training tasks prompt the fusion module to perform representation learning on segments including TSC, thus capturing more temporal affective patterns. Experimental results on three benchmark datasets show that the proposed fusion module achieves state-of-the-art results in affective video content analysis. Ablation studies verify that after TSC-based pre-training, the fusion module learns more segments' affective patterns and achieves better performance.

Supplementary Material

MP4 File (MM22-fp1134.mp4)

Presentation video

Download
9.40 MB

References

[1]

Esra Acar, Frank Hopfgartner, and Sahin Albayrak. 2014. Understanding affective content of music videos through learned representations. In International conference on multimedia modeling. Springer, 303--314.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Toward multimodal fusion of affective cues

Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions

The MuSe 2022 Multimodal Sentiment Analysis Challenge: Humor, Emotional Reactions, and Stress

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations