skip to main content
10.1145/3581783.3611726acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Emotion-Prior Awareness Network for Emotional Video Captioning

Published: 27 October 2023 Publication History

Abstract

Emotional video captioning (EVC) is an emerging task to describe the factual content with the inherent emotion expressed in a video. It is crucial for the EVC task to effectively perceive subtle and ambiguous visual emotion cues in the stage of caption generation. However, existing captioning methods usually overlooked the learning of emotions in user-generated videos, thus making the generated sentence a bit boring and soulless.
To address this issue, this paper proposes a new emotional captioning perspective in a human-like perception-priority manner, i.e., first perceiving the inherent emotion and then leveraging the perceived emotion cue to support caption generation. Specifically, we devise an Emotion-Prior Awareness Network (EPAN). It mainly benefits from a novel tree-structured emotion learning module involving both catalog-level psychological categories and lexical-level usual words to achieve the goal of explicit and fine-grained emotion perception. Besides, we develop a novel subordinate emotion masking mechanism between the catalog level and lexical level that facilitates coarse-to-fine emotion learning. Afterward, with the emotion prior, we can effectively decode the emotional caption by exploiting the complementation of visual, textual, and emotional semantics. In addition, we also introduce three simple yet effective optimization objectives, which can significantly boost the emotion learning from the perspectives of emotional captioning, hierarchical emotion classification, and emotional contrastive learning. Sufficient experimental results on three benchmark datasets clearly demonstrate the advantages of our proposed EPAN over existing SOTA methods in both semantic and emotional metrics. The extensive ablation study and visualization analysis further reveal the good interpretability of our emotional video captioning method. Code will be made available at https://github.com/songpipi/EPAN.

References

[1]
Panos Achlioptas, Maks Ovsjanikov, Kilichbek Haydarov, Mohamed Elhoseiny, and Leonidas J Guibas. 2021. Artemis: Affective language for visual art. In CVPR. 11569--11579.
[2]
Pablo Barros, German Parisi, and Stefan Wermter. 2019. A Personalized Affective Memory Model for Improving Emotion Recognition. In ICML, Vol. 97. 485--494.
[3]
David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In ACL. 190--200.
[4]
Jianfeng Dong, Xianke Chen, Minsong Zhang, Xun Yang, Shujie Chen, Xirong Li, and Xun Wang. 2022. Partially Relevant Video Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 246--257.
[5]
Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 8 (2021), 4065--4080.
[6]
Dan Guo, Kun Li, Zheng-Jun Zha, and Meng Wang. 2019. Dadnet: Dilated-attention-deformable convnet for crowd counting. In Proceedings of the 27th ACM international conference on multimedia. 1823--1832.
[7]
Dan Guo, Hui Wang, and Meng Wang. 2021. Context-aware graph inference with knowledge distillation for visual dialog. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 10 (2021), 6056--6073.
[8]
Dan Guo, Wengang Zhou, Houqiang Li, and Meng Wang. 2018. Hierarchical LSTM for sign language translation. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
[9]
Yanbin Hao, Hao Zhang, Chong-Wah Ngo, and Xiangnan He. 2022. Group contextualization for video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 928--938.
[10]
Yanbin Hao, Hao Zhang, Chong-Wah Ngo, Qiang Liu, and Xiaojun Hu. 2020. Compact bilinear augmented query structured attention for sport highlights classification. In Proceedings of the 28th ACM international conference on multimedia. 628--636.
[11]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In CVPR. 6546--6555.
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.
[13]
Tao He and Xiaoming Jin. 2019. Image emotion distribution learning with graph convolutional networks. In ICMR. 382--390.
[14]
MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A comprehensive survey of deep learning for image captioning. ACM CsUR, Vol. 51, 6 (2019), 1--36.
[15]
Vanita Jain, Fadi Al-Turjman, Gopal Chaudhary, Devang Nayar, Varun Gupta, and Aayush Kumar. 2022. Video captioning: a review of theory, techniques and practices. Springer MTA, Vol. 81, 25 (2022), 35619--35653.
[16]
Yu-Gang Jiang, Baohan Xu, and Xiangyang Xue. 2014. Predicting emotions in user-generated videos. In AAAI. 73--79.
[17]
Gihwi Kim, Ilyoung Choi, Qinglong Li, and Jaekyeong Kim. 2021. A CNN-based advertisement recommendation through real-time user face recognition. Applied Sciences, Vol. 11, 20 (2021), 9705.
[18]
Dimitrios Kollias and Stefanos Zafeiriou. 2020. Exploiting multi-cnn features in cnn-rnn based dimensional emotion recognition on the omg in-the-wild dataset. IEEE Transactions on Affective Computing, Vol. 12, 3 (2020), 595--606.
[19]
Guodun Li, Yuchen Zhai, Zehao Lin, and Yin Zhang. 2021d. Similar scenes arouse similar emotions: Parallel data augmentation for stylized image captioning. In ACM MM. 5363--5372.
[20]
Kun Li, Dan Guo, and Meng Wang. 2021a. Proposal-free video grounding with contextual pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1902--1910.
[21]
Kun Li, Jiaxiu Li, Dan Guo, Xun Yang, and Meng Wang. 2023. Transformer-based Visual Grounding with Cross-modality Interaction. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 19, 6 (2023), 1--19.
[22]
Qintong Li, Piji Li, Zhaochun Ren, Pengjie Ren, and Zhumin Chen. 2022. Knowledge bridging for empathetic dialogue generation. In AAAI.
[23]
Tong Li, Yunhui Hu, and Xinxiao Wu. 2021b. Image Captioning with Inherent Sentiment. In ICME. IEEE, 1--6.
[24]
Yicong Li, Xun Yang, Xindi Shang, and Tat-Seng Chua. 2021c. Interventional video relation detection. In Proceedings of the 29th ACM International Conference on Multimedia. 4091--4099.
[25]
Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, and Guosheng Lin. 2021. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In CVPR. 2554--2562.
[26]
Ziyu Ma, Fuyan Ma, Bin Sun, and Shutao Li. 2021. Hybrid mutimodal fusion for dimensional emotion recognition. In MuSe. 29--36.
[27]
Alexander Mathews, Lexing Xie, and Xuming He. 2016. Senticap: Generating image descriptions with sentiments. In AAAI. 3574--3580.
[28]
Trisha Mittal, Puneet Mathur, Aniket Bera, and Dinesh Manocha. 2021. Affect2MM: Affective Analysis of Multimedia Content Using Emotion Causality. In CVPR. 5661--5671.
[29]
Omid Mohamad Nezami, Mark Dras, Peter Anderson, and Len Hamey. 2018. Face-cap: Image captioning using facial expression analysis. In ECML. 226--240.
[30]
Jiaqi Mu, Suma Bhat, and Pramod Viswanath. 2018. All-but-the-top: Simple and effective postprocessing for word representations. In ICLR.
[31]
Michal Muszynski, Leimin Tian, Catherine Lai, Johanna D Moore, Theodoros Kostoulas, Patrizia Lombardo, Thierry Pun, and Guillaume Chanel. 2021. Recognizing induced emotions of movie audiences from multimodal information. IEEE Transactions on Affective Computing, Vol. 12, 1 (2021), 36--52.
[32]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532--1543.
[33]
Robert Plutchik. 1980. A general psychoevolutionary theory of emotion. In Theories of emotion. Elsevier, 3--33.
[34]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In ICML, Vol. 139. 8748--8763.
[35]
Tianrong Rao, Xiaoxu Li, and Min Xu. 2020. Learning multi-level deep representations for image emotion classification. Neural Processing Letters, Vol. 51, 3 (2020), 2043--2061.
[36]
Peipei Song, Dan Guo, Jun Cheng, and Meng Wang. 2022. Contextual Attention Network for Emotional Video Captioning. IEEE Transactions on Multimedia (2022), 1--11. https://doi.org/10.1109/TMM.2022.3183402
[37]
Shengeng Tang, Richang Hong, Dan Guo, and Meng Wang. 2022. Gloss semantic-enhanced network with online back-translation for sign language production. In Proceedings of the 30th ACM International Conference on Multimedia. 5630--5638.
[38]
Yu-Chih Tsai, Tse-Yu Pan, Ting-Yang Kao, Yi-Hsuan Yang, and Min-Chun Hu. 2022. EMVGAN: Emotion-Aware Music-Video Common Representation Learning via Generative Adversarial Networks. In MMArt. 13--18.
[39]
Kohei Uehara, Yusuke Mori, Yusuke Mukuta, and Tatsuya Harada. 2022. ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer. arXiv preprint arXiv:2202.07305 (2022).
[40]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS. 5998--6008.
[41]
Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. 2018. Reconstruction network for video captioning. In CVPR. 7622--7631.
[42]
Bo Wang, Zhao Zhang, Jicong Fan, Mingbo Zhao, Choujun Zhan, and Mingliang Xu. 2022b. FineFormer: Fine-Grained Adaptive Object Transformer for Image Captioning. In 2022 IEEE International Conference on Data Mining (ICDM). IEEE, 508--517.
[43]
Hanli Wang, Pengjie Tang, Qinyu Li, and Meng Cheng. 2022a. Emotion Expression with Fact Transfer for Video Description. IEEE Transactions on Multimedia, Vol. 24 (2022), 715--727.
[44]
Lin Wang, Xiangmin Xu, Fang Liu, Xiaofen Xing, Bolun Cai, and Weirui Lu. 2019. Robust emotion navigation: Few-shot visual sentiment analysis by auxiliary noisy data. In ACIIW. 121--127.
[45]
Shangfei Wang and Qiang Ji. 2015. Video affective content analysis: a survey of state-of-the-art methods. IEEE TAC, Vol. 6, 4 (2015), 410--430.
[46]
Junbin Xiao, Xindi Shang, Xun Yang, Sheng Tang, and Tat-Seng Chua. 2020. Visual relation grounding in videos. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part VI 16. Springer, 447--464.
[47]
Haitao Xiong, Hongfu Liu, Bineng Zhong, and Yun Fu. 2019. Structured and sparse annotations for image emotion distribution learning. In AAAI. 363--370.
[48]
Liwen Xu, Zhengtao Wang, Bin Wu, and Simon Lui. 2022. MDAN: Multi-level Dependent Attention Network for Visual Emotion Analysis. In CVPR. 9479--9488.
[49]
Jingyuan Yang, Xinbo Gao, Leida Li, Xiumei Wang, and Jinshan Ding. 2021b. SOLVER: Scene-Object Interrelated Visual Emotion Reasoning Network. IEEE Transactions on Image Processing, Vol. 30 (2021), 8686--8701.
[50]
Jingyuan Yang, Jie Li, Leida Li, Xiumei Wang, and Xinbo Gao. 2021c. A Circular-Structured Representation for Visual Emotion Distribution Learning. In CVPR. 4237--4246.
[51]
Jingyuan Yang, Jie Li, Xiumei Wang, Yuxuan Ding, and Xinbo Gao. 2021d. Stimuli-aware visual emotion analysis. IEEE Transactions on Image Processing, Vol. 30 (2021), 7432--7445.
[52]
Lin Yang, Yi Shen, Yue Mao, and Longjun Cai. 2022a. Hybrid curriculum learning for emotion recognition in conversation. In AAAI. 11595--11603.
[53]
Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020a. Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. 1339--1348.
[54]
Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. 2021a. Deconfounded video moment retrieval with causal intervention. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1--10.
[55]
Xun Yang, Xueliang Liu, Meng Jian, Xinjian Gao, and Meng Wang. 2020b. Weakly-supervised video object grounding by exploring spatio-temporal contexts. In Proceedings of the 28th ACM international conference on multimedia. 1939--1947.
[56]
Xun Yang, Shanshan Wang, Jian Dong, Jianfeng Dong, Meng Wang, and Tat-Seng Chua. 2022b. Video moment retrieval with cross-modal neural architecture search. IEEE Transactions on Image Processing, Vol. 31 (2022), 1204--1216.
[57]
Zhengyuan Yang, Yixuan Zhang, and Jiebo Luo. 2019. Human-centered emotion recognition in animated gifs. In ICME. 1090--1095.
[58]
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In ICCV. 4507--4515.
[59]
Chi Zhan, Dongyu She, Sicheng Zhao, Ming-Ming Cheng, and Jufeng Yang. 2019. Zero-shot emotion recognition via affective structural embedding. In ICCV. 1151--1160.
[60]
Hao Zhang, Yanbin Hao, and Chong-Wah Ngo. 2021. Token shift transformer for video classification. In Proceedings of the 29th ACM International Conference on Multimedia. 917--925.
[61]
Haimin Zhang and Min Xu. 2021. Recognition of Emotions in User-generated Videos through Frame-level Adaptation and Emotion Intensity Learning. IEEE Transactions on Multimedia (2021).
[62]
Wei Zhang, Xuanyu He, and Weizhi Lu. 2019. Exploring discriminative representations for image emotion recognition with CNNs. IEEE Transactions on Multimedia, Vol. 22, 2 (2019), 515--523.
[63]
Sicheng Zhao, Xuanbai Chen, Xiangyu Yue, Chuang Lin, Pengfei Xu, Ravi Krishna, Jufeng Yang, Guiguang Ding, Alberto L Sangiovanni-Vincentelli, and Kurt Keutzer. 2021a. Emotional semantics-preserved and feature-aligned cyclegan for visual emotion adaptation. IEEE Transactions on Cybernetics (2021).
[64]
Sicheng Zhao, Amir Gholaminejad, Guiguang Ding, Yue Gao, Jungong Han, and Kurt Keutzer. 2019a. Personalized emotion recognition by personality-aware high-order learning of physiological signals. ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 15, 1s (2019), 1--18.
[65]
Sicheng Zhao, Chuang Lin, Pengfei Xu, Sendong Zhao, Yuchen Guo, Ravi Krishna, Guiguang Ding, and Kurt Keutzer. 2019b. Cycleemotiongan: Emotional semantic consistency preserved cyclegan for adapting image emotions. In AAAI, Vol. 33. 2620--2627.
[66]
Sicheng Zhao, Yunsheng Ma, Yang Gu, Jufeng Yang, Tengfei Xing, Pengfei Xu, Runbo Hu, Hua Chai, and Kurt Keutzer. 2020a. An End-to-End visual-audio attention network for emotion recognition in user-generated videos. In AAAI. 303--311.
[67]
Sicheng Zhao, Xingxu Yao, Jufeng Yang, Guoli Jia, Guiguang Ding, Tat-Seng Chua, Bjoern W Schuller, and Kurt Keutzer. 2021b. Affective image content analysis: Two decades review and new perspectives. IEEE TPAMI, Vol. 44, 10 (2021), 6729--6751.
[68]
Wentian Zhao, Xinxiao Wu, and Xiaoxun Zhang. 2020b. Memcap: Memorizing style knowledge for image captioning. In AAAI. 12984--12992.
[69]
Weixiang Zhao, Yanyan Zhao, and Xin Lu. 2022. Cauain: Causal aware interaction network for emotion recognition in conversations. In IJCAI. 4524--4530.
[70]
Qi Zheng, Jianfeng Dong, Xiaoye Qu, Xun Yang, Yabing Wang, Pan Zhou, Baolong Liu, and Xun Wang. 2023. Progressive localization networks for language-based moment localization. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 19, 2 (2023), 1--21.
[71]
Jinxing Zhou, Dan Guo, and Meng Wang. 2022. Contrastive positive sample propagation along the audio-visual event line. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).

Cited By

View all
  • (2025)Ensemble Prototype Network For Weakly Supervised Temporal Action LocalizationIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2024.337746836:3(4560-4574)Online publication date: Mar-2025
  • (2024)Efficiently Gluing Pre-Trained Language and Vision Models for Image CaptioningACM Transactions on Intelligent Systems and Technology10.1145/368206715:6(1-16)Online publication date: 29-Jul-2024
  • (2024)Syntax-Controllable Video Captioning with Tree-Structural Syntax AugmentationProceedings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition10.1145/3663976.3664004(1-7)Online publication date: 26-Apr-2024
  • Show More Cited By

Index Terms

  1. Emotion-Prior Awareness Network for Emotional Video Captioning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. emotion learning
    2. video captioning
    3. video understanding

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)365
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Ensemble Prototype Network For Weakly Supervised Temporal Action LocalizationIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2024.337746836:3(4560-4574)Online publication date: Mar-2025
    • (2024)Efficiently Gluing Pre-Trained Language and Vision Models for Image CaptioningACM Transactions on Intelligent Systems and Technology10.1145/368206715:6(1-16)Online publication date: 29-Jul-2024
    • (2024)Syntax-Controllable Video Captioning with Tree-Structural Syntax AugmentationProceedings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition10.1145/3663976.3664004(1-7)Online publication date: 26-Apr-2024
    • (2024)Active Exploration of Modality Complementarity for MultimodalSentiment AnalysisProceedings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition10.1145/3663976.3663986(1-7)Online publication date: 26-Apr-2024
    • (2024)Active Factor Graph Network for Group Activity RecognitionIEEE Transactions on Image Processing10.1109/TIP.2024.336214033(1574-1587)Online publication date: 9-Feb-2024
    • (2024)Learning topic emotion and logical semantic for video paragraph captioningDisplays10.1016/j.displa.2024.10270683(102706)Online publication date: Jul-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media