skip to main content
10.1145/3589334.3645711acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Understanding Human Preferences: Towards More Personalized Video to Text Generation

Published: 13 May 2024 Publication History

Abstract

While previous video to text models have achieved remarkable successes, they mostly focus on how to understand the video contents in a general sense, but fail to capture the human personalized preferences, which is highly demanded for an engaging multimodal chatbots. Different from user modeling in collaborative filtering, there is no other user behaviors in inference as a real-time video stream is coming. In this paper, we formally define the task of personalized video commenting task and design an end-to-end personalized framework for solving this task. In specific, we argue that the personalization for video comment generation can be reflected in two aspects, that is, (1) for the same video, different users may comment on different clips, and (2) for the same clip, different people may also express various opinions with diverse commentary styles. Motivated by these considerations, we design our framework based on two components. The first one is a clip selector, which is responsible for predicting the clips that the user may comment in the video. The second one is a text generator, which aims to produce the comment based on the above predicted clips and the user's preference. In our framework, these two components are optimized in an end-to-end manner to mutually enhance each other, where we design confidence-aware scheduled sampling and iterative inference strategies to solve the problem that the ground truth clips are absent in the inference phase. As the absence of personalized video to text dataset, we collect and release a new dataset for studying this problem. We conduct extensive experiments to demonstrate the effectiveness of our model.

Supplemental Material

MP4 File
Supplemental video

References

[1]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L. Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karé n Simonyan. 2022. Flamingo: a Visual Language Model for Few-Shot Learning. In NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/960a172bc7fbf0177ccccbb411a7d800-Abstract-Conference.html
[2]
Xu Chen, Hanxiong Chen, Hongteng Xu, Yongfeng Zhang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2019. Personalized Fashion Recommendation with Visual Explanations based on Multimodal Attention Network: Towards Visually Explainable Recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 21--25, 2019. ACM, 765--774. https://doi.org/10.1145/3331184.3331254
[3]
Xu Chen, Yongfeng Zhang, Qingyao Ai, Hongteng Xu, Junchi Yan, and Zheng Qin. 2017. Personalized Key Frame Recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (Shinjuku, Tokyo, Japan) (SIGIR '17). Association for Computing Machinery, New York, NY, USA, 315--324. https://doi.org/10.1145/3077136.3080776
[4]
Sergio Cleger-Tamayo, Juan M. Ferná ndez-Luna, and Juan F. Huete. 2012. Explaining neighborhood-based recommendations. In The 35th International ACM SIGIR conference on research and development in Information Retrieval, SIGIR '12, Portland, OR, USA, August 12--16, 2012. ACM, 1063--1064. https://doi.org/10.1145/2348283.2348470
[5]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.
[6]
Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation. 376--380.
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171--4186. https://doi.org/10.18653/v1/n19--1423
[8]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3--7, 2021. OpenReview.net. https://openreview.net/forum?id=YicbFdNTTy
[9]
Chaoqun Duan, Lei Cui, Shuming Ma, Furu Wei, Conghui Zhu, and Tiejun Zhao. 2020. Multimodal Matching Transformer for Live Commenting. In ECAI 2020 - 24th European Conference on Artificial Intelligence, 29 August-8 September 2020, Santiago de Compostela, Spain, August 29 - September 8, 2020 - Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020) (Frontiers in Artificial Intelligence and Applications, Vol. 325), Giuseppe De Giacomo, Alejandro Catalá, Bistra Dilkina, Michela Milano, Sené n Barro, Alberto Bugar'i n, and Jé rô me Lang (Eds.). IOS Press, 1998--2005. https://doi.org/10.3233/FAIA200320
[10]
Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019. Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16--20, 2019. Computer Vision Foundation / IEEE, 1999--2007. https://doi.org/10.1109/CVPR.2019.00210
[11]
Pranay Gupta and Manish Gupta. 2022. NewsKVQA: Knowledge-Aware News Video Question Answering. In Advances in Knowledge Discovery and Data Mining - 26th Pacific-Asia Conference, PAKDD 2022, Chengdu, China, May 16--19, 2022, Proceedings, Part III (Lecture Notes in Computer Science, Vol. 13282), Jo a o Gama, Tianrui Li, Yang Yu, Enhong Chen, Yu Zheng, and Fei Teng (Eds.). Springer, 3--15. https://doi.org/10.1007/978--3-031-05981-0_1
[12]
Yunfeng Hou, Ning Yang, Yi Wu, and Philip S. Yu. 2019. Explainable recommendation with fusion of aspect information. World Wide Web, Vol. 22, 1 (2019), 221--240. https://doi.org/10.1007/s11280-018-0558--1
[13]
Vladimir Iashin and Esa Rahtu. 2020. Multi-modal Dense Video Captioning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020, Seattle, WA, USA, June 14--19, 2020. Computer Vision Foundation / IEEE, 4117--4126. https://doi.org/10.1109/CVPRW50498.2020.00487
[14]
Chengxi Li and Brent Harrison. 2021. 3M: Multi-style image caption generation using Multi-modality features under Multi-UPDOWN model. In Proceedings of the Thirty-Fourth International Florida Artificial Intelligence Research Society Conference, North Miami Beach, Florida, USA, May 17--19, 2021, Eric Bell and Fazel Keshtkar (Eds.). https://doi.org/10.32473/flairs.v34i1.128380
[15]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. CoRR, Vol. abs/2301.12597 (2023). https://doi.org/10.48550/arXiv.2301.12597 showeprint[arXiv]2301.12597
[16]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning, ICML 2022, 17--23 July 2022, Baltimore, Maryland, USA (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvá ri, Gang Niu, and Sivan Sabato (Eds.). PMLR, 12888--12900. https://proceedings.mlr.press/v162/li22n.html
[17]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.
[18]
Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe Gan, Zicheng Liu, Yumao Lu, and Lijuan Wang. 2022. SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18--24, 2022. IEEE, 17928--17937. https://doi.org/10.1109/CVPR52688.2022.01742
[19]
Bing Liu. 2012. Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers. https://doi.org/10.2200/S00416ED1V01Y201204HLT016
[20]
Yichao Lu, Ruihai Dong, and Barry Smyth. 2018. Why I like it: multi-task learning for recommendation and explanation. In Proceedings of the 12th ACM Conference on Recommender Systems, RecSys 2018, Vancouver, BC, Canada, October 2--7, 2018. ACM, 4--12. https://doi.org/10.1145/3240323.3240365
[21]
Guangyi Lv, Tong Xu, Qi Liu, Enhong Chen, Weidong He, Mingxiao An, and Zhongming Chen. 2019. Gossiping the Videos: An Embedding-Based Generative Adversarial Framework for Time-Sync Comments Generation. In Advances in Knowledge Discovery and Data Mining - 23rd Pacific-Asia Conference, PAKDD 2019, Macau, China, April 14--17, 2019, Proceedings, Part III (Lecture Notes in Computer Science, Vol. 11441), Qiang Yang, Zhi-Hua Zhou, Zhiguo Gong, Min-Ling Zhang, and Sheng-Jun Huang (Eds.). Springer, 412--424. https://doi.org/10.1007/978--3-030--16142--2_32
[22]
Shuming Ma, Lei Cui, Damai Dai, Furu Wei, and Xu Sun. 2019. LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 6810--6817. https://doi.org/10.1609/aaai.v33i01.33016810
[23]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6--12, 2002, Philadelphia, PA, USA. ACL, 311--318. https://doi.org/10.3115/1073083.1073135
[24]
Cesc Chunseong Park, Byeongchang Kim, and Gunhee Kim. 2017b. Attend to You: Personalized Image Captioning with Context Sequence Memory Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 6432--6440. https://doi.org/10.1109/CVPR.2017.681
[25]
Haekyu Park, Hyunsik Jeon, Junghwan Kim, Beunguk Ahn, and U Kang. 2017a. UniWalk: Explainable and Accurate Recommendation for Rating and Network Data. CoRR, Vol. abs/1710.07134 (2017). showeprint[arXiv]1710.07134 http://arxiv.org/abs/1710.07134
[26]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139). PMLR, 8748--8763. http://proceedings.mlr.press/v139/radford21a.html
[27]
Zhaochun Ren, Shangsong Liang, Piji Li, Shuaiqiang Wang, and Maarten de Rijke. 2017. Social Collaborative Viewpoint Regression with Explainable Recommendations. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM 2017, Cambridge, United Kingdom, February 6--10, 2017, Maarten de Rijke, Milad Shokouhi, Andrew Tomkins, and Min Zhang (Eds.). ACM, 485--494. https://doi.org/10.1145/3018661.3018686
[28]
Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, and Jason Weston. 2019. Engaging Image Captioning via Personality. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16--20, 2019. Computer Vision Foundation / IEEE, 12516--12526. https://doi.org/10.1109/CVPR.2019.01280
[29]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4566--4575.
[30]
Nan Wang, Hongning Wang, Yiling Jia, and Yue Yin. 2018. Explainable Recommendation via Multi-Task Learning in Opinionated Text Data. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI, USA, July 08--12, 2018, Kevyn Collins-Thompson, Qiaozhu Mei, Brian D. Davison, Yiqun Liu, and Emine Yilmaz (Eds.). ACM, 165--174. https://doi.org/10.1145/3209978.3210010
[31]
Weiying Wang, Jieting Chen, and Qin Jin. 2020. VideoIC: A Video Interactive Comments Dataset and Multimodal Multitask Learning for Comments Generation. In MM '20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12--16, 2020, Chang Wen Chen, Rita Cucchiara, Xian-Sheng Hua, Guo-Jun Qi, Elisa Ricci, Zhengyou Zhang, and Roger Zimmermann (Eds.). ACM, 2599--2607. https://doi.org/10.1145/3394171.3413890
[32]
Zhongke Wang. 2021. Analysis of User Personalized Retrieval of Multimedia Digital Archives Dependent on BP Neural Network Algorithm. Advances in Multimedia, Vol. 2021 (2021), 1--7.
[33]
Hao Wu, Francc ois Pitié, and Gareth Jones. 2021. Cold Start Problem For Automated Live Video Comments. In Proceedings of the Third Workshop on Multimodal Artificial Intelligence. 54--62.
[34]
Le Wu, Lei Chen, Yonghui Yang, Richang Hong, Yong Ge, Xing Xie, and Meng Wang. 2019. Personalized Multimedia Item and Key Frame Recommendation. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10--16, 2019, Sarit Kraus (Ed.). ijcai.org, 1431--1437. https://doi.org/10.24963/ijcai.2019/198
[35]
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video Question Answering via Gradually Refined Attention over Appearance and Motion. In Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, October 23--27, 2017, Qiong Liu, Rainer Lienhart, Haohong Wang, Sheng-Wei "Kuan-Ta" Chen, Susanne Boll, Yi-Ping Phoebe Chen, Gerald Friedland, Jia Li, and Shuicheng Yan (Eds.). ACM, 1645--1653. https://doi.org/10.1145/3123266.3123427
[36]
Markus Zanker and Daniel Ninaus. 2010. Knowledgeable Explanations for Recommender Systems. In 2010 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2010, Toronto, Canada, August 31 - September 3, 2010, Main Conference Proceedings, Jimmy Xiangji Huang, Irwin King, Vijay V. Raghavan, and Stefan M. Rü ger (Eds.). IEEE Computer Society, 657--660. https://doi.org/10.1109/WI-IAT.2010.131
[37]
Wei Zhang, Yue Ying, Pan Lu, and Hongyuan Zha. 2020b. Learning long-and short-term user literal-preference with multimodal hierarchical transformer network for personalized image caption. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9571--9578.
[38]
Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. 2020a. Object Relational Graph With Teacher-Recommended Learning for Video Captioning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13--19, 2020. Computer Vision Foundation / IEEE, 13275--13285. https://doi.org/10.1109/CVPR42600.2020.01329
[39]
Wayne Xin Zhao, Yanwei Guo, Yulan He, Han Jiang, Yuexin Wu, and Xiaoming Li. 2014. We know what you want to buy: a demographic-based system for product recommendation on microblogs. In The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, New York, NY, USA - August 24 - 27, 2014. ACM, 1935--1944. https://doi.org/10.1145/2623330.2623351
[40]
Wayne Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Yushuo Chen, Xingyu Pan, Kaiyuan Li, Yujie Lu, Hui Wang, Changxin Tian, Yingqian Min, Zhichao Feng, Xinyan Fan, Xu Chen, Pengfei Wang, Wendi Ji, Yaliang Li, Xiaoling Wang, and Ji-Rong Wen. 2021. RecBole: Towards a Unified, Comprehensive and Efficient Framework for Recommendation Algorithms. In CIKM. ACM, 4653--4664.
[41]
Yucheng Zhou and Guodong Long. 2023. Style-Aware Contrastive Learning for Multi-Style Image Captioning. CoRR, Vol. abs/2301.11367 (2023). https://doi.org/10.48550/arXiv.2301.11367 showeprint[arXiv]2301.11367
[42]
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).

Index Terms

  1. Understanding Human Preferences: Towards More Personalized Video to Text Generation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WWW '24: Proceedings of the ACM Web Conference 2024
      May 2024
      4826 pages
      ISBN:9798400701719
      DOI:10.1145/3589334
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 May 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. multimodal interaction
      2. personalized content generation
      3. user preference modeling
      4. video comments dataset
      5. video to text generation

      Qualifiers

      • Research-article

      Funding Sources

      • Migu Culture Technology Co.
      • National Natural Science Foundation of China
      • Beijing Natural Science Foundation
      • Research Funds of Renmin University of China
      • Outstanding Innovative Talents Cultivation Funded Programs 2024 of Renmin University of China
      • Huawei Poisson Lab

      Conference

      WWW '24
      Sponsor:
      WWW '24: The ACM Web Conference 2024
      May 13 - 17, 2024
      Singapore, Singapore

      Acceptance Rates

      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 206
        Total Downloads
      • Downloads (Last 12 months)206
      • Downloads (Last 6 weeks)24
      Reflects downloads up to 14 Feb 2025

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media