research-article

Toward Human Perception-Centric Video Thumbnail Generation

Authors:
Tao Yang

Hong Kong Polytechnic University, Hong Kong, China

Hong Kong Polytechnic University, Hong Kong, China

0000-0002-6943-2907
View Profile

,
Fan Wang

Hong Kong Polytechnic University, Hong Kong, China

Hong Kong Polytechnic University, Hong Kong, China

0009-0007-7928-2723
View Profile

,
Junfan Lin

Hong Kong Polytechnic University & Sun Yat-sen University, Hong Kong, China

Hong Kong Polytechnic University & Sun Yat-sen University, Hong Kong, China

0000-0001-8717-8351
View Profile

,
Zhongang Qi

Tencent PCG, Shenzhen, China

Tencent PCG, Shenzhen, China

0000-0001-8298-4063
View Profile

,
Yang Wu

Tencent, Shenzhen, China

Tencent, Shenzhen, China

0000-0001-8010-6857
View Profile

,
Jing Xu

Tencent PCG, Shenzhen, China

Tencent PCG, Shenzhen, China

0009-0009-7932-2654
View Profile

,
Ying Shan

Tencent PCG, Shenzhen, China

Tencent PCG, Shenzhen, China

0000-0001-7673-8325
View Profile

,
Changwen Chen

Hong Kong Polytechnic University, Hong Kong, China

Hong Kong Polytechnic University, Hong Kong, China

0000-0002-6720-234X
View Profile

MM '23: Proceedings of the 31st ACM International Conference on MultimediaOctober 2023Pages 6653–6664https://doi.org/10.1145/3581783.3612434

Published:27 October 2023Publication History

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 6653–6664

ABSTRACT

Video thumbnail plays an essential role in summarizing video content into a compact and concise image for users to browse efficiently. However, automatically generating attractive and informative video thumbnails remains an open problem due to the difficulty of formulating human aesthetic perception and the scarcity of paired training data. This work proposes a novel Human Perception-Centric Video Thumbnail Generation (HPCVTG) to address these challenges. Specifically, our framework first generates a set of thumbnails using a principle-based system, which conforms to established aesthetic and human perception principles, such as visual balance in the layout and avoiding overlapping elements. Then rather than designing from scratch, we ask human annotators to evaluate some of these thumbnails and select their preferred ones. A Transformer-based Variational Auto-Encoder (VAE) model is firstly pre-trained with Model-Agnostic Meta-Learning (MAML) and then fine-tuned on these human-selected thumbnails. The exploration of combining the MAML pre-training paradigm with human feedback in training can reduce human involvement and make the training process more efficient. Extensive experimental results show that our HPCVTG framework outperforms existing methods in objective and subjective evaluations, highlighting its potential to improve the user experience when browsing videos and inspire future research in human perception-centric content generation tasks. The code and dataset will be released via https://github.com/yangtao2019yt/HPCVTG.

References

Diego Martin Arroyo, Janis Postels, and Federico Tombari. 2021. Variational transformer networks for layout generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13642--13652.Google ScholarCross Ref
Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2016. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086 (2016).Google Scholar
Michael Bauerly and Yili Liu. 2006. Computational modeling and experimental investigation of effects of compositional elements on interface and design aesthetics. International journal of human-computer studies 64, 8 (2006), 670--682.Google ScholarDigital Library
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877--1901.Google Scholar
Zhaowei Cai and Nuno Vasconcelos. 2018. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6154--6162.Google ScholarCross Ref
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017).Google Scholar
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic metalearning for fast adaptation of deep networks. In International conference on machine learning. PMLR, 1126--1135.Google Scholar
Kamal Gupta, Justin Lazarow, Alessandro Achille, Larry S Davis, Vijay Mahadevan, and Abhinav Shrivastava. 2021. Layouttransformer: Layout generation and completion with self-attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1004--1014.Google ScholarCross Ref
Donald Joseph Hejna III and Dorsa Sadigh. 2023. Few-shot preference learning for human-in-the-loop RL. In Conference on Robot Learning. PMLR, 2014--2025.Google Scholar
Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. 2007. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Technical Report 07--49. University of Massachusetts, Amherst.Google Scholar
Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. 2018. Reward learning from human preferences and demonstrations in atari. Advances in neural information processing systems 31 (2018).Google Scholar
Naoto Inoue, Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota Yamaguchi. 2023. LayoutDM: Discrete Diffusion Model for Controllable Layout Generation. arXiv preprint arXiv:2303.08137 (2023).Google Scholar
Zhaoyun Jiang, Shizhao Sun, Jihua Zhu, Jian-Guang Lou, and Dongmei Zhang. 2022. Coarse-to-Fine Generative Modeling for Graphic Layouts. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1096--1103.Google ScholarCross Ref
Chuhao Jin, Hongteng Xu, Ruihua Song, and Zhiwu Lu. 2022. Text2Poster: Laying Out Stylized Texts on Retrieved Images. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4823--4827.Google ScholarCross Ref
Hadi Kazemi, Fariborz Taherkhani, and Nasser Nasrabadi. 2020. Preferencebased image generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3404--3413.Google Scholar
Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota Yamaguchi. 2021. Constrained graphic layout generation via latent optimization. In Proceedings of the 29th ACM International Conference on Multimedia. 88--96.Google ScholarDigital Library
Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).Google Scholar
Xiang Kong, Lu Jiang, Huiwen Chang, Han Zhang, Yuan Hao, Haifeng Gong, and Irfan Essa. 2022. BLT: bidirectional layout transformer for controllable layout generation. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XVII. Springer, 474--490.Google Scholar
Julia Kreutzer, Shahram Khadivi, Evgeny Matusov, and Stefan Riezler. 2018. Can neural machine translation be improved with user feedback? arXiv preprint arXiv:1804.05958 (2018).Google Scholar
Chien-Yin Lai, Pai-Hsun Chen, Sheng-Wen Shih, Yili Liu, and Jen-Shin Hong. 2010. Computational models and experimental investigations of effects of balance and symmetry on the aesthetics of text-overlaid images. International journal of human-computer studies 68, 1--2 (2010), 41--56.Google ScholarDigital Library
Hsin-Ying Lee, Lu Jiang, Irfan Essa, Phuong B Le, Haifeng Gong, Ming-Hsuan Yang, and Weilong Yang. 2020. Neural design network: Graphic layout generation with constraints. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part III 16. Springer, 491--506.Google ScholarDigital Library
Kimin Lee, Laura Smith, and Pieter Abbeel. 2021. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091 (2021).Google Scholar
Jinyu Li, Shujin Lin, Fan Zhou, and Ruomei Wang. 2022. NewsThumbnail: Automatic Generation of News Video Thumbnail. In 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 1383--1388.Google Scholar
Jianan Li, Jimei Yang, Aaron Hertzmann, Jianming Zhang, and Tingfa Xu. 2019. Layoutgan: Generating graphic layouts with wireframe discriminators. arXiv preprint arXiv:1901.06767 (2019).Google Scholar
Jianan Li, Jimei Yang, Jianming Zhang, Chang Liu, Christina Wang, and Tingfa Xu. 2020. Attribute-conditioned layout gan for automatic graphic design. IEEE Transactions on Visualization and Computer Graphics 27, 10 (2020), 4039--4048.Google ScholarDigital Library
Zhiwei Li, Shuming Shi, and Lei Zhang. 2008. Improving relevance judgment of web search results with image excerpts. In Proceedings of the 17th international conference on World Wide Web. 21--30.Google ScholarDigital Library
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11976--11986.Google ScholarCross Ref
Shuang Ma and Chang Wen Chen. 2016. Automatic creation of magazine-pagelike social media visual summary for mobile browsing. In 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 469--473.Google ScholarCross Ref
James MacGlashan, Mark K Ho, Robert Loftin, Bei Peng, Guan Wang, David L Roberts, Matthew E Taylor, and Michael L Littman. 2017. Interactive learning from policy-dependent human feedback. In International Conference on Machine Learning. PMLR, 2285--2294.Google Scholar
Roberto Martínez-Cruz, Alvaro J López-López, and José Portela. 2023. ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task. arXiv preprint arXiv:2304.14177 (2023).Google Scholar
Tao Mei and Xian-Sheng Hua. 2010. Contextual internet multimedia advertising. Proc. IEEE 98, 8 (2010), 1416--1433.Google ScholarCross Ref
Tao Mei, Xian-Sheng Hua, and Shipeng Li. 2008. Contextual in-image advertising. In Proceedings of the 16th ACM international conference on Multimedia. 439--448.Google ScholarDigital Library
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. 2021. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 (2021).Google Scholar
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730--27744.Google Scholar
Rik Pieters and Michel Wedel. 2004. Attention capture and transfer in advertising: Brand, pictorial, and text-size effects. Journal of marketing 68, 2 (2004), 36--50.Google ScholarCross Ref
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.Google Scholar
DM Rocke. 2000. Genetic Algorithms Data Structures= Evolution programs (3rd. J. Amer. Statist. Assoc. 95, 449 (2000), 347.Google Scholar
Mingyang Song, Haiyun Jiang, Shuming Shi, Songfang Yao, Shilong Lu, Yi Feng, Huafeng Liu, and Liping Jing. 2023. Is ChatGPT A Good Keyphrase Generator? A Preliminary Study. arXiv preprint arXiv:2303.13001 (2023).Google Scholar
Jaime Teevan, Edward Cutrell, Danyel Fisher, Steven M Drucker, Gonzalo Ramos, Paul André, and Chang Hu. 2009. Visual snippets: summarizing web pages for search and revisitation. In Proceedings of the SIGCHI conference on human factors in computing systems. 2023--2032.Google ScholarDigital Library
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google Scholar
Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. 2018. Unified Perceptual Parsing for Scene Understanding. In European Conference on Computer Vision. Springer.Google Scholar
Binbin Xie, Jia Song, Liangying Shao, Suhang Wu, Xiangpeng Wei, Baosong Yang, Huan Lin, Jun Xie, and Jinsong Su. 2023. From statistical methods to deep learning, automatic keyphrase prediction: A survey. Information Processing & Management 60, 4 (2023), 103382.Google ScholarDigital Library
Yi Xu, Fan Bai, Yingxuan Shi, Qiuyu Chen, Longwen Gao, Kai Tian, Shuigeng Zhou, and Huyang Sun. 2021. Gif thumbnails: Attract more clicks to your videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3074--3082.Google ScholarCross Ref
Kota Yamaguchi. 2021. Canvasvae: learning to generate vector graphic documents. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5481--5489.Google ScholarCross Ref
Xuyong Yang, Tao Mei, Ying-Qing Xu, Yong Rui, and Shipeng Li. 2016. Automatic generation of visual-textual presentation layout. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 12, 2 (2016), 1--22.Google ScholarDigital Library
Wenyuan Yin, Tao Mei, and Chang Wen Chen. 2013. Automatic generation of social media snippets for mobile browsing. In Proceedings of the 21st ACM international conference on Multimedia. 927--936.Google ScholarDigital Library
Ning Yu, Chia-Chih Chen, Zeyuan Chen, Rui Meng, Gang Wu, Paul Josel, Juan Carlos Niebles, Caiming Xiong, and Ran Xu. 2022. LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer. arXiv preprint arXiv:2212.09877 (2022).Google Scholar
Junyi Zhang, Jiaqi Guo, Shizhao Sun, Jian-Guang Lou, and Dongmei Zhang. 2023. LayoutDiffusion: Improving Graphic Layout Generation by Discrete Diffusion Probabilistic Models. arXiv preprint arXiv:2303.11589 (2023).Google Scholar
Yunke Zhang, Kangkang Hu, Peiran Ren, Changyuan Yang, Weiwei Xu, and Xian-Sheng Hua. 2017. Layout style modeling for automating banner design. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017. 451--459.Google ScholarDigital Library
Baoquan Zhao, Hanhui Li, Ruomei Wang, and Xiaonan Luo. 2020. Automatic generation of informative video thumbnail. In 2020 8th International Conference on Digital Home (ICDH). IEEE, 254--259.Google ScholarCross Ref
Baoquan Zhao, Shujin Lin, Xin Qi, Zhiquan Zhang, Xiaonan Luo, and Ruomei Wang. 2017. Automatic generation of visual-textual web video thumbnail. In SIGGRAPH Asia 2017 Posters. 1--2.Google Scholar
Ting Zhao and Xiangqian Wu. 2019. Pyramid Feature Attention Network for Saliency detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Min Zhou, Chenchen Xu, Ye Ma, Tiezheng Ge, Yuning Jiang, and Weiwei Xu. 2022. Composition-aware Graphic Layout GAN for Visual-textual Presentation Designs. arXiv preprint arXiv:2205.00303 (2022).Google Scholar
Wangchunshu Zhou and Ke Xu. 2020. Learning to compare for better training and evaluation of open domain natural language generation models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9717--9724.Google ScholarCross Ref

Index Terms

Toward Human Perception-Centric Video Thumbnail Generation
1. Applied computing
  1. Arts and humanities
    1. Media arts
2. Human-centered computing
  1. Visualization
    1. Visualization design and evaluation methods

Recommendations

Combining Adversarial and Reinforcement Learning for Video Thumbnail Selection
ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval

This paper presents a new method for unsupervised video thumbnail selection. The developed network architecture selects video thumbnails based on two criteria: the representativeness and the aesthetic quality of their visual content. Training relies on ...
Read More
Sentence Specified Dynamic Video Thumbnail Generation
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

With the tremendous growth of videos over the Internet, video thumbnails, providing video content previews, are becoming increasingly crucial to influencing users' online searching experiences. Conventional video thumbnails are generated once purely ...
Read More
A Novel Framework for Web Video Thumbnail Generation
IIH-MSP '12: Proceedings of the 2012 Eighth International Conference on Intelligent Information Hiding and Multimedia Signal Processing

When user uploads a video clip to the video sharing websites, a video thumbnail needs to be generated as the cover to represent the video content. In this paper, a novel video thumbnail generation framework is presented. For generating a good thumbnail, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
few-shot learning
human preference
variational auto-encoder
video thumbnail
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 140
  Total Downloads
- Downloads (Last 12 months)140
- Downloads (Last 6 weeks)26
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Toward Human Perception-Centric Video Thumbnail Generation

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Combining Adversarial and Reinforcement Learning for Video Thumbnail Selection

Sentence Specified Dynamic Video Thumbnail Generation

A Novel Framework for Web Video Thumbnail Generation