ABSTRACT
Video thumbnail plays an essential role in summarizing video content into a compact and concise image for users to browse efficiently. However, automatically generating attractive and informative video thumbnails remains an open problem due to the difficulty of formulating human aesthetic perception and the scarcity of paired training data. This work proposes a novel Human Perception-Centric Video Thumbnail Generation (HPCVTG) to address these challenges. Specifically, our framework first generates a set of thumbnails using a principle-based system, which conforms to established aesthetic and human perception principles, such as visual balance in the layout and avoiding overlapping elements. Then rather than designing from scratch, we ask human annotators to evaluate some of these thumbnails and select their preferred ones. A Transformer-based Variational Auto-Encoder (VAE) model is firstly pre-trained with Model-Agnostic Meta-Learning (MAML) and then fine-tuned on these human-selected thumbnails. The exploration of combining the MAML pre-training paradigm with human feedback in training can reduce human involvement and make the training process more efficient. Extensive experimental results show that our HPCVTG framework outperforms existing methods in objective and subjective evaluations, highlighting its potential to improve the user experience when browsing videos and inspire future research in human perception-centric content generation tasks. The code and dataset will be released via https://github.com/yangtao2019yt/HPCVTG.
- Diego Martin Arroyo, Janis Postels, and Federico Tombari. 2021. Variational transformer networks for layout generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13642--13652.Google ScholarCross Ref
- Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2016. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086 (2016).Google Scholar
- Michael Bauerly and Yili Liu. 2006. Computational modeling and experimental investigation of effects of compositional elements on interface and design aesthetics. International journal of human-computer studies 64, 8 (2006), 670--682.Google ScholarDigital Library
- Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877--1901.Google Scholar
- Zhaowei Cai and Nuno Vasconcelos. 2018. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6154--6162.Google ScholarCross Ref
- Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017).Google Scholar
- Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic metalearning for fast adaptation of deep networks. In International conference on machine learning. PMLR, 1126--1135.Google Scholar
- Kamal Gupta, Justin Lazarow, Alessandro Achille, Larry S Davis, Vijay Mahadevan, and Abhinav Shrivastava. 2021. Layouttransformer: Layout generation and completion with self-attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1004--1014.Google ScholarCross Ref
- Donald Joseph Hejna III and Dorsa Sadigh. 2023. Few-shot preference learning for human-in-the-loop RL. In Conference on Robot Learning. PMLR, 2014--2025.Google Scholar
- Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. 2007. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Technical Report 07--49. University of Massachusetts, Amherst.Google Scholar
- Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. 2018. Reward learning from human preferences and demonstrations in atari. Advances in neural information processing systems 31 (2018).Google Scholar
- Naoto Inoue, Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota Yamaguchi. 2023. LayoutDM: Discrete Diffusion Model for Controllable Layout Generation. arXiv preprint arXiv:2303.08137 (2023).Google Scholar
- Zhaoyun Jiang, Shizhao Sun, Jihua Zhu, Jian-Guang Lou, and Dongmei Zhang. 2022. Coarse-to-Fine Generative Modeling for Graphic Layouts. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1096--1103.Google ScholarCross Ref
- Chuhao Jin, Hongteng Xu, Ruihua Song, and Zhiwu Lu. 2022. Text2Poster: Laying Out Stylized Texts on Retrieved Images. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4823--4827.Google ScholarCross Ref
- Hadi Kazemi, Fariborz Taherkhani, and Nasser Nasrabadi. 2020. Preferencebased image generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3404--3413.Google Scholar
- Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota Yamaguchi. 2021. Constrained graphic layout generation via latent optimization. In Proceedings of the 29th ACM International Conference on Multimedia. 88--96.Google ScholarDigital Library
- Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).Google Scholar
- Xiang Kong, Lu Jiang, Huiwen Chang, Han Zhang, Yuan Hao, Haifeng Gong, and Irfan Essa. 2022. BLT: bidirectional layout transformer for controllable layout generation. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XVII. Springer, 474--490.Google Scholar
- Julia Kreutzer, Shahram Khadivi, Evgeny Matusov, and Stefan Riezler. 2018. Can neural machine translation be improved with user feedback? arXiv preprint arXiv:1804.05958 (2018).Google Scholar
- Chien-Yin Lai, Pai-Hsun Chen, Sheng-Wen Shih, Yili Liu, and Jen-Shin Hong. 2010. Computational models and experimental investigations of effects of balance and symmetry on the aesthetics of text-overlaid images. International journal of human-computer studies 68, 1--2 (2010), 41--56.Google ScholarDigital Library
- Hsin-Ying Lee, Lu Jiang, Irfan Essa, Phuong B Le, Haifeng Gong, Ming-Hsuan Yang, and Weilong Yang. 2020. Neural design network: Graphic layout generation with constraints. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part III 16. Springer, 491--506.Google ScholarDigital Library
- Kimin Lee, Laura Smith, and Pieter Abbeel. 2021. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091 (2021).Google Scholar
- Jinyu Li, Shujin Lin, Fan Zhou, and Ruomei Wang. 2022. NewsThumbnail: Automatic Generation of News Video Thumbnail. In 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 1383--1388.Google Scholar
- Jianan Li, Jimei Yang, Aaron Hertzmann, Jianming Zhang, and Tingfa Xu. 2019. Layoutgan: Generating graphic layouts with wireframe discriminators. arXiv preprint arXiv:1901.06767 (2019).Google Scholar
- Jianan Li, Jimei Yang, Jianming Zhang, Chang Liu, Christina Wang, and Tingfa Xu. 2020. Attribute-conditioned layout gan for automatic graphic design. IEEE Transactions on Visualization and Computer Graphics 27, 10 (2020), 4039--4048.Google ScholarDigital Library
- Zhiwei Li, Shuming Shi, and Lei Zhang. 2008. Improving relevance judgment of web search results with image excerpts. In Proceedings of the 17th international conference on World Wide Web. 21--30.Google ScholarDigital Library
- Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11976--11986.Google ScholarCross Ref
- Shuang Ma and Chang Wen Chen. 2016. Automatic creation of magazine-pagelike social media visual summary for mobile browsing. In 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 469--473.Google ScholarCross Ref
- James MacGlashan, Mark K Ho, Robert Loftin, Bei Peng, Guan Wang, David L Roberts, Matthew E Taylor, and Michael L Littman. 2017. Interactive learning from policy-dependent human feedback. In International Conference on Machine Learning. PMLR, 2285--2294.Google Scholar
- Roberto Martínez-Cruz, Alvaro J López-López, and José Portela. 2023. ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task. arXiv preprint arXiv:2304.14177 (2023).Google Scholar
- Tao Mei and Xian-Sheng Hua. 2010. Contextual internet multimedia advertising. Proc. IEEE 98, 8 (2010), 1416--1433.Google ScholarCross Ref
- Tao Mei, Xian-Sheng Hua, and Shipeng Li. 2008. Contextual in-image advertising. In Proceedings of the 16th ACM international conference on Multimedia. 439--448.Google ScholarDigital Library
- Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. 2021. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 (2021).Google Scholar
- Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730--27744.Google Scholar
- Rik Pieters and Michel Wedel. 2004. Attention capture and transfer in advertising: Brand, pictorial, and text-size effects. Journal of marketing 68, 2 (2004), 36--50.Google ScholarCross Ref
- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.Google Scholar
- DM Rocke. 2000. Genetic Algorithms Data Structures= Evolution programs (3rd. J. Amer. Statist. Assoc. 95, 449 (2000), 347.Google Scholar
- Mingyang Song, Haiyun Jiang, Shuming Shi, Songfang Yao, Shilong Lu, Yi Feng, Huafeng Liu, and Liping Jing. 2023. Is ChatGPT A Good Keyphrase Generator? A Preliminary Study. arXiv preprint arXiv:2303.13001 (2023).Google Scholar
- Jaime Teevan, Edward Cutrell, Danyel Fisher, Steven M Drucker, Gonzalo Ramos, Paul André, and Chang Hu. 2009. Visual snippets: summarizing web pages for search and revisitation. In Proceedings of the SIGCHI conference on human factors in computing systems. 2023--2032.Google ScholarDigital Library
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google Scholar
- Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. 2018. Unified Perceptual Parsing for Scene Understanding. In European Conference on Computer Vision. Springer.Google Scholar
- Binbin Xie, Jia Song, Liangying Shao, Suhang Wu, Xiangpeng Wei, Baosong Yang, Huan Lin, Jun Xie, and Jinsong Su. 2023. From statistical methods to deep learning, automatic keyphrase prediction: A survey. Information Processing & Management 60, 4 (2023), 103382.Google ScholarDigital Library
- Yi Xu, Fan Bai, Yingxuan Shi, Qiuyu Chen, Longwen Gao, Kai Tian, Shuigeng Zhou, and Huyang Sun. 2021. Gif thumbnails: Attract more clicks to your videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3074--3082.Google ScholarCross Ref
- Kota Yamaguchi. 2021. Canvasvae: learning to generate vector graphic documents. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5481--5489.Google ScholarCross Ref
- Xuyong Yang, Tao Mei, Ying-Qing Xu, Yong Rui, and Shipeng Li. 2016. Automatic generation of visual-textual presentation layout. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 12, 2 (2016), 1--22.Google ScholarDigital Library
- Wenyuan Yin, Tao Mei, and Chang Wen Chen. 2013. Automatic generation of social media snippets for mobile browsing. In Proceedings of the 21st ACM international conference on Multimedia. 927--936.Google ScholarDigital Library
- Ning Yu, Chia-Chih Chen, Zeyuan Chen, Rui Meng, Gang Wu, Paul Josel, Juan Carlos Niebles, Caiming Xiong, and Ran Xu. 2022. LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer. arXiv preprint arXiv:2212.09877 (2022).Google Scholar
- Junyi Zhang, Jiaqi Guo, Shizhao Sun, Jian-Guang Lou, and Dongmei Zhang. 2023. LayoutDiffusion: Improving Graphic Layout Generation by Discrete Diffusion Probabilistic Models. arXiv preprint arXiv:2303.11589 (2023).Google Scholar
- Yunke Zhang, Kangkang Hu, Peiran Ren, Changyuan Yang, Weiwei Xu, and Xian-Sheng Hua. 2017. Layout style modeling for automating banner design. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017. 451--459.Google ScholarDigital Library
- Baoquan Zhao, Hanhui Li, Ruomei Wang, and Xiaonan Luo. 2020. Automatic generation of informative video thumbnail. In 2020 8th International Conference on Digital Home (ICDH). IEEE, 254--259.Google ScholarCross Ref
- Baoquan Zhao, Shujin Lin, Xin Qi, Zhiquan Zhang, Xiaonan Luo, and Ruomei Wang. 2017. Automatic generation of visual-textual web video thumbnail. In SIGGRAPH Asia 2017 Posters. 1--2.Google Scholar
- Ting Zhao and Xiangqian Wu. 2019. Pyramid Feature Attention Network for Saliency detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Min Zhou, Chenchen Xu, Ye Ma, Tiezheng Ge, Yuning Jiang, and Weiwei Xu. 2022. Composition-aware Graphic Layout GAN for Visual-textual Presentation Designs. arXiv preprint arXiv:2205.00303 (2022).Google Scholar
- Wangchunshu Zhou and Ke Xu. 2020. Learning to compare for better training and evaluation of open domain natural language generation models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9717--9724.Google ScholarCross Ref
Index Terms
- Toward Human Perception-Centric Video Thumbnail Generation
Recommendations
Combining Adversarial and Reinforcement Learning for Video Thumbnail Selection
ICMR '21: Proceedings of the 2021 International Conference on Multimedia RetrievalThis paper presents a new method for unsupervised video thumbnail selection. The developed network architecture selects video thumbnails based on two criteria: the representativeness and the aesthetic quality of their visual content. Training relies on ...
Sentence Specified Dynamic Video Thumbnail Generation
MM '19: Proceedings of the 27th ACM International Conference on MultimediaWith the tremendous growth of videos over the Internet, video thumbnails, providing video content previews, are becoming increasingly crucial to influencing users' online searching experiences. Conventional video thumbnails are generated once purely ...
A Novel Framework for Web Video Thumbnail Generation
IIH-MSP '12: Proceedings of the 2012 Eighth International Conference on Intelligent Information Hiding and Multimedia Signal ProcessingWhen user uploads a video clip to the video sharing websites, a video thumbnail needs to be generated as the cover to represent the video content. In this paper, a novel video thumbnail generation framework is presented. For generating a good thumbnail, ...
Comments