skip to main content
10.1145/3581783.3612434acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Toward Human Perception-Centric Video Thumbnail Generation

Published:27 October 2023Publication History

ABSTRACT

Video thumbnail plays an essential role in summarizing video content into a compact and concise image for users to browse efficiently. However, automatically generating attractive and informative video thumbnails remains an open problem due to the difficulty of formulating human aesthetic perception and the scarcity of paired training data. This work proposes a novel Human Perception-Centric Video Thumbnail Generation (HPCVTG) to address these challenges. Specifically, our framework first generates a set of thumbnails using a principle-based system, which conforms to established aesthetic and human perception principles, such as visual balance in the layout and avoiding overlapping elements. Then rather than designing from scratch, we ask human annotators to evaluate some of these thumbnails and select their preferred ones. A Transformer-based Variational Auto-Encoder (VAE) model is firstly pre-trained with Model-Agnostic Meta-Learning (MAML) and then fine-tuned on these human-selected thumbnails. The exploration of combining the MAML pre-training paradigm with human feedback in training can reduce human involvement and make the training process more efficient. Extensive experimental results show that our HPCVTG framework outperforms existing methods in objective and subjective evaluations, highlighting its potential to improve the user experience when browsing videos and inspire future research in human perception-centric content generation tasks. The code and dataset will be released via https://github.com/yangtao2019yt/HPCVTG.

References

  1. Diego Martin Arroyo, Janis Postels, and Federico Tombari. 2021. Variational transformer networks for layout generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13642--13652.Google ScholarGoogle ScholarCross RefCross Ref
  2. Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2016. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086 (2016).Google ScholarGoogle Scholar
  3. Michael Bauerly and Yili Liu. 2006. Computational modeling and experimental investigation of effects of compositional elements on interface and design aesthetics. International journal of human-computer studies 64, 8 (2006), 670--682.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877--1901.Google ScholarGoogle Scholar
  5. Zhaowei Cai and Nuno Vasconcelos. 2018. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6154--6162.Google ScholarGoogle ScholarCross RefCross Ref
  6. Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017).Google ScholarGoogle Scholar
  7. Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic metalearning for fast adaptation of deep networks. In International conference on machine learning. PMLR, 1126--1135.Google ScholarGoogle Scholar
  8. Kamal Gupta, Justin Lazarow, Alessandro Achille, Larry S Davis, Vijay Mahadevan, and Abhinav Shrivastava. 2021. Layouttransformer: Layout generation and completion with self-attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1004--1014.Google ScholarGoogle ScholarCross RefCross Ref
  9. Donald Joseph Hejna III and Dorsa Sadigh. 2023. Few-shot preference learning for human-in-the-loop RL. In Conference on Robot Learning. PMLR, 2014--2025.Google ScholarGoogle Scholar
  10. Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. 2007. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Technical Report 07--49. University of Massachusetts, Amherst.Google ScholarGoogle Scholar
  11. Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. 2018. Reward learning from human preferences and demonstrations in atari. Advances in neural information processing systems 31 (2018).Google ScholarGoogle Scholar
  12. Naoto Inoue, Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota Yamaguchi. 2023. LayoutDM: Discrete Diffusion Model for Controllable Layout Generation. arXiv preprint arXiv:2303.08137 (2023).Google ScholarGoogle Scholar
  13. Zhaoyun Jiang, Shizhao Sun, Jihua Zhu, Jian-Guang Lou, and Dongmei Zhang. 2022. Coarse-to-Fine Generative Modeling for Graphic Layouts. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1096--1103.Google ScholarGoogle ScholarCross RefCross Ref
  14. Chuhao Jin, Hongteng Xu, Ruihua Song, and Zhiwu Lu. 2022. Text2Poster: Laying Out Stylized Texts on Retrieved Images. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4823--4827.Google ScholarGoogle ScholarCross RefCross Ref
  15. Hadi Kazemi, Fariborz Taherkhani, and Nasser Nasrabadi. 2020. Preferencebased image generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3404--3413.Google ScholarGoogle Scholar
  16. Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota Yamaguchi. 2021. Constrained graphic layout generation via latent optimization. In Proceedings of the 29th ACM International Conference on Multimedia. 88--96.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).Google ScholarGoogle Scholar
  18. Xiang Kong, Lu Jiang, Huiwen Chang, Han Zhang, Yuan Hao, Haifeng Gong, and Irfan Essa. 2022. BLT: bidirectional layout transformer for controllable layout generation. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XVII. Springer, 474--490.Google ScholarGoogle Scholar
  19. Julia Kreutzer, Shahram Khadivi, Evgeny Matusov, and Stefan Riezler. 2018. Can neural machine translation be improved with user feedback? arXiv preprint arXiv:1804.05958 (2018).Google ScholarGoogle Scholar
  20. Chien-Yin Lai, Pai-Hsun Chen, Sheng-Wen Shih, Yili Liu, and Jen-Shin Hong. 2010. Computational models and experimental investigations of effects of balance and symmetry on the aesthetics of text-overlaid images. International journal of human-computer studies 68, 1--2 (2010), 41--56.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Hsin-Ying Lee, Lu Jiang, Irfan Essa, Phuong B Le, Haifeng Gong, Ming-Hsuan Yang, and Weilong Yang. 2020. Neural design network: Graphic layout generation with constraints. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part III 16. Springer, 491--506.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kimin Lee, Laura Smith, and Pieter Abbeel. 2021. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091 (2021).Google ScholarGoogle Scholar
  23. Jinyu Li, Shujin Lin, Fan Zhou, and Ruomei Wang. 2022. NewsThumbnail: Automatic Generation of News Video Thumbnail. In 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 1383--1388.Google ScholarGoogle Scholar
  24. Jianan Li, Jimei Yang, Aaron Hertzmann, Jianming Zhang, and Tingfa Xu. 2019. Layoutgan: Generating graphic layouts with wireframe discriminators. arXiv preprint arXiv:1901.06767 (2019).Google ScholarGoogle Scholar
  25. Jianan Li, Jimei Yang, Jianming Zhang, Chang Liu, Christina Wang, and Tingfa Xu. 2020. Attribute-conditioned layout gan for automatic graphic design. IEEE Transactions on Visualization and Computer Graphics 27, 10 (2020), 4039--4048.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Zhiwei Li, Shuming Shi, and Lei Zhang. 2008. Improving relevance judgment of web search results with image excerpts. In Proceedings of the 17th international conference on World Wide Web. 21--30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11976--11986.Google ScholarGoogle ScholarCross RefCross Ref
  28. Shuang Ma and Chang Wen Chen. 2016. Automatic creation of magazine-pagelike social media visual summary for mobile browsing. In 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 469--473.Google ScholarGoogle ScholarCross RefCross Ref
  29. James MacGlashan, Mark K Ho, Robert Loftin, Bei Peng, Guan Wang, David L Roberts, Matthew E Taylor, and Michael L Littman. 2017. Interactive learning from policy-dependent human feedback. In International Conference on Machine Learning. PMLR, 2285--2294.Google ScholarGoogle Scholar
  30. Roberto Martínez-Cruz, Alvaro J López-López, and José Portela. 2023. ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task. arXiv preprint arXiv:2304.14177 (2023).Google ScholarGoogle Scholar
  31. Tao Mei and Xian-Sheng Hua. 2010. Contextual internet multimedia advertising. Proc. IEEE 98, 8 (2010), 1416--1433.Google ScholarGoogle ScholarCross RefCross Ref
  32. Tao Mei, Xian-Sheng Hua, and Shipeng Li. 2008. Contextual in-image advertising. In Proceedings of the 16th ACM international conference on Multimedia. 439--448.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. 2021. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 (2021).Google ScholarGoogle Scholar
  34. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730--27744.Google ScholarGoogle Scholar
  35. Rik Pieters and Michel Wedel. 2004. Attention capture and transfer in advertising: Brand, pictorial, and text-size effects. Journal of marketing 68, 2 (2004), 36--50.Google ScholarGoogle ScholarCross RefCross Ref
  36. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.Google ScholarGoogle Scholar
  37. DM Rocke. 2000. Genetic Algorithms Data Structures= Evolution programs (3rd. J. Amer. Statist. Assoc. 95, 449 (2000), 347.Google ScholarGoogle Scholar
  38. Mingyang Song, Haiyun Jiang, Shuming Shi, Songfang Yao, Shilong Lu, Yi Feng, Huafeng Liu, and Liping Jing. 2023. Is ChatGPT A Good Keyphrase Generator? A Preliminary Study. arXiv preprint arXiv:2303.13001 (2023).Google ScholarGoogle Scholar
  39. Jaime Teevan, Edward Cutrell, Danyel Fisher, Steven M Drucker, Gonzalo Ramos, Paul André, and Chang Hu. 2009. Visual snippets: summarizing web pages for search and revisitation. In Proceedings of the SIGCHI conference on human factors in computing systems. 2023--2032.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google ScholarGoogle Scholar
  41. Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. 2018. Unified Perceptual Parsing for Scene Understanding. In European Conference on Computer Vision. Springer.Google ScholarGoogle Scholar
  42. Binbin Xie, Jia Song, Liangying Shao, Suhang Wu, Xiangpeng Wei, Baosong Yang, Huan Lin, Jun Xie, and Jinsong Su. 2023. From statistical methods to deep learning, automatic keyphrase prediction: A survey. Information Processing & Management 60, 4 (2023), 103382.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Yi Xu, Fan Bai, Yingxuan Shi, Qiuyu Chen, Longwen Gao, Kai Tian, Shuigeng Zhou, and Huyang Sun. 2021. Gif thumbnails: Attract more clicks to your videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3074--3082.Google ScholarGoogle ScholarCross RefCross Ref
  44. Kota Yamaguchi. 2021. Canvasvae: learning to generate vector graphic documents. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5481--5489.Google ScholarGoogle ScholarCross RefCross Ref
  45. Xuyong Yang, Tao Mei, Ying-Qing Xu, Yong Rui, and Shipeng Li. 2016. Automatic generation of visual-textual presentation layout. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 12, 2 (2016), 1--22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Wenyuan Yin, Tao Mei, and Chang Wen Chen. 2013. Automatic generation of social media snippets for mobile browsing. In Proceedings of the 21st ACM international conference on Multimedia. 927--936.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Ning Yu, Chia-Chih Chen, Zeyuan Chen, Rui Meng, Gang Wu, Paul Josel, Juan Carlos Niebles, Caiming Xiong, and Ran Xu. 2022. LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer. arXiv preprint arXiv:2212.09877 (2022).Google ScholarGoogle Scholar
  48. Junyi Zhang, Jiaqi Guo, Shizhao Sun, Jian-Guang Lou, and Dongmei Zhang. 2023. LayoutDiffusion: Improving Graphic Layout Generation by Discrete Diffusion Probabilistic Models. arXiv preprint arXiv:2303.11589 (2023).Google ScholarGoogle Scholar
  49. Yunke Zhang, Kangkang Hu, Peiran Ren, Changyuan Yang, Weiwei Xu, and Xian-Sheng Hua. 2017. Layout style modeling for automating banner design. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017. 451--459.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Baoquan Zhao, Hanhui Li, Ruomei Wang, and Xiaonan Luo. 2020. Automatic generation of informative video thumbnail. In 2020 8th International Conference on Digital Home (ICDH). IEEE, 254--259.Google ScholarGoogle ScholarCross RefCross Ref
  51. Baoquan Zhao, Shujin Lin, Xin Qi, Zhiquan Zhang, Xiaonan Luo, and Ruomei Wang. 2017. Automatic generation of visual-textual web video thumbnail. In SIGGRAPH Asia 2017 Posters. 1--2.Google ScholarGoogle Scholar
  52. Ting Zhao and Xiangqian Wu. 2019. Pyramid Feature Attention Network for Saliency detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  53. Min Zhou, Chenchen Xu, Ye Ma, Tiezheng Ge, Yuning Jiang, and Weiwei Xu. 2022. Composition-aware Graphic Layout GAN for Visual-textual Presentation Designs. arXiv preprint arXiv:2205.00303 (2022).Google ScholarGoogle Scholar
  54. Wangchunshu Zhou and Ke Xu. 2020. Learning to compare for better training and evaluation of open domain natural language generation models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9717--9724.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Toward Human Perception-Centric Video Thumbnail Generation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '23: Proceedings of the 31st ACM International Conference on Multimedia
        October 2023
        9913 pages
        ISBN:9798400701085
        DOI:10.1145/3581783

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 October 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia
      • Article Metrics

        • Downloads (Last 12 months)140
        • Downloads (Last 6 weeks)26

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader