skip to main content
10.1145/3591106.3592286acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Dual-Stream Multimodal Learning for Topic-Adaptive Video Highlight Detection

Published:12 June 2023Publication History

ABSTRACT

This paper targets at topic-adaptive video highlight detection, aiming to identify the moments in a video described by arbitrary text inputs. The fundamental challenge is the availability of annotated training data. It is costly to further scale up the number of topic-level categories which requires manually identifying and labeling corresponding highlights. To overcome this challenge, our method provides a new perspective on highlight detection by attaching importance to the semantic information of topic text rather than simply classifying whether a snippet is a highlight.Specifically, we decompose a topic into a set of key concepts and utilize the remarkable ability of visual-language pre-trained models to learn knowledge from both videos and semantic language. With the merits of this reformulation, the highlight detection task can be modeled as a snippet-text matching problem within a dual-stream multimodal learning framework, which strengthens the video representation with semantic language supervision and enables our model to accomplish open-set topic-adaptive highlight detection without any further labeled data. Our empirical evaluation shows the effectiveness of our method on several publicly available datasets, where the proposed method outperforms competitive baselines and achieves a novel state-of-the-art for topic-adaptive highlight detection. Further, when transferring our pre-trained model to the open-set video highlight detection task, our method outperforms prior supervised work by a substantial margin.

References

  1. Taivanbat Badamdorj, Mrigank Rochan, Yang Wang, and Li Cheng. 2021. Joint visual and audio learning for video highlight detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8127–8137.Google ScholarGoogle ScholarCross RefCross Ref
  2. Taivanbat Badamdorj, Mrigank Rochan, Yang Wang, and Li Cheng. 2022. Contrastive learning for unsupervised video highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14042–14052.Google ScholarGoogle ScholarCross RefCross Ref
  3. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.Google ScholarGoogle Scholar
  4. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.Google ScholarGoogle Scholar
  5. Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. 2021. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544 (2021).Google ScholarGoogle Scholar
  6. Ana Garcia del Molino and Michael Gygli. 2018. Phd-gifs: personalized highlight detection for automatic gif creation. In Proceedings of the 26th ACM international conference on Multimedia. 600–608.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Boqing Gong, Wei-Lun Chao, Kristen Grauman, and Fei Sha. 2014. Diverse sequential subset selection for supervised video summarization. Advances in neural information processing systems 27 (2014).Google ScholarGoogle Scholar
  8. Michael Gygli, Helmut Grabner, and Luc Van Gool. 2015. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3090–3098.Google ScholarGoogle ScholarCross RefCross Ref
  9. Michael Gygli, Yale Song, and Liangliang Cao. 2016. Video2gif: Automatic generation of animated gifs from video. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1001–1009.Google ScholarGoogle ScholarCross RefCross Ref
  10. Fa-Ting Hong, Xuanteng Huang, Wei-Hong Li, and Wei-Shi Zheng. 2020. Mini-net: Multiple instance ranking network for video highlight detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16. Springer, 345–360.Google ScholarGoogle Scholar
  11. Yifan Jiao, Tianzhu Zhang, Shucheng Huang, Bin Liu, and Changsheng Xu. 2019. Video highlight detection via region-based deep ranking model. International Journal of Pattern Recognition and Artificial Intelligence 33, 07 (2019), 1940001.Google ScholarGoogle ScholarCross RefCross Ref
  12. Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. 2022. Prompting visual-language models for efficient video understanding. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV. Springer, 105–124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Gihyun Kwon and Jong Chul Ye. 2022. Clipstyler: Image style transfer with a single text condition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18062–18071.Google ScholarGoogle ScholarCross RefCross Ref
  14. Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman. 2012. Discovering important people and objects for egocentric video summarization. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 1346–1353.Google ScholarGoogle Scholar
  15. Jie Lei, Tamara L Berg, and Mohit Bansal. 2021. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems 34 (2021), 11846–11858.Google ScholarGoogle Scholar
  16. Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, and Steven CH Hoi. 2022. Align and prompt: Video-and-language pre-training with entity prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4953–4963.Google ScholarGoogle ScholarCross RefCross Ref
  17. Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng, Heng Ji, and Shih-Fu Chang. 2022. Clip-event: Connecting text and images with event structures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16420–16429.Google ScholarGoogle ScholarCross RefCross Ref
  18. Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. 2022. Frozen clip models are efficient video learners. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV. Springer, 388–404.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ye Liu, Siyuan Li, Yang Wu, Chang-Wen Chen, Ying Shan, and Xiaohu Qie. 2022. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3042–3051.Google ScholarGoogle ScholarCross RefCross Ref
  21. Zheng Lu and Kristen Grauman. 2013. Story-driven summarization for egocentric video. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2714–2721.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293–304.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Behrooz Mahasseni, Michael Lam, and Sinisa Todorovic. 2017. Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 202–211.Google ScholarGoogle ScholarCross RefCross Ref
  24. Isao Otsuka, Kazuhiko Nakane, Ajay Divakaran, Keiji Hatanaka, and Masaharu Ogawa. 2005. A highlight scene detection and video summarization system using audio feature for a personal video recorder. IEEE Transactions on Consumer Electronics 51, 1 (2005), 112–116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. (2021), 8748–8763.Google ScholarGoogle Scholar
  26. Yong Rui, Anoop Gupta, and Alex Acero. 2000. Automatically extracting highlights for TV baseball programs. In Proceedings of the eighth ACM international conference on Multimedia. 105–115.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. Tvsum: Summarizing web videos using titles. (2015), 5179–5187.Google ScholarGoogle Scholar
  28. Min Sun, Ali Farhadi, and Steve Seitz. 2014. Ranking domain-specific highlights by analyzing edited videos. (2014), 787–802.Google ScholarGoogle Scholar
  29. Mingkang Tang, Zhanyu Wang, Zhaoyang Zeng, Fengyun Rao, and Dian Li. 2021. Clip4caption++: Multi-clip for video caption. arXiv preprint arXiv:2110.05204 (2021).Google ScholarGoogle Scholar
  30. Xiaofeng Tong, Qingshan Liu, Yifan Zhang, and Hanqing Lu. 2005. Highlight ranking for sports video browsing. In Proceedings of the 13th annual ACM international conference on Multimedia. 519–522.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Mengmeng Wang, Jiazheng Xing, and Yong Liu. 2021. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021).Google ScholarGoogle Scholar
  32. Fanyue Wei, Biao Wang, Tiezheng Ge, Yuning Jiang, Wen Li, and Lixin Duan. 2022. Learning pixel-level distinctions for video highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3073–3082.Google ScholarGoogle ScholarCross RefCross Ref
  33. Bo Xiong, Yannis Kalantidis, Deepti Ghadiyaram, and Kristen Grauman. 2019. Less is more: Learning highlight detection from video duration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1258–1267.Google ScholarGoogle ScholarCross RefCross Ref
  34. Lumin Xu, Sheng Jin, Wang Zeng, Wentao Liu, Chen Qian, Wanli Ouyang, Ping Luo, and Xiaogang Wang. 2022. Pose for Everything: Towards Category-Agnostic Pose Estimation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VI. Springer, 398–416.Google ScholarGoogle Scholar
  35. Minghao Xu, Hang Wang, Bingbing Ni, Riheng Zhu, Zhenbang Sun, and Changhu Wang. 2021. Cross-category video highlight detection via set-based learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7970–7979.Google ScholarGoogle ScholarCross RefCross Ref
  36. Huan Yang, Baoyuan Wang, Stephen Lin, David Wipf, Minyi Guo, and Baining Guo. 2015. Unsupervised extraction of video highlights via robust recurrent auto-encoders. In Proceedings of the IEEE international conference on computer vision. 4633–4641.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Ting Yao, Tao Mei, and Yong Rui. 2016. Highlight detection with pairwise deep ranking for first-person video summarization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 982–990.Google ScholarGoogle ScholarCross RefCross Ref
  38. Ting Yao, Tao Mei, and Yong Rui. 2016. Highlight detection with pairwise deep ranking for first-person video summarization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 982–990.Google ScholarGoogle ScholarCross RefCross Ref
  39. Bin Zhang, Weibei Dou, and Liming Chen. 2006. Combining short and long term audio features for TV sports highlight detection. In Advances in Information Retrieval: 28th European Conference on IR Research, ECIR 2006, London, UK, April 10-12, 2006. Proceedings 28. Springer, 472–475.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2017. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM international conference on Multimedia. 863–871.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2018. Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7405–7414.Google ScholarGoogle ScholarCross RefCross Ref
  42. Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao, BG Vijay Kumar, Anastasis Stathopoulos, Manmohan Chandraker, and Dimitris N Metaxas. 2022. Exploiting unlabeled data with vision and language models for object detection. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX. Springer, 159–175.Google ScholarGoogle Scholar
  43. Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, 2022. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16793–16803.Google ScholarGoogle ScholarCross RefCross Ref
  44. Kaiyang Zhou, Yu Qiao, and Tao Xiang. 2018. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.Google ScholarGoogle ScholarCross RefCross Ref
  45. Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language models. International Journal of Computer Vision 130, 9 (2022), 2337–2348.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Linchao Zhu and Yi Yang. 2020. Actbert: Learning global-local video-text representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8746–8755.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Dual-Stream Multimodal Learning for Topic-Adaptive Video Highlight Detection

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval
            June 2023
            694 pages
            ISBN:9798400701788
            DOI:10.1145/3591106

            Copyright © 2023 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 12 June 2023

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited

            Acceptance Rates

            Overall Acceptance Rate254of830submissions,31%

            Upcoming Conference

            ICMR '24
            International Conference on Multimedia Retrieval
            June 10 - 14, 2024
            Phuket , Thailand
          • Article Metrics

            • Downloads (Last 12 months)137
            • Downloads (Last 6 weeks)16

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format