ABSTRACT
This paper targets at topic-adaptive video highlight detection, aiming to identify the moments in a video described by arbitrary text inputs. The fundamental challenge is the availability of annotated training data. It is costly to further scale up the number of topic-level categories which requires manually identifying and labeling corresponding highlights. To overcome this challenge, our method provides a new perspective on highlight detection by attaching importance to the semantic information of topic text rather than simply classifying whether a snippet is a highlight.Specifically, we decompose a topic into a set of key concepts and utilize the remarkable ability of visual-language pre-trained models to learn knowledge from both videos and semantic language. With the merits of this reformulation, the highlight detection task can be modeled as a snippet-text matching problem within a dual-stream multimodal learning framework, which strengthens the video representation with semantic language supervision and enables our model to accomplish open-set topic-adaptive highlight detection without any further labeled data. Our empirical evaluation shows the effectiveness of our method on several publicly available datasets, where the proposed method outperforms competitive baselines and achieves a novel state-of-the-art for topic-adaptive highlight detection. Further, when transferring our pre-trained model to the open-set video highlight detection task, our method outperforms prior supervised work by a substantial margin.
- Taivanbat Badamdorj, Mrigank Rochan, Yang Wang, and Li Cheng. 2021. Joint visual and audio learning for video highlight detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8127–8137.Google ScholarCross Ref
- Taivanbat Badamdorj, Mrigank Rochan, Yang Wang, and Li Cheng. 2022. Contrastive learning for unsupervised video highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14042–14052.Google ScholarCross Ref
- Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.Google Scholar
- Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.Google Scholar
- Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. 2021. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544 (2021).Google Scholar
- Ana Garcia del Molino and Michael Gygli. 2018. Phd-gifs: personalized highlight detection for automatic gif creation. In Proceedings of the 26th ACM international conference on Multimedia. 600–608.Google ScholarDigital Library
- Boqing Gong, Wei-Lun Chao, Kristen Grauman, and Fei Sha. 2014. Diverse sequential subset selection for supervised video summarization. Advances in neural information processing systems 27 (2014).Google Scholar
- Michael Gygli, Helmut Grabner, and Luc Van Gool. 2015. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3090–3098.Google ScholarCross Ref
- Michael Gygli, Yale Song, and Liangliang Cao. 2016. Video2gif: Automatic generation of animated gifs from video. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1001–1009.Google ScholarCross Ref
- Fa-Ting Hong, Xuanteng Huang, Wei-Hong Li, and Wei-Shi Zheng. 2020. Mini-net: Multiple instance ranking network for video highlight detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16. Springer, 345–360.Google Scholar
- Yifan Jiao, Tianzhu Zhang, Shucheng Huang, Bin Liu, and Changsheng Xu. 2019. Video highlight detection via region-based deep ranking model. International Journal of Pattern Recognition and Artificial Intelligence 33, 07 (2019), 1940001.Google ScholarCross Ref
- Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. 2022. Prompting visual-language models for efficient video understanding. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV. Springer, 105–124.Google ScholarDigital Library
- Gihyun Kwon and Jong Chul Ye. 2022. Clipstyler: Image style transfer with a single text condition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18062–18071.Google ScholarCross Ref
- Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman. 2012. Discovering important people and objects for egocentric video summarization. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 1346–1353.Google Scholar
- Jie Lei, Tamara L Berg, and Mohit Bansal. 2021. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems 34 (2021), 11846–11858.Google Scholar
- Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, and Steven CH Hoi. 2022. Align and prompt: Video-and-language pre-training with entity prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4953–4963.Google ScholarCross Ref
- Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng, Heng Ji, and Shih-Fu Chang. 2022. Clip-event: Connecting text and images with event structures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16420–16429.Google ScholarCross Ref
- Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. 2022. Frozen clip models are efficient video learners. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV. Springer, 388–404.Google ScholarDigital Library
- Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.Google ScholarDigital Library
- Ye Liu, Siyuan Li, Yang Wu, Chang-Wen Chen, Ying Shan, and Xiaohu Qie. 2022. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3042–3051.Google ScholarCross Ref
- Zheng Lu and Kristen Grauman. 2013. Story-driven summarization for egocentric video. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2714–2721.Google ScholarDigital Library
- Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293–304.Google ScholarDigital Library
- Behrooz Mahasseni, Michael Lam, and Sinisa Todorovic. 2017. Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 202–211.Google ScholarCross Ref
- Isao Otsuka, Kazuhiko Nakane, Ajay Divakaran, Keiji Hatanaka, and Masaharu Ogawa. 2005. A highlight scene detection and video summarization system using audio feature for a personal video recorder. IEEE Transactions on Consumer Electronics 51, 1 (2005), 112–116.Google ScholarDigital Library
- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. (2021), 8748–8763.Google Scholar
- Yong Rui, Anoop Gupta, and Alex Acero. 2000. Automatically extracting highlights for TV baseball programs. In Proceedings of the eighth ACM international conference on Multimedia. 105–115.Google ScholarDigital Library
- Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. Tvsum: Summarizing web videos using titles. (2015), 5179–5187.Google Scholar
- Min Sun, Ali Farhadi, and Steve Seitz. 2014. Ranking domain-specific highlights by analyzing edited videos. (2014), 787–802.Google Scholar
- Mingkang Tang, Zhanyu Wang, Zhaoyang Zeng, Fengyun Rao, and Dian Li. 2021. Clip4caption++: Multi-clip for video caption. arXiv preprint arXiv:2110.05204 (2021).Google Scholar
- Xiaofeng Tong, Qingshan Liu, Yifan Zhang, and Hanqing Lu. 2005. Highlight ranking for sports video browsing. In Proceedings of the 13th annual ACM international conference on Multimedia. 519–522.Google ScholarDigital Library
- Mengmeng Wang, Jiazheng Xing, and Yong Liu. 2021. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021).Google Scholar
- Fanyue Wei, Biao Wang, Tiezheng Ge, Yuning Jiang, Wen Li, and Lixin Duan. 2022. Learning pixel-level distinctions for video highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3073–3082.Google ScholarCross Ref
- Bo Xiong, Yannis Kalantidis, Deepti Ghadiyaram, and Kristen Grauman. 2019. Less is more: Learning highlight detection from video duration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1258–1267.Google ScholarCross Ref
- Lumin Xu, Sheng Jin, Wang Zeng, Wentao Liu, Chen Qian, Wanli Ouyang, Ping Luo, and Xiaogang Wang. 2022. Pose for Everything: Towards Category-Agnostic Pose Estimation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VI. Springer, 398–416.Google Scholar
- Minghao Xu, Hang Wang, Bingbing Ni, Riheng Zhu, Zhenbang Sun, and Changhu Wang. 2021. Cross-category video highlight detection via set-based learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7970–7979.Google ScholarCross Ref
- Huan Yang, Baoyuan Wang, Stephen Lin, David Wipf, Minyi Guo, and Baining Guo. 2015. Unsupervised extraction of video highlights via robust recurrent auto-encoders. In Proceedings of the IEEE international conference on computer vision. 4633–4641.Google ScholarDigital Library
- Ting Yao, Tao Mei, and Yong Rui. 2016. Highlight detection with pairwise deep ranking for first-person video summarization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 982–990.Google ScholarCross Ref
- Ting Yao, Tao Mei, and Yong Rui. 2016. Highlight detection with pairwise deep ranking for first-person video summarization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 982–990.Google ScholarCross Ref
- Bin Zhang, Weibei Dou, and Liming Chen. 2006. Combining short and long term audio features for TV sports highlight detection. In Advances in Information Retrieval: 28th European Conference on IR Research, ECIR 2006, London, UK, April 10-12, 2006. Proceedings 28. Springer, 472–475.Google ScholarDigital Library
- Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2017. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM international conference on Multimedia. 863–871.Google ScholarDigital Library
- Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2018. Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7405–7414.Google ScholarCross Ref
- Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao, BG Vijay Kumar, Anastasis Stathopoulos, Manmohan Chandraker, and Dimitris N Metaxas. 2022. Exploiting unlabeled data with vision and language models for object detection. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX. Springer, 159–175.Google Scholar
- Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, 2022. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16793–16803.Google ScholarCross Ref
- Kaiyang Zhou, Yu Qiao, and Tao Xiang. 2018. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.Google ScholarCross Ref
- Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language models. International Journal of Computer Vision 130, 9 (2022), 2337–2348.Google ScholarDigital Library
- Linchao Zhu and Yi Yang. 2020. Actbert: Learning global-local video-text representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8746–8755.Google ScholarCross Ref
Index Terms
- Dual-Stream Multimodal Learning for Topic-Adaptive Video Highlight Detection
Recommendations
Transfer learning for video anomaly detection
Soft Computing and Intelligent Systems: Techniques and ApplicationsAnomaly detection from crowd is a widely addressed problem in the field of computer vision. It is an essential part of video surveillance and security. In surveillance videos, very little information about anomalous behaviors is available, so it becomes ...
Deep Transfer Learning For Abnormality Detection
ICCSE'19: Proceedings of the 4th International Conference on Crowd Science and EngineeringDeep learning has proven to be effective in learning scenarios with massive training data. However, in many real applications (i.e., abnormality detection), there is a lack of sufficient data to achieve a good deep learning model. Considering the fact ...
SeMAnD: Self-Supervised Anomaly Detection in Multimodal Geospatial Datasets
SIGSPATIAL '23: Proceedings of the 31st ACM International Conference on Advances in Geographic Information SystemsWe propose SeMAnD, a Self-supervised Anomaly Detection technique to detect geometric anomalies in Multimodal geospatial datasets. SeMAnD consists of (i) a simple data augmentation strategy, called RandPolyAugment, capable of generating diverse ...
Comments