research-article

Dual-Stream Multimodal Learning for Topic-Adaptive Video Highlight Detection

Authors:
Ziwei Xiong

Beijing Forestry University, China

Beijing Forestry University, China

0009-0001-7041-5936
View Profile

,
Han Wang

Beijing Forestry University, China

Beijing Forestry University, China

0009-0001-0884-2603
View Profile

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia RetrievalJune 2023Pages 272–279https://doi.org/10.1145/3591106.3592286

Published:12 June 2023Publication History

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

Pages 272–279

ABSTRACT

This paper targets at topic-adaptive video highlight detection, aiming to identify the moments in a video described by arbitrary text inputs. The fundamental challenge is the availability of annotated training data. It is costly to further scale up the number of topic-level categories which requires manually identifying and labeling corresponding highlights. To overcome this challenge, our method provides a new perspective on highlight detection by attaching importance to the semantic information of topic text rather than simply classifying whether a snippet is a highlight.Specifically, we decompose a topic into a set of key concepts and utilize the remarkable ability of visual-language pre-trained models to learn knowledge from both videos and semantic language. With the merits of this reformulation, the highlight detection task can be modeled as a snippet-text matching problem within a dual-stream multimodal learning framework, which strengthens the video representation with semantic language supervision and enables our model to accomplish open-set topic-adaptive highlight detection without any further labeled data. Our empirical evaluation shows the effectiveness of our method on several publicly available datasets, where the proposed method outperforms competitive baselines and achieves a novel state-of-the-art for topic-adaptive highlight detection. Further, when transferring our pre-trained model to the open-set video highlight detection task, our method outperforms prior supervised work by a substantial margin.

References

Taivanbat Badamdorj, Mrigank Rochan, Yang Wang, and Li Cheng. 2021. Joint visual and audio learning for video highlight detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8127–8137.Google ScholarCross Ref
Taivanbat Badamdorj, Mrigank Rochan, Yang Wang, and Li Cheng. 2022. Contrastive learning for unsupervised video highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14042–14052.Google ScholarCross Ref
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.Google Scholar
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.Google Scholar
Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. 2021. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544 (2021).Google Scholar
Ana Garcia del Molino and Michael Gygli. 2018. Phd-gifs: personalized highlight detection for automatic gif creation. In Proceedings of the 26th ACM international conference on Multimedia. 600–608.Google ScholarDigital Library
Boqing Gong, Wei-Lun Chao, Kristen Grauman, and Fei Sha. 2014. Diverse sequential subset selection for supervised video summarization. Advances in neural information processing systems 27 (2014).Google Scholar
Michael Gygli, Helmut Grabner, and Luc Van Gool. 2015. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3090–3098.Google ScholarCross Ref
Michael Gygli, Yale Song, and Liangliang Cao. 2016. Video2gif: Automatic generation of animated gifs from video. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1001–1009.Google ScholarCross Ref
Fa-Ting Hong, Xuanteng Huang, Wei-Hong Li, and Wei-Shi Zheng. 2020. Mini-net: Multiple instance ranking network for video highlight detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16. Springer, 345–360.Google Scholar
Yifan Jiao, Tianzhu Zhang, Shucheng Huang, Bin Liu, and Changsheng Xu. 2019. Video highlight detection via region-based deep ranking model. International Journal of Pattern Recognition and Artificial Intelligence 33, 07 (2019), 1940001.Google ScholarCross Ref
Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. 2022. Prompting visual-language models for efficient video understanding. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV. Springer, 105–124.Google ScholarDigital Library
Gihyun Kwon and Jong Chul Ye. 2022. Clipstyler: Image style transfer with a single text condition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18062–18071.Google ScholarCross Ref
Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman. 2012. Discovering important people and objects for egocentric video summarization. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 1346–1353.Google Scholar
Jie Lei, Tamara L Berg, and Mohit Bansal. 2021. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems 34 (2021), 11846–11858.Google Scholar
Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, and Steven CH Hoi. 2022. Align and prompt: Video-and-language pre-training with entity prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4953–4963.Google ScholarCross Ref
Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng, Heng Ji, and Shih-Fu Chang. 2022. Clip-event: Connecting text and images with event structures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16420–16429.Google ScholarCross Ref
Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. 2022. Frozen clip models are efficient video learners. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV. Springer, 388–404.Google ScholarDigital Library
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.Google ScholarDigital Library
Ye Liu, Siyuan Li, Yang Wu, Chang-Wen Chen, Ying Shan, and Xiaohu Qie. 2022. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3042–3051.Google ScholarCross Ref
Zheng Lu and Kristen Grauman. 2013. Story-driven summarization for egocentric video. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2714–2721.Google ScholarDigital Library
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293–304.Google ScholarDigital Library
Behrooz Mahasseni, Michael Lam, and Sinisa Todorovic. 2017. Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 202–211.Google ScholarCross Ref
Isao Otsuka, Kazuhiko Nakane, Ajay Divakaran, Keiji Hatanaka, and Masaharu Ogawa. 2005. A highlight scene detection and video summarization system using audio feature for a personal video recorder. IEEE Transactions on Consumer Electronics 51, 1 (2005), 112–116.Google ScholarDigital Library
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. (2021), 8748–8763.Google Scholar
Yong Rui, Anoop Gupta, and Alex Acero. 2000. Automatically extracting highlights for TV baseball programs. In Proceedings of the eighth ACM international conference on Multimedia. 105–115.Google ScholarDigital Library
Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. Tvsum: Summarizing web videos using titles. (2015), 5179–5187.Google Scholar
Min Sun, Ali Farhadi, and Steve Seitz. 2014. Ranking domain-specific highlights by analyzing edited videos. (2014), 787–802.Google Scholar
Mingkang Tang, Zhanyu Wang, Zhaoyang Zeng, Fengyun Rao, and Dian Li. 2021. Clip4caption++: Multi-clip for video caption. arXiv preprint arXiv:2110.05204 (2021).Google Scholar
Xiaofeng Tong, Qingshan Liu, Yifan Zhang, and Hanqing Lu. 2005. Highlight ranking for sports video browsing. In Proceedings of the 13th annual ACM international conference on Multimedia. 519–522.Google ScholarDigital Library
Mengmeng Wang, Jiazheng Xing, and Yong Liu. 2021. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021).Google Scholar
Fanyue Wei, Biao Wang, Tiezheng Ge, Yuning Jiang, Wen Li, and Lixin Duan. 2022. Learning pixel-level distinctions for video highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3073–3082.Google ScholarCross Ref
Bo Xiong, Yannis Kalantidis, Deepti Ghadiyaram, and Kristen Grauman. 2019. Less is more: Learning highlight detection from video duration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1258–1267.Google ScholarCross Ref
Lumin Xu, Sheng Jin, Wang Zeng, Wentao Liu, Chen Qian, Wanli Ouyang, Ping Luo, and Xiaogang Wang. 2022. Pose for Everything: Towards Category-Agnostic Pose Estimation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VI. Springer, 398–416.Google Scholar
Minghao Xu, Hang Wang, Bingbing Ni, Riheng Zhu, Zhenbang Sun, and Changhu Wang. 2021. Cross-category video highlight detection via set-based learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7970–7979.Google ScholarCross Ref
Huan Yang, Baoyuan Wang, Stephen Lin, David Wipf, Minyi Guo, and Baining Guo. 2015. Unsupervised extraction of video highlights via robust recurrent auto-encoders. In Proceedings of the IEEE international conference on computer vision. 4633–4641.Google ScholarDigital Library
Ting Yao, Tao Mei, and Yong Rui. 2016. Highlight detection with pairwise deep ranking for first-person video summarization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 982–990.Google ScholarCross Ref
Ting Yao, Tao Mei, and Yong Rui. 2016. Highlight detection with pairwise deep ranking for first-person video summarization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 982–990.Google ScholarCross Ref
Bin Zhang, Weibei Dou, and Liming Chen. 2006. Combining short and long term audio features for TV sports highlight detection. In Advances in Information Retrieval: 28th European Conference on IR Research, ECIR 2006, London, UK, April 10-12, 2006. Proceedings 28. Springer, 472–475.Google ScholarDigital Library
Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2017. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM international conference on Multimedia. 863–871.Google ScholarDigital Library
Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2018. Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7405–7414.Google ScholarCross Ref
Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao, BG Vijay Kumar, Anastasis Stathopoulos, Manmohan Chandraker, and Dimitris N Metaxas. 2022. Exploiting unlabeled data with vision and language models for object detection. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX. Springer, 159–175.Google Scholar
Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, 2022. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16793–16803.Google ScholarCross Ref
Kaiyang Zhou, Yu Qiao, and Tao Xiang. 2018. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.Google ScholarCross Ref
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language models. International Journal of Computer Vision 130, 9 (2022), 2337–2348.Google ScholarDigital Library
Linchao Zhu and Yi Yang. 2020. Actbert: Learning global-local video-text representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8746–8755.Google ScholarCross Ref

Index Terms

Dual-Stream Multimodal Learning for Topic-Adaptive Video Highlight Detection
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

Transfer learning for video anomaly detection
Soft Computing and Intelligent Systems: Techniques and Applications

Anomaly detection from crowd is a widely addressed problem in the field of computer vision. It is an essential part of video surveillance and security. In surveillance videos, very little information about anomalous behaviors is available, so it becomes ...
Read More
Deep Transfer Learning For Abnormality Detection
ICCSE'19: Proceedings of the 4th International Conference on Crowd Science and Engineering

Deep learning has proven to be effective in learning scenarios with massive training data. However, in many real applications (i.e., abnormality detection), there is a lack of sufficient data to achieve a good deep learning model. Considering the fact ...
Read More
SeMAnD: Self-Supervised Anomaly Detection in Multimodal Geospatial Datasets
SIGSPATIAL '23: Proceedings of the 31st ACM International Conference on Advances in Geographic Information Systems

We propose SeMAnD, a Self-supervised Anomaly Detection technique to detect geometric anomalies in Multimodal geospatial datasets. SeMAnD consists of (i) a simple data augmentation strategy, called RandPolyAugment, capable of generating diverse ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval
June 2023
694 pages
ISBN:9798400701788
DOI:10.1145/3591106
Editors:
Ioannis (Yiannis) Kompatsiaris
Centre for Research and Technology Hellas, Greece
,
Jiebo Luo
University of Rochester,USA
,
Nicu Sebe
University of Trento, Italy
,
Angela Yao
National University of Singapore, Singapore
,
Vasileios Mezaris
Centre for Research and Technology Hellas, Greece
,
Symeon Papadopoulos
Centre for Research and Technology Hellas, Greece
,
Adrian Popescu
CEA LIST, France
,
Zi (Helen) Huang
University of Queensland, Australia
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 June 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Highlight Detection
Multimodal Learning
Transfer Learning
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate254of830submissions,31%
Upcoming Conference
ICMR '24

Sponsor:

sigmm

International Conference on Multimedia Retrieval

June 10 - 14, 2024

Phuket , Thailand
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 137
  Total Downloads
- Downloads (Last 12 months)137
- Downloads (Last 6 weeks)16
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Dual-Stream Multimodal Learning for Topic-Adaptive Video Highlight Detection

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Transfer learning for video anomaly detection

Deep Transfer Learning For Abnormality Detection

SeMAnD: Self-Supervised Anomaly Detection in Multimodal Geospatial Datasets