skip to main content
10.1145/3581783.3612038acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Filling the Information Gap between Video and Query for Language-Driven Moment Retrieval

Published: 27 October 2023 Publication History

Abstract

This paper addresses the challenging task of language-driven moment retrieval. Previous methods are typically trained to localize the target moment corresponding to a single sentence query in a complicated video. However, this specific moment generally delivers richer contents than the query, i.e., the semantics of one query may miss certain object details or actions in the complex foreground-background visual contents. Such information imbalance between two modalities makes it difficult to finely align their representations. To this end, instead of training with a single query, we propose to utilize the diversity and complementarity among different queries corresponding to the same video moment for enriching the textual semantics. Specifically, we develop a Teacher-Student Moment Retrieval (TSMR) framework to fill this cross-modal information gap. A teacher model is trained to not only encode a certain query but also capture extra complementary queries to aggregate contextual semantics for obtaining more comprehensive moment-related query representations. Since the additional queries are inaccessible during inference, we further introduce an adaptive knowledge distillation mechanism to train a student model with a single query input by selectively absorbing the knowledge from the teacher model. In this manner, the student model is more robust to the cross-modal information gap during the moment retrieval guided by a single query. Experimental results on two benchmarks demonstrate the effectiveness of our proposed method.

Supplemental Material

MP4 File
This paper addresses the task of language-driven moment retrieval. Previous methods are typically trained to localize the target moment corresponding to a single sentence query in a complicated video. However, this specific moment generally delivers richer contents than the query, i.e., the semantics of one query may miss certain object details or actions in the complex foreground- background visual contents. Such information imbalance between two modalities makes it difficult to finely align their representations. To this end, instead of training with a single query, we propose to utilize the diversity and complementarity among different queries corresponding to the same video moment for enriching the textual semantics.

References

[1]
Soroush Abbasi Koohpayegani, Ajinkya Tejankar, and Hamed Pirsiavash. 2020. Compress: Self-supervised learning by compressing representations. Advances in Neural Information Processing Systems, Vol. 33 (2020), 12980--12992.
[2]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
[3]
Meng Cao, Long Chen, Mike Zheng Shou, Can Zhang, and Yuexian Zou. 2021. On Pursuit of Designing Multi-modal Transformer for Video Grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).
[4]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6299--6308.
[5]
Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. 2018. Temporally grounding natural sentence in video. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 162--171.
[6]
Long Chen, Chujie Lu, Siliang Tang, Jun Xiao, Dong Zhang, Chilie Tan, and Xiaolin Li. 2020. Rethinking the Bottom-Up Framework for Query-based Video Localization. In Proceedings of the AAAI Conference on Artificial Intelligence.
[7]
Ioana Croitoru, Simion-Vlad Bogolin, Marius Leordeanu, Hailin Jin, Andrew Zisserman, Samuel Albanie, and Yang Liu. 2021. Teachtext: Crossmodal generalized distillation for text-video retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11583--11593.
[8]
Xinpeng Ding, Nannan Wang, Shiwei Zhang, De Cheng, Xiaomeng Li, Ziyuan Huang, Mingqian Tang, and Xinbo Gao. 2021. Support-set based cross-supervision for video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11573--11582.
[9]
Jianfeng Dong, Xianke Chen, Minsong Zhang, Xun Yang, Shujie Chen, Xirong Li, and Xun Wang. 2022a. Partially Relevant Video Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 246--257.
[10]
Jianfeng Dong, Xirong Li, and Cees GM Snoek. 2018. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia, Vol. 20, 12 (2018), 3377--3388.
[11]
Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2022b. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 8 (2022), 4065--4080.
[12]
Jianfeng Dong, Xiaoman Peng, Zhe Ma, Daizong Liu, Xiaoye Qu, Xun Yang, Jixiang Zhu, and Baolong Liu. 2023 a. From Region to Patch: Attribute-Aware Foreground-Background Contrastive Learning for Fine-Grained Fashion Retrieval. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1273--1282.
[13]
Jianfeng Dong, Shengkai Sun, Zhonglin Liu, Shujie Chen, Baolong Liu, and Xun Wang. 2023 b. Hierarchical contrast for unsupervised skeleton-based action representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 525--533.
[14]
Jianfeng Dong, Yabing Wang, Xianke Chen, Xiaoye Qu, Xirong Li, Yuan He, and Xun Wang. 2022c. Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 8 (2022), 5680--5694.
[15]
Sheng Fang, Shuhui Wang, Junbao Zhuo, Xinzhe Han, and Qingming Huang. 2022c. Learning Linguistic Association Towards Efficient Text-Video Retrieval. In European Conference on Computer Vision. Springer, 254--270.
[16]
Xiang Fang, Daizong Liu, Pan Zhou, and Yuchong Hu. 2022a. Multi-modal cross-domain alignment network for video moment retrieval. IEEE Transactions on Multimedia (2022).
[17]
Xiang Fang, Daizong Liu, Pan Zhou, and Guoshun Nan. 2023. You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2448--2460.
[18]
Xiang Fang, Daizong Liu, Pan Zhou, Zichuan Xu, and Ruixuan Li. 2022b. Hierarchical local-global transformer for temporal sentence grounding. arXiv preprint arXiv:2208.14882 (2022).
[19]
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 5267--5275.
[20]
Junyu Gao and Changsheng Xu. 2021. Fast video moment retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
[21]
Runzhou Ge, Jiyang Gao, Kan Chen, and Ram Nevatia. 2019. Mac: Mining activity concepts for language-based temporal localization. In IEEE Winter Conference on Applications of Computer Vision (WACV). 245--253.
[22]
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, Vol. 33 (2020), 21271--21284.
[23]
Chao Guo, Daizong Liu, and Pan Zhou. 2022. A hybird alignment loss for temporal moment localization with natural language. In 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.
[24]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729--9738.
[25]
Fan Hu, Aozhu Chen, Ziyue Wang, Fangming Zhou, Jianfeng Dong, and Xirong Li. 2022. Lightweight attentional feature fusion: A new baseline for text-to-video retrieval. In European Conference on Computer Vision. 444--461.
[26]
Jiabo Huang, Hailin Jin, Shaogang Gong, and Yang Liu. 2022. Video Activity Localisation with Uncertainties in Temporal Boundary. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIV. Springer, 724--740.
[27]
Xinyue Huo, Lingxi Xie, Jianzhong He, Zijie Yang, Wengang Zhou, Houqiang Li, and Qi Tian. 2021. ATSO: Asynchronous teacher-student optimization for semi-supervised image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1235--1244.
[28]
Wei Ji, Long Chen, Yinwei Wei, Yiming Wu, and Tat-Seng Chua. 2022. Mrtnet: Multi-resolution temporal network for video sentence grounding. arXiv preprint arXiv:2212.13163 (2022).
[29]
Wei Ji, Xi Li, Lina Wei, Fei Wu, and Yueting Zhuang. 2020. Context-aware graph label propagation network for saliency detection. IEEE Transactions on Image Processing, Vol. 29 (2020), 8177--8186.
[30]
Wei Ji, Xi Li, Fei Wu, Zhijie Pan, and Yueting Zhuang. 2019. Human-centric clothing segmentation via deformable semantic locality-preserving network. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 30, 12 (2019), 4837--4848.
[31]
Wei Ji, Yicong Li, Meng Wei, Xindi Shang, Junbin Xiao, Tongwei Ren, and Tat-Seng Chua. 2021. Vidvrd 2021: The third grand challenge on video relation detection. In Proceedings of the 29th ACM International Conference on Multimedia. 4779--4783.
[32]
Wei Ji, Renjie Liang, Zhedong Zheng, Wenqiao Zhang, Shengyu Zhang, Juncheng Li, Mengze Li, and Tat-seng Chua. 2023. Are binary annotations sufficient? video moment retrieval via hierarchical uncertainty-based active learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23013--23022.
[33]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[34]
Kun Li, Dan Guo, and Meng Wang. 2021. Proposal-free video grounding with contextual pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1902--1910.
[35]
Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, and Tat-Seng Chua. 2022. Invariant grounding for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2928--2937.
[36]
Chen Liang, Yue Yu, Haoming Jiang, Siawpeng Er, Ruijia Wang, Tuo Zhao, and Chao Zhang. 2020. Bond: Bert-assisted open-domain named entity recognition with distant supervision. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1054--1064.
[37]
Xi Lin, Hui-Ling Zhen, Zhenhua Li, Qing-Fu Zhang, and Sam Kwong. 2019. Pareto multi-task learning. Advances in neural information processing systems (2019).
[38]
Daizong Liu, Xiang Fang, Wei Hu, and Pan Zhou. 2023 a. Exploring optical-flow-guided motion and detection-based appearance for temporal sentence grounding. IEEE Transactions on Multimedia (2023).
[39]
Daizong Liu, Xiang Fang, Pan Zhou, Xing Di, Weining Lu, and Yu Cheng. 2023 b. Hypotheses tree building for one-shot temporal sentence localization. arXiv preprint arXiv:2301.01871 (2023).
[40]
Daizong Liu and Wei Hu. 2022a. Learning to focus on the foreground for temporal sentence grounding. In Proceedings of the 29th International Conference on Computational Linguistics. 5532--5541.
[41]
Daizong Liu and Wei Hu. 2022b. Skimming, locating, then perusing: A human-like framework for natural language video localization. In Proceedings of the 30th ACM International Conference on Multimedia. 4536--4545.
[42]
Daizong Liu, Xiaoye Qu, Xing Di, Yu Cheng, Zichuan Xu, and Pan Zhou. 2022b. Memory-guided semantic learning network for temporal sentence grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1665--1673.
[43]
Daizong Liu, Xiaoye Qu, Jianfeng Dong, and Pan Zhou. 2020a. Reasoning Step-by-Step: Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network. In Proceedings of the 28th International Conference on Computational Linguistics. 1841--1851.
[44]
Daizong Liu, Xiaoye Qu, Jianfeng Dong, and Pan Zhou. 2021b. Adaptive Proposal Generation Network for Temporal Sentence Localization in Videos. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). 9292--9301.
[45]
Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Yu Cheng, Wei Wei, Zichuan Xu, and Yulai Xie. 2021c. Context-aware Biaffine Localizing Network for Temporal Sentence Grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[46]
Daizong Liu, Xiaoye Qu, and Wei Hu. 2022a. Reducing the vision and language bias for temporal sentence grounding. In Proceedings of the 30th ACM International Conference on Multimedia. 4092--4101.
[47]
Daizong Liu, Xiaoye Qu, Xiao-Yang Liu, Jianfeng Dong, Pan Zhou, and Zichuan Xu. 2020b. Jointly Cross-and Self-Modal Graph Attention Network for Query-Based Moment Localization. In Proceedings of the 28th ACM International Conference on Multimedia. 4070--4078.
[48]
Daizong Liu, Xiaoye Qu, Yinzhen Wang, Xing Di, Kai Zou, Yu Cheng, Zichuan Xu, and Pan Zhou. 2022c. Unsupervised Temporal Video Grounding with Deep Semantic Clustering. In Proceedings of the AAAI Conference on Artificial Intelligence.
[49]
Daizong Liu, Xiaoye Qu, and Pan Zhou. 2021a. Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). 9302--9311.
[50]
Daizong Liu, Xiaoye Qu, Pan Zhou, and Yang Liu. 2022d. Exploring Motion and Appearance Information for Temporal Sentence Grounding. In Proceedings of the AAAI Conference on Artificial Intelligence.
[51]
Daizong Liu, Pan Zhou, Zichuan Xu, Haozhao Wang, and Ruixuan Li. 2022 e. Few-shot temporal sentence grounding via memory-guided semantic learning. IEEE Transactions on Circuits and Systems for Video Technology (2022).
[52]
Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018a. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018).
[53]
Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018b. Attentive moment retrieval in videos. In Proceedings of the 41nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).
[54]
Yu Meng, Yunyi Zhang, Jiaxin Huang, Xuan Wang, Yu Zhang, Heng Ji, and Jiawei Han. 2021. Distantly-supervised named entity recognition with noise-robust learning and language model augmented self-training. arXiv preprint arXiv:2109.05003 (2021).
[55]
Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2021. Thinking fast and slow: Efficient text-to-visual retrieval with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[56]
Shentong Mo, Daizong Liu, and Wei Hu. 2022. Multi-scale self-contrastive learning with hard negative mining for weakly-supervised query-based video grounding. arXiv preprint arXiv:2203.03838 (2022).
[57]
Jonghwan Mun, Minsu Cho, and Bohyung Han. 2020. Local-Global Video-Text Interactions for Temporal Grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 10810--10819.
[58]
Guoshun Nan, Rui Qiao, Yao Xiao, Jun Liu, Sicong Leng, Hao Zhang, and Wei Lu. 2021. Interventional Video Grounding with Dual Contrastive Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[59]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532--1543.
[60]
Xiaoye Qu, Pengwei Tang, Zhikang Zou, Yu Cheng, Jianfeng Dong, Pan Zhou, and Zichuan Xu. 2020. Fine-grained iterative attention network for temporal language localization in videos. In Proceedings of the 28th ACM International Conference on Multimedia. 4280--4288.
[61]
Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics (2013).
[62]
Cristian Rodriguez, Edison Marrese-Taylor, Fatemeh Sadat Saleh, Hongdong Li, and Stephen Gould. 2020. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In The IEEE Winter Conference on Applications of Computer Vision (WACV). 2464--2473.
[63]
Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. 2016. Grounding of textual phrases in images by reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV). 817--834.
[64]
Ozan Sener and Vladlen Koltun. 2018. Multi-task learning as multi-objective optimization. Advances in neural information processing systems, Vol. 31 (2018).
[65]
Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision (ECCV). 510--526.
[66]
Jingwen Wang, Lin Ma, and Wenhao Jiang. 2020. Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction. In Proceedings of the AAAI Conference on Artificial Intelligence.
[67]
Weining Wang, Yan Huang, and Liang Wang. 2019. Language-driven temporal activity localization: A semantic matching reinforcement learning model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 334--343.
[68]
Zhenzhi Wang, Limin Wang, Tao Wu, Tianhao Li, and Gangshan Wu. 2022. Negative sample matters: A renaissance of metric learning for temporal grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2613--2623.
[69]
Shaoning Xiao, Long Chen, Songyang Zhang, Wei Ji, Jian Shao, Lu Ye, and Jun Xiao. 2021. Boundary Proposal Network for Two-Stage Natural Language Video Localization. In Proceedings of the AAAI Conference on Artificial Intelligence.
[70]
Zeyu Xiong, Daizong Liu, Pan Zhou, and Jiahao Zhu. 2023. Tracking Objects and Activities with Attention for Temporal Sentence Grounding. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1--5.
[71]
Shuo Yang and Xinxiao Wu. 2022. Entity-aware and motion-aware transformers for language-driven action localization in videos. arXiv preprint arXiv:2205.05854 (2022).
[72]
Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, and Wenwu Zhu. 2019a. Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos. In Advances in Neural Information Processing Systems (NIPS). 534--544.
[73]
Yitian Yuan, Tao Mei, and Wenwu Zhu. 2019b. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9159--9166.
[74]
Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, and Chuang Gan. 2020. Dense regression network for video grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 10287--10296.
[75]
Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S Davis. 2019a. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1247--1257.
[76]
Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2020b. Span-based Localizing Network for Natural Language Video Localization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6543--6554.
[77]
Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. 2020a. Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language. In Proceedings of the AAAI Conference on Artificial Intelligence.
[78]
Zhu Zhang, Zhijie Lin, Zhou Zhao, and Zhenxin Xiao. 2019b. Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 655--664.
[79]
Yang Zhao, Zhou Zhao, Zhu Zhang, and Zhijie Lin. 2021. Cascaded prediction network via segment tree for temporal video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4197--4206.
[80]
Qi Zheng, Jianfeng Dong, Xiaoye Qu, Xun Yang, Yabing Wang, Pan Zhou, Baolong Liu, and Xun Wang. 2023. Progressive localization networks for language-based moment localization ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 19, 2 (2023), 1--21.
[81]
Sheng Zhou, Yucheng Wang, Defang Chen, Jiawei Chen, Xin Wang, Can Wang, and Jiajun Bu. 2021. Distilling holistic knowledge with graph neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10387--10396.
[82]
Jiahao Zhu, Daizong Liu, Pan Zhou, Xing Di, Yu Cheng, Song Yang, Wenzheng Xu, Zichuan Xu, Yao Wan, Lichao Sun, et al. 2023. Rethinking the video sampling and reasoning strategies for temporal sentence grounding. arXiv preprint arXiv:2301.00514 (2023).

Cited By

View all
  • (2024)Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680731(5260-5269)Online publication date: 28-Oct-2024
  • (2024)Rethinking Weakly-Supervised Video Temporal Grounding From a Game PerspectiveComputer Vision – ECCV 202410.1007/978-3-031-72995-9_17(290-311)Online publication date: 24-Nov-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. information imbalance
  2. language-driven moment retrieval
  3. teacher-student model

Qualifiers

  • Research-article

Funding Sources

  • open research fund of The State Key Laboratory of Multimodal Artificial Intelligence Systems
  • Young Elite Scientists Sponsorship Program by CAST
  • Fundamental Research Funds for the Provincial Universities of Zhejiang
  • National Natural Science Foundation of China

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)147
  • Downloads (Last 6 weeks)8
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680731(5260-5269)Online publication date: 28-Oct-2024
  • (2024)Rethinking Weakly-Supervised Video Temporal Grounding From a Game PerspectiveComputer Vision – ECCV 202410.1007/978-3-031-72995-9_17(290-311)Online publication date: 24-Nov-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media