skip to main content
10.1145/3477495.3531795acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

Point Prompt Tuning for Temporally Language Grounding

Published: 07 July 2022 Publication History

Abstract

The task of temporally language grounding (TLG) aims to locate a video moment from an untrimmed video that match a given textual query, which has attracted considerable research attention. In recent years, typical retrieval-based TLG methods are inefficient due to pre-segmented candidate moments, while localization-based TLG solutions adopt reinforcement learning resulting in unstable convergence. Therefore, how to perform TLG task efficiently and stably is a non-trivial work.
Toward this end, we innovatively contribute a solution, Point Prompt Tuning (PPT), which formulates this task as a prompt-based multi-modal problem and integrates multiple sub-tasks to tuning performance. Specifically, a flexible prompt strategy is contributed to rewrite the query firstly, which contains both query, start point and end point. Thereafter, a multi-modal Transformer is adopted to fully learn the multi-modal context. Meanwhile, we design various sub-tasks to constrain the novel framework, namely matching task and localization task. Finally, the start and end points of matched video moment are straightforward predicted, simply yet stably. Extensive experiments on two real-world datasets have well verified the effectiveness of our proposed solution.

References

[1]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In ICCV. IEEE, 5803--5812.
[2]
Deng Cai, Yan Wang, Huayang Li, Wai Lam, and Lemao Liu. 2021. Neural Machine Translation with Monolingual Translation Memory. In ACL. ACL, 7307--7318.
[3]
Da Cao, Yawen Zeng, Meng Liu, Xiangnan He, Meng Wang, and Zhen Qin. 2020 a. STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization. In MM. ACM, 4162--4170.
[4]
Da Cao, Yawen Zeng, Xiaochi Wei, Liqiang Nie, Richang Hong, and Zeng Qin. 2020 b. Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization. In MM. ACM, 898--906.
[5]
Shaoxiang Chen and Yu-Gang Jiang. 2019. Semantic proposal for activity localization in videos via sentence query. In AAAI. AAAI, 8199--8206.
[6]
Shiyu Chen, Yawen Zeng, Da Cao, and Shaofei Lu. 2022. Video-guided machine translation via dual-level back-translation. Knowledge-Based Systems, Vol. 245 (2022), 108598.
[7]
Zhenfang Chen, Lin Ma, Wenhan Luo, Peng Tang, and Kwan-Yee K. Wong. 2020. Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video. arXiv preprint arXiv:2001.09308 (2020), 1--10.
[8]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Houlsby Neil. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR . OpenReview.net.
[9]
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. TALL: Temporal activity localization via language query. In ICCV. IEEE, 5267--5275.
[10]
Ning Han, Jingjing Chen, Guangyi Xiao, Zhang Hao, Yawen Zeng, and Hao Chen. 2021. Fine-grained Cross-modal Alignment Network for Text-Video Retrieval. In MM. ACM, 3826--3834.
[11]
Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, and Shilei Wen. 2019. Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos. In AAAI. AAAI, 8393--8400.
[12]
Bin Jiang, Xin Huang, Chao Yang, and Junsong Yuan. 2019. Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention. In ICMR. ACM, 217--225.
[13]
Dongliang Liao, Jin Xu, Gongfu Li, and Yiru Wang. 2021. Hierarchical Coherence Modeling for Document Quality Assessment. In AAAI. AAAI, 13353--13361.
[14]
Junyang Lin, Rui Men, An Yang, Chang Zhou, Yichang Zhang, Peng Wang, Jingren Zhou, Jie Tang, and Hongxia Yang. 2021. M6: Multi-Modality-to-Multi-Modality Multitask Mega-transformer for Unified Pretraining. In KDD. ACM, 3251--3261.
[15]
Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018a. Attentive moment retrieval in videos. In SIGIR. ACM, 15--24.
[16]
Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018b. Cross-modal moment localization in videos. In MM. ACM, 843--851.
[17]
Minheng Ni, Haoyang Huang, Lin Su, Edward Cui, Taroon Bharti, Lijuan Wang, Dongdong Zhang, and Nan Duan. 2021. M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-Training. In ICCV. IEEE, 3977--3986.
[18]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In ICML. PMLR, 8748--8763.
[19]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2019. Improving Language Understanding by Generative Pre-Training. OpenAI (2019).
[20]
Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, Vol. 1 (2013), 25--36.
[21]
Rendle Steffen, Freudenthaler Christoph, Gantner Zeno, and Schmidt-Thieme Lars. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In UAI. AUAI Press, 452--461.
[22]
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A Joint Model for Video and Language Representation Learning. In ICCV. IEEE, 7463--7472.
[23]
Mengmeng Wang, Jiazheng Xing, and Yong Liu. 2021. ActionCLIP: A New Paradigm for Video Action Recognition. arXiv preprint arXiv:2109.08472 (2021).
[24]
Weining Wang, Yan Huang, and Liang Wang. 2019. Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model. In CVPR. IEEE, 334--343.
[25]
Yiru Wang, Shen Huang, Gongfu Li, Qiang Deng, Dongliang Liao, Pengda Si, Yujiu Yang, and Jin Xu. 2020. Cognitive Representation Learning of Self-Media Online Article Quality. In MM. ACM, 843--851.
[26]
Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. 2021 a. Deconfounded Video Moment Retrieval with Causal Intervention. In SIGIR. ACM, 1--10.
[27]
Xiaojun Yang, Lunjia Liao, Qin Yang, Bo Sun, and Jianxiang Xi. 2021 b. Limited-energy output formation for multiagent systems with intermittent interactions. Journal of the Franklin Institute (2021), 6462--6489.
[28]
Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. 2021. CPT-Colorful Prompt Tuning for Pre-Training Vision-Language Models. arXiv preprint arXiv:2109.11797 (2021).
[29]
Xinli Yu, Mohsen Malmir, Xin He, Jiangning Chen, Tong Wang, Yue Wu, Yue Liu, and Yang Liu. 2021. Video Corpus Moment Retrieval with Contrastive Learning. In SIGIR. ACM, 1860--1864.
[30]
Yitian Yuan, Tao Mei, and Wenwu Zhu. 2019. To find where you talk: Temporal sentence localization in video with attention based location regression. In AAAI. AAAI, 9159--9166.
[31]
Yawen Zeng, Da Cao, Shaofei Lu, Hanling Zhang, Jiao Xu, and Qin Zheng. 2022 a. Moment is Important: Language-Based Video Moment Retrieval via Adversarial Learning. ACM Trans. Multim. Comput. Commun. Appl., Vol. 18 (2022), 56:1--56:21.
[32]
Yawen Zeng, Da Cao, Xiaochi Wei, Meng Liu, Zhou Zhao, and Zheng Qin. 2021. Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval. In CVPR. IEEE, 2215--2224.
[33]
Yawen Zeng, Yiru Wang, Dongliang Liao, Gongfu Li, Weijie Huang, Jin Xu, Da Cao, and Hong Man. 2022 b. Keyword-Based Diverse Image Retrieval with Variational Multiple Instance Graph. IEEE Trans. Neural Networks Learn. Syst. (2022).
[34]
Hao Zhang, Yanbin Hao, and Chong-Wah Ngo. 2021 a. Token shift transformer for video classification. In MM. 917--925.
[35]
Hao Zhang, Aixin Sun, Wei Jing, Guoshun Nan, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh. 2021 b. Video Corpus Moment Retrieval with Contrastive Learning. In SIGIR. ACM, 685--695.
[36]
Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2020. Span-based localizing network for natural language video localization. In ACL. ACL, 6543--6554.
[37]
Zhu Zhang, Zhijie Lin, Zhou Zhao, and Zhenxin Xiao. 2019. Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos. In SIGIR. ACM, 655--664.

Cited By

View all
  • (2024)Multi-prompts learning with cross-modal alignment for attribute-based person re-identificationProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i7.28524(6979-6987)Online publication date: 20-Feb-2024
  • (2024)RetrievalMMT: Retrieval-Constrained Multi-Modal Prompt Learning for Multi-Modal Machine TranslationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658018(860-868)Online publication date: 7-Jun-2024
  • (2024)FinReport: Explainable Stock Earnings Forecasting via News Factor Analyzing ModelCompanion Proceedings of the ACM on Web Conference 202410.1145/3589335.3648330(319-327)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2022
3569 pages
ISBN:9781450387323
DOI:10.1145/3477495
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-modal retrieval
  2. multi-modal understanding
  3. prompt learning
  4. temporally language grounding

Qualifiers

  • Short-paper

Conference

SIGIR '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)66
  • Downloads (Last 6 weeks)7
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Multi-prompts learning with cross-modal alignment for attribute-based person re-identificationProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i7.28524(6979-6987)Online publication date: 20-Feb-2024
  • (2024)RetrievalMMT: Retrieval-Constrained Multi-Modal Prompt Learning for Multi-Modal Machine TranslationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658018(860-868)Online publication date: 7-Jun-2024
  • (2024)FinReport: Explainable Stock Earnings Forecasting via News Factor Analyzing ModelCompanion Proceedings of the ACM on Web Conference 202410.1145/3589335.3648330(319-327)Online publication date: 13-May-2024
  • (2024)Temporally Language Grounding With Multi-Modal Multi-Prompt TuningIEEE Transactions on Multimedia10.1109/TMM.2023.331028226(3366-3377)Online publication date: 2024
  • (2024)Frame as Video Clip: Proposal-Free Moment Retrieval by Semantic Aligned FramesIEEE Transactions on Industrial Informatics10.1109/TII.2024.343109720:11(13158-13168)Online publication date: Nov-2024
  • (2024)CPT: Colorful Prompt Tuning for pre-trained vision-language modelsAI Open10.1016/j.aiopen.2024.01.0045(30-38)Online publication date: 2024
  • (2024)Deep Learning for Video LocalizationDeep Learning for Video Understanding10.1007/978-3-031-57679-9_4(39-68)Online publication date: 28-Mar-2024
  • (2023)RewardTLG: Learning to Temporally Language Grounding from Flexible RewardProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592054(2344-2348)Online publication date: 19-Jul-2023
  • (2023)Adapting Generative Pretrained Language Model for Open-domain Multimodal Sentence SummarizationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591633(195-204)Online publication date: 19-Jul-2023
  • (2023)Temporal Sentence Grounding in Videos: A Survey and Future DirectionsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.325862845:8(10443-10465)Online publication date: 1-Aug-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media