short-paper

Point Prompt Tuning for Temporally Language Grounding

Author:

Yawen ZengAuthors Info & Claims

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 2003 - 2007

https://doi.org/10.1145/3477495.3531795

Published: 07 July 2022 Publication History

Abstract

The task of temporally language grounding (TLG) aims to locate a video moment from an untrimmed video that match a given textual query, which has attracted considerable research attention. In recent years, typical retrieval-based TLG methods are inefficient due to pre-segmented candidate moments, while localization-based TLG solutions adopt reinforcement learning resulting in unstable convergence. Therefore, how to perform TLG task efficiently and stably is a non-trivial work.

Toward this end, we innovatively contribute a solution, Point Prompt Tuning (PPT), which formulates this task as a prompt-based multi-modal problem and integrates multiple sub-tasks to tuning performance. Specifically, a flexible prompt strategy is contributed to rewrite the query firstly, which contains both query, start point and end point. Thereafter, a multi-modal Transformer is adopted to fully learn the multi-modal context. Meanwhile, we design various sub-tasks to constrain the novel framework, namely matching task and localization task. Finally, the start and end points of matched video moment are straightforward predicted, simply yet stably. Extensive experiments on two real-world datasets have well verified the effectiveness of our proposed solution.

References

[1]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In ICCV. IEEE, 5803--5812.

[2]

Deng Cai, Yan Wang, Huayang Li, Wai Lam, and Lemao Liu. 2021. Neural Machine Translation with Monolingual Translation Memory. In ACL. ACL, 7307--7318.

[3]

Da Cao, Yawen Zeng, Meng Liu, Xiangnan He, Meng Wang, and Zhen Qin. 2020 a. STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization. In MM. ACM, 4162--4170.

[4]

Da Cao, Yawen Zeng, Xiaochi Wei, Liqiang Nie, Richang Hong, and Zeng Qin. 2020 b. Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization. In MM. ACM, 898--906.

[5]

Shaoxiang Chen and Yu-Gang Jiang. 2019. Semantic proposal for activity localization in videos via sentence query. In AAAI. AAAI, 8199--8206.

[6]

Shiyu Chen, Yawen Zeng, Da Cao, and Shaofei Lu. 2022. Video-guided machine translation via dual-level back-translation. Knowledge-Based Systems, Vol. 245 (2022), 108598.

Digital Library

[7]

Zhenfang Chen, Lin Ma, Wenhan Luo, Peng Tang, and Kwan-Yee K. Wong. 2020. Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video. arXiv preprint arXiv:2001.09308 (2020), 1--10.

[8]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Houlsby Neil. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR . OpenReview.net.

[9]

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. TALL: Temporal activity localization via language query. In ICCV. IEEE, 5267--5275.

[10]

Ning Han, Jingjing Chen, Guangyi Xiao, Zhang Hao, Yawen Zeng, and Hao Chen. 2021. Fine-grained Cross-modal Alignment Network for Text-Video Retrieval. In MM. ACM, 3826--3834.

[11]

Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, and Shilei Wen. 2019. Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos. In AAAI. AAAI, 8393--8400.

[12]

Bin Jiang, Xin Huang, Chao Yang, and Junsong Yuan. 2019. Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention. In ICMR. ACM, 217--225.

[13]

Dongliang Liao, Jin Xu, Gongfu Li, and Yiru Wang. 2021. Hierarchical Coherence Modeling for Document Quality Assessment. In AAAI. AAAI, 13353--13361.

[14]

Junyang Lin, Rui Men, An Yang, Chang Zhou, Yichang Zhang, Peng Wang, Jingren Zhou, Jie Tang, and Hongxia Yang. 2021. M6: Multi-Modality-to-Multi-Modality Multitask Mega-transformer for Unified Pretraining. In KDD. ACM, 3251--3261.

[15]

Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018a. Attentive moment retrieval in videos. In SIGIR. ACM, 15--24.

[16]

Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018b. Cross-modal moment localization in videos. In MM. ACM, 843--851.

[17]

Minheng Ni, Haoyang Huang, Lin Su, Edward Cui, Taroon Bharti, Lijuan Wang, Dongdong Zhang, and Nan Duan. 2021. M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-Training. In ICCV. IEEE, 3977--3986.

[18]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In ICML. PMLR, 8748--8763.

[19]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2019. Improving Language Understanding by Generative Pre-Training. OpenAI (2019).

[20]

Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, Vol. 1 (2013), 25--36.

[21]

Rendle Steffen, Freudenthaler Christoph, Gantner Zeno, and Schmidt-Thieme Lars. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In UAI. AUAI Press, 452--461.

[22]

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A Joint Model for Video and Language Representation Learning. In ICCV. IEEE, 7463--7472.

[23]

Mengmeng Wang, Jiazheng Xing, and Yong Liu. 2021. ActionCLIP: A New Paradigm for Video Action Recognition. arXiv preprint arXiv:2109.08472 (2021).

[24]

Weining Wang, Yan Huang, and Liang Wang. 2019. Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model. In CVPR. IEEE, 334--343.

[25]

Yiru Wang, Shen Huang, Gongfu Li, Qiang Deng, Dongliang Liao, Pengda Si, Yujiu Yang, and Jin Xu. 2020. Cognitive Representation Learning of Self-Media Online Article Quality. In MM. ACM, 843--851.

[26]

Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. 2021 a. Deconfounded Video Moment Retrieval with Causal Intervention. In SIGIR. ACM, 1--10.

[27]

Xiaojun Yang, Lunjia Liao, Qin Yang, Bo Sun, and Jianxiang Xi. 2021 b. Limited-energy output formation for multiagent systems with intermittent interactions. Journal of the Franklin Institute (2021), 6462--6489.

[28]

Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. 2021. CPT-Colorful Prompt Tuning for Pre-Training Vision-Language Models. arXiv preprint arXiv:2109.11797 (2021).

[29]

Xinli Yu, Mohsen Malmir, Xin He, Jiangning Chen, Tong Wang, Yue Wu, Yue Liu, and Yang Liu. 2021. Video Corpus Moment Retrieval with Contrastive Learning. In SIGIR. ACM, 1860--1864.

[30]

Yitian Yuan, Tao Mei, and Wenwu Zhu. 2019. To find where you talk: Temporal sentence localization in video with attention based location regression. In AAAI. AAAI, 9159--9166.

[31]

Yawen Zeng, Da Cao, Shaofei Lu, Hanling Zhang, Jiao Xu, and Qin Zheng. 2022 a. Moment is Important: Language-Based Video Moment Retrieval via Adversarial Learning. ACM Trans. Multim. Comput. Commun. Appl., Vol. 18 (2022), 56:1--56:21.

Digital Library

[32]

Yawen Zeng, Da Cao, Xiaochi Wei, Meng Liu, Zhou Zhao, and Zheng Qin. 2021. Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval. In CVPR. IEEE, 2215--2224.

[33]

Yawen Zeng, Yiru Wang, Dongliang Liao, Gongfu Li, Weijie Huang, Jin Xu, Da Cao, and Hong Man. 2022 b. Keyword-Based Diverse Image Retrieval with Variational Multiple Instance Graph. IEEE Trans. Neural Networks Learn. Syst. (2022).

[34]

Hao Zhang, Yanbin Hao, and Chong-Wah Ngo. 2021 a. Token shift transformer for video classification. In MM. 917--925.

[35]

Hao Zhang, Aixin Sun, Wei Jing, Guoshun Nan, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh. 2021 b. Video Corpus Moment Retrieval with Contrastive Learning. In SIGIR. ACM, 685--695.

[36]

Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2020. Span-based localizing network for natural language video localization. In ACL. ACL, 6543--6554.

[37]

Zhu Zhang, Zhijie Lin, Zhou Zhao, and Zhenxin Xiao. 2019. Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos. In SIGIR. ACM, 655--664.

Cited By

Zhai YZeng YHuang ZQin ZJin XCao DWooldridge MDy JNatarajan S(2024)Multi-prompts learning with cross-modal alignment for attribute-based person re-identificationProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i7.28524(6979-6987)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1609/aaai.v38i7.28524
Wang YZeng YLiang JXing XXu JXu XGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)RetrievalMMT: Retrieval-Constrained Multi-Modal Prompt Learning for Multi-Modal Machine TranslationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658018(860-868)Online publication date: 7-Jun-2024
https://doi.org/10.1145/3652583.3658018
Li XShen XZeng YXing XXu JChua TNgo CKa-Wei Lee RKumar RLauw H(2024)FinReport: Explainable Stock Earnings Forecasting via News Factor Analyzing ModelCompanion Proceedings of the ACM on Web Conference 202410.1145/3589335.3648330(319-327)Online publication date: 13-May-2024
https://doi.org/10.1145/3589335.3648330
Show More Cited By

Index Terms

Point Prompt Tuning for Temporally Language Grounding
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Video search
  2. Information systems applications
    1. Multimedia information systems

Recommendations

Temporally Language Grounding With Multi-Modal Multi-Prompt Tuning
The task of temporally language grounding (TLG), aiming to locate a video moment within an untrimmed video that matches a given textual query, has attracted considerable research attention in recent years. Typical retrieval-based TLG methods are ...
Low-rank Prompt Interaction for Continual Vision-Language Retrieval
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Research on continual learning in multi-modal tasks has been receiving increasing attention. However, most existing work overlooks the explicit cross-modal and cross-task interactions. In this paper, we innovatively propose the Low-rank Prompt I...
Calibrating Prompt from History for Continual Vision-Language Retrieval and Grounding
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

In the field of machine learning, continual learning is a crucial concept that allows models to adapt to non-stationary data distributions. However, most of the existing works focus on uni-modal settings and ignore the multi-modal data. In this paper, to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2022

3569 pages

ISBN:9781450387323

DOI:10.1145/3477495

General Chairs:
Enrique Amigo
UNED
,
Pablo Castells
UAM and Amazon
,
Julio Gonzalo
UNED
,
Program Chairs:
Ben Carterette
Spotify
,
J. Shane Culpepper
RMIT University
,
Gabriella Kazai
Waseda University

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

SIGIR '22

Sponsor:

SIGIR

SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 11 - 15, 2022

Madrid, Spain

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
447
Total Downloads

Downloads (Last 12 months)66
Downloads (Last 6 weeks)7

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhai YZeng YHuang ZQin ZJin XCao DWooldridge MDy JNatarajan S(2024)Multi-prompts learning with cross-modal alignment for attribute-based person re-identificationProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i7.28524(6979-6987)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1609/aaai.v38i7.28524
Wang YZeng YLiang JXing XXu JXu XGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)RetrievalMMT: Retrieval-Constrained Multi-Modal Prompt Learning for Multi-Modal Machine TranslationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658018(860-868)Online publication date: 7-Jun-2024
https://doi.org/10.1145/3652583.3658018
Li XShen XZeng YXing XXu JChua TNgo CKa-Wei Lee RKumar RLauw H(2024)FinReport: Explainable Stock Earnings Forecasting via News Factor Analyzing ModelCompanion Proceedings of the ACM on Web Conference 202410.1145/3589335.3648330(319-327)Online publication date: 13-May-2024
https://doi.org/10.1145/3589335.3648330
Zeng YHan NPan KJin Q(2024)Temporally Language Grounding With Multi-Modal Multi-Prompt TuningIEEE Transactions on Multimedia10.1109/TMM.2023.331028226(3366-3377)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3310282
Shi MSu YLin XZao BKong STan M(2024)Frame as Video Clip: Proposal-Free Moment Retrieval by Semantic Aligned FramesIEEE Transactions on Industrial Informatics10.1109/TII.2024.343109720:11(13158-13168)Online publication date: Nov-2024
https://doi.org/10.1109/TII.2024.3431097
Yao YZhang AZhang ZLiu ZChua TSun M(2024)CPT: Colorful Prompt Tuning for pre-trained vision-language modelsAI Open10.1016/j.aiopen.2024.01.0045(30-38)Online publication date: 2024
https://doi.org/10.1016/j.aiopen.2024.01.004
Wu ZJiang YWu ZJiang Y(2024)Deep Learning for Video LocalizationDeep Learning for Video Understanding10.1007/978-3-031-57679-9_4(39-68)Online publication date: 28-Mar-2024
https://doi.org/10.1007/978-3-031-57679-9_4
Zeng YPan KHan NChen HDuh WHuang HKato MMothe JPoblete B(2023)RewardTLG: Learning to Temporally Language Grounding from Flexible RewardProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592054(2344-2348)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3592054
Lin DJing LSong XLiu MSun TNie LChen HDuh WHuang HKato MMothe JPoblete B(2023)Adapting Generative Pretrained Language Model for Open-domain Multimodal Sentence SummarizationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591633(195-204)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591633
Zhang HSun AJing WZhou J(2023)Temporal Sentence Grounding in Videos: A Survey and Future DirectionsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.325862845:8(10443-10465)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1109/TPAMI.2023.3258628
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten