short-paper

RewardTLG: Learning to Temporally Language Grounding from Flexible Reward

Authors:

Ning HanAuthors Info & Claims

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 2344 - 2348

https://doi.org/10.1145/3539618.3592054

Published: 18 July 2023 Publication History

Abstract

Given a textual sentence provided by a user, the Temporal Language Grounding (TLG) task is defined as the process of finding a semantically relevant video moment or clip from an untrimmed video. In recent years, localization-based TLG methods have been explored, which adopt reinforcement learning to locate a clip from the video. However, these methods are not stable enough due to the stochastic exploration mechanism of reinforcement learning, which is sensitive to the reward. Therefore, providing a more flexible and reasonable reward has become a focus of attention for both academia and industry.

Inspired by the training process of chatGPT, we innovatively adopt a vision-language pre-training (VLP) model as a reward model, which provides flexible rewards to help the localization-based TLG task converge. Specifically, a reinforcement learning-based localization module is introduced to predict the start and end timestamps in multi-modal scenarios. Thereafter, we fine-tune a reward model based on a VLP model, even introducing some human feedback, which provides a flexible reward score for the localization module. In this way, our model is able to capture subtle differences of the untrimmed video. Extensive experiments on two datasets have well verified the effectiveness of our proposed solution.

References

[1]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5803--5812.

[2]

Da Cao, Yawen Zeng, Meng Liu, Xiangnan He, Meng Wang, and Zhen Qin. 2020a. STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization. In Proceedings of the ACM International Conference on Multimedia. ACM, 4162--4170.

Digital Library

[3]

Da Cao, Yawen Zeng, Xiaochi Wei, Liqiang Nie, Richang Hong, and Zeng Qin. 2020b. Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization. In Proceedings of the ACM International Conference on Multimedia. ACM, 898--906.

Digital Library

[4]

Shaoxiang Chen and Yu-Gang Jiang. 2019. Semantic proposal for activity localization in videos via sentence query. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 8199--8206.

Digital Library

[5]

Zhenfang Chen, Lin Ma, Wenhan Luo, Peng Tang, and Kwan-Yee K. Wong. 2020. Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video. arXiv:2001.09308 (2020), 1--10.

[6]

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. TALL: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5267--5275.

[7]

Ning Han, Jingjing Chen, Guangyi Xiao, Hao Zhang, Yawen Zeng, and Hao Chen. 2021. Fine-grained cross-modal alignment network for text-video retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. ACM, 3826--3834.

Digital Library

[8]

Ning Han, Jingjing Chen, Hao Zhang, Huanwen Wang, and Hao Chen. 2022. Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 18, 2 (2022), 1--23.

Digital Library

[9]

Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, and Shilei Wen. 2019. Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 8393--8400.

Digital Library

[10]

Ionel-Alexandru Hosu and Traian Rebedea. 2016. Playing Atari Games with Deep Reinforcement Learning and Human Checkpoint Replay. arXiv:1607.05077 (2016).

[11]

Bin Jiang, Xin Huang, Chao Yang, and Junsong Yuan. 2019. Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention. In Proceedings of the ACM SIGMM International Conference on Multimedia Retrieval. ACM, 217--225.

Digital Library

[12]

P.Diederik Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations. ACM.

[13]

Belinda Z. Li, William Chen, Pratyusha Sharma, and Jacob Andreas. 2023 a. Language Models as Probabilistic Priors for Perception and Action. arXiv:2302.02801 (2023).

[14]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023 b. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv:2301.12597 (2023).

[15]

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. HERO: Hierarchical Encoder for VideoLanguage Omni-representation Pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2046--2065.

[16]

Timothy P. Lillicrap, Jonathan J. Hunt, Nicolas Pritzel, Alexander andHeess, Yuval andSilver David Erez, Tom andTassa, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. In Proceedings of the 4th International Conference on Learning Representations.

[17]

Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018a. Attentive moment retrieval in videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 15--24.

Digital Library

[18]

Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018b. Cross-modal moment localization in videos. In Proceedings of the ACM International Conference on Multimedia. ACM, 843--851.

Digital Library

[19]

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. Neurocomputing, Vol. 508 (2022), 293--304.

Digital Library

[20]

Krishna Ranjay, Hata Kenji, Ren Frederic, Fei-Fei Li, and Carlos Niebles Juan. 2017. Dense-Captioning Events in Videos. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 706--715.

[21]

Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, Vol. 1 (2013), 25--36.

[22]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. arXiv:1707.06347 (2017).

[23]

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A Joint Model for Video and Language Representation Learning. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 7463--7472.

[24]

Weining Wang, Yan Huang, and Liang Wang. 2019. Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 334--343.

[25]

Xinli Yu, Mohsen Malmir, Xin He, Jiangning Chen, Tong Wang, Yue Wu, Yue Liu, and Yang Liu. 2021. Video Corpus Moment Retrieval with Contrastive Learning. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1860--1864.

[26]

Yitian Yuan, Tao Mei, and Wenwu Zhu. 2019. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 9159--9166.

Digital Library

[27]

Yawen Zeng. 2022. Point Prompt Tuning for Temporally Language Grounding. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2003--2007.

Digital Library

[28]

Yawen Zeng, Da Cao, Xiaochi Wei, Meng Liu, Zhou Zhao, and Zheng Qin. 2021. Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2215--2224.

[29]

Hao Zhang, Aixin Sun, Wei Jing, Guoshun Nan, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh. 2021. Video Corpus Moment Retrieval with Contrastive Learning. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 685--695.

Digital Library

[30]

Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2020. Span-based localizing network for natural language video localization. In Proceedings of the Association for Computational Linguistics. ACL, 6543--6554.

[31]

Zhu Zhang, Zhijie Lin, Zhou Zhao, and Zhenxin Xiao. 2019. Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 655--664.

Digital Library

Cited By

Huang ZJi YLi YLiu C(2024)Gazing After Glancing: Edge Information Guided Perception Network for Video Moment RetrievalIEEE Signal Processing Letters10.1109/LSP.2024.340353331(1535-1539)Online publication date: 2024
https://doi.org/10.1109/LSP.2024.3403533
Han NZeng YShi CXiao GChen HChen J(2023)BiC-Net: Learning Efficient Spatio-temporal Relation for Text-Video RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/362710320:3(1-21)Online publication date: 9-Dec-2023
https://dl.acm.org/doi/10.1145/3627103

Index Terms

RewardTLG: Learning to Temporally Language Grounding from Flexible Reward
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Hierarchical Average Reward Reinforcement Learning

Hierarchical reinforcement learning (HRL) is a general framework for scaling reinforcement learning (RL) to problems with large state and action spaces by using the task (or action) structure to restrict the space of policies. Prior work in HRL ...
Learning reward machines: A study in partially observable reinforcement learning
Abstract
Reinforcement Learning (RL) is a machine learning paradigm wherein an artificial agent interacts with an environment with the purpose of learning behaviour that maximizes the expected cumulative reward it receives from the environment. ...
Reward Shaping in Episodic Reinforcement Learning
AAMAS '17: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems

Recent advancements in reinforcement learning confirm that reinforcement learning techniques can solve large scale problems leading to high quality autonomous decision making. It is a matter of time until we will see large scale applications of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2023

3567 pages

ISBN:9781450394086

DOI:10.1145/3539618

General Chairs:
Hsin-Hsi Chen
National Taiwan University
,
Wei-Jou (Edward) Duh
National Taiwan University
,
Hen-Hsen Huang
Academia Sinica
,
Program Chairs:
Makoto P. Kato
Spotify
,
Josiane Mothe
Universite de Toulouse
,
Barbara Poblete
University of Chile and Amazon Visiting Academic

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

SIGIR '23

Sponsor:

SIGIR

SIGIR '23: The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 23 - 27, 2023

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
186
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)5

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huang ZJi YLi YLiu C(2024)Gazing After Glancing: Edge Information Guided Perception Network for Video Moment RetrievalIEEE Signal Processing Letters10.1109/LSP.2024.340353331(1535-1539)Online publication date: 2024
https://doi.org/10.1109/LSP.2024.3403533
Han NZeng YShi CXiao GChen HChen J(2023)BiC-Net: Learning Efficient Spatio-temporal Relation for Text-Video RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/362710320:3(1-21)Online publication date: 9-Dec-2023
https://dl.acm.org/doi/10.1145/3627103

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten