Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts

Yang, Yanting; Chen, Minghao; Qiu, Qibo; Wu, Jiahao; Wang, Wenxiao; Lin, Binbin; Guan, Ziyu; He, Xiaofei

doi:10.1007/978-3-031-72998-0_10

Yanting Yang¹³,
Minghao Chen¹⁴,
Qibo Qiu^15,19,
Jiahao Wu¹⁶,
Wenxiao Wang¹³,
Binbin Lin^13,17,
Ziyu Guan¹⁸ &
…
Xiaofei He¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15115))

Included in the following conference series:

European Conference on Computer Vision

338 Accesses

Abstract

For a general-purpose robot to operate in reality, executing a broad range of instructions across various environments is imperative. Central to the reinforcement learning and planning for such robotic agents is a generalizable reward function. Recent advances in vision-language models, such as CLIP, have shown remarkable performance in the domain of deep learning, paving the way for open-domain visual recognition. However, collecting data on robots executing various language instructions across multiple environments remains a challenge. This paper aims to transfer video-language models with robust generalization into a generalizable language-conditioned reward function, only utilizing robot video data from a minimal amount of tasks in a singular environment. Unlike common robotic datasets used for training reward functions, human video-language datasets rarely contain trivial failure videos. To enhance the model’s ability to distinguish between successful and failed robot executions, we cluster failure video features to enable the model to identify patterns within. For each cluster, we integrate a newly trained failure prompt into the text encoder to represent the corresponding failure mode. Our language-conditioned reward function shows outstanding generalization to new environments and new instructions for robot planning and reinforcement learning.

Y. Yang and M. Chen—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Enhancing Robot Manipulation Skill Learning with Multi-task Capability Based on Transformer and Token Reduction

V2A - Vision to Action: Learning Robotic Arm Actions Based on Vision and Language

Human–robot interaction through joint robot planning with large language models

Article 17 January 2025

References

Abbeel, P., Ng, A.: Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the Twenty-First International Conference on Machine Learning (2004)
Google Scholar
Babaeizadeh, M., Saffar, M.T., Nair, S., Levine, S., Finn, C., Erhan, D.: FitVid: overfitting in pixel-level video prediction. ArXiv abs/2106.13195 (2021)
Google Scholar
Brohan, A., et al.: RT-2: vision-language-action models transfer web knowledge to robotic control. ArXiv abs/2307.15818 (2023)
Google Scholar
Brohan, A., et al.: RT-1: robotics transformer for real-world control at scale. ArXiv abs/2212.06817 (2022)
Google Scholar
Chen, A.S., Nair, S., Finn, C.: Learning generalizable robotic reward functions from “in-the-wild” human videos. ArXiv abs/2103.16817 (2021)
Google Scholar
Das, N., Bechtle, S., Davchev, T., Jayaraman, D., Rai, A., Meier, F.: Model-based inverse reinforcement learning from visual demonstrations. In: Conference on Robot Learning, pp. 1930–1942. PMLR (2021)
Google Scholar
Du, Y., et al.: Vision-language models as success detectors. ArXiv abs/2303.07280 (2023)
Google Scholar
Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A.X., Levine, S.: Visual foresight: model-based deep reinforcement learning for vision-based robotic control. ArXiv abs/1812.00568 (2018)
Google Scholar
Fan, L.J., et al.: MineDojo: building open-ended embodied agents with internet-scale knowledge. ArXiv abs/2206.08853 (2022)
Google Scholar
Finn, C., Levine, S., Abbeel, P.: Guided cost learning: deep inverse optimal control via policy optimization. In: International Conference on Machine Learning (2016)
Google Scholar
Fu, J., Luo, K., Levine, S.: Learning robust rewards with adverserial inverse reinforcement learning. In: International Conference on Learning Representations (2018)
Google Scholar
Fu, J., Singh, A., Ghosh, D., Yang, L., Levine, S.: Variational inverse control with events: a general framework for data-driven reward definition. In: Neural Information Processing Systems (2018)
Google Scholar
Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5843–5851 (2017)
Google Scholar
Hafner, D., et al.: Learning latent dynamics for planning from pixels. In: International Conference on Machine Learning, pp. 2555–2565. PMLR (2019)
Google Scholar
Jain, A., Hu, M., Ratliff, N.D., Bagnell, D., Zinkevich, M.A.: Maximum margin planning. In: Proceedings of the 23rd International Conference on Machine Learning (2006)
Google Scholar
Jang, E., et al.: BC-Z: zero-shot task generalization with robotic imitation learning. ArXiv abs/2202.02005 (2022)
Google Scholar
Kwon, M., Xie, S.M., Bullard, K., Sadigh, D.: Reward design with language models. ArXiv abs/2303.00001 (2023)
Google Scholar
Lee, J., Ryoo, M.S.: Learning robot activities from first-person human videos using convolutional future regression. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1497–1504 (2017)
Google Scholar
Lee, K., Su, Y., Kim, T.K., Demiris, Y.: A syntactic approach to robot imitation learning using probabilistic activity grammars. Robot. Auton. Syst. 61, 1323–1334 (2013)
Article Google Scholar
Lei, J., Berg, T.L., Bansal, M.: Revealing single frame bias for video-and-language learning. ArXiv abs/2206.03428 (2022)
Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.C.H.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning (2022)
Google Scholar
Li, Y., Zhao, X., Chen, C., Pang, S., Zhou, Z., Yin, J.: Scenario-driven cyber-physical-social system: Intelligent workflow generation based on capability. In: Companion Proceedings of the ACM on Web Conference 2024 (2024)
Google Scholar
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N.M.O., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. CoRR abs/1509.02971 (2015)
Google Scholar
Liu, Y., Gupta, A., Abbeel, P., Levine, S.: Imitation from observation: learning to imitate behaviors from raw video via context translation. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1118–1125 (2017)
Google Scholar
Luo, H., et al.: CLIP4Clip: an empirical study of clip for end to end video clip retrieval. Neurocomputing 508, 293–304 (2021)
Google Scholar
Ma, Y.J., Liang, W., Som, V., Kumar, V., Zhang, A., Bastani, O., Jayaraman, D.: LIV: language-image representations and rewards for robotic control. In: International Conference on Machine Learning (2023)
Google Scholar
Ma, Y.J., et al.: Eureka: human-level reward design via coding large language models. ArXiv abs/2310.12931 (2023)
Google Scholar
Ma, Y.J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V., Zhang, A.: VIP: towards universal visual reward and representation via value-implicit pre-training. ArXiv abs/2210.00030 (2022)
Google Scholar
Nair, S., Mitchell, E., Chen, K., Ichter, B., Savarese, S., Finn, C.: Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In: CoRL (2021)
Google Scholar
Nair, S., Rajeswaran, A., Kumar, V., Finn, C., Gupta, A.: R3M: a universal visual representation for robot manipulation. In: CoRL (2022)
Google Scholar
Nguyen, A., Kanoulas, D., Muratore, L., Caldwell, D.G., Tsagarakis, N.G.: Translating videos to commands for robotic manipulation with deep recurrent neural networks. 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9 (2017)
Google Scholar
OpenAI: GPT-4 technical report. ArXiv abs/2303.08774 (2023)
Google Scholar
Parisi, S., Rajeswaran, A., Purushwalkam, S., Gupta, A.K.: The unsurprising effectiveness of pre-trained vision models for control. In: International Conference on Machine Learning (2022)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Radosavovic, I., Xiao, T., James, S., Abbeel, P., Malik, J., Darrell, T.: Real-world robot learning with masked visual pre-training. In: CoRL (2022)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. ArXiv abs/2204.06125 (2022)
Google Scholar
Rothfuss, J., Ferreira, F., Aksoy, E.E., Zhou, Y., Asfour, T.: Deep episodic memory: encoding, recalling, and predicting episodic experiences for robot action execution. IEEE Robot. Autom. Lett. 3, 4007–4014 (2018)
Article Google Scholar
Rubinstein, R.Y., Kroese, D.P.: The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning, vol. 133. Springer, Heidelberg (2004). https://doi.org/10.1007/978-1-4757-4321-0
Book Google Scholar
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. ArXiv abs/2210.08402 (2022)
Google Scholar
Schuhmann, C., et .: LAION-400M: open dataset of clip-filtered 400 million image-text pairs. ArXiv abs/2111.02114 (2021)
Google Scholar
Sermanet, P., et al.: Time-contrastive networks: Self-supervised learning from video. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1134–1141 (2017)
Google Scholar
Shao, L., Migimatsu, T., Zhang, Q., Yang, K., Bohg, J.: Concept2Robot: learning manipulation concepts from instructions and human demonstrations. Int. J. Robot. Res. 40, 1419–1434 (2020)
Article Google Scholar
Sharma, P., Pathak, D., Gupta, A.K.: Third-person visual imitation learning via decoupled hierarchical controller. In: Neural Information Processing Systems (2019)
Google Scholar
Shaw, K., Bahl, S., Pathak, D.: VideoDex: learning dexterity from internet videos. In: Conference on Robot Learning (2022)
Google Scholar
Shridhar, M., Manuelli, L., Fox, D.: CLIPort: what and where pathways for robotic manipulation. ArXiv abs/2109.12098 (2021)
Google Scholar
Singh, A., Yang, L., Hartikainen, K., Finn, C., Levine, S.: End-to-end robotic reinforcement learning without reward engineering. ArXiv abs/1904.07854 (2019)
Google Scholar
Smith, L.M., Dhawan, N., Zhang, M., Abbeel, P., Levine, S.: AVID: learning multi-stage tasks via pixel-level translation of human videos. ArXiv abs/1912.04443 (2019)
Google Scholar
Stone, A., et al.: Open-world object manipulation using pre-trained vision-language models. ArXiv abs/2303.00905 (2023)
Google Scholar
Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033 (2012)
Google Scholar
Wang, C., et al.: MimicPlay: long-horizon imitation learning by watching human play. ArXiv abs/2302.12422 (2023)
Google Scholar
Watter, M., Springenberg, J., Boedecker, J., Riedmiller, M.: Embed to control: a locally linear latent dynamics model for control from raw images. Advances in Neural Inf. Process. Syst. 28 (2015)
Google Scholar
Wu, J., Fan, W., Chen, J., Liu, S., Li, Q., Tang, K.: Disentangled contrastive learning for social recommendation. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management (2022)
Google Scholar
Wu, J., et al.: Leveraging large language models (LLMs) to empower training-free dataset condensation for content-based recommendation. ArXiv abs/2310.09874 (2023)
Google Scholar
Wulfmeier, M., Ondruska, P., Posner, I.: Maximum entropy deep inverse reinforcement learning. arXiv Learning (2015)
Google Scholar
Xiao, T., Radosavovic, I., Darrell, T., Malik, J.: Masked visual pre-training for motor control. ArXiv abs/2203.06173 (2022)
Google Scholar
Xie, T., et al.: Text2Reward: automated dense reward function generation for reinforcement learning. ArXiv abs/2309.11489 (2023)
Google Scholar
Xu, Y., Jiang, Y., Zhao, X., Li, Y., Li, R.: Personalized repository recommendation service for developers with multi-modal features learning. In: 2023 IEEE International Conference on Web Services (ICWS), pp. 455–464 (2023)
Google Scholar
Xu, Y., Qiu, Z., Gao, H., Zhao, X., Wang, L., Li, R.: Heterogeneous data-driven failure diagnosis for microservice-based industrial clouds toward consumer digital ecosystems. IEEE Trans. Consum. Electron. 70, 2027–2037 (2024)
Article Google Scholar
Yang, Y., Li, Y., Fermüller, C., Aloimonos, Y.: Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In: AAAI Conference on Artificial Intelligence (2015)
Google Scholar
Yu, T., et al.: Meta-world: a benchmark and evaluation for multi-task and meta reinforcement learning. ArXiv abs/1910.10897 (2019)
Google Scholar
Yu, W., et al.: Language to rewards for robotic skill synthesis. ArXiv abs/2306.08647 (2023)
Google Scholar
Zakka, K., Zeng, A., Florence, P.R., Tompson, J., Bohg, J., Dwibedi, D.: XIRL: cross-embodiment inverse reinforcement learning. In: Conference on Robot Learning (2021)
Google Scholar
Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K.: Maximum entropy inverse reinforcement learning. In: AAAI Conference on Artificial Intelligence (2008)
Google Scholar

Download references

Acknowledgment

This work was supported in part by The National Nature Science Foundation of China (Grant No: 62273303, 62303406), in part by the Key R&D Program of Zhejiang Province, China (2023C01135), in part by Ningbo Key R&D Program (No. 2023Z231, 2023Z229), in part by Yongjiang Talent Introduction Programme (Grant No: 2022A-240-G, 2023A-194-G).

Author information

Authors and Affiliations

School of Software Technology, Zhejiang University, Hangzhou, China
Yanting Yang, Wenxiao Wang & Binbin Lin
School of Computer Sciene and Technology, Hangzhou Dianzi University, Hangzhou, China
Minghao Chen
China Mobile (Zhejiang) Research and Innovation Institute, Hangzhou, China
Qibo Qiu
The Hong Kong Polytechnic University, Hung Hom, Hong Kong
Jiahao Wu
Zhiyuan Research Institute, Beijing, China
Binbin Lin
School of Computer Sciene and Technology, Xidian University, Xi’an, China
Ziyu Guan
State Key Lab of CAD&CG, Zhejiang University, Hangzhou, China
Qibo Qiu & Xiaofei He

Authors

Yanting Yang
View author publications
You can also search for this author in PubMed Google Scholar
Minghao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Qibo Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Jiahao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Wenxiao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Binbin Lin
View author publications
You can also search for this author in PubMed Google Scholar
Ziyu Guan
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofei He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Minghao Chen .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2545 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, Y. et al. (2025). Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15115. Springer, Cham. https://doi.org/10.1007/978-3-031-72998-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-72998-0_10
Published: 30 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72997-3
Online ISBN: 978-3-031-72998-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts