Abstract
For a general-purpose robot to operate in reality, executing a broad range of instructions across various environments is imperative. Central to the reinforcement learning and planning for such robotic agents is a generalizable reward function. Recent advances in vision-language models, such as CLIP, have shown remarkable performance in the domain of deep learning, paving the way for open-domain visual recognition. However, collecting data on robots executing various language instructions across multiple environments remains a challenge. This paper aims to transfer video-language models with robust generalization into a generalizable language-conditioned reward function, only utilizing robot video data from a minimal amount of tasks in a singular environment. Unlike common robotic datasets used for training reward functions, human video-language datasets rarely contain trivial failure videos. To enhance the model’s ability to distinguish between successful and failed robot executions, we cluster failure video features to enable the model to identify patterns within. For each cluster, we integrate a newly trained failure prompt into the text encoder to represent the corresponding failure mode. Our language-conditioned reward function shows outstanding generalization to new environments and new instructions for robot planning and reinforcement learning.
Y. Yang and M. Chen—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abbeel, P., Ng, A.: Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the Twenty-First International Conference on Machine Learning (2004)
Babaeizadeh, M., Saffar, M.T., Nair, S., Levine, S., Finn, C., Erhan, D.: FitVid: overfitting in pixel-level video prediction. ArXiv abs/2106.13195 (2021)
Brohan, A., et al.: RT-2: vision-language-action models transfer web knowledge to robotic control. ArXiv abs/2307.15818 (2023)
Brohan, A., et al.: RT-1: robotics transformer for real-world control at scale. ArXiv abs/2212.06817 (2022)
Chen, A.S., Nair, S., Finn, C.: Learning generalizable robotic reward functions from “in-the-wild” human videos. ArXiv abs/2103.16817 (2021)
Das, N., Bechtle, S., Davchev, T., Jayaraman, D., Rai, A., Meier, F.: Model-based inverse reinforcement learning from visual demonstrations. In: Conference on Robot Learning, pp. 1930–1942. PMLR (2021)
Du, Y., et al.: Vision-language models as success detectors. ArXiv abs/2303.07280 (2023)
Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A.X., Levine, S.: Visual foresight: model-based deep reinforcement learning for vision-based robotic control. ArXiv abs/1812.00568 (2018)
Fan, L.J., et al.: MineDojo: building open-ended embodied agents with internet-scale knowledge. ArXiv abs/2206.08853 (2022)
Finn, C., Levine, S., Abbeel, P.: Guided cost learning: deep inverse optimal control via policy optimization. In: International Conference on Machine Learning (2016)
Fu, J., Luo, K., Levine, S.: Learning robust rewards with adverserial inverse reinforcement learning. In: International Conference on Learning Representations (2018)
Fu, J., Singh, A., Ghosh, D., Yang, L., Levine, S.: Variational inverse control with events: a general framework for data-driven reward definition. In: Neural Information Processing Systems (2018)
Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5843–5851 (2017)
Hafner, D., et al.: Learning latent dynamics for planning from pixels. In: International Conference on Machine Learning, pp. 2555–2565. PMLR (2019)
Jain, A., Hu, M., Ratliff, N.D., Bagnell, D., Zinkevich, M.A.: Maximum margin planning. In: Proceedings of the 23rd International Conference on Machine Learning (2006)
Jang, E., et al.: BC-Z: zero-shot task generalization with robotic imitation learning. ArXiv abs/2202.02005 (2022)
Kwon, M., Xie, S.M., Bullard, K., Sadigh, D.: Reward design with language models. ArXiv abs/2303.00001 (2023)
Lee, J., Ryoo, M.S.: Learning robot activities from first-person human videos using convolutional future regression. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1497–1504 (2017)
Lee, K., Su, Y., Kim, T.K., Demiris, Y.: A syntactic approach to robot imitation learning using probabilistic activity grammars. Robot. Auton. Syst. 61, 1323–1334 (2013)
Lei, J., Berg, T.L., Bansal, M.: Revealing single frame bias for video-and-language learning. ArXiv abs/2206.03428 (2022)
Li, J., Li, D., Xiong, C., Hoi, S.C.H.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning (2022)
Li, Y., Zhao, X., Chen, C., Pang, S., Zhou, Z., Yin, J.: Scenario-driven cyber-physical-social system: Intelligent workflow generation based on capability. In: Companion Proceedings of the ACM on Web Conference 2024 (2024)
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N.M.O., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. CoRR abs/1509.02971 (2015)
Liu, Y., Gupta, A., Abbeel, P., Levine, S.: Imitation from observation: learning to imitate behaviors from raw video via context translation. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1118–1125 (2017)
Luo, H., et al.: CLIP4Clip: an empirical study of clip for end to end video clip retrieval. Neurocomputing 508, 293–304 (2021)
Ma, Y.J., Liang, W., Som, V., Kumar, V., Zhang, A., Bastani, O., Jayaraman, D.: LIV: language-image representations and rewards for robotic control. In: International Conference on Machine Learning (2023)
Ma, Y.J., et al.: Eureka: human-level reward design via coding large language models. ArXiv abs/2310.12931 (2023)
Ma, Y.J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V., Zhang, A.: VIP: towards universal visual reward and representation via value-implicit pre-training. ArXiv abs/2210.00030 (2022)
Nair, S., Mitchell, E., Chen, K., Ichter, B., Savarese, S., Finn, C.: Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In: CoRL (2021)
Nair, S., Rajeswaran, A., Kumar, V., Finn, C., Gupta, A.: R3M: a universal visual representation for robot manipulation. In: CoRL (2022)
Nguyen, A., Kanoulas, D., Muratore, L., Caldwell, D.G., Tsagarakis, N.G.: Translating videos to commands for robotic manipulation with deep recurrent neural networks. 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9 (2017)
OpenAI: GPT-4 technical report. ArXiv abs/2303.08774 (2023)
Parisi, S., Rajeswaran, A., Purushwalkam, S., Gupta, A.K.: The unsurprising effectiveness of pre-trained vision models for control. In: International Conference on Machine Learning (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Radosavovic, I., Xiao, T., James, S., Abbeel, P., Malik, J., Darrell, T.: Real-world robot learning with masked visual pre-training. In: CoRL (2022)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. ArXiv abs/2204.06125 (2022)
Rothfuss, J., Ferreira, F., Aksoy, E.E., Zhou, Y., Asfour, T.: Deep episodic memory: encoding, recalling, and predicting episodic experiences for robot action execution. IEEE Robot. Autom. Lett. 3, 4007–4014 (2018)
Rubinstein, R.Y., Kroese, D.P.: The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning, vol. 133. Springer, Heidelberg (2004). https://doi.org/10.1007/978-1-4757-4321-0
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. ArXiv abs/2210.08402 (2022)
Schuhmann, C., et .: LAION-400M: open dataset of clip-filtered 400 million image-text pairs. ArXiv abs/2111.02114 (2021)
Sermanet, P., et al.: Time-contrastive networks: Self-supervised learning from video. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1134–1141 (2017)
Shao, L., Migimatsu, T., Zhang, Q., Yang, K., Bohg, J.: Concept2Robot: learning manipulation concepts from instructions and human demonstrations. Int. J. Robot. Res. 40, 1419–1434 (2020)
Sharma, P., Pathak, D., Gupta, A.K.: Third-person visual imitation learning via decoupled hierarchical controller. In: Neural Information Processing Systems (2019)
Shaw, K., Bahl, S., Pathak, D.: VideoDex: learning dexterity from internet videos. In: Conference on Robot Learning (2022)
Shridhar, M., Manuelli, L., Fox, D.: CLIPort: what and where pathways for robotic manipulation. ArXiv abs/2109.12098 (2021)
Singh, A., Yang, L., Hartikainen, K., Finn, C., Levine, S.: End-to-end robotic reinforcement learning without reward engineering. ArXiv abs/1904.07854 (2019)
Smith, L.M., Dhawan, N., Zhang, M., Abbeel, P., Levine, S.: AVID: learning multi-stage tasks via pixel-level translation of human videos. ArXiv abs/1912.04443 (2019)
Stone, A., et al.: Open-world object manipulation using pre-trained vision-language models. ArXiv abs/2303.00905 (2023)
Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033 (2012)
Wang, C., et al.: MimicPlay: long-horizon imitation learning by watching human play. ArXiv abs/2302.12422 (2023)
Watter, M., Springenberg, J., Boedecker, J., Riedmiller, M.: Embed to control: a locally linear latent dynamics model for control from raw images. Advances in Neural Inf. Process. Syst. 28 (2015)
Wu, J., Fan, W., Chen, J., Liu, S., Li, Q., Tang, K.: Disentangled contrastive learning for social recommendation. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management (2022)
Wu, J., et al.: Leveraging large language models (LLMs) to empower training-free dataset condensation for content-based recommendation. ArXiv abs/2310.09874 (2023)
Wulfmeier, M., Ondruska, P., Posner, I.: Maximum entropy deep inverse reinforcement learning. arXiv Learning (2015)
Xiao, T., Radosavovic, I., Darrell, T., Malik, J.: Masked visual pre-training for motor control. ArXiv abs/2203.06173 (2022)
Xie, T., et al.: Text2Reward: automated dense reward function generation for reinforcement learning. ArXiv abs/2309.11489 (2023)
Xu, Y., Jiang, Y., Zhao, X., Li, Y., Li, R.: Personalized repository recommendation service for developers with multi-modal features learning. In: 2023 IEEE International Conference on Web Services (ICWS), pp. 455–464 (2023)
Xu, Y., Qiu, Z., Gao, H., Zhao, X., Wang, L., Li, R.: Heterogeneous data-driven failure diagnosis for microservice-based industrial clouds toward consumer digital ecosystems. IEEE Trans. Consum. Electron. 70, 2027–2037 (2024)
Yang, Y., Li, Y., Fermüller, C., Aloimonos, Y.: Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In: AAAI Conference on Artificial Intelligence (2015)
Yu, T., et al.: Meta-world: a benchmark and evaluation for multi-task and meta reinforcement learning. ArXiv abs/1910.10897 (2019)
Yu, W., et al.: Language to rewards for robotic skill synthesis. ArXiv abs/2306.08647 (2023)
Zakka, K., Zeng, A., Florence, P.R., Tompson, J., Bohg, J., Dwibedi, D.: XIRL: cross-embodiment inverse reinforcement learning. In: Conference on Robot Learning (2021)
Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K.: Maximum entropy inverse reinforcement learning. In: AAAI Conference on Artificial Intelligence (2008)
Acknowledgment
This work was supported in part by The National Nature Science Foundation of China (Grant No: 62273303, 62303406), in part by the Key R&D Program of Zhejiang Province, China (2023C01135), in part by Ningbo Key R&D Program (No. 2023Z231, 2023Z229), in part by Yongjiang Talent Introduction Programme (Grant No: 2022A-240-G, 2023A-194-G).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yang, Y. et al. (2025). Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15115. Springer, Cham. https://doi.org/10.1007/978-3-031-72998-0_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-72998-0_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72997-3
Online ISBN: 978-3-031-72998-0
eBook Packages: Computer ScienceComputer Science (R0)