Skip to main content

Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15115))

Included in the following conference series:

  • 338 Accesses

Abstract

For a general-purpose robot to operate in reality, executing a broad range of instructions across various environments is imperative. Central to the reinforcement learning and planning for such robotic agents is a generalizable reward function. Recent advances in vision-language models, such as CLIP, have shown remarkable performance in the domain of deep learning, paving the way for open-domain visual recognition. However, collecting data on robots executing various language instructions across multiple environments remains a challenge. This paper aims to transfer video-language models with robust generalization into a generalizable language-conditioned reward function, only utilizing robot video data from a minimal amount of tasks in a singular environment. Unlike common robotic datasets used for training reward functions, human video-language datasets rarely contain trivial failure videos. To enhance the model’s ability to distinguish between successful and failed robot executions, we cluster failure video features to enable the model to identify patterns within. For each cluster, we integrate a newly trained failure prompt into the text encoder to represent the corresponding failure mode. Our language-conditioned reward function shows outstanding generalization to new environments and new instructions for robot planning and reinforcement learning.

Y. Yang and M. Chen—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abbeel, P., Ng, A.: Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the Twenty-First International Conference on Machine Learning (2004)

    Google Scholar 

  2. Babaeizadeh, M., Saffar, M.T., Nair, S., Levine, S., Finn, C., Erhan, D.: FitVid: overfitting in pixel-level video prediction. ArXiv abs/2106.13195 (2021)

    Google Scholar 

  3. Brohan, A., et al.: RT-2: vision-language-action models transfer web knowledge to robotic control. ArXiv abs/2307.15818 (2023)

    Google Scholar 

  4. Brohan, A., et al.: RT-1: robotics transformer for real-world control at scale. ArXiv abs/2212.06817 (2022)

    Google Scholar 

  5. Chen, A.S., Nair, S., Finn, C.: Learning generalizable robotic reward functions from “in-the-wild” human videos. ArXiv abs/2103.16817 (2021)

    Google Scholar 

  6. Das, N., Bechtle, S., Davchev, T., Jayaraman, D., Rai, A., Meier, F.: Model-based inverse reinforcement learning from visual demonstrations. In: Conference on Robot Learning, pp. 1930–1942. PMLR (2021)

    Google Scholar 

  7. Du, Y., et al.: Vision-language models as success detectors. ArXiv abs/2303.07280 (2023)

    Google Scholar 

  8. Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A.X., Levine, S.: Visual foresight: model-based deep reinforcement learning for vision-based robotic control. ArXiv abs/1812.00568 (2018)

    Google Scholar 

  9. Fan, L.J., et al.: MineDojo: building open-ended embodied agents with internet-scale knowledge. ArXiv abs/2206.08853 (2022)

    Google Scholar 

  10. Finn, C., Levine, S., Abbeel, P.: Guided cost learning: deep inverse optimal control via policy optimization. In: International Conference on Machine Learning (2016)

    Google Scholar 

  11. Fu, J., Luo, K., Levine, S.: Learning robust rewards with adverserial inverse reinforcement learning. In: International Conference on Learning Representations (2018)

    Google Scholar 

  12. Fu, J., Singh, A., Ghosh, D., Yang, L., Levine, S.: Variational inverse control with events: a general framework for data-driven reward definition. In: Neural Information Processing Systems (2018)

    Google Scholar 

  13. Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5843–5851 (2017)

    Google Scholar 

  14. Hafner, D., et al.: Learning latent dynamics for planning from pixels. In: International Conference on Machine Learning, pp. 2555–2565. PMLR (2019)

    Google Scholar 

  15. Jain, A., Hu, M., Ratliff, N.D., Bagnell, D., Zinkevich, M.A.: Maximum margin planning. In: Proceedings of the 23rd International Conference on Machine Learning (2006)

    Google Scholar 

  16. Jang, E., et al.: BC-Z: zero-shot task generalization with robotic imitation learning. ArXiv abs/2202.02005 (2022)

    Google Scholar 

  17. Kwon, M., Xie, S.M., Bullard, K., Sadigh, D.: Reward design with language models. ArXiv abs/2303.00001 (2023)

    Google Scholar 

  18. Lee, J., Ryoo, M.S.: Learning robot activities from first-person human videos using convolutional future regression. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1497–1504 (2017)

    Google Scholar 

  19. Lee, K., Su, Y., Kim, T.K., Demiris, Y.: A syntactic approach to robot imitation learning using probabilistic activity grammars. Robot. Auton. Syst. 61, 1323–1334 (2013)

    Article  Google Scholar 

  20. Lei, J., Berg, T.L., Bansal, M.: Revealing single frame bias for video-and-language learning. ArXiv abs/2206.03428 (2022)

    Google Scholar 

  21. Li, J., Li, D., Xiong, C., Hoi, S.C.H.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning (2022)

    Google Scholar 

  22. Li, Y., Zhao, X., Chen, C., Pang, S., Zhou, Z., Yin, J.: Scenario-driven cyber-physical-social system: Intelligent workflow generation based on capability. In: Companion Proceedings of the ACM on Web Conference 2024 (2024)

    Google Scholar 

  23. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N.M.O., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. CoRR abs/1509.02971 (2015)

    Google Scholar 

  24. Liu, Y., Gupta, A., Abbeel, P., Levine, S.: Imitation from observation: learning to imitate behaviors from raw video via context translation. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1118–1125 (2017)

    Google Scholar 

  25. Luo, H., et al.: CLIP4Clip: an empirical study of clip for end to end video clip retrieval. Neurocomputing 508, 293–304 (2021)

    Google Scholar 

  26. Ma, Y.J., Liang, W., Som, V., Kumar, V., Zhang, A., Bastani, O., Jayaraman, D.: LIV: language-image representations and rewards for robotic control. In: International Conference on Machine Learning (2023)

    Google Scholar 

  27. Ma, Y.J., et al.: Eureka: human-level reward design via coding large language models. ArXiv abs/2310.12931 (2023)

    Google Scholar 

  28. Ma, Y.J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V., Zhang, A.: VIP: towards universal visual reward and representation via value-implicit pre-training. ArXiv abs/2210.00030 (2022)

    Google Scholar 

  29. Nair, S., Mitchell, E., Chen, K., Ichter, B., Savarese, S., Finn, C.: Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In: CoRL (2021)

    Google Scholar 

  30. Nair, S., Rajeswaran, A., Kumar, V., Finn, C., Gupta, A.: R3M: a universal visual representation for robot manipulation. In: CoRL (2022)

    Google Scholar 

  31. Nguyen, A., Kanoulas, D., Muratore, L., Caldwell, D.G., Tsagarakis, N.G.: Translating videos to commands for robotic manipulation with deep recurrent neural networks. 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9 (2017)

    Google Scholar 

  32. OpenAI: GPT-4 technical report. ArXiv abs/2303.08774 (2023)

    Google Scholar 

  33. Parisi, S., Rajeswaran, A., Purushwalkam, S., Gupta, A.K.: The unsurprising effectiveness of pre-trained vision models for control. In: International Conference on Machine Learning (2022)

    Google Scholar 

  34. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  35. Radosavovic, I., Xiao, T., James, S., Abbeel, P., Malik, J., Darrell, T.: Real-world robot learning with masked visual pre-training. In: CoRL (2022)

    Google Scholar 

  36. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. ArXiv abs/2204.06125 (2022)

    Google Scholar 

  37. Rothfuss, J., Ferreira, F., Aksoy, E.E., Zhou, Y., Asfour, T.: Deep episodic memory: encoding, recalling, and predicting episodic experiences for robot action execution. IEEE Robot. Autom. Lett. 3, 4007–4014 (2018)

    Article  Google Scholar 

  38. Rubinstein, R.Y., Kroese, D.P.: The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning, vol. 133. Springer, Heidelberg (2004). https://doi.org/10.1007/978-1-4757-4321-0

    Book  Google Scholar 

  39. Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. ArXiv abs/2210.08402 (2022)

    Google Scholar 

  40. Schuhmann, C., et .: LAION-400M: open dataset of clip-filtered 400 million image-text pairs. ArXiv abs/2111.02114 (2021)

    Google Scholar 

  41. Sermanet, P., et al.: Time-contrastive networks: Self-supervised learning from video. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1134–1141 (2017)

    Google Scholar 

  42. Shao, L., Migimatsu, T., Zhang, Q., Yang, K., Bohg, J.: Concept2Robot: learning manipulation concepts from instructions and human demonstrations. Int. J. Robot. Res. 40, 1419–1434 (2020)

    Article  Google Scholar 

  43. Sharma, P., Pathak, D., Gupta, A.K.: Third-person visual imitation learning via decoupled hierarchical controller. In: Neural Information Processing Systems (2019)

    Google Scholar 

  44. Shaw, K., Bahl, S., Pathak, D.: VideoDex: learning dexterity from internet videos. In: Conference on Robot Learning (2022)

    Google Scholar 

  45. Shridhar, M., Manuelli, L., Fox, D.: CLIPort: what and where pathways for robotic manipulation. ArXiv abs/2109.12098 (2021)

    Google Scholar 

  46. Singh, A., Yang, L., Hartikainen, K., Finn, C., Levine, S.: End-to-end robotic reinforcement learning without reward engineering. ArXiv abs/1904.07854 (2019)

    Google Scholar 

  47. Smith, L.M., Dhawan, N., Zhang, M., Abbeel, P., Levine, S.: AVID: learning multi-stage tasks via pixel-level translation of human videos. ArXiv abs/1912.04443 (2019)

    Google Scholar 

  48. Stone, A., et al.: Open-world object manipulation using pre-trained vision-language models. ArXiv abs/2303.00905 (2023)

    Google Scholar 

  49. Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033 (2012)

    Google Scholar 

  50. Wang, C., et al.: MimicPlay: long-horizon imitation learning by watching human play. ArXiv abs/2302.12422 (2023)

    Google Scholar 

  51. Watter, M., Springenberg, J., Boedecker, J., Riedmiller, M.: Embed to control: a locally linear latent dynamics model for control from raw images. Advances in Neural Inf. Process. Syst. 28 (2015)

    Google Scholar 

  52. Wu, J., Fan, W., Chen, J., Liu, S., Li, Q., Tang, K.: Disentangled contrastive learning for social recommendation. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management (2022)

    Google Scholar 

  53. Wu, J., et al.: Leveraging large language models (LLMs) to empower training-free dataset condensation for content-based recommendation. ArXiv abs/2310.09874 (2023)

    Google Scholar 

  54. Wulfmeier, M., Ondruska, P., Posner, I.: Maximum entropy deep inverse reinforcement learning. arXiv Learning (2015)

    Google Scholar 

  55. Xiao, T., Radosavovic, I., Darrell, T., Malik, J.: Masked visual pre-training for motor control. ArXiv abs/2203.06173 (2022)

    Google Scholar 

  56. Xie, T., et al.: Text2Reward: automated dense reward function generation for reinforcement learning. ArXiv abs/2309.11489 (2023)

    Google Scholar 

  57. Xu, Y., Jiang, Y., Zhao, X., Li, Y., Li, R.: Personalized repository recommendation service for developers with multi-modal features learning. In: 2023 IEEE International Conference on Web Services (ICWS), pp. 455–464 (2023)

    Google Scholar 

  58. Xu, Y., Qiu, Z., Gao, H., Zhao, X., Wang, L., Li, R.: Heterogeneous data-driven failure diagnosis for microservice-based industrial clouds toward consumer digital ecosystems. IEEE Trans. Consum. Electron. 70, 2027–2037 (2024)

    Article  Google Scholar 

  59. Yang, Y., Li, Y., Fermüller, C., Aloimonos, Y.: Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In: AAAI Conference on Artificial Intelligence (2015)

    Google Scholar 

  60. Yu, T., et al.: Meta-world: a benchmark and evaluation for multi-task and meta reinforcement learning. ArXiv abs/1910.10897 (2019)

    Google Scholar 

  61. Yu, W., et al.: Language to rewards for robotic skill synthesis. ArXiv abs/2306.08647 (2023)

    Google Scholar 

  62. Zakka, K., Zeng, A., Florence, P.R., Tompson, J., Bohg, J., Dwibedi, D.: XIRL: cross-embodiment inverse reinforcement learning. In: Conference on Robot Learning (2021)

    Google Scholar 

  63. Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K.: Maximum entropy inverse reinforcement learning. In: AAAI Conference on Artificial Intelligence (2008)

    Google Scholar 

Download references

Acknowledgment

This work was supported in part by The National Nature Science Foundation of China (Grant No: 62273303, 62303406), in part by the Key R&D Program of Zhejiang Province, China (2023C01135), in part by Ningbo Key R&D Program (No. 2023Z231, 2023Z229), in part by Yongjiang Talent Introduction Programme (Grant No: 2022A-240-G, 2023A-194-G).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Minghao Chen .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2545 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, Y. et al. (2025). Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15115. Springer, Cham. https://doi.org/10.1007/978-3-031-72998-0_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72998-0_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72997-3

  • Online ISBN: 978-3-031-72998-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics