Skip to main content

Pre-trained Visual Dynamics Representations for Efficient Policy Learning

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15139))

Included in the following conference series:

  • 313 Accesses

Abstract

Pre-training for Reinforcement Learning (RL) with purely video data is a valuable yet challenging problem. Although in-the-wild videos are readily available and inhere a vast amount of prior world knowledge, the absence of action annotations and the common domain gap with downstream tasks hinder utilizing videos for RL pre-training. To address the challenge of pre-training with videos, we propose Pre-trained Visual Dynamics Representations (PVDR) to bridge the domain gap between videos and downstream tasks for efficient policy learning. By adopting video prediction as a pre-training task, we use a Transformer-based Conditional Variational Autoencoder (CVAE) to learn visual dynamics representations. The pre-trained visual dynamics representations capture the visual dynamics prior knowledge in the videos. This abstract prior knowledge can be readily adapted to downstream tasks and aligned with executable actions through online adaptation. We conduct experiments on a series of robotics visual control tasks and verify that PVDR is an effective form for pre-training with videos to promote policy learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    To maintain the pre-trained visual dynamics knowledge, we freeze the dynamics encoder except for the last layer that generates the mean and variance of \(z_t\).

  2. 2.

    Actor-critic algorithms require a critic network, and we simply replace the last layer of the action alignment module to construct a critic network.

References

  1. Akan, A.K., Safadoust, S., Güney, F.: Stochastic video prediction with structure and motion. arXiv preprint arXiv:2203.10528 (2022)

  2. Andrychowicz, M., et al.: Hindsight experience replay. In: Neural Information Processing Systems (2017)

    Google Scholar 

  3. Åström, K.J.: Optimal control of markov processes with incomplete state information i. J. Math. Anal. Appl. 10, 174–205 (1965)

    Article  MathSciNet  Google Scholar 

  4. Bahl, S., Gupta, A., Pathak, D.: Human-to-robot imitation in the wild. arXiv preprint arXiv:2207.09450 (2022)

  5. Bahl, S., Mendonca, R., Chen, L., Jain, U., Pathak, D.: Affordances from human videos as a versatile representation for robotics. In: Computer Vision and Pattern Recognition (2023)

    Google Scholar 

  6. Baker, B., et al.: Video pretraining (VPT): learning to act by watching unlabeled online videos. In: Neural Information Processing Systems (2022)

    Google Scholar 

  7. Bhateja, C., et al.: Robotic offline RL from internet videos via value-function pre-training. arXiv preprint arXiv:2309.13041 (2023)

  8. Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition (2023)

    Google Scholar 

  9. Bobrin, M., Buzun, N., Krylov, D., Dylov, D.V.: Align your intents: offline imitation learning via optimal transport. arXiv preprint arXiv:2402.13037 (2024)

  10. Bojarski, M., et al.: End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 (2016)

  11. Brown, T., et al.: Language models are few-shot learners. In: Neural Information Processing Systems (2020)

    Google Scholar 

  12. Bruce, J., Anand, A., Mazoure, B., Fergus, R.: Learning about progress from experts. In: International Conference on Learning Representations (2023)

    Google Scholar 

  13. Bruce, J., et al.: Genie: generative interactive environments. arXiv preprint arXiv:2402.15391 (2024)

  14. Chang, M., Gupta, A., Gupta, S.: Learning value functions from undirected state-only experience. In: International Conference on Learning Representations (2022)

    Google Scholar 

  15. Chen, A.S., Nair, S., Finn, C.: Learning generalizable robotic reward functions from “in-the-wild” human videos. In: Robotics: Science and Systems (2022)

    Google Scholar 

  16. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning (2020)

    Google Scholar 

  17. Clark, A., Donahue, J., Simonyan, K.: Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571 (2019)

  18. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

    Google Scholar 

  19. Ebert, F., Finn, C., Lee, A.X., Levine, S.: Self-supervised visual planning with temporal skip connections. Conf. Robot Learn. (2017)

    Google Scholar 

  20. Edwards, A., Sahni, H., Schroecker, Y., Isbell, C.: Imitating latent policies from observation. In: International Conference on Machine Learning (2019)

    Google Scholar 

  21. Escontrela, A., et al.: Video prediction models as rewards for reinforcement learning. In: Neural Information Processing Systems (2023)

    Google Scholar 

  22. Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Computer Vision and Pattern Recognition (2021)

    Google Scholar 

  23. Ghosh, D., Bhateja, C.A., Levine, S.: Reinforcement learning from passive data via latent intentions. In: International Conference on Machine Learning (2023)

    Google Scholar 

  24. Gupta, A., et al.: Maskvit: masked visual pre-training for video prediction. In: International Conference on Learning Representations (2022)

    Google Scholar 

  25. Hafner, D., Lillicrap, T.P., Norouzi, M., Ba, J.: Mastering atari with discrete world models. In: International Conference on Learning Representations (2021)

    Google Scholar 

  26. He, H., Bai, C., Pan, L., Zhang, W., Zhao, B., Li, X.: Large-scale actionless video pre-training via discrete diffusion for efficient policy learning. arXiv preprint arXiv:2402.14407 (2024)

  27. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Computer Vision and Pattern Recognition (2022)

    Google Scholar 

  28. Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. arXiv preprint arXiv:2204.03458 (2021)

  29. James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: the robot learning benchmark & learning environment. Robot. Autom. Lett. 5(2), 3019–3026 (2020)

    Article  Google Scholar 

  30. Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artif. Intell. 101(1–2), 99–134 (1998)

    Article  MathSciNet  Google Scholar 

  31. Karamcheti, S., et al.: Language-driven representation learning for robotics. arXiv preprint arXiv:2302.12766 (2023)

  32. Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for offline reinforcement learning. In: Neural Information Processing Systems (2020)

    Google Scholar 

  33. Li, S., Han, S., Zhao, Y., Liang, B., Liu, P.: Auxiliary reward generation with transition distance representation learning. arXiv preprint arXiv:2402.07412 (2024)

  34. Luc, P., et al.: Transformation-based adversarial video prediction on large-scale data. arXiv preprint arXiv:2003.04035 (2020)

  35. Ma, Y.J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V., Zhang, A.: Vip: towards universal visual reward and representation via value-implicit pre-training. arXiv preprint arXiv:2210.00030 (2022)

  36. Mei, K., Patel, V.: VIDM: video implicit diffusion models. In: Association for the Advancement of Artificial Intelligence (2023)

    Google Scholar 

  37. Menapace, W., Lathuiliere, S., Tulyakov, S., Siarohin, A., Ricci, E.: Playable video generation. In: Conference on Computer Vision and Pattern Recognition (2021)

    Google Scholar 

  38. Mendonca, R., Bahl, S., Pathak, D.: Structured world models from human videos. In: Robotics: Science and Systems (2023)

    Google Scholar 

  39. Mendonca, R., Rybkin, O., Daniilidis, K., Hafner, D., Pathak, D.: Discovering and achieving goals via world models. In: Neural Information Processing Systems (2021)

    Google Scholar 

  40. Nair, A., Bahl, S., Khazatsky, A., Pong, V., Berseth, G., Levine, S.: Contextual imagined goals for self-supervised robotic learning. In: Conference on Robot Learning (2020)

    Google Scholar 

  41. Nair, A.V., Pong, V., Dalal, M., Bahl, S., Lin, S., Levine, S.: Visual reinforcement learning with imagined goals. In: Neural Information Processing Systems (2018)

    Google Scholar 

  42. Nair, S., Finn, C.: Hierarchical foresight: self-supervised learning of long-horizon tasks via visual subgoal generation. In: International Conference on Learning Representations (2020)

    Google Scholar 

  43. Nair, S., Rajeswaran, A., Kumar, V., Finn, C., Gupta, A.: R3m: a universal visual representation for robot manipulation. In: Conference on Robot Learning (2023)

    Google Scholar 

  44. Nair, S., Savarese, S., Finn, C.: Goal-aware prediction: learning to model what matters. In: International Conference on Machine Learning (2020)

    Google Scholar 

  45. Padalkar, A., et al.: Open x-embodiment: robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864 (2023)

  46. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Neural Information Processing Systems (2019)

    Google Scholar 

  47. Pathak, D., et al.: Zero-shot visual imitation. In: Computer Vision and Pattern Recognition Workshops (2018)

    Google Scholar 

  48. Peng, X.B., Kanazawa, A., Malik, J., Abbeel, P., Levine, S.: Sfv: reinforcement learning of physical skills from videos. ACM Trans. Graph. 37(6), 1–14 (2018)

    Article  Google Scholar 

  49. Pong, V., Dalal, M., Lin, S., Nair, A., Bahl, S., Levine, S.: Skew-fit: state-covering self-supervised reinforcement learning. In: International Conference on Machine Learning (2020)

    Google Scholar 

  50. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)

    Google Scholar 

  51. Radosavovic, I., Xiao, T., James, S., Abbeel, P., Malik, J., Darrell, T.: Real-world robot learning with masked visual pre-training. In: Conference on Robot Learning (2023)

    Google Scholar 

  52. Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014)

  53. Schaul, T., Horgan, D., Gregor, K., Silver, D.: Universal value function approximators. In: International Conference on Machine Learning (2015)

    Google Scholar 

  54. Schmeckpeper, K., Rybkin, O., Daniilidis, K., Levine, S., Finn, C.: Reinforcement learning with videos: combining offline observations with interaction. In: Conference on Robot Learning (2021)

    Google Scholar 

  55. Schmidt, D., Jiang, M.: Learning to act without actions. In: International Conference on Learning Representations (2024)

    Google Scholar 

  56. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  57. Sekar, R., Rybkin, O., Daniilidis, K., Abbeel, P., Hafner, D., Pathak, D.: Planning to explore via self-supervised world models. In: International Conference on Machine Learning (2020)

    Google Scholar 

  58. Seo, Y., Lee, K., James, S.L., Abbeel, P.: Reinforcement learning with action-free pre-training from videos. In: International Conference on Machine Learning (2022)

    Google Scholar 

  59. Sermanet, P., et al.: Time-contrastive networks: self-supervised learning from video. In: International Conference on Robotics and Automation (2018)

    Google Scholar 

  60. Shao, L., Migimatsu, T., Zhang, Q., Yang, K., Bohg, J.: Concept2robot: learning manipulation concepts from instructions and human demonstrations. Int. J. Robot. Res. 40(12–14), 1419–1434 (2021)

    Article  Google Scholar 

  61. Sharma, P., Pathak, D., Gupta, A.: Third-person visual imitation learning via decoupled hierarchical controller. In: Neural Information Processing Systems (2019)

    Google Scholar 

  62. Shridhar, M., Manuelli, L., Fox, D.: Perceiver-actor: a multi-task transformer for robotic manipulation. In: Conference on Robot Learning (2023)

    Google Scholar 

  63. Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. In: International Conference on Learning Representations (2023)

    Google Scholar 

  64. Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. In: Neural Information Processing Systems (2015)

    Google Scholar 

  65. Torabi, F., Warnell, G., Stone, P.: Behavioral cloning from observation. In: International Joint Conference on Artificial Intelligence (2018)

    Google Scholar 

  66. Torabi, F., Warnell, G., Stone, P.: Generative adversarial imitation from observation. arXiv preprint arXiv:1807.06158 (2018)

  67. Torabi, F., Warnell, G., Stone, P.: Imitation learning from video by leveraging proprioception. In: International Joint Conference on Artificial Intelligence (2019)

    Google Scholar 

  68. Trott, A., Zheng, S., Xiong, C., Socher, R.: Keeping your distance: solving sparse reward tasks using self-balancing shaped rewards. In: Neural Information Processing Systems (2019)

    Google Scholar 

  69. Van Den Oord, A., Vinyals, O., , Kavukcuoglu, K.: Neural discrete representation learning. In: Neural Information Processing Systems (2017)

    Google Scholar 

  70. Vaswani, A., et al.: Attention is all you need. In: Neural Information Processing Systems (2017)

    Google Scholar 

  71. Villegas, R., et al.: Phenaki: variable length video generation from open domain textual descriptions. In: International Conference on Learning Representations (2023)

    Google Scholar 

  72. Villegas, R., Pathak, A., Kannan, H., Erhan, D., Le, Q.V., Lee, H.: High fidelity video prediction with large stochastic recurrent neural networks. In: Neural Information Processing Systems (2019)

    Google Scholar 

  73. Walker, J., Razavi, A., Oord, A.V.D.: Predicting video with vqvae. arXiv preprint arXiv:2103.01950 (2021)

  74. Wang, X., Zhu, Z., Huang, G., Wang, B., Chen, X., Lu, J.: Worlddreamer: towards general world models for video generation via predicting masked tokens. arXiv preprint arXiv:2401.09985 (2024)

  75. Wu, H., et al.: Unleashing large-scale video generative pre-training for visual robot manipulation. arXiv preprint arXiv:2312.13139 (2023)

  76. Wu, J., Ma, H., Deng, C., Long, M.: Pre-training contextualized world models with in-the-wild videos for reinforcement learning. arXiv preprint arXiv:2305.18499 (2023)

  77. Wu, Y., Tucker, G., Nachum, O.: The laplacian in RL: learning representations with efficient approximations. In: International Conference on Learning Representations (2018)

    Google Scholar 

  78. Xiao, T., Radosavovic, I., Darrell, T., Malik, J.: Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173 (2022)

  79. Xu, M., et al.: Spatial-temporal transformer networks for traffic flow forecasting. arXiv preprint arXiv:2001.02908 (2020)

  80. Yan, W., Hafner, D., James, S., Abbeel, P.: Temporally consistent transformers for video generation. In: International Conference on Machine Learning (2023)

    Google Scholar 

  81. Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: Videogpt: video generation using VQ-VAE and transformers. arXiv preprint arXiv:2104.10157 (2021)

  82. Yang, C., et al.: Imitation learning from observations by minimizing inverse dynamics disagreement. In: Neural Information Processing Systems (2019)

    Google Scholar 

  83. Yang, J., Liu, B., Fu, J., Pan, B., Wu, G., Wang, L.: Spatiotemporal predictive pre-training for robotic motor control. arXiv preprint arXiv:2403.05304 (2024)

  84. Yang, M., Nachum, O.: Representation matters: offline pretraining for sequential decision making. In: International Conference on Machine Learning (2021)

    Google Scholar 

  85. Ye, W., Zhang, Y., Abbeel, P., Gao, Y.: Become a proficient player with limited data through watching pure videos. In: International Conference on Learning Representations (2022)

    Google Scholar 

  86. Yu, L., et al.: Magvit: masked generative video transformer. In: Computer Vision and Pattern Recognition (2023)

    Google Scholar 

  87. Yu, T., Kumar, A., Chebotar, Y., Hausman, K., Finn, C., Levine, S.: How to leverage unlabeled data in offline reinforcement learning. In: International Conference on Machine Learning (2022)

    Google Scholar 

  88. Yu, T., et al.: Meta-world: a benchmark and evaluation for multi-task and meta reinforcement learning. In: Conference on Robot Learning (2020)

    Google Scholar 

  89. Yu, X., Lyu, Y., Tsang, I.: Intrinsic reward driven imitation learning via generative model. In: International Conference on Machine Learning (2020)

    Google Scholar 

  90. Zhang, Q., Peng, Z., Zhou, B.: Learning to drive by watching YouTube videos: action-conditioned contrastive policy pretraining. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13686, pp. 111–128. Springer, Cham. (2022). https://doi.org/10.1007/978-3-031-19809-0_7

    Chapter  Google Scholar 

  91. Zheng, Q., Henaff, M., Amos, B., Grover, A.: Semi-supervised offline reinforcement learning with action-free trajectories. In: International Conference on Machine Learning (2023)

    Google Scholar 

  92. Zhou, B., Li, K., Jiang, J., Lu, Z.: Learning from visual observation via offline pretrained state-to-go transformer. In: Neural Information Processing Systems (2023)

    Google Scholar 

Download references

Acknowledgements

This work was supported by NSFC under grant 62250068.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zongqing Lu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3843 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Luo, H., Zhou, B., Lu, Z. (2025). Pre-trained Visual Dynamics Representations for Efficient Policy Learning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15139. Springer, Cham. https://doi.org/10.1007/978-3-031-73004-7_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73004-7_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73003-0

  • Online ISBN: 978-3-031-73004-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics