Pre-trained Visual Dynamics Representations for Efficient Policy Learning

Luo, Hao; Zhou, Bohan; Lu, Zongqing

doi:10.1007/978-3-031-73004-7_15

Hao Luo¹³,
Bohan Zhou¹³ &
Zongqing Lu^13,14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15139))

Included in the following conference series:

European Conference on Computer Vision

313 Accesses

Abstract

Pre-training for Reinforcement Learning (RL) with purely video data is a valuable yet challenging problem. Although in-the-wild videos are readily available and inhere a vast amount of prior world knowledge, the absence of action annotations and the common domain gap with downstream tasks hinder utilizing videos for RL pre-training. To address the challenge of pre-training with videos, we propose Pre-trained Visual Dynamics Representations (PVDR) to bridge the domain gap between videos and downstream tasks for efficient policy learning. By adopting video prediction as a pre-training task, we use a Transformer-based Conditional Variational Autoencoder (CVAE) to learn visual dynamics representations. The pre-trained visual dynamics representations capture the visual dynamics prior knowledge in the videos. This abstract prior knowledge can be readily adapted to downstream tasks and aligned with executable actions through online adaptation. We conduct experiments on a series of robotics visual control tasks and verify that PVDR is an effective form for pre-training with videos to promote policy learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

PreLAR: World Model Pre-training with Learnable Action Representation

Learning to Drive by Watching YouTube Videos: Action-Conditioned Contrastive Policy Pretraining

Diffusion Reward: Learning Rewards via Conditional Video Diffusion

Notes

1.
To maintain the pre-trained visual dynamics knowledge, we freeze the dynamics encoder except for the last layer that generates the mean and variance of $z_t$.
2.
Actor-critic algorithms require a critic network, and we simply replace the last layer of the action alignment module to construct a critic network.

References

Akan, A.K., Safadoust, S., Güney, F.: Stochastic video prediction with structure and motion. arXiv preprint arXiv:2203.10528 (2022)
Andrychowicz, M., et al.: Hindsight experience replay. In: Neural Information Processing Systems (2017)
Google Scholar
Åström, K.J.: Optimal control of markov processes with incomplete state information i. J. Math. Anal. Appl. 10, 174–205 (1965)
Article MathSciNet Google Scholar
Bahl, S., Gupta, A., Pathak, D.: Human-to-robot imitation in the wild. arXiv preprint arXiv:2207.09450 (2022)
Bahl, S., Mendonca, R., Chen, L., Jain, U., Pathak, D.: Affordances from human videos as a versatile representation for robotics. In: Computer Vision and Pattern Recognition (2023)
Google Scholar
Baker, B., et al.: Video pretraining (VPT): learning to act by watching unlabeled online videos. In: Neural Information Processing Systems (2022)
Google Scholar
Bhateja, C., et al.: Robotic offline RL from internet videos via value-function pre-training. arXiv preprint arXiv:2309.13041 (2023)
Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition (2023)
Google Scholar
Bobrin, M., Buzun, N., Krylov, D., Dylov, D.V.: Align your intents: offline imitation learning via optimal transport. arXiv preprint arXiv:2402.13037 (2024)
Bojarski, M., et al.: End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 (2016)
Brown, T., et al.: Language models are few-shot learners. In: Neural Information Processing Systems (2020)
Google Scholar
Bruce, J., Anand, A., Mazoure, B., Fergus, R.: Learning about progress from experts. In: International Conference on Learning Representations (2023)
Google Scholar
Bruce, J., et al.: Genie: generative interactive environments. arXiv preprint arXiv:2402.15391 (2024)
Chang, M., Gupta, A., Gupta, S.: Learning value functions from undirected state-only experience. In: International Conference on Learning Representations (2022)
Google Scholar
Chen, A.S., Nair, S., Finn, C.: Learning generalizable robotic reward functions from “in-the-wild” human videos. In: Robotics: Science and Systems (2022)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning (2020)
Google Scholar
Clark, A., Donahue, J., Simonyan, K.: Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571 (2019)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Google Scholar
Ebert, F., Finn, C., Lee, A.X., Levine, S.: Self-supervised visual planning with temporal skip connections. Conf. Robot Learn. (2017)
Google Scholar
Edwards, A., Sahni, H., Schroecker, Y., Isbell, C.: Imitating latent policies from observation. In: International Conference on Machine Learning (2019)
Google Scholar
Escontrela, A., et al.: Video prediction models as rewards for reinforcement learning. In: Neural Information Processing Systems (2023)
Google Scholar
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Computer Vision and Pattern Recognition (2021)
Google Scholar
Ghosh, D., Bhateja, C.A., Levine, S.: Reinforcement learning from passive data via latent intentions. In: International Conference on Machine Learning (2023)
Google Scholar
Gupta, A., et al.: Maskvit: masked visual pre-training for video prediction. In: International Conference on Learning Representations (2022)
Google Scholar
Hafner, D., Lillicrap, T.P., Norouzi, M., Ba, J.: Mastering atari with discrete world models. In: International Conference on Learning Representations (2021)
Google Scholar
He, H., Bai, C., Pan, L., Zhang, W., Zhao, B., Li, X.: Large-scale actionless video pre-training via discrete diffusion for efficient policy learning. arXiv preprint arXiv:2402.14407 (2024)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Computer Vision and Pattern Recognition (2022)
Google Scholar
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. arXiv preprint arXiv:2204.03458 (2021)
James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: the robot learning benchmark & learning environment. Robot. Autom. Lett. 5(2), 3019–3026 (2020)
Article Google Scholar
Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artif. Intell. 101(1–2), 99–134 (1998)
Article MathSciNet Google Scholar
Karamcheti, S., et al.: Language-driven representation learning for robotics. arXiv preprint arXiv:2302.12766 (2023)
Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for offline reinforcement learning. In: Neural Information Processing Systems (2020)
Google Scholar
Li, S., Han, S., Zhao, Y., Liang, B., Liu, P.: Auxiliary reward generation with transition distance representation learning. arXiv preprint arXiv:2402.07412 (2024)
Luc, P., et al.: Transformation-based adversarial video prediction on large-scale data. arXiv preprint arXiv:2003.04035 (2020)
Ma, Y.J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V., Zhang, A.: Vip: towards universal visual reward and representation via value-implicit pre-training. arXiv preprint arXiv:2210.00030 (2022)
Mei, K., Patel, V.: VIDM: video implicit diffusion models. In: Association for the Advancement of Artificial Intelligence (2023)
Google Scholar
Menapace, W., Lathuiliere, S., Tulyakov, S., Siarohin, A., Ricci, E.: Playable video generation. In: Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Mendonca, R., Bahl, S., Pathak, D.: Structured world models from human videos. In: Robotics: Science and Systems (2023)
Google Scholar
Mendonca, R., Rybkin, O., Daniilidis, K., Hafner, D., Pathak, D.: Discovering and achieving goals via world models. In: Neural Information Processing Systems (2021)
Google Scholar
Nair, A., Bahl, S., Khazatsky, A., Pong, V., Berseth, G., Levine, S.: Contextual imagined goals for self-supervised robotic learning. In: Conference on Robot Learning (2020)
Google Scholar
Nair, A.V., Pong, V., Dalal, M., Bahl, S., Lin, S., Levine, S.: Visual reinforcement learning with imagined goals. In: Neural Information Processing Systems (2018)
Google Scholar
Nair, S., Finn, C.: Hierarchical foresight: self-supervised learning of long-horizon tasks via visual subgoal generation. In: International Conference on Learning Representations (2020)
Google Scholar
Nair, S., Rajeswaran, A., Kumar, V., Finn, C., Gupta, A.: R3m: a universal visual representation for robot manipulation. In: Conference on Robot Learning (2023)
Google Scholar
Nair, S., Savarese, S., Finn, C.: Goal-aware prediction: learning to model what matters. In: International Conference on Machine Learning (2020)
Google Scholar
Padalkar, A., et al.: Open x-embodiment: robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864 (2023)
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Neural Information Processing Systems (2019)
Google Scholar
Pathak, D., et al.: Zero-shot visual imitation. In: Computer Vision and Pattern Recognition Workshops (2018)
Google Scholar
Peng, X.B., Kanazawa, A., Malik, J., Abbeel, P., Levine, S.: Sfv: reinforcement learning of physical skills from videos. ACM Trans. Graph. 37(6), 1–14 (2018)
Article Google Scholar
Pong, V., Dalal, M., Lin, S., Nair, A., Bahl, S., Levine, S.: Skew-fit: state-covering self-supervised reinforcement learning. In: International Conference on Machine Learning (2020)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)
Google Scholar
Radosavovic, I., Xiao, T., James, S., Abbeel, P., Malik, J., Darrell, T.: Real-world robot learning with masked visual pre-training. In: Conference on Robot Learning (2023)
Google Scholar
Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014)
Schaul, T., Horgan, D., Gregor, K., Silver, D.: Universal value function approximators. In: International Conference on Machine Learning (2015)
Google Scholar
Schmeckpeper, K., Rybkin, O., Daniilidis, K., Levine, S., Finn, C.: Reinforcement learning with videos: combining offline observations with interaction. In: Conference on Robot Learning (2021)
Google Scholar
Schmidt, D., Jiang, M.: Learning to act without actions. In: International Conference on Learning Representations (2024)
Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Sekar, R., Rybkin, O., Daniilidis, K., Abbeel, P., Hafner, D., Pathak, D.: Planning to explore via self-supervised world models. In: International Conference on Machine Learning (2020)
Google Scholar
Seo, Y., Lee, K., James, S.L., Abbeel, P.: Reinforcement learning with action-free pre-training from videos. In: International Conference on Machine Learning (2022)
Google Scholar
Sermanet, P., et al.: Time-contrastive networks: self-supervised learning from video. In: International Conference on Robotics and Automation (2018)
Google Scholar
Shao, L., Migimatsu, T., Zhang, Q., Yang, K., Bohg, J.: Concept2robot: learning manipulation concepts from instructions and human demonstrations. Int. J. Robot. Res. 40(12–14), 1419–1434 (2021)
Article Google Scholar
Sharma, P., Pathak, D., Gupta, A.: Third-person visual imitation learning via decoupled hierarchical controller. In: Neural Information Processing Systems (2019)
Google Scholar
Shridhar, M., Manuelli, L., Fox, D.: Perceiver-actor: a multi-task transformer for robotic manipulation. In: Conference on Robot Learning (2023)
Google Scholar
Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. In: International Conference on Learning Representations (2023)
Google Scholar
Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. In: Neural Information Processing Systems (2015)
Google Scholar
Torabi, F., Warnell, G., Stone, P.: Behavioral cloning from observation. In: International Joint Conference on Artificial Intelligence (2018)
Google Scholar
Torabi, F., Warnell, G., Stone, P.: Generative adversarial imitation from observation. arXiv preprint arXiv:1807.06158 (2018)
Torabi, F., Warnell, G., Stone, P.: Imitation learning from video by leveraging proprioception. In: International Joint Conference on Artificial Intelligence (2019)
Google Scholar
Trott, A., Zheng, S., Xiong, C., Socher, R.: Keeping your distance: solving sparse reward tasks using self-balancing shaped rewards. In: Neural Information Processing Systems (2019)
Google Scholar
Van Den Oord, A., Vinyals, O., , Kavukcuoglu, K.: Neural discrete representation learning. In: Neural Information Processing Systems (2017)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Neural Information Processing Systems (2017)
Google Scholar
Villegas, R., et al.: Phenaki: variable length video generation from open domain textual descriptions. In: International Conference on Learning Representations (2023)
Google Scholar
Villegas, R., Pathak, A., Kannan, H., Erhan, D., Le, Q.V., Lee, H.: High fidelity video prediction with large stochastic recurrent neural networks. In: Neural Information Processing Systems (2019)
Google Scholar
Walker, J., Razavi, A., Oord, A.V.D.: Predicting video with vqvae. arXiv preprint arXiv:2103.01950 (2021)
Wang, X., Zhu, Z., Huang, G., Wang, B., Chen, X., Lu, J.: Worlddreamer: towards general world models for video generation via predicting masked tokens. arXiv preprint arXiv:2401.09985 (2024)
Wu, H., et al.: Unleashing large-scale video generative pre-training for visual robot manipulation. arXiv preprint arXiv:2312.13139 (2023)
Wu, J., Ma, H., Deng, C., Long, M.: Pre-training contextualized world models with in-the-wild videos for reinforcement learning. arXiv preprint arXiv:2305.18499 (2023)
Wu, Y., Tucker, G., Nachum, O.: The laplacian in RL: learning representations with efficient approximations. In: International Conference on Learning Representations (2018)
Google Scholar
Xiao, T., Radosavovic, I., Darrell, T., Malik, J.: Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173 (2022)
Xu, M., et al.: Spatial-temporal transformer networks for traffic flow forecasting. arXiv preprint arXiv:2001.02908 (2020)
Yan, W., Hafner, D., James, S., Abbeel, P.: Temporally consistent transformers for video generation. In: International Conference on Machine Learning (2023)
Google Scholar
Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: Videogpt: video generation using VQ-VAE and transformers. arXiv preprint arXiv:2104.10157 (2021)
Yang, C., et al.: Imitation learning from observations by minimizing inverse dynamics disagreement. In: Neural Information Processing Systems (2019)
Google Scholar
Yang, J., Liu, B., Fu, J., Pan, B., Wu, G., Wang, L.: Spatiotemporal predictive pre-training for robotic motor control. arXiv preprint arXiv:2403.05304 (2024)
Yang, M., Nachum, O.: Representation matters: offline pretraining for sequential decision making. In: International Conference on Machine Learning (2021)
Google Scholar
Ye, W., Zhang, Y., Abbeel, P., Gao, Y.: Become a proficient player with limited data through watching pure videos. In: International Conference on Learning Representations (2022)
Google Scholar
Yu, L., et al.: Magvit: masked generative video transformer. In: Computer Vision and Pattern Recognition (2023)
Google Scholar
Yu, T., Kumar, A., Chebotar, Y., Hausman, K., Finn, C., Levine, S.: How to leverage unlabeled data in offline reinforcement learning. In: International Conference on Machine Learning (2022)
Google Scholar
Yu, T., et al.: Meta-world: a benchmark and evaluation for multi-task and meta reinforcement learning. In: Conference on Robot Learning (2020)
Google Scholar
Yu, X., Lyu, Y., Tsang, I.: Intrinsic reward driven imitation learning via generative model. In: International Conference on Machine Learning (2020)
Google Scholar
Zhang, Q., Peng, Z., Zhou, B.: Learning to drive by watching YouTube videos: action-conditioned contrastive policy pretraining. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13686, pp. 111–128. Springer, Cham. (2022). https://doi.org/10.1007/978-3-031-19809-0_7
Chapter Google Scholar
Zheng, Q., Henaff, M., Amos, B., Grover, A.: Semi-supervised offline reinforcement learning with action-free trajectories. In: International Conference on Machine Learning (2023)
Google Scholar
Zhou, B., Li, K., Jiang, J., Lu, Z.: Learning from visual observation via offline pretrained state-to-go transformer. In: Neural Information Processing Systems (2023)
Google Scholar

Download references

Acknowledgements

This work was supported by NSFC under grant 62250068.

Author information

Authors and Affiliations

School of Computer Science, Peking University, Beijing, China
Hao Luo, Bohan Zhou & Zongqing Lu
Beijing Academy of Artificial Intelligence, Beijing, China
Zongqing Lu

Authors

Hao Luo
View author publications
You can also search for this author in PubMed Google Scholar
Bohan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Zongqing Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zongqing Lu .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3843 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Luo, H., Zhou, B., Lu, Z. (2025). Pre-trained Visual Dynamics Representations for Efficient Policy Learning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15139. Springer, Cham. https://doi.org/10.1007/978-3-031-73004-7_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-73004-7_15
Published: 01 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73003-0
Online ISBN: 978-3-031-73004-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Pre-trained Visual Dynamics Representations for Efficient Policy Learning