Abstract
Pre-training for Reinforcement Learning (RL) with purely video data is a valuable yet challenging problem. Although in-the-wild videos are readily available and inhere a vast amount of prior world knowledge, the absence of action annotations and the common domain gap with downstream tasks hinder utilizing videos for RL pre-training. To address the challenge of pre-training with videos, we propose Pre-trained Visual Dynamics Representations (PVDR) to bridge the domain gap between videos and downstream tasks for efficient policy learning. By adopting video prediction as a pre-training task, we use a Transformer-based Conditional Variational Autoencoder (CVAE) to learn visual dynamics representations. The pre-trained visual dynamics representations capture the visual dynamics prior knowledge in the videos. This abstract prior knowledge can be readily adapted to downstream tasks and aligned with executable actions through online adaptation. We conduct experiments on a series of robotics visual control tasks and verify that PVDR is an effective form for pre-training with videos to promote policy learning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
To maintain the pre-trained visual dynamics knowledge, we freeze the dynamics encoder except for the last layer that generates the mean and variance of \(z_t\).
- 2.
Actor-critic algorithms require a critic network, and we simply replace the last layer of the action alignment module to construct a critic network.
References
Akan, A.K., Safadoust, S., Güney, F.: Stochastic video prediction with structure and motion. arXiv preprint arXiv:2203.10528 (2022)
Andrychowicz, M., et al.: Hindsight experience replay. In: Neural Information Processing Systems (2017)
Åström, K.J.: Optimal control of markov processes with incomplete state information i. J. Math. Anal. Appl. 10, 174–205 (1965)
Bahl, S., Gupta, A., Pathak, D.: Human-to-robot imitation in the wild. arXiv preprint arXiv:2207.09450 (2022)
Bahl, S., Mendonca, R., Chen, L., Jain, U., Pathak, D.: Affordances from human videos as a versatile representation for robotics. In: Computer Vision and Pattern Recognition (2023)
Baker, B., et al.: Video pretraining (VPT): learning to act by watching unlabeled online videos. In: Neural Information Processing Systems (2022)
Bhateja, C., et al.: Robotic offline RL from internet videos via value-function pre-training. arXiv preprint arXiv:2309.13041 (2023)
Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition (2023)
Bobrin, M., Buzun, N., Krylov, D., Dylov, D.V.: Align your intents: offline imitation learning via optimal transport. arXiv preprint arXiv:2402.13037 (2024)
Bojarski, M., et al.: End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 (2016)
Brown, T., et al.: Language models are few-shot learners. In: Neural Information Processing Systems (2020)
Bruce, J., Anand, A., Mazoure, B., Fergus, R.: Learning about progress from experts. In: International Conference on Learning Representations (2023)
Bruce, J., et al.: Genie: generative interactive environments. arXiv preprint arXiv:2402.15391 (2024)
Chang, M., Gupta, A., Gupta, S.: Learning value functions from undirected state-only experience. In: International Conference on Learning Representations (2022)
Chen, A.S., Nair, S., Finn, C.: Learning generalizable robotic reward functions from “in-the-wild” human videos. In: Robotics: Science and Systems (2022)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning (2020)
Clark, A., Donahue, J., Simonyan, K.: Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571 (2019)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Ebert, F., Finn, C., Lee, A.X., Levine, S.: Self-supervised visual planning with temporal skip connections. Conf. Robot Learn. (2017)
Edwards, A., Sahni, H., Schroecker, Y., Isbell, C.: Imitating latent policies from observation. In: International Conference on Machine Learning (2019)
Escontrela, A., et al.: Video prediction models as rewards for reinforcement learning. In: Neural Information Processing Systems (2023)
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Computer Vision and Pattern Recognition (2021)
Ghosh, D., Bhateja, C.A., Levine, S.: Reinforcement learning from passive data via latent intentions. In: International Conference on Machine Learning (2023)
Gupta, A., et al.: Maskvit: masked visual pre-training for video prediction. In: International Conference on Learning Representations (2022)
Hafner, D., Lillicrap, T.P., Norouzi, M., Ba, J.: Mastering atari with discrete world models. In: International Conference on Learning Representations (2021)
He, H., Bai, C., Pan, L., Zhang, W., Zhao, B., Li, X.: Large-scale actionless video pre-training via discrete diffusion for efficient policy learning. arXiv preprint arXiv:2402.14407 (2024)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Computer Vision and Pattern Recognition (2022)
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. arXiv preprint arXiv:2204.03458 (2021)
James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: the robot learning benchmark & learning environment. Robot. Autom. Lett. 5(2), 3019–3026 (2020)
Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artif. Intell. 101(1–2), 99–134 (1998)
Karamcheti, S., et al.: Language-driven representation learning for robotics. arXiv preprint arXiv:2302.12766 (2023)
Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for offline reinforcement learning. In: Neural Information Processing Systems (2020)
Li, S., Han, S., Zhao, Y., Liang, B., Liu, P.: Auxiliary reward generation with transition distance representation learning. arXiv preprint arXiv:2402.07412 (2024)
Luc, P., et al.: Transformation-based adversarial video prediction on large-scale data. arXiv preprint arXiv:2003.04035 (2020)
Ma, Y.J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V., Zhang, A.: Vip: towards universal visual reward and representation via value-implicit pre-training. arXiv preprint arXiv:2210.00030 (2022)
Mei, K., Patel, V.: VIDM: video implicit diffusion models. In: Association for the Advancement of Artificial Intelligence (2023)
Menapace, W., Lathuiliere, S., Tulyakov, S., Siarohin, A., Ricci, E.: Playable video generation. In: Conference on Computer Vision and Pattern Recognition (2021)
Mendonca, R., Bahl, S., Pathak, D.: Structured world models from human videos. In: Robotics: Science and Systems (2023)
Mendonca, R., Rybkin, O., Daniilidis, K., Hafner, D., Pathak, D.: Discovering and achieving goals via world models. In: Neural Information Processing Systems (2021)
Nair, A., Bahl, S., Khazatsky, A., Pong, V., Berseth, G., Levine, S.: Contextual imagined goals for self-supervised robotic learning. In: Conference on Robot Learning (2020)
Nair, A.V., Pong, V., Dalal, M., Bahl, S., Lin, S., Levine, S.: Visual reinforcement learning with imagined goals. In: Neural Information Processing Systems (2018)
Nair, S., Finn, C.: Hierarchical foresight: self-supervised learning of long-horizon tasks via visual subgoal generation. In: International Conference on Learning Representations (2020)
Nair, S., Rajeswaran, A., Kumar, V., Finn, C., Gupta, A.: R3m: a universal visual representation for robot manipulation. In: Conference on Robot Learning (2023)
Nair, S., Savarese, S., Finn, C.: Goal-aware prediction: learning to model what matters. In: International Conference on Machine Learning (2020)
Padalkar, A., et al.: Open x-embodiment: robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864 (2023)
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Neural Information Processing Systems (2019)
Pathak, D., et al.: Zero-shot visual imitation. In: Computer Vision and Pattern Recognition Workshops (2018)
Peng, X.B., Kanazawa, A., Malik, J., Abbeel, P., Levine, S.: Sfv: reinforcement learning of physical skills from videos. ACM Trans. Graph. 37(6), 1–14 (2018)
Pong, V., Dalal, M., Lin, S., Nair, A., Bahl, S., Levine, S.: Skew-fit: state-covering self-supervised reinforcement learning. In: International Conference on Machine Learning (2020)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)
Radosavovic, I., Xiao, T., James, S., Abbeel, P., Malik, J., Darrell, T.: Real-world robot learning with masked visual pre-training. In: Conference on Robot Learning (2023)
Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014)
Schaul, T., Horgan, D., Gregor, K., Silver, D.: Universal value function approximators. In: International Conference on Machine Learning (2015)
Schmeckpeper, K., Rybkin, O., Daniilidis, K., Levine, S., Finn, C.: Reinforcement learning with videos: combining offline observations with interaction. In: Conference on Robot Learning (2021)
Schmidt, D., Jiang, M.: Learning to act without actions. In: International Conference on Learning Representations (2024)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Sekar, R., Rybkin, O., Daniilidis, K., Abbeel, P., Hafner, D., Pathak, D.: Planning to explore via self-supervised world models. In: International Conference on Machine Learning (2020)
Seo, Y., Lee, K., James, S.L., Abbeel, P.: Reinforcement learning with action-free pre-training from videos. In: International Conference on Machine Learning (2022)
Sermanet, P., et al.: Time-contrastive networks: self-supervised learning from video. In: International Conference on Robotics and Automation (2018)
Shao, L., Migimatsu, T., Zhang, Q., Yang, K., Bohg, J.: Concept2robot: learning manipulation concepts from instructions and human demonstrations. Int. J. Robot. Res. 40(12–14), 1419–1434 (2021)
Sharma, P., Pathak, D., Gupta, A.: Third-person visual imitation learning via decoupled hierarchical controller. In: Neural Information Processing Systems (2019)
Shridhar, M., Manuelli, L., Fox, D.: Perceiver-actor: a multi-task transformer for robotic manipulation. In: Conference on Robot Learning (2023)
Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. In: International Conference on Learning Representations (2023)
Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. In: Neural Information Processing Systems (2015)
Torabi, F., Warnell, G., Stone, P.: Behavioral cloning from observation. In: International Joint Conference on Artificial Intelligence (2018)
Torabi, F., Warnell, G., Stone, P.: Generative adversarial imitation from observation. arXiv preprint arXiv:1807.06158 (2018)
Torabi, F., Warnell, G., Stone, P.: Imitation learning from video by leveraging proprioception. In: International Joint Conference on Artificial Intelligence (2019)
Trott, A., Zheng, S., Xiong, C., Socher, R.: Keeping your distance: solving sparse reward tasks using self-balancing shaped rewards. In: Neural Information Processing Systems (2019)
Van Den Oord, A., Vinyals, O., , Kavukcuoglu, K.: Neural discrete representation learning. In: Neural Information Processing Systems (2017)
Vaswani, A., et al.: Attention is all you need. In: Neural Information Processing Systems (2017)
Villegas, R., et al.: Phenaki: variable length video generation from open domain textual descriptions. In: International Conference on Learning Representations (2023)
Villegas, R., Pathak, A., Kannan, H., Erhan, D., Le, Q.V., Lee, H.: High fidelity video prediction with large stochastic recurrent neural networks. In: Neural Information Processing Systems (2019)
Walker, J., Razavi, A., Oord, A.V.D.: Predicting video with vqvae. arXiv preprint arXiv:2103.01950 (2021)
Wang, X., Zhu, Z., Huang, G., Wang, B., Chen, X., Lu, J.: Worlddreamer: towards general world models for video generation via predicting masked tokens. arXiv preprint arXiv:2401.09985 (2024)
Wu, H., et al.: Unleashing large-scale video generative pre-training for visual robot manipulation. arXiv preprint arXiv:2312.13139 (2023)
Wu, J., Ma, H., Deng, C., Long, M.: Pre-training contextualized world models with in-the-wild videos for reinforcement learning. arXiv preprint arXiv:2305.18499 (2023)
Wu, Y., Tucker, G., Nachum, O.: The laplacian in RL: learning representations with efficient approximations. In: International Conference on Learning Representations (2018)
Xiao, T., Radosavovic, I., Darrell, T., Malik, J.: Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173 (2022)
Xu, M., et al.: Spatial-temporal transformer networks for traffic flow forecasting. arXiv preprint arXiv:2001.02908 (2020)
Yan, W., Hafner, D., James, S., Abbeel, P.: Temporally consistent transformers for video generation. In: International Conference on Machine Learning (2023)
Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: Videogpt: video generation using VQ-VAE and transformers. arXiv preprint arXiv:2104.10157 (2021)
Yang, C., et al.: Imitation learning from observations by minimizing inverse dynamics disagreement. In: Neural Information Processing Systems (2019)
Yang, J., Liu, B., Fu, J., Pan, B., Wu, G., Wang, L.: Spatiotemporal predictive pre-training for robotic motor control. arXiv preprint arXiv:2403.05304 (2024)
Yang, M., Nachum, O.: Representation matters: offline pretraining for sequential decision making. In: International Conference on Machine Learning (2021)
Ye, W., Zhang, Y., Abbeel, P., Gao, Y.: Become a proficient player with limited data through watching pure videos. In: International Conference on Learning Representations (2022)
Yu, L., et al.: Magvit: masked generative video transformer. In: Computer Vision and Pattern Recognition (2023)
Yu, T., Kumar, A., Chebotar, Y., Hausman, K., Finn, C., Levine, S.: How to leverage unlabeled data in offline reinforcement learning. In: International Conference on Machine Learning (2022)
Yu, T., et al.: Meta-world: a benchmark and evaluation for multi-task and meta reinforcement learning. In: Conference on Robot Learning (2020)
Yu, X., Lyu, Y., Tsang, I.: Intrinsic reward driven imitation learning via generative model. In: International Conference on Machine Learning (2020)
Zhang, Q., Peng, Z., Zhou, B.: Learning to drive by watching YouTube videos: action-conditioned contrastive policy pretraining. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13686, pp. 111–128. Springer, Cham. (2022). https://doi.org/10.1007/978-3-031-19809-0_7
Zheng, Q., Henaff, M., Amos, B., Grover, A.: Semi-supervised offline reinforcement learning with action-free trajectories. In: International Conference on Machine Learning (2023)
Zhou, B., Li, K., Jiang, J., Lu, Z.: Learning from visual observation via offline pretrained state-to-go transformer. In: Neural Information Processing Systems (2023)
Acknowledgements
This work was supported by NSFC under grant 62250068.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Luo, H., Zhou, B., Lu, Z. (2025). Pre-trained Visual Dynamics Representations for Efficient Policy Learning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15139. Springer, Cham. https://doi.org/10.1007/978-3-031-73004-7_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-73004-7_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73003-0
Online ISBN: 978-3-031-73004-7
eBook Packages: Computer ScienceComputer Science (R0)