Reinforcement Learning Friendly Vision-Language Model for Minecraft

Jiang, Haobin; Yue, Junpeng; Luo, Hao; Ding, Ziluo; Lu, Zongqing

doi:10.1007/978-3-031-73113-6_1

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15126))

Included in the following conference series:

European Conference on Computer Vision

297 Accesses

Abstract

One of the essential missions in the AI research community is to build an autonomous embodied agent that can achieve high-level performance across a wide spectrum of tasks. However, acquiring or manually designing rewards for all open-ended tasks is unrealistic. In this paper, we propose a novel cross-modal contrastive learning framework architecture, CLIP4MC, aiming to learn a reinforcement learning (RL) friendly vision-language model (VLM) that serves as an intrinsic reward function for open-ended tasks. Simply utilizing the similarity between the video snippet and the language prompt is not RL-friendly since standard VLMs may only capture the similarity at a coarse level. To achieve RL-friendliness, we incorporate the task completion degree into the VLM training objective, as this information can assist agents in distinguishing the importance between different states. Moreover, we provide neat YouTube datasets based on the large-scale YouTube database provided by MineDojo. Specifically, two rounds of filtering operations guarantee that the dataset covers enough essential information and that the video-text pair is highly correlated. Empirically, we demonstrate that the proposed method achieves better performance on RL tasks compared with baselines. The code and datasets are available at https://github.com/PKU-RL/CLIP4MC.

H. Jiang and J. Yue—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Soft Expert Reward Learning for Vision-and-Language Navigation

VLG: General Video Recognition with Web Textual Knowledge

Article 25 May 2024

Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts

References

Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Baker, B., et al.: Video pretraining (VPT): learning to act by watching unlabeled online videos. arXiv preprint arXiv:2206.11795 (2022)
Baumli, K., et al.: Vision-language models as a source of rewards. arXiv preprint arXiv:2312.09187 (2023)
Cai, S., Wang, Z., Ma, X., Liu, A., Liang, Y.: Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction. arXiv preprint arXiv:2301.10034 (2023)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021)
Google Scholar
Dzabraev, M., Kalashnikov, M., Komkov, S., Petiushko, A.: MDMMT: multidomain multimodal transformer for video retrieval. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2021)
Google Scholar
Fan, L., et al.: MineDojo: building open-ended embodied agents with internet-scale knowledge. In: Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track (2022)
Google Scholar
Guhr, O., Schumann, A.K., Bahrmann, F., Böhme, H.J.: Fullstop: multilingual deep models for punctuation prediction, June 2021. http://ceur-ws.org/Vol-2957/sepp_paper4.pdf
Guss, W.H., et al.: NeurIPS 2019 competition: the MineRL competition on sample efficient reinforcement learning using human priors. arXiv preprint arXiv:1904.10079 (2019)
Guss, W.H., et al.: MineRL: a large-scale dataset of minecraft demonstrations. In: International Joint Conference on Artificial Intelligence (IJCAI) (2019)
Google Scholar
Hafner, D., Pasukonis, J., Ba, J., Lillicrap, T.P.: Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104 (2023)
Haiminen, N., Gionis, A., Laasonen, K.: Algorithms for unimodal segmentation with applications to unimodality detection. Knowl. Inf. Syst. 14, 39–57 (2008)
Article Google Scholar
Johnson, M., Hofmann, K., Hutton, T., Bignell, D.: The Malmo platform for artificial intelligence experimentation. In: International Joint Conference on Artificial Intelligence (IJCAI) (2016)
Google Scholar
Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artif. Intell. 101(1–2), 99–134 (1998)
Article MathSciNet Google Scholar
Kanervisto, A., et al.: MineRL diamond 2021 competition: overview, results, and lessons learned. arXiv preprint arXiv:2202.10583 (2022)
Li, Y., Wang, H., Duan, Y., Li, X.: Clip surgery for better explainability with enhancement in open-vocabulary tasks. arXiv preprint arXiv:2304.05653 (2023)
Lin, Z., Li, J., Shi, J., Ye, D., Fu, Q., Yang, W.: JueWu-MC: playing minecraft with sample-efficient hierarchical reinforcement learning. arXiv preprint arXiv:2112.04907 (2021)
Liu, S., Fan, H., Qian, S., Chen, Y., Ding, W., Wang, Z.: Hit: hierarchical transformer with momentum contrast for video-text retrieval. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Luo, H., et al.: CLIP4clip: an empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 508, 293–304 (2022)
Article Google Scholar
Ma, Y.J., et al.: Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931 (2023)
Nottingham, K., et al.: Do embodied agents dream of pixelated sheep?: Embodied decision making using language guided world modelling. arXiv preprint arXiv:2301.12050 (2023)
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
OpenAI: GPT-4 technical report (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, ICML (2021)
Google Scholar
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision (2022). https://doi.org/10.48550/ARXIV.2212.04356, https://arxiv.org/abs/2212.04356
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Shah, R., et al.: Retrospective on the 2021 mineRL BASALT competition on learning from human feedback. In: Neural Information Processing Systems (NeurIPS) Competitions and Demonstrations Track (2021)
Google Scholar
Shu, T., Xiong, C., Socher, R.: Hierarchical and interpretable skill acquisition in multi-task reinforcement learning. In: International Conference on Learning Representations (ICLR) (2018)
Google Scholar
Tessler, C., Givony, S., Zahavy, T., Mankowitz, D., Mannor, S.: A deep hierarchical approach to lifelong learning in minecraft. In: AAAI Conference on Artificial Intelligence (AAAI) (2017)
Google Scholar
Wang, G., et al.: Voyager: an open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 (2023)
Xu, H., et al.: Videoclip: contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 (2021)
Yuan, H., et al.: Plan4MC: skill reinforcement learning and planning for open-world minecraft tasks. arXiv preprint arXiv:2303.16563 (2023)
Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from clip. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 696–712. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_40
Chapter Google Scholar
Zhu, X., et al.: Ghost in the minecraft: generally capable agents for open-world environments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144 (2023)

Download references

Acknowledgements

This work was supported by NSFC under grant 62250068. The authors would like to thank the anonymous reviewers for their valuable comments.

Author information

Authors and Affiliations

School of Computer Science, Peking University, Beijing, China
Haobin Jiang, Junpeng Yue, Hao Luo & Zongqing Lu
Beijing Academy of Artificial Intelligence, Beijing, China
Ziluo Ding & Zongqing Lu

Authors

Haobin Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Junpeng Yue
View author publications
You can also search for this author in PubMed Google Scholar
Hao Luo
View author publications
You can also search for this author in PubMed Google Scholar
Ziluo Ding
View author publications
You can also search for this author in PubMed Google Scholar
Zongqing Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zongqing Lu .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 804 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, H., Yue, J., Luo, H., Ding, Z., Lu, Z. (2025). Reinforcement Learning Friendly Vision-Language Model for Minecraft. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15126. Springer, Cham. https://doi.org/10.1007/978-3-031-73113-6_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-73113-6_1
Published: 21 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73112-9
Online ISBN: 978-3-031-73113-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Reinforcement Learning Friendly Vision-Language Model for Minecraft