StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning

Shang, Jinghuan; Kahatapitiya, Kumara; Li, Xiang; Ryoo, Michael S.

doi:10.1007/978-3-031-19842-7_27

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13699))

Included in the following conference series:

European Conference on Computer Vision

4152 Accesses
10 Citations
7 Altmetric

Abstract

Reinforcement Learning (RL) can be considered as a sequence modeling task: given a sequence of past state-action-reward experiences, an agent predicts a sequence of next actions. In this work, we propose State-Action-Reward Transformer (StARformer) for visual RL, which explicitly models short-term state-action-reward representations (StAR-representations), essentially introducing a Markovian-like inductive bias to improve long-term modeling. Our approach first extracts StAR-representations by self-attending image state patches, action, and reward tokens within a short temporal window. These are then combined with pure image state representations—extracted as convolutional features, to perform self-attention over the whole sequence. Our experiments show that StARformer outperforms the state-of-the-art Transformer-based method on image-based Atari and DeepMind Control Suite benchmarks, in both offline-RL and imitation learning settings. StARformer is also more compliant with longer sequences of inputs. Our code is available at https://github.com/elicassion/StARformer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Robust Visual Reinforcement Learning by Prompt Tuning

Deep Reinforcement Learning: A New Frontier in Computer Vision Research

Memory Based Reinforcement Learning with Transformers for Long Horizon Timescales

References

Abbeel, P., Ng, A.Y.: Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the International Conference on Machine Learning (ICML), p. 1 (2004)
Google Scholar
Agarwal, R., Schuurmans, D., Norouzi, M.: An optimistic perspective on offline reinforcement learning. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 104–114, July 2020
Google Scholar
Agarwal, R., Schuurmans, D., Norouzi, M.: An optimistic perspective on offline reinforcement learning. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 104–114. PMLR (2020)
Google Scholar
Andrychowicz, M., et al.: Hindsight experience replay. In: Advances in Neural Information Processing Systems (NeurIPS) (2017)
Google Scholar
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: Proceedings of the International Conference on Computer Vision (ICCV), October 2021
Google Scholar
Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47(1), 253–279 (2013)
Article Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: Proceedings of the International Conference on Machine Learning (ICML), July 2021
Google Scholar
Brockman, G., et al.: OpenAI gym (2016). arXiv:1606.01540
Chen, L., et al.: Decision transformer: reinforcement learning via sequence modeling. In: Advances in Neural Information Processing Systems (NeurIPS), December 2021
Google Scholar
Chen, M., et al.: Generative pretraining from pixels. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 1691–1703, July 2000
Google Scholar
Cheng, J., Dong, L., Lapata, M.: Long short-term memory-networks for machine reading (2016). arXiv:1601.06733
Choromanski, K., et al.: Rethinking attention with performers. In: Proceedings of the International Conference on Learning Representations (ICLR), April 2020
Google Scholar
Dabney, W., Rowland, M., Bellemare, M., Munos, R.: Distributional reinforcement learning with quantile regression. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 32 (2018)
Google Scholar
Dai, R., Das, S., Kahatapitiya, K., Ryoo, M.S., Bremond, F.: MS-TCT: multi-scale temporal convtransformer for action detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20041–20051 (2022)
Google Scholar
Dai, Z., Liu, H., Le, Q.V., Tan, M.: CoAtNet: marrying convolution and attention for all data sizes. In: Advances in Neural Information Processing Systems (NeurIPS), December 2021
Google Scholar
Dasari, S., Gupta, A.: Transformers for one-shot visual imitation. In: Conference on Robot Learning (CoRL) (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019). arXiv:1810.04805
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations (ICLR), April 2020
Google Scholar
Eysenbach, B., Geng, X., Levine, S., Salakhutdinov, R.: Rewriting history with inverse RL: hindsight inference for policy improvement. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)
Google Scholar
Furuta, H., Matsuo, Y., Gu, S.S.: Distributional decision transformer for hindsight information matching. In: Proceedings of the International Conference on Learning Representations (ICLR) (2022)
Google Scholar
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 1861–1870, July 2018
Google Scholar
Hafner, D., Lillicrap, T., Norouzi, M., Ba, J.: Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193 (2020)
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. In: Advances in Neural Information Processing Systems (NeurIPS), December 2021
Google Scholar
Hessel, M., et al.: Rainbow: combining improvements in deep reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) (2018)
Google Scholar
Ho, J., Ermon, S.: Generative adversarial imitation learning. In: Advances in Neural Information Processing Systems (NeurIPS), December 2016
Google Scholar
Janner, M., Li, Q., Levine, S.: Offline reinforcement learning as one big sequence modeling problem. In: Advances in Neural Information Processing Systems (NeurIPS), December 2021
Google Scholar
Kaelbling, L.P.: Learning to achieve goals. In: International Joint Conference on Artificial Intelligence (IJCAI) (1993)
Google Scholar
Kahatapitiya, K., Ryoo, M.S.: Swat: spatial structure within and among tokens. arXiv preprint arXiv:2111.13677 (2021)
Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer. In: Proceedings of the International Conference on Learning Representations (ICLR), May 2019
Google Scholar
Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems (NeurIPS), December 2000
Google Scholar
Kumar, A., Fu, J., Soh, M., Tucker, G., Levine, S.: Stabilizing off-policy Q-learning via bootstrapping error reduction. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 32 (2019)
Google Scholar
Kumar, A., Peng, X.B., Levine, S.: Reward-conditioned policies. arXiv preprint arXiv:1912.13465 (2019)
Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for offline reinforcement learning. Adv. Neural Inf. Process. Syst. (NeurIPS) 33, 1179–1191 (2020)
Google Scholar
Laskin, M., Srinivas, A., Abbeel, P.: Curl: contrastive unsupervised representations for reinforcement learning. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 5639–5650. PMLR (2020)
Google Scholar
Laskin, M., Lee, K., Stooke, A., Pinto, L., Abbeel, P., Srinivas, A.: Reinforcement learning with augmented data. Adv. Neural. Inf. Process. Syst. 33, 19884–19895 (2020)
Google Scholar
Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline reinforcement learning: tutorial, review, and perspectives on open problems (2020). arXiv:2005.01643
Li, A.C., Pinto, L., Abbeel, P.: Generalized hindsight for reinforcement learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)
Google Scholar
Li, X., Shang, J., Das, S., Ryoo, M.S.: Does self-supervised learning really improve reinforcement learning from pixels? arXiv preprint arXiv:2206.05266 (2022)
Lin, Z., et al.: A structured self-attentive sentence embedding (2017). arXiv:1703.03130
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 10012–10022, October 2021
Google Scholar
Liu, Z., et al.: Video swin transformer (2021). arXiv:2106.13230
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. arXiv preprint arXiv:2201.03545 (2022)
Mnih, V., et al.: Playing atari with deep reinforcement learning (2013). arXiv:1312.5602
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Google Scholar
Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network (2021). arXiv:2102.00719
Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 3163–3172 (2021)
Google Scholar
Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Proceedings of the International Conference on Machine Learning (ICML), vol. 1, p. 2 (2000)
Google Scholar
Parikh, A.P., Täckström, O., Das, D., Uszkoreit, J.: A decomposable attention model for natural language inference (2016). arXiv:1606.01933
Pong, V., Gu, S., Dalal, M., Levine, S.: Temporal difference models: model-free deep RL for model-based control. In: Proceedings of the International Conference on Learning Representations (ICLR) (2018)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Google Scholar
Reed, S., et al.: A generalist agent. arXiv preprint arXiv:2205.06175 (2022)
Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems, vol. 37. Citeseer (1994)
Google Scholar
Ryoo, M.S., Piergiovanni, A., Arnab, A., Dehghani, M., Angelova, A.: TokenLearner: adaptive space-time tokenization for videos. In: Advances in Neural Information Processing Systems (NeurIPS), December 2021
Google Scholar
Shang, J., Das, S., Ryoo, M.S.: Learning viewpoint-agnostic visual representations by recovering tokens in 3D space. arXiv preprint arXiv:2206.11895 (2022)
Shang, J., Ryoo, M.S.: Self-supervised disentangled representation learning for third-person imitation learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 214–221. IEEE (2021)
Google Scholar
Srivastava, R.K., Shyam, P., Mutz, F., Jaśkowski, W., Schmidhuber, J.: Training agents using upside-down reinforcement learning. arXiv preprint arXiv:1912.02877 (2019)
Sutton, R.S.: Learning to predict by the methods of temporal differences. Mach. Learn. 3(1), 9–44 (1988)
Article Google Scholar
Tang, Y., Ha, D.: The sensory neuron as a transformer: permutation-invariant neural networks for reinforcement learning. arXiv preprint arXiv:2109.02869 (2021)
Tesauro, G., et al.: Temporal difference learning and TD-Gammon. Commun. ACM 38(3), 58–68 (1995)
Article Google Scholar
Torabi, F., Warnell, G., Stone, P.: Generative Adversarial Imitation from Observation (2019). arXiv:1807.06158
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 32–42, October 2021
Google Scholar
Tunyasuvunakool, S., et al.: dm_control: software and tasks for continuous control. Softw. Impacts 6, 100022 (2020)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS), December 2017
Google Scholar
Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: self-attention with linear complexity (2020). arXiv:2006.04768
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
Article Google Scholar
Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., Girshick, R.: Early convolutions help transformers see better. Adv. Neural Inf. Process. Syst. (NeurIPS) 34, 30392–30400 (2021)
Google Scholar
Yang, J., et al.: Focal self-attention for local-global interactions in vision transformers (2021). arXiv:2107.00641
Yarats, D., Fergus, R., Lazaric, A., Pinto, L.: Mastering visual continuous control: improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645 (2021)
Yarats, D., Kostrikov, I., Fergus, R.: Image augmentation is all you need: regularizing deep reinforcement learning from pixels. In: Proceedings of the International Conference on Learning Representations (ICLR) (2020)
Google Scholar
Yarats, D., Zhang, A., Kostrikov, I., Amos, B., Pineau, J., Fergus, R.: Improving sample efficiency in model-free reinforcement learning from images. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 10674–10681, May 2021
Google Scholar
Zheng, Q., Zhang, A., Grover, A.: Online decision transformer (2022)
Google Scholar

Download references

Acknowledgements

We thank members in Robotics Lab at Stony Brook for valuable discussions. This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Ministry of Science and ICT (No.2018-0-00205, Development of Core Technology of Robot Task-Intelligence for Improvement of Labor Condition.

Author information

Authors and Affiliations

Stony Brook University, Stony Brook, NY, 11794, USA
Jinghuan Shang, Kumara Kahatapitiya, Xiang Li & Michael S. Ryoo

Authors

Jinghuan Shang
View author publications
You can also search for this author in PubMed Google Scholar
Kumara Kahatapitiya
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Michael S. Ryoo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinghuan Shang .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1437 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shang, J., Kahatapitiya, K., Li, X., Ryoo, M.S. (2022). StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13699. Springer, Cham. https://doi.org/10.1007/978-3-031-19842-7_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-19842-7_27
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19841-0
Online ISBN: 978-3-031-19842-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning