Skip to main content

StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

Reinforcement Learning (RL) can be considered as a sequence modeling task: given a sequence of past state-action-reward experiences, an agent predicts a sequence of next actions. In this work, we propose State-Action-Reward Transformer (StARformer) for visual RL, which explicitly models short-term state-action-reward representations (StAR-representations), essentially introducing a Markovian-like inductive bias to improve long-term modeling. Our approach first extracts StAR-representations by self-attending image state patches, action, and reward tokens within a short temporal window. These are then combined with pure image state representations—extracted as convolutional features, to perform self-attention over the whole sequence. Our experiments show that StARformer outperforms the state-of-the-art Transformer-based method on image-based Atari and DeepMind Control Suite benchmarks, in both offline-RL and imitation learning settings. StARformer is also more compliant with longer sequences of inputs. Our code is available at https://github.com/elicassion/StARformer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abbeel, P., Ng, A.Y.: Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the International Conference on Machine Learning (ICML), p. 1 (2004)

    Google Scholar 

  2. Agarwal, R., Schuurmans, D., Norouzi, M.: An optimistic perspective on offline reinforcement learning. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 104–114, July 2020

    Google Scholar 

  3. Agarwal, R., Schuurmans, D., Norouzi, M.: An optimistic perspective on offline reinforcement learning. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 104–114. PMLR (2020)

    Google Scholar 

  4. Andrychowicz, M., et al.: Hindsight experience replay. In: Advances in Neural Information Processing Systems (NeurIPS) (2017)

    Google Scholar 

  5. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: Proceedings of the International Conference on Computer Vision (ICCV), October 2021

    Google Scholar 

  6. Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47(1), 253–279 (2013)

    Article  Google Scholar 

  7. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: Proceedings of the International Conference on Machine Learning (ICML), July 2021

    Google Scholar 

  8. Brockman, G., et al.: OpenAI gym (2016). arXiv:1606.01540

  9. Chen, L., et al.: Decision transformer: reinforcement learning via sequence modeling. In: Advances in Neural Information Processing Systems (NeurIPS), December 2021

    Google Scholar 

  10. Chen, M., et al.: Generative pretraining from pixels. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 1691–1703, July 2000

    Google Scholar 

  11. Cheng, J., Dong, L., Lapata, M.: Long short-term memory-networks for machine reading (2016). arXiv:1601.06733

  12. Choromanski, K., et al.: Rethinking attention with performers. In: Proceedings of the International Conference on Learning Representations (ICLR), April 2020

    Google Scholar 

  13. Dabney, W., Rowland, M., Bellemare, M., Munos, R.: Distributional reinforcement learning with quantile regression. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 32 (2018)

    Google Scholar 

  14. Dai, R., Das, S., Kahatapitiya, K., Ryoo, M.S., Bremond, F.: MS-TCT: multi-scale temporal convtransformer for action detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20041–20051 (2022)

    Google Scholar 

  15. Dai, Z., Liu, H., Le, Q.V., Tan, M.: CoAtNet: marrying convolution and attention for all data sizes. In: Advances in Neural Information Processing Systems (NeurIPS), December 2021

    Google Scholar 

  16. Dasari, S., Gupta, A.: Transformers for one-shot visual imitation. In: Conference on Robot Learning (CoRL) (2020)

    Google Scholar 

  17. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019). arXiv:1810.04805

  18. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations (ICLR), April 2020

    Google Scholar 

  19. Eysenbach, B., Geng, X., Levine, S., Salakhutdinov, R.: Rewriting history with inverse RL: hindsight inference for policy improvement. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

    Google Scholar 

  20. Furuta, H., Matsuo, Y., Gu, S.S.: Distributional decision transformer for hindsight information matching. In: Proceedings of the International Conference on Learning Representations (ICLR) (2022)

    Google Scholar 

  21. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 1861–1870, July 2018

    Google Scholar 

  22. Hafner, D., Lillicrap, T., Norouzi, M., Ba, J.: Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193 (2020)

  23. Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. In: Advances in Neural Information Processing Systems (NeurIPS), December 2021

    Google Scholar 

  24. Hessel, M., et al.: Rainbow: combining improvements in deep reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) (2018)

    Google Scholar 

  25. Ho, J., Ermon, S.: Generative adversarial imitation learning. In: Advances in Neural Information Processing Systems (NeurIPS), December 2016

    Google Scholar 

  26. Janner, M., Li, Q., Levine, S.: Offline reinforcement learning as one big sequence modeling problem. In: Advances in Neural Information Processing Systems (NeurIPS), December 2021

    Google Scholar 

  27. Kaelbling, L.P.: Learning to achieve goals. In: International Joint Conference on Artificial Intelligence (IJCAI) (1993)

    Google Scholar 

  28. Kahatapitiya, K., Ryoo, M.S.: Swat: spatial structure within and among tokens. arXiv preprint arXiv:2111.13677 (2021)

  29. Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer. In: Proceedings of the International Conference on Learning Representations (ICLR), May 2019

    Google Scholar 

  30. Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems (NeurIPS), December 2000

    Google Scholar 

  31. Kumar, A., Fu, J., Soh, M., Tucker, G., Levine, S.: Stabilizing off-policy Q-learning via bootstrapping error reduction. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 32 (2019)

    Google Scholar 

  32. Kumar, A., Peng, X.B., Levine, S.: Reward-conditioned policies. arXiv preprint arXiv:1912.13465 (2019)

  33. Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for offline reinforcement learning. Adv. Neural Inf. Process. Syst. (NeurIPS) 33, 1179–1191 (2020)

    Google Scholar 

  34. Laskin, M., Srinivas, A., Abbeel, P.: Curl: contrastive unsupervised representations for reinforcement learning. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 5639–5650. PMLR (2020)

    Google Scholar 

  35. Laskin, M., Lee, K., Stooke, A., Pinto, L., Abbeel, P., Srinivas, A.: Reinforcement learning with augmented data. Adv. Neural. Inf. Process. Syst. 33, 19884–19895 (2020)

    Google Scholar 

  36. Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline reinforcement learning: tutorial, review, and perspectives on open problems (2020). arXiv:2005.01643

  37. Li, A.C., Pinto, L., Abbeel, P.: Generalized hindsight for reinforcement learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

    Google Scholar 

  38. Li, X., Shang, J., Das, S., Ryoo, M.S.: Does self-supervised learning really improve reinforcement learning from pixels? arXiv preprint arXiv:2206.05266 (2022)

  39. Lin, Z., et al.: A structured self-attentive sentence embedding (2017). arXiv:1703.03130

  40. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 10012–10022, October 2021

    Google Scholar 

  41. Liu, Z., et al.: Video swin transformer (2021). arXiv:2106.13230

  42. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. arXiv preprint arXiv:2201.03545 (2022)

  43. Mnih, V., et al.: Playing atari with deep reinforcement learning (2013). arXiv:1312.5602

  44. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)

    Google Scholar 

  45. Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network (2021). arXiv:2102.00719

  46. Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 3163–3172 (2021)

    Google Scholar 

  47. Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Proceedings of the International Conference on Machine Learning (ICML), vol. 1, p. 2 (2000)

    Google Scholar 

  48. Parikh, A.P., Täckström, O., Das, D., Uszkoreit, J.: A decomposable attention model for natural language inference (2016). arXiv:1606.01933

  49. Pong, V., Gu, S., Dalal, M., Levine, S.: Temporal difference models: model-free deep RL for model-based control. In: Proceedings of the International Conference on Learning Representations (ICLR) (2018)

    Google Scholar 

  50. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  51. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)

    Google Scholar 

  52. Reed, S., et al.: A generalist agent. arXiv preprint arXiv:2205.06175 (2022)

  53. Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems, vol. 37. Citeseer (1994)

    Google Scholar 

  54. Ryoo, M.S., Piergiovanni, A., Arnab, A., Dehghani, M., Angelova, A.: TokenLearner: adaptive space-time tokenization for videos. In: Advances in Neural Information Processing Systems (NeurIPS), December 2021

    Google Scholar 

  55. Shang, J., Das, S., Ryoo, M.S.: Learning viewpoint-agnostic visual representations by recovering tokens in 3D space. arXiv preprint arXiv:2206.11895 (2022)

  56. Shang, J., Ryoo, M.S.: Self-supervised disentangled representation learning for third-person imitation learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 214–221. IEEE (2021)

    Google Scholar 

  57. Srivastava, R.K., Shyam, P., Mutz, F., Jaśkowski, W., Schmidhuber, J.: Training agents using upside-down reinforcement learning. arXiv preprint arXiv:1912.02877 (2019)

  58. Sutton, R.S.: Learning to predict by the methods of temporal differences. Mach. Learn. 3(1), 9–44 (1988)

    Article  Google Scholar 

  59. Tang, Y., Ha, D.: The sensory neuron as a transformer: permutation-invariant neural networks for reinforcement learning. arXiv preprint arXiv:2109.02869 (2021)

  60. Tesauro, G., et al.: Temporal difference learning and TD-Gammon. Commun. ACM 38(3), 58–68 (1995)

    Article  Google Scholar 

  61. Torabi, F., Warnell, G., Stone, P.: Generative Adversarial Imitation from Observation (2019). arXiv:1807.06158

  62. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 32–42, October 2021

    Google Scholar 

  63. Tunyasuvunakool, S., et al.: dm_control: software and tasks for continuous control. Softw. Impacts 6, 100022 (2020)

    Article  Google Scholar 

  64. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS), December 2017

    Google Scholar 

  65. Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: self-attention with linear complexity (2020). arXiv:2006.04768

  66. Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)

    Article  Google Scholar 

  67. Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., Girshick, R.: Early convolutions help transformers see better. Adv. Neural Inf. Process. Syst. (NeurIPS) 34, 30392–30400 (2021)

    Google Scholar 

  68. Yang, J., et al.: Focal self-attention for local-global interactions in vision transformers (2021). arXiv:2107.00641

  69. Yarats, D., Fergus, R., Lazaric, A., Pinto, L.: Mastering visual continuous control: improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645 (2021)

  70. Yarats, D., Kostrikov, I., Fergus, R.: Image augmentation is all you need: regularizing deep reinforcement learning from pixels. In: Proceedings of the International Conference on Learning Representations (ICLR) (2020)

    Google Scholar 

  71. Yarats, D., Zhang, A., Kostrikov, I., Amos, B., Pineau, J., Fergus, R.: Improving sample efficiency in model-free reinforcement learning from images. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 10674–10681, May 2021

    Google Scholar 

  72. Zheng, Q., Zhang, A., Grover, A.: Online decision transformer (2022)

    Google Scholar 

Download references

Acknowledgements

We thank members in Robotics Lab at Stony Brook for valuable discussions. This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Ministry of Science and ICT (No.2018-0-00205, Development of Core Technology of Robot Task-Intelligence for Improvement of Labor Condition.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinghuan Shang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1437 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shang, J., Kahatapitiya, K., Li, X., Ryoo, M.S. (2022). StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13699. Springer, Cham. https://doi.org/10.1007/978-3-031-19842-7_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19842-7_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19841-0

  • Online ISBN: 978-3-031-19842-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics