Skip to main content

Efficient Deep Reinforcement Learning via Policy-Extended Successor Feature Approximator

  • Conference paper
  • First Online:
Distributed Artificial Intelligence (DAI 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13824))

Included in the following conference series:

  • 293 Accesses

Abstract

Successor Features (SFs) improve the generalization of Reinforcement Learning across unseen tasks by decoupling the dynamics of the environment from the rewards. However, the decomposition highly depends on the policy learned on the task, which may not be optimal in other tasks. To improve the generalization of SFs, in this paper, we propose a novel SFs learning paradigm, Policy-extended Successor Feature Approximator (PeSFA) which decouples the SFs from the policy by learning a policy representation module and inputting the policy representation to SFs. In this way, when we fit SFs well in the policy representation space, we can directly obtain a better SFs corresponding to any task by searching the policy representation space. Experimental results show that PeSFA significantly improves the generalizability of SFs and accelerates the learning process in two representative environments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alegre, L.N., Bazzan, A.L.C., da Silva, B.C.: Optimistic linear support and successor features as a basis for optimal policy transfer. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., Sabato, S. (eds.) International Conference on Machine Learning, ICML 2022, Baltimore, Maryland, USA, 17–23 July 2022. Proceedings of Machine Learning Research, vol. 162, pp. 394–413. PMLR (2022)

    Google Scholar 

  2. Alver, S., Precup, D.: Constructing a good behavior basis for transfer using generalized policy updates. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. OpenReview.net (2022)

    Google Scholar 

  3. Barreto, A., et al.: Successor features for transfer in reinforcement learning. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017, pp. 4055–4065 (2017)

    Google Scholar 

  4. Borsa, D., et al.: Universal successor features approximators. CoRR abs/1812.07626 (2018)

    Google Scholar 

  5. Ellenberger, B.: Pybullet gymperium (2018–2019)

    Google Scholar 

  6. Feinberg, A.: Markov decision processes: discrete stochastic dynamic programming (Martin l. Puterman). SIAM Rev. 38(4), 689 (1996)

    Google Scholar 

  7. Filos, A., Lyle, C., Gal, Y., Levine, S., Jaques, N., Farquhar, G.: Psiphi-learning: reinforcement learning with demonstrations using successor features and inverse temporal difference learning. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 3305–3317. PMLR (2021)

    Google Scholar 

  8. Gimelfarb, M., Barreto, A., Sanner, S., Lee, C.: Risk-aware transfer in reinforcement learning using successor features. In: Ranzato, M., Beygelzimer, A., Dauphin, Y.N., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, 6–14 December 2021, Virtual, pp. 17298–17310 (2021)

    Google Scholar 

  9. Han, D., Tschiatschek, S.: Option transfer and SMDP abstraction with successor features. In: Raedt, L.D. (ed.) Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23–29 July 2022, pp. 3036–3042. ijcai.org (2022)

    Google Scholar 

  10. Hansen, S., Dabney, W., Barreto, A., Warde-Farley, D., de Wiele, T.V., Mnih, V.: Fast task inference with variational intrinsic successor features. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. OpenReview.net (2020)

    Google Scholar 

  11. Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. In: Bengio, Y., LeCun, Y. (eds.) 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, 2–4 May 2016, Conference Track Proceedings (2016)

    Google Scholar 

  12. Liu, H., Abbeel, P.: APS: active pretraining with successor features. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 6736–6747. PMLR (2021)

    Google Scholar 

  13. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nat. 518(7540), 529–533 (2015)

    Article  Google Scholar 

  14. Nemecek, M.W., Parr, R.: Policy caches with successor features. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 8025–8033. PMLR (2021)

    Google Scholar 

  15. Raileanu, R., Goldstein, M., Szlam, A., Fergus, R.: Fast adaptation to new environments via policy-dynamics value functions. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13–18 July 2020, Virtual Event. Proceedings of Machine Learning Research, vol. 119, pp. 7920–7931. PMLR (2020)

    Google Scholar 

  16. Schaul, T., Horgan, D., Gregor, K., Silver, D.: Universal value function approximators. In: Bach, F.R., Blei, D.M. (eds.) Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6–11 July 2015. JMLR Workshop and Conference Proceedings, vol. 37, pp. 1312–1320. JMLR.org (2015)

    Google Scholar 

  17. Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nat. 529(7587), 484–489 (2016)

    Article  Google Scholar 

  18. Sutton, R.S., Barto, A.G.: Reinforcement Learning - An Introduction. Adaptive Computation and Machine Learning. MIT Press (1998)

    Google Scholar 

  19. Tang, H., et al.: What about inputting policy in value function: policy representation and policy-extended value function approximator. In: Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, 22 February–1 March 2022, pp. 8441–8449. AAAI Press (2022)

    Google Scholar 

  20. Taylor, M.E., Stone, P.: Transfer learning for reinforcement learning domains: a survey. J. Mach. Learn. Res. 10, 1633–1685 (2009). https://dl.acm.org/doi/10.5555/1577069.1755839

  21. Yang, T., et al.: Efficient deep reinforcement learning via adaptive policy transfer. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pp. 3094–3100 (2020)

    Google Scholar 

  22. Zhu, Z., Lin, K., Zhou, J.: Transfer learning in deep reinforcement learning: a survey. CoRR abs/2009.07888 (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Tianpei Yang or Jianye Hao .

Editor information

Editors and Affiliations

Appendices

Appendix

A Training PeSFA

In this section, we will show how to train \(\tilde{\psi }(s,a,\chi _\pi )\) in an on-policy way based on the sarsa method.

Algorithm 1 shows the overall process of training PeSFA. First, at the beginning of a task, we will reset the environment and select an action for interacting with the environment (line 4–6), and the action will be executed with some information obtained from the environment, then the next action we will execute will be selected before update the policy (line 9–12). After updating PeSFA, for the reasons described in Sect. 4.1, it is necessary to recalculate the state-action pairs \(\omega _{\pi '}\) corresponding to the new policy according to Eq. 18 (line 15–16). We will also search for a better policy in the policy representation space according to Eq. 19, and the \(\chi _{opt}\) is chosen as the initial policy for subsequent training, which leads to better sample efficiency and exploration in the policy representation space. Besides, we will select the optimized policy representation which is found in the representation space as described in Sect. 4.3.

figure a

B Additional Experimental Details

For code-level details, our codes are implemented with Python 3.6.9 and Torch 1.11.0. All experiments were run on a single NVIDIA GeForce GTX 1660Ti GPU. The hyperparameters used in Grid World and Reacher experiments are shown in Table 1, and the task weight is shown in Table 2.

The first experimental environment is a navigation task in Grid World, a two-dimensional discrete space consisting of four rooms. In this environment, the agent starts from a location in a room and needs to reach a goal point in another room, where the agent can pick up objects and obtain their corresponding reward by passing through it, similarly as done in [3, 8]. These objects belong to one of the three types of objects and each type of object has a specific reward. The location of each object in the environment remains the same for all tasks, but the reward of each type of object varies with the task. The goal is to maximize the cumulative sum of reward values over tasks. And \(\phi \) and \(\textrm{w}\) are artificially constructed, which satisfy the reward function in Eq. 2, and \(\phi \in \mathbb {R}^4\) represents whether a particular type of object is passed in that transition and \(\textrm{w}\in \mathbb {R}^4\) indicates the reward corresponding to each type of objects.

The second environment is a continuous state space environment constructed on the PyBullet physics engine [5], and the agent can control the robotic arm to reach the preferred target location, as done in [3, 8]. In each task of this environment, we control the degree of preference to each target locus by controlling the task weights \(\textrm{w}\), while \(\phi \in \mathbb {R}^4\) represents the negative of the euclidean distance from the robotic arm’s tip to each target locus and then adding one.

Table 1. PeSFA’s hyperparameters per environment.
Table 2. Task weight per environment.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Y., Yang, T., Hao, J., Zheng, Y., Tang, H. (2023). Efficient Deep Reinforcement Learning via Policy-Extended Successor Feature Approximator. In: Yokoo, M., Qiao, H., Vorobeychik, Y., Hao, J. (eds) Distributed Artificial Intelligence. DAI 2022. Lecture Notes in Computer Science(), vol 13824. Springer, Cham. https://doi.org/10.1007/978-3-031-25549-6_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-25549-6_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-25548-9

  • Online ISBN: 978-3-031-25549-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics