Efficient Deep Reinforcement Learning via Policy-Extended Successor Feature Approximator

Li, Yining; Yang, Tianpei; Hao, Jianye; Zheng, Yan; Tang, Hongyao

doi:10.1007/978-3-031-25549-6_3

Yining Li¹¹,
Tianpei Yang^11,12,
Jianye Hao¹¹,
Yan Zheng¹¹ &
…
Hongyao Tang¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13824))

Included in the following conference series:

International Conference on Distributed Artificial Intelligence

413 Accesses

Abstract

Successor Features (SFs) improve the generalization of Reinforcement Learning across unseen tasks by decoupling the dynamics of the environment from the rewards. However, the decomposition highly depends on the policy learned on the task, which may not be optimal in other tasks. To improve the generalization of SFs, in this paper, we propose a novel SFs learning paradigm, Policy-extended Successor Feature Approximator (PeSFA) which decouples the SFs from the policy by learning a policy representation module and inputting the policy representation to SFs. In this way, when we fit SFs well in the policy representation space, we can directly obtain a better SFs corresponding to any task by searching the policy representation space. Experimental results show that PeSFA significantly improves the generalizability of SFs and accelerates the learning process in two representative environments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Ensemble successor representations for task generalization in offline-to-online reinforcement learning

Article 25 June 2024

Policy Prediction Network: Model-Free Behavior Policy with Model-Based Learning in Continuous Action Space

A Novel State Space Exploration Method for the Sparse-Reward Reinforcement Learning Environment

References

Alegre, L.N., Bazzan, A.L.C., da Silva, B.C.: Optimistic linear support and successor features as a basis for optimal policy transfer. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., Sabato, S. (eds.) International Conference on Machine Learning, ICML 2022, Baltimore, Maryland, USA, 17–23 July 2022. Proceedings of Machine Learning Research, vol. 162, pp. 394–413. PMLR (2022)
Google Scholar
Alver, S., Precup, D.: Constructing a good behavior basis for transfer using generalized policy updates. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. OpenReview.net (2022)
Google Scholar
Barreto, A., et al.: Successor features for transfer in reinforcement learning. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017, pp. 4055–4065 (2017)
Google Scholar
Borsa, D., et al.: Universal successor features approximators. CoRR abs/1812.07626 (2018)
Google Scholar
Ellenberger, B.: Pybullet gymperium (2018–2019)
Google Scholar
Feinberg, A.: Markov decision processes: discrete stochastic dynamic programming (Martin l. Puterman). SIAM Rev. 38(4), 689 (1996)
Google Scholar
Filos, A., Lyle, C., Gal, Y., Levine, S., Jaques, N., Farquhar, G.: Psiphi-learning: reinforcement learning with demonstrations using successor features and inverse temporal difference learning. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 3305–3317. PMLR (2021)
Google Scholar
Gimelfarb, M., Barreto, A., Sanner, S., Lee, C.: Risk-aware transfer in reinforcement learning using successor features. In: Ranzato, M., Beygelzimer, A., Dauphin, Y.N., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, 6–14 December 2021, Virtual, pp. 17298–17310 (2021)
Google Scholar
Han, D., Tschiatschek, S.: Option transfer and SMDP abstraction with successor features. In: Raedt, L.D. (ed.) Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23–29 July 2022, pp. 3036–3042. ijcai.org (2022)
Google Scholar
Hansen, S., Dabney, W., Barreto, A., Warde-Farley, D., de Wiele, T.V., Mnih, V.: Fast task inference with variational intrinsic successor features. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. OpenReview.net (2020)
Google Scholar
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. In: Bengio, Y., LeCun, Y. (eds.) 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, 2–4 May 2016, Conference Track Proceedings (2016)
Google Scholar
Liu, H., Abbeel, P.: APS: active pretraining with successor features. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 6736–6747. PMLR (2021)
Google Scholar
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nat. 518(7540), 529–533 (2015)
Article Google Scholar
Nemecek, M.W., Parr, R.: Policy caches with successor features. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 8025–8033. PMLR (2021)
Google Scholar
Raileanu, R., Goldstein, M., Szlam, A., Fergus, R.: Fast adaptation to new environments via policy-dynamics value functions. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13–18 July 2020, Virtual Event. Proceedings of Machine Learning Research, vol. 119, pp. 7920–7931. PMLR (2020)
Google Scholar
Schaul, T., Horgan, D., Gregor, K., Silver, D.: Universal value function approximators. In: Bach, F.R., Blei, D.M. (eds.) Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6–11 July 2015. JMLR Workshop and Conference Proceedings, vol. 37, pp. 1312–1320. JMLR.org (2015)
Google Scholar
Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nat. 529(7587), 484–489 (2016)
Article Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning - An Introduction. Adaptive Computation and Machine Learning. MIT Press (1998)
Google Scholar
Tang, H., et al.: What about inputting policy in value function: policy representation and policy-extended value function approximator. In: Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, 22 February–1 March 2022, pp. 8441–8449. AAAI Press (2022)
Google Scholar
Taylor, M.E., Stone, P.: Transfer learning for reinforcement learning domains: a survey. J. Mach. Learn. Res. 10, 1633–1685 (2009). https://dl.acm.org/doi/10.5555/1577069.1755839
Yang, T., et al.: Efficient deep reinforcement learning via adaptive policy transfer. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pp. 3094–3100 (2020)
Google Scholar
Zhu, Z., Lin, K., Zhou, J.: Transfer learning in deep reinforcement learning: a survey. CoRR abs/2009.07888 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Intelligence and Computing, Tianjin University, Tianjin, China
Yining Li, Tianpei Yang, Jianye Hao, Yan Zheng & Hongyao Tang
University of Alberta, Edmonton, Canada
Tianpei Yang

Authors

Yining Li
View author publications
You can also search for this author in PubMed Google Scholar
Tianpei Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jianye Hao
View author publications
You can also search for this author in PubMed Google Scholar
Yan Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Hongyao Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Tianpei Yang or Jianye Hao .

Editor information

Editors and Affiliations

Kyushu University, Kyushu, Japan
Makoto Yokoo
University of Science and Technology, Beijing, China
Hong Qiao
Washington University in St. Louis, Louis, MO, USA
Yevgeniy Vorobeychik
Tianjin University, Tianjin, China
Jianye Hao

Appendices

Appendix

A Training PeSFA

In this section, we will show how to train $\tilde{\psi }(s,a,\chi _\pi )$ in an on-policy way based on the sarsa method.

Algorithm 1 shows the overall process of training PeSFA. First, at the beginning of a task, we will reset the environment and select an action for interacting with the environment (line 4–6), and the action will be executed with some information obtained from the environment, then the next action we will execute will be selected before update the policy (line 9–12). After updating PeSFA, for the reasons described in Sect. 4.1, it is necessary to recalculate the state-action pairs $\omega _{\pi '}$ corresponding to the new policy according to Eq. 18 (line 15–16). We will also search for a better policy in the policy representation space according to Eq. 19, and the $\chi _{opt}$ is chosen as the initial policy for subsequent training, which leads to better sample efficiency and exploration in the policy representation space. Besides, we will select the optimized policy representation which is found in the representation space as described in Sect. 4.3.

B Additional Experimental Details

For code-level details, our codes are implemented with Python 3.6.9 and Torch 1.11.0. All experiments were run on a single NVIDIA GeForce GTX 1660Ti GPU. The hyperparameters used in Grid World and Reacher experiments are shown in Table 1, and the task weight is shown in Table 2.

The first experimental environment is a navigation task in Grid World, a two-dimensional discrete space consisting of four rooms. In this environment, the agent starts from a location in a room and needs to reach a goal point in another room, where the agent can pick up objects and obtain their corresponding reward by passing through it, similarly as done in [3, 8]. These objects belong to one of the three types of objects and each type of object has a specific reward. The location of each object in the environment remains the same for all tasks, but the reward of each type of object varies with the task. The goal is to maximize the cumulative sum of reward values over tasks. And $\phi $ and $\textrm{w}$ are artificially constructed, which satisfy the reward function in Eq. 2, and $\phi \in \mathbb {R}^4$ represents whether a particular type of object is passed in that transition and $\textrm{w}\in \mathbb {R}^4$ indicates the reward corresponding to each type of objects.

The second environment is a continuous state space environment constructed on the PyBullet physics engine [5], and the agent can control the robotic arm to reach the preferred target location, as done in [3, 8]. In each task of this environment, we control the degree of preference to each target locus by controlling the task weights $\textrm{w}$, while $\phi \in \mathbb {R}^4$ represents the negative of the euclidean distance from the robotic arm’s tip to each target locus and then adding one.

Table 1. PeSFA’s hyperparameters per environment.

Full size table

Table 2. Task weight per environment.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Y., Yang, T., Hao, J., Zheng, Y., Tang, H. (2023). Efficient Deep Reinforcement Learning via Policy-Extended Successor Feature Approximator. In: Yokoo, M., Qiao, H., Vorobeychik, Y., Hao, J. (eds) Distributed Artificial Intelligence. DAI 2022. Lecture Notes in Computer Science(), vol 13824. Springer, Cham. https://doi.org/10.1007/978-3-031-25549-6_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-25549-6_3
Published: 22 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25548-9
Online ISBN: 978-3-031-25549-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Deep Reinforcement Learning via Policy-Extended Successor Feature Approximator

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Ensemble successor representations for task generalization in offline-to-online reinforcement learning

Policy Prediction Network: Model-Free Behavior Policy with Model-Based Learning in Continuous Action Space

A Novel State Space Exploration Method for the Sparse-Reward Reinforcement Learning Environment

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Appendices

Appendix

A Training PeSFA

B Additional Experimental Details

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Efficient Deep Reinforcement Learning via Policy-Extended Successor Feature Approximator

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Ensemble successor representations for task generalization in offline-to-online reinforcement learning

Policy Prediction Network: Model-Free Behavior Policy with Model-Based Learning in Continuous Action Space

A Novel State Space Exploration Method for the Sparse-Reward Reinforcement Learning Environment

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Appendices

Appendix

A Training PeSFA

B Additional Experimental Details

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation