Abstract
Successor Features (SFs) improve the generalization of Reinforcement Learning across unseen tasks by decoupling the dynamics of the environment from the rewards. However, the decomposition highly depends on the policy learned on the task, which may not be optimal in other tasks. To improve the generalization of SFs, in this paper, we propose a novel SFs learning paradigm, Policy-extended Successor Feature Approximator (PeSFA) which decouples the SFs from the policy by learning a policy representation module and inputting the policy representation to SFs. In this way, when we fit SFs well in the policy representation space, we can directly obtain a better SFs corresponding to any task by searching the policy representation space. Experimental results show that PeSFA significantly improves the generalizability of SFs and accelerates the learning process in two representative environments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alegre, L.N., Bazzan, A.L.C., da Silva, B.C.: Optimistic linear support and successor features as a basis for optimal policy transfer. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., Sabato, S. (eds.) International Conference on Machine Learning, ICML 2022, Baltimore, Maryland, USA, 17–23 July 2022. Proceedings of Machine Learning Research, vol. 162, pp. 394–413. PMLR (2022)
Alver, S., Precup, D.: Constructing a good behavior basis for transfer using generalized policy updates. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. OpenReview.net (2022)
Barreto, A., et al.: Successor features for transfer in reinforcement learning. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017, pp. 4055–4065 (2017)
Borsa, D., et al.: Universal successor features approximators. CoRR abs/1812.07626 (2018)
Ellenberger, B.: Pybullet gymperium (2018–2019)
Feinberg, A.: Markov decision processes: discrete stochastic dynamic programming (Martin l. Puterman). SIAM Rev. 38(4), 689 (1996)
Filos, A., Lyle, C., Gal, Y., Levine, S., Jaques, N., Farquhar, G.: Psiphi-learning: reinforcement learning with demonstrations using successor features and inverse temporal difference learning. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 3305–3317. PMLR (2021)
Gimelfarb, M., Barreto, A., Sanner, S., Lee, C.: Risk-aware transfer in reinforcement learning using successor features. In: Ranzato, M., Beygelzimer, A., Dauphin, Y.N., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, 6–14 December 2021, Virtual, pp. 17298–17310 (2021)
Han, D., Tschiatschek, S.: Option transfer and SMDP abstraction with successor features. In: Raedt, L.D. (ed.) Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23–29 July 2022, pp. 3036–3042. ijcai.org (2022)
Hansen, S., Dabney, W., Barreto, A., Warde-Farley, D., de Wiele, T.V., Mnih, V.: Fast task inference with variational intrinsic successor features. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. OpenReview.net (2020)
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. In: Bengio, Y., LeCun, Y. (eds.) 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, 2–4 May 2016, Conference Track Proceedings (2016)
Liu, H., Abbeel, P.: APS: active pretraining with successor features. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 6736–6747. PMLR (2021)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nat. 518(7540), 529–533 (2015)
Nemecek, M.W., Parr, R.: Policy caches with successor features. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 8025–8033. PMLR (2021)
Raileanu, R., Goldstein, M., Szlam, A., Fergus, R.: Fast adaptation to new environments via policy-dynamics value functions. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13–18 July 2020, Virtual Event. Proceedings of Machine Learning Research, vol. 119, pp. 7920–7931. PMLR (2020)
Schaul, T., Horgan, D., Gregor, K., Silver, D.: Universal value function approximators. In: Bach, F.R., Blei, D.M. (eds.) Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6–11 July 2015. JMLR Workshop and Conference Proceedings, vol. 37, pp. 1312–1320. JMLR.org (2015)
Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nat. 529(7587), 484–489 (2016)
Sutton, R.S., Barto, A.G.: Reinforcement Learning - An Introduction. Adaptive Computation and Machine Learning. MIT Press (1998)
Tang, H., et al.: What about inputting policy in value function: policy representation and policy-extended value function approximator. In: Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, 22 February–1 March 2022, pp. 8441–8449. AAAI Press (2022)
Taylor, M.E., Stone, P.: Transfer learning for reinforcement learning domains: a survey. J. Mach. Learn. Res. 10, 1633–1685 (2009). https://dl.acm.org/doi/10.5555/1577069.1755839
Yang, T., et al.: Efficient deep reinforcement learning via adaptive policy transfer. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pp. 3094–3100 (2020)
Zhu, Z., Lin, K., Zhou, J.: Transfer learning in deep reinforcement learning: a survey. CoRR abs/2009.07888 (2020)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Appendices
Appendix
A Training PeSFA
In this section, we will show how to train \(\tilde{\psi }(s,a,\chi _\pi )\) in an on-policy way based on the sarsa method.
Algorithm 1 shows the overall process of training PeSFA. First, at the beginning of a task, we will reset the environment and select an action for interacting with the environment (line 4–6), and the action will be executed with some information obtained from the environment, then the next action we will execute will be selected before update the policy (line 9–12). After updating PeSFA, for the reasons described in Sect. 4.1, it is necessary to recalculate the state-action pairs \(\omega _{\pi '}\) corresponding to the new policy according to Eq. 18 (line 15–16). We will also search for a better policy in the policy representation space according to Eq. 19, and the \(\chi _{opt}\) is chosen as the initial policy for subsequent training, which leads to better sample efficiency and exploration in the policy representation space. Besides, we will select the optimized policy representation which is found in the representation space as described in Sect. 4.3.

B Additional Experimental Details
For code-level details, our codes are implemented with Python 3.6.9 and Torch 1.11.0. All experiments were run on a single NVIDIA GeForce GTX 1660Ti GPU. The hyperparameters used in Grid World and Reacher experiments are shown in Table 1, and the task weight is shown in Table 2.
The first experimental environment is a navigation task in Grid World, a two-dimensional discrete space consisting of four rooms. In this environment, the agent starts from a location in a room and needs to reach a goal point in another room, where the agent can pick up objects and obtain their corresponding reward by passing through it, similarly as done in [3, 8]. These objects belong to one of the three types of objects and each type of object has a specific reward. The location of each object in the environment remains the same for all tasks, but the reward of each type of object varies with the task. The goal is to maximize the cumulative sum of reward values over tasks. And \(\phi \) and \(\textrm{w}\) are artificially constructed, which satisfy the reward function in Eq. 2, and \(\phi \in \mathbb {R}^4\) represents whether a particular type of object is passed in that transition and \(\textrm{w}\in \mathbb {R}^4\) indicates the reward corresponding to each type of objects.
The second environment is a continuous state space environment constructed on the PyBullet physics engine [5], and the agent can control the robotic arm to reach the preferred target location, as done in [3, 8]. In each task of this environment, we control the degree of preference to each target locus by controlling the task weights \(\textrm{w}\), while \(\phi \in \mathbb {R}^4\) represents the negative of the euclidean distance from the robotic arm’s tip to each target locus and then adding one.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, Y., Yang, T., Hao, J., Zheng, Y., Tang, H. (2023). Efficient Deep Reinforcement Learning via Policy-Extended Successor Feature Approximator. In: Yokoo, M., Qiao, H., Vorobeychik, Y., Hao, J. (eds) Distributed Artificial Intelligence. DAI 2022. Lecture Notes in Computer Science(), vol 13824. Springer, Cham. https://doi.org/10.1007/978-3-031-25549-6_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-25549-6_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25548-9
Online ISBN: 978-3-031-25549-6
eBook Packages: Computer ScienceComputer Science (R0)