Abstract
In reinforcement learning, an agent makes decisions to maximize rewards in an environment. Rewards are an integral part of the reinforcement learning as they guide the agent towards its learning objective. However, having consistent rewards can be infeasible in certain scenarios, due to either cost, the nature of the problem or other constraints. In this paper, we investigate the problem of delayed, aggregated, and anonymous rewards. We propose and analyze two strategies for conducting policy evaluation under cumulative periodic rewards, and study them by making use of simulation environments. Our findings indicate that both strategies can achieve similar sample efficiency as when we have consistent rewards.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
E.g. if \(N = 5\), in state 4 the agent ought to select action 4; selecting 3 yields a penalty of \(-(N - 4 + 3)\) and selecting 5 yields a penalty of \(-1\).
- 2.
Complete results can be found at https://github.com/dsv-data-science/rl-daaf.git.
References
Agogino, A.K., Tumer, K.: Unifying temporal and structural credit assignment problems, pp. 980–987. AAMAS 2004. IEEE Computer Society, USA, July 2004
Cesa-Bianchi, N., Gentile, C., Mansour, Y.: Nonstochastic bandits with composite anonymous feedback, pp. 750–773. PMLR, July 2018. ISSN: 2640–3498
Chelu, V., Borsa, D., Precup, D., Hasselt, H.P.V.: Selective credit assignment. arXiv preprint arXiv:2202.09699 (2022)
Chen, H., et al.: Large-scale interactive recommendation with tree-structured policy gradient, vol. 33(1), pp. 3312–3320 (2019)
Chen, M., Beutel, A., Covington, P., Jain, S., Belletti, F., Chi, E.H.: Top-k off-policy correction for a REINFORCE recommender system, pp. 456–464. WSDM 2019. Association for Computing Machinery (2019)
Garg, S., Akash, A.K.: Stochastic bandits with delayed composite anonymous feedback, October 2019. arXiv:1910.01161
Jindal, I., Qin, Z.T., Chen, X., Nokleby, M., Ye, J.: Optimizing taxi carpool policies via reinforcement learning and spatio-temporal mining, pp. 1417–1426 (2018)
Krueger, D., Leike, J., Evans, O., Salvatier, J.: Active Reinforcement Learning: observing Rewards at a Cost, November 2020. arXiv:2011.06709
Lawson, C.L., Hanson, R.J.: Least-squares approximation, pp. 963–964. John Wiley and Sons Ltd., GBR, January 2003
Lee, K., Rucker, M., Scherer, W.T., Beling, P.A., Gerber, M.S., Kang, H.: Agent-based model construction using inverse reinforcement learning, pp. 1–12. WSC 2017. IEEE Press (2017)
Li, M., et al.: Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning, pp. 983–994. WWW 2019. Association for Computing Machinery (2019)
Mesnard, T., et al.: Counterfactual credit assignment in model-free reinforcement learning, pp. 7654–7664. PMLR, ISSN: 2640–3498 (2021)
Pike-Burke, C., Agrawal, S., Szepesvari, C., Grunewalder, S.: Bandits with delayed, aggregated anonymous feedback, June 2018. arXiv:1709.06853
Ratliff, N.D., Bagnell, J.A., Zinkevich, M.A.: Maximum margin planning, pp. 729–736. ICML 2006. Association for Computing Machinery, New York, NY, USA (2006)
Sutton, R.S.: Learning to predict by the methods of temporal differences. Mach. Lang. 3(1), 9–44 (1988). https://doi.org/10.1007/BF00115009
Sutton, R.S., Barto, A.G.: Reinforcement Learning: an Introduction. Adaptive Computation and Machine Learning Series. The MIT Press, Cambridge, Massachusetts, second edition edn. (2018)
Wang, Z., Qin, Z., Tang, X., Ye, J., Zhu, H.: Deep reinforcement learning with knowledge transfer for online rides order dispatching, pp. 617–626. ISSN: 2374–8486
Xu, Z., et al.: large-scale order dispatch in on-demand ride-hailing platforms: a learning and planning approach, pp. 905–913. KDD 2018. Association for Computing Machinery (2018)
Zhao, Y., Zhou, Y.H., Ou, M., Xu, H., Li, N.: Maximizing cumulative user engagement in sequential recommendation: an online optimization perspective, pp. 2784–2792. KDD 2020. Association for Computing Machinery, New York, NY, USA, August 2020
Zou, L., Xia, L., Ding, Z., Song, J., Liu, W., Yin, D.: Reinforcement learning to optimize long-term user engagement in recommender systems, pp. 2810–2818. KDD 2019. Association for Computing Machinery, New York, NY, USA, July 2019
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Dinis Junior, G., Magnússon, S., Hollmén, J. (2022). Policy Evaluation with Delayed, Aggregated Anonymous Feedback. In: Pascal, P., Ienco, D. (eds) Discovery Science. DS 2022. Lecture Notes in Computer Science(), vol 13601. Springer, Cham. https://doi.org/10.1007/978-3-031-18840-4_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-18840-4_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18839-8
Online ISBN: 978-3-031-18840-4
eBook Packages: Computer ScienceComputer Science (R0)