Policy Evaluation with Delayed, Aggregated Anonymous Feedback

Dinis Junior, Guilherme; Magnússon, Sindri; Hollmén, Jaakko

doi:10.1007/978-3-031-18840-4_9

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13601))

Included in the following conference series:

International Conference on Discovery Science

Abstract

In reinforcement learning, an agent makes decisions to maximize rewards in an environment. Rewards are an integral part of the reinforcement learning as they guide the agent towards its learning objective. However, having consistent rewards can be infeasible in certain scenarios, due to either cost, the nature of the problem or other constraints. In this paper, we investigate the problem of delayed, aggregated, and anonymous rewards. We propose and analyze two strategies for conducting policy evaluation under cumulative periodic rewards, and study them by making use of simulation environments. Our findings indicate that both strategies can achieve similar sample efficiency as when we have consistent rewards.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Policy Control with Delayed, Aggregate, and Anonymous Feedback

Offline Policy Comparison Under Limited Historical Agent-Environment Interactions

Learning in the Presence of Multiple Agents

Notes

1.
E.g. if $N = 5$, in state 4 the agent ought to select action 4; selecting 3 yields a penalty of $-(N - 4 + 3)$ and selecting 5 yields a penalty of $-1$.
2.
Complete results can be found at https://github.com/dsv-data-science/rl-daaf.git.

References

Agogino, A.K., Tumer, K.: Unifying temporal and structural credit assignment problems, pp. 980–987. AAMAS 2004. IEEE Computer Society, USA, July 2004
Google Scholar
Cesa-Bianchi, N., Gentile, C., Mansour, Y.: Nonstochastic bandits with composite anonymous feedback, pp. 750–773. PMLR, July 2018. ISSN: 2640–3498
Google Scholar
Chelu, V., Borsa, D., Precup, D., Hasselt, H.P.V.: Selective credit assignment. arXiv preprint arXiv:2202.09699 (2022)
Chen, H., et al.: Large-scale interactive recommendation with tree-structured policy gradient, vol. 33(1), pp. 3312–3320 (2019)
Google Scholar
Chen, M., Beutel, A., Covington, P., Jain, S., Belletti, F., Chi, E.H.: Top-k off-policy correction for a REINFORCE recommender system, pp. 456–464. WSDM 2019. Association for Computing Machinery (2019)
Google Scholar
Garg, S., Akash, A.K.: Stochastic bandits with delayed composite anonymous feedback, October 2019. arXiv:1910.01161
Jindal, I., Qin, Z.T., Chen, X., Nokleby, M., Ye, J.: Optimizing taxi carpool policies via reinforcement learning and spatio-temporal mining, pp. 1417–1426 (2018)
Google Scholar
Krueger, D., Leike, J., Evans, O., Salvatier, J.: Active Reinforcement Learning: observing Rewards at a Cost, November 2020. arXiv:2011.06709
Lawson, C.L., Hanson, R.J.: Least-squares approximation, pp. 963–964. John Wiley and Sons Ltd., GBR, January 2003
Google Scholar
Lee, K., Rucker, M., Scherer, W.T., Beling, P.A., Gerber, M.S., Kang, H.: Agent-based model construction using inverse reinforcement learning, pp. 1–12. WSC 2017. IEEE Press (2017)
Google Scholar
Li, M., et al.: Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning, pp. 983–994. WWW 2019. Association for Computing Machinery (2019)
Google Scholar
Mesnard, T., et al.: Counterfactual credit assignment in model-free reinforcement learning, pp. 7654–7664. PMLR, ISSN: 2640–3498 (2021)
Google Scholar
Pike-Burke, C., Agrawal, S., Szepesvari, C., Grunewalder, S.: Bandits with delayed, aggregated anonymous feedback, June 2018. arXiv:1709.06853
Ratliff, N.D., Bagnell, J.A., Zinkevich, M.A.: Maximum margin planning, pp. 729–736. ICML 2006. Association for Computing Machinery, New York, NY, USA (2006)
Google Scholar
Sutton, R.S.: Learning to predict by the methods of temporal differences. Mach. Lang. 3(1), 9–44 (1988). https://doi.org/10.1007/BF00115009
Sutton, R.S., Barto, A.G.: Reinforcement Learning: an Introduction. Adaptive Computation and Machine Learning Series. The MIT Press, Cambridge, Massachusetts, second edition edn. (2018)
Google Scholar
Wang, Z., Qin, Z., Tang, X., Ye, J., Zhu, H.: Deep reinforcement learning with knowledge transfer for online rides order dispatching, pp. 617–626. ISSN: 2374–8486
Google Scholar
Xu, Z., et al.: large-scale order dispatch in on-demand ride-hailing platforms: a learning and planning approach, pp. 905–913. KDD 2018. Association for Computing Machinery (2018)
Google Scholar
Zhao, Y., Zhou, Y.H., Ou, M., Xu, H., Li, N.: Maximizing cumulative user engagement in sequential recommendation: an online optimization perspective, pp. 2784–2792. KDD 2020. Association for Computing Machinery, New York, NY, USA, August 2020
Google Scholar
Zou, L., Xia, L., Ding, Z., Song, J., Liu, W., Yin, D.: Reinforcement learning to optimize long-term user engagement in recommender systems, pp. 2810–2818. KDD 2019. Association for Computing Machinery, New York, NY, USA, July 2019
Google Scholar

Download references

Author information

Authors and Affiliations

Stockholm University, Stockholm, Sweden
Guilherme Dinis Junior, Sindri Magnússon & Jaakko Hollmén

Authors

Guilherme Dinis Junior
View author publications
You can also search for this author in PubMed Google Scholar
Sindri Magnússon
View author publications
You can also search for this author in PubMed Google Scholar
Jaakko Hollmén
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guilherme Dinis Junior .

Editor information

Editors and Affiliations

University of Montpellier, Montpellier, France
Poncelet Pascal
INRAE, Montpellier, France
Dino Ienco

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dinis Junior, G., Magnússon, S., Hollmén, J. (2022). Policy Evaluation with Delayed, Aggregated Anonymous Feedback. In: Pascal, P., Ienco, D. (eds) Discovery Science. DS 2022. Lecture Notes in Computer Science(), vol 13601. Springer, Cham. https://doi.org/10.1007/978-3-031-18840-4_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-18840-4_9
Published: 06 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18839-8
Online ISBN: 978-3-031-18840-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Policy Evaluation with Delayed, Aggregated Anonymous Feedback