Abstract
In a conventional contextual multi-armed bandit problem, the feedback (or reward) is immediately observable after an action. Nevertheless, delayed feedback arises in numerous real-life situations and is particularly crucial in time-sensitive applications. The exploration-exploitation dilemma becomes particularly challenging under such conditions, as it couples with the interplay between delays and limited resources. Besides, a limited budget often aggravates the problem by restricting the exploration potential. A motivating example is the distribution of medical supplies at the early stage of COVID-19. The delayed feedback of testing results, thus insufficient information for learning, degraded the efficiency of resource allocation. Motivated by such applications, we study the effect of delayed feedback on constrained contextual bandits. We develop a decision-making policy, delay-oriented resource allocation with learning (DORAL), to optimize the resource expenditure in a contextual multi-armed bandit problem with arm-dependent delayed feedback.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amuru, S., Buehrer, R.M.: Optimal jamming using delayed learning. In: 2014 IEEE Military Communications Conference, IEEE (2014), pp. 1528–1533 (2014)
Badanidiyuru, A., Langford, J., Slivkins, A.: Resourceful contextual bandits. In: Conference on Learning Theory, PMLR (2014), pp. 1109–1134 (2014)
Bastani, H., et al.: Efficient and targeted Covid-19 border testing via reinforcement learning. Nature 599(7883), 108–113 (2021)
Bubeck, S., Cesa-Bianchi, N., Lugosi, G.: Bandits with heavy tail. IEEE Trans. Inf. Theory 59(11), 7711–7717 (2013)
Bubeck, S., Wang, T., Viswanathan, N.: Multiple identifications in multi-armed bandits. In: International Conference on Machine Learning, PMLR (2013), pp. 258–265 (2013)
Cesa-Bianchi, N., Gentile, C., Mansour, Y.: Nonstochastic bandits with composite anonymous feedback. In: Conference On Learning Theory, PMLR (2018), pp. 750–773 (2018)
Chapelle, O., Manavoglu, E., Rosales, R.: Simple and scalable response prediction for display advertising. ACM Trans. Intell. Syst. Technol. (TIST) 5(4), 1–34 (2014)
Chen, L., Xu, J.: Task replication for vehicular cloud: contextual combinatorial bandit with delayed feedback. In: IEEE INFOCOM 2019-IEEE Conference on Computer Communications, IEEE (2019), pp. 748–756 (2019)
Gael, M.A., Vernade, C., Carpentier, A., Valko, M.: Stochastic bandits with arm-dependent delays. In: International Conference on Machine Learning, PMLR (2020), pp. 3348–3356 (2020)
Ghoorchian, S., Maghsudi, S.: Multi-armed bandit for energy-efficient and delay-sensitive edge computing in dynamic networks with uncertainty. IEEE Transactions on Cognitive Communications and Networking (2020)
Grover, A., et al.: Best arm identification in multi-armed bandits with delayed feedback. In: International Conference on Artificial Intelligence and Statistics, PMLR (2018), pp. 833–842 (2018)
Han, B., Gabor, J.: Contextual bandits for advertising budget allocation. In: Proceedings of the ADKDD, vol. 17 (2020)
Heidrich-Meisner, V., Igel, C.: Hoeffding and bernstein races for selecting policies in evolutionary direct policy search. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 401–408 (2009)
Joulani, P., Gyorgy, A., Szepesvári, C.: Online learning under delayed feedback. In: International Conference on Machine Learning, PMLR (2013), pp. 1453–1461 (2013)
Thune, T.S., Cesa-Bianchi, N., Seldin, Y.: Nonstochastic multiarmed bandits with unrestricted delays. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R., eds. In: Advances in Neural Information Processing Systems. Vol. 32., Curran Associates, Inc. (2019)
Vernade, C., Cappé, O., Perchet, V.: Stochastic bandit models for delayed conversions. In: Conference on Uncertainty in Artificial Intelligence (2017)
Vernade, C., Carpentier, A., Lattimore, T., Zappella, G., Ermis, B., Brueckner, M.: Linear bandits with stochastic delayed feedback. In: International Conference on Machine Learning, PMLR, pp. 9712–9721 (2020)
Wu, H., Srikant, R., Liu, X., Jiang, C.: Algorithms with logarithmic or sublinear regret for constrained contextual bandits. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., eds.: Advances in Neural Information Processing Systems. Vol. 28., Curran Associates, Inc. (2015)
Zhou, Z., Xu, R., Blanchet, J.: Learning in generalized linear contextual bandits with stochastic delays. Adv. Neural. Inf. Process. Syst. 32, 5197–5208 (2019)
Acknowledgement
The work of S.M. was supported by Grant 01IS20051 and Grant 16KISK035 from the German Federal Ministry of Education and Research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, K., Maghsudi, S., Yokoo, M. (2024). Budgeted Recommendation with Delayed Feedback. In: Rocha, Á., Adeli, H., Dzemyda, G., Moreira, F., Poniszewska-Marańda, A. (eds) Good Practices and New Perspectives in Information Systems and Technologies. WorldCIST 2024. Lecture Notes in Networks and Systems, vol 987. Springer, Cham. https://doi.org/10.1007/978-3-031-60221-4_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-60221-4_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-60220-7
Online ISBN: 978-3-031-60221-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)