Abstract
Contextual bandits efficiently solve the exploration and exploitation (EE) problem in online recommendation tasks. Most existing contextual bandit algorithms utilize a fixed reward mechanism, which makes it difficult to accurately capture the preference changes of users in non-stationary environments, thus affecting recommendation performance. In this paper, we formalize the online recommendation task as a contextual bandit problem and propose a Thompson sampling algorithm with time-varying reward (TV-TS) that captures user preference changes from three perspectives: (1) forgetting past preferences based on a functional decay method while capturing possible periodic demands, (2) mining fine-grained preference changes from multi-behavioral implicit feedback, and (3) iterating the reward weights adaptively. We also provide theoretical regret analysis to demonstrate the sublinearity of the algorithm. Extensive empirical experiments on two real-world datasets show that our proposed algorithm outperforms state-of-the-art time-varying bandit algorithms. Furthermore, the designed reward mechanism can be flexibly configured to other bandit algorithms to improve them.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agrawal, S., Goyal, N.: Thompson sampling for contextual bandits with linear payoffs. In: Proceedings of the 30th International Conference on Machine Learning, pp. 127–135 (2013)
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: Gambling in a rigged casino: the adversarial multi-armed bandit problem. In: Proceedings of the 36th Annual Foundations of Computer Science, pp. 322–331 (1995)
Besbes, O., Gur, Y., Zeevi, A.: Stochastic multi-armed-bandit problem with non-stationary rewards. In: Proceedings of the 28th Conference on Neural Information Processing Systems, pp. 199–207 (2014)
Cheung, W.C., Simchi-Levi, D., Zhu, R.: Learning to optimize under non-stationarity. In: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pp. 1079–1087 (2019)
Deng, Y., Zhou, X., Kim, B., Tewari, A., Gupta, A., Shroff, N.: Weighted gaussian process bandits for non-stationary environments. In: Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 6909–6932 (2022)
Ghatak, G.: A change-detection-based thompson sampling framework for non-stationary bandits. IEEE Trans. Comput. 70(10), 1670–1676 (2020)
Li, C., Wang, H.: Asynchronous upper confidence bound algorithms for federated linear bandits. In: Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, pp. 6529–6553 (2022)
Liu, E.Z., Raghunathan, A., Liang, P., Finn, C.: Decoupling exploration and exploitation for meta-reinforcement learning without sacrifices. In: Proceedings of the 38th International Conference on Machine Learning, pp. 6925–6935 (2021)
Russac, Y., Vernade, C., Cappé, O.: Weighted linear bandits for non-stationary environments. In: Advances in Neural Information Processing Systems (2019)
Trovo, F., Paladino, S., Restelli, M., Gatti, N.: Sliding-window thompson sampling for non-stationary settings. J. Artif. Intell. Res. 68, 311–364 (2020)
Vakili, S., Zhao, Q., Zhou, Y.: Time-varying stochastic multi-armed bandit problems. In: Proceedings of the 48th Asilomar Conference on Signals, Systems and Computers, pp. 2103–2107 (2014)
Xu, L., Jiang, C., Qian, Y., Zhao, Y., Li, J., Ren, Y.: Dynamic privacy pricing: a multi-armed bandit approach with time-variant rewards. IEEE Trans. Inf. Forensics Secur. 12(2), 271–285 (2016)
Xu, X., Dong, F., Li, Y., He, S., Li, X.: Contextual-bandit based personalized recommendation with time-varying user interests. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence, pp. 6518–6525 (2020)
Yan, C., Han, H., Zhang, Y., Zhu, D., Wan, Y.: Dynamic clustering based contextual combinatorial multi-armed bandit for online recommendation. Knowl.-Based Syst. 257, 109927 (2022)
Zhu, Z., Huang, L., Xu, H.: Self-accelerated thompson sampling with near-optimal regret upper bound. Neurocomputing 399, 37–47 (2020)
Acknowledgements
This work is supported by National Natural Science Foundation of China (No. 62206046) and Shanghai Science and Technology Innovation Action Plan Project (No. 22511100700).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yan, C., Xu, H., Han, H., Zhang, Y., Wang, Z. (2023). Thompson Sampling with Time-Varying Reward for Contextual Bandits. In: Wang, X., et al. Database Systems for Advanced Applications. DASFAA 2023. Lecture Notes in Computer Science, vol 13944. Springer, Cham. https://doi.org/10.1007/978-3-031-30672-3_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-30672-3_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30671-6
Online ISBN: 978-3-031-30672-3
eBook Packages: Computer ScienceComputer Science (R0)