Thompson Sampling with Time-Varying Reward for Contextual Bandits

Yan, Cairong; Xu, Hualu; Han, Haixia; Zhang, Yanting; Wang, Zijian

doi:10.1007/978-3-031-30672-3_4

Cairong Yan¹⁵,
Hualu Xu¹⁵,
Haixia Han¹⁵,
Yanting Zhang¹⁵ &
…
Zijian Wang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13944))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1928 Accesses

Abstract

Contextual bandits efficiently solve the exploration and exploitation (EE) problem in online recommendation tasks. Most existing contextual bandit algorithms utilize a fixed reward mechanism, which makes it difficult to accurately capture the preference changes of users in non-stationary environments, thus affecting recommendation performance. In this paper, we formalize the online recommendation task as a contextual bandit problem and propose a Thompson sampling algorithm with time-varying reward (TV-TS) that captures user preference changes from three perspectives: (1) forgetting past preferences based on a functional decay method while capturing possible periodic demands, (2) mining fine-grained preference changes from multi-behavioral implicit feedback, and (3) iterating the reward weights adaptively. We also provide theoretical regret analysis to demonstrate the sublinearity of the algorithm. Extensive empirical experiments on two real-world datasets show that our proposed algorithm outperforms state-of-the-art time-varying bandit algorithms. Furthermore, the designed reward mechanism can be flexibly configured to other bandit algorithms to improve them.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Agrawal, S., Goyal, N.: Thompson sampling for contextual bandits with linear payoffs. In: Proceedings of the 30th International Conference on Machine Learning, pp. 127–135 (2013)
Google Scholar
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: Gambling in a rigged casino: the adversarial multi-armed bandit problem. In: Proceedings of the 36th Annual Foundations of Computer Science, pp. 322–331 (1995)
Google Scholar
Besbes, O., Gur, Y., Zeevi, A.: Stochastic multi-armed-bandit problem with non-stationary rewards. In: Proceedings of the 28th Conference on Neural Information Processing Systems, pp. 199–207 (2014)
Google Scholar
Cheung, W.C., Simchi-Levi, D., Zhu, R.: Learning to optimize under non-stationarity. In: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pp. 1079–1087 (2019)
Google Scholar
Deng, Y., Zhou, X., Kim, B., Tewari, A., Gupta, A., Shroff, N.: Weighted gaussian process bandits for non-stationary environments. In: Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 6909–6932 (2022)
Google Scholar
Ghatak, G.: A change-detection-based thompson sampling framework for non-stationary bandits. IEEE Trans. Comput. 70(10), 1670–1676 (2020)
Article MathSciNet MATH Google Scholar
Li, C., Wang, H.: Asynchronous upper confidence bound algorithms for federated linear bandits. In: Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, pp. 6529–6553 (2022)
Google Scholar
Liu, E.Z., Raghunathan, A., Liang, P., Finn, C.: Decoupling exploration and exploitation for meta-reinforcement learning without sacrifices. In: Proceedings of the 38th International Conference on Machine Learning, pp. 6925–6935 (2021)
Google Scholar
Russac, Y., Vernade, C., Cappé, O.: Weighted linear bandits for non-stationary environments. In: Advances in Neural Information Processing Systems (2019)
Google Scholar
Trovo, F., Paladino, S., Restelli, M., Gatti, N.: Sliding-window thompson sampling for non-stationary settings. J. Artif. Intell. Res. 68, 311–364 (2020)
Article MathSciNet MATH Google Scholar
Vakili, S., Zhao, Q., Zhou, Y.: Time-varying stochastic multi-armed bandit problems. In: Proceedings of the 48th Asilomar Conference on Signals, Systems and Computers, pp. 2103–2107 (2014)
Google Scholar
Xu, L., Jiang, C., Qian, Y., Zhao, Y., Li, J., Ren, Y.: Dynamic privacy pricing: a multi-armed bandit approach with time-variant rewards. IEEE Trans. Inf. Forensics Secur. 12(2), 271–285 (2016)
Article Google Scholar
Xu, X., Dong, F., Li, Y., He, S., Li, X.: Contextual-bandit based personalized recommendation with time-varying user interests. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence, pp. 6518–6525 (2020)
Google Scholar
Yan, C., Han, H., Zhang, Y., Zhu, D., Wan, Y.: Dynamic clustering based contextual combinatorial multi-armed bandit for online recommendation. Knowl.-Based Syst. 257, 109927 (2022)
Article Google Scholar
Zhu, Z., Huang, L., Xu, H.: Self-accelerated thompson sampling with near-optimal regret upper bound. Neurocomputing 399, 37–47 (2020)
Article Google Scholar

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China (No. 62206046) and Shanghai Science and Technology Innovation Action Plan Project (No. 22511100700).

Author information

Authors and Affiliations

School of Computer Science and Technology, Donghua University, Shanghai, China
Cairong Yan, Hualu Xu, Haixia Han, Yanting Zhang & Zijian Wang

Authors

Cairong Yan
View author publications
You can also search for this author in PubMed Google Scholar
Hualu Xu
View author publications
You can also search for this author in PubMed Google Scholar
Haixia Han
View author publications
You can also search for this author in PubMed Google Scholar
Yanting Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zijian Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Cairong Yan or Yanting Zhang .

Editor information

Editors and Affiliations

Tianjin University, Tianjin, China
Xin Wang
University of Torino, Turin, Italy
Maria Luisa Sapino
POSTECH, Pohang, Korea (Republic of)
Wook-Shin Han
University of California Santa Barbara, Santa Barbara, CA, USA
Amr El Abbadi
University of Auckland, Auckland, New Zealand
Gill Dobbie
Tianjin University, Tianjin, China
Zhiyong Feng
Beijing University of Posts and Telecommunications, Beijing, China
Yingxiao Shao
The University of Queensland, Brisbane, QLD, Australia
Hongzhi Yin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yan, C., Xu, H., Han, H., Zhang, Y., Wang, Z. (2023). Thompson Sampling with Time-Varying Reward for Contextual Bandits. In: Wang, X., et al. Database Systems for Advanced Applications. DASFAA 2023. Lecture Notes in Computer Science, vol 13944. Springer, Cham. https://doi.org/10.1007/978-3-031-30672-3_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-30672-3_4
Published: 14 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30671-6
Online ISBN: 978-3-031-30672-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Thompson Sampling with Time-Varying Reward for Contextual Bandits