Skip to main content

Thompson Sampling with Time-Varying Reward for Contextual Bandits

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13944))

Included in the following conference series:

  • 1928 Accesses

Abstract

Contextual bandits efficiently solve the exploration and exploitation (EE) problem in online recommendation tasks. Most existing contextual bandit algorithms utilize a fixed reward mechanism, which makes it difficult to accurately capture the preference changes of users in non-stationary environments, thus affecting recommendation performance. In this paper, we formalize the online recommendation task as a contextual bandit problem and propose a Thompson sampling algorithm with time-varying reward (TV-TS) that captures user preference changes from three perspectives: (1) forgetting past preferences based on a functional decay method while capturing possible periodic demands, (2) mining fine-grained preference changes from multi-behavioral implicit feedback, and (3) iterating the reward weights adaptively. We also provide theoretical regret analysis to demonstrate the sublinearity of the algorithm. Extensive empirical experiments on two real-world datasets show that our proposed algorithm outperforms state-of-the-art time-varying bandit algorithms. Furthermore, the designed reward mechanism can be flexibly configured to other bandit algorithms to improve them.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://tianchi.aliyun.com/dataset/dataDetail?dataId=42.

  2. 2.

    https://tianchi.aliyun.com/dataset/dataDetail?dataId=649.

References

  1. Agrawal, S., Goyal, N.: Thompson sampling for contextual bandits with linear payoffs. In: Proceedings of the 30th International Conference on Machine Learning, pp. 127–135 (2013)

    Google Scholar 

  2. Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: Gambling in a rigged casino: the adversarial multi-armed bandit problem. In: Proceedings of the 36th Annual Foundations of Computer Science, pp. 322–331 (1995)

    Google Scholar 

  3. Besbes, O., Gur, Y., Zeevi, A.: Stochastic multi-armed-bandit problem with non-stationary rewards. In: Proceedings of the 28th Conference on Neural Information Processing Systems, pp. 199–207 (2014)

    Google Scholar 

  4. Cheung, W.C., Simchi-Levi, D., Zhu, R.: Learning to optimize under non-stationarity. In: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pp. 1079–1087 (2019)

    Google Scholar 

  5. Deng, Y., Zhou, X., Kim, B., Tewari, A., Gupta, A., Shroff, N.: Weighted gaussian process bandits for non-stationary environments. In: Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 6909–6932 (2022)

    Google Scholar 

  6. Ghatak, G.: A change-detection-based thompson sampling framework for non-stationary bandits. IEEE Trans. Comput. 70(10), 1670–1676 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  7. Li, C., Wang, H.: Asynchronous upper confidence bound algorithms for federated linear bandits. In: Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, pp. 6529–6553 (2022)

    Google Scholar 

  8. Liu, E.Z., Raghunathan, A., Liang, P., Finn, C.: Decoupling exploration and exploitation for meta-reinforcement learning without sacrifices. In: Proceedings of the 38th International Conference on Machine Learning, pp. 6925–6935 (2021)

    Google Scholar 

  9. Russac, Y., Vernade, C., Cappé, O.: Weighted linear bandits for non-stationary environments. In: Advances in Neural Information Processing Systems (2019)

    Google Scholar 

  10. Trovo, F., Paladino, S., Restelli, M., Gatti, N.: Sliding-window thompson sampling for non-stationary settings. J. Artif. Intell. Res. 68, 311–364 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  11. Vakili, S., Zhao, Q., Zhou, Y.: Time-varying stochastic multi-armed bandit problems. In: Proceedings of the 48th Asilomar Conference on Signals, Systems and Computers, pp. 2103–2107 (2014)

    Google Scholar 

  12. Xu, L., Jiang, C., Qian, Y., Zhao, Y., Li, J., Ren, Y.: Dynamic privacy pricing: a multi-armed bandit approach with time-variant rewards. IEEE Trans. Inf. Forensics Secur. 12(2), 271–285 (2016)

    Article  Google Scholar 

  13. Xu, X., Dong, F., Li, Y., He, S., Li, X.: Contextual-bandit based personalized recommendation with time-varying user interests. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence, pp. 6518–6525 (2020)

    Google Scholar 

  14. Yan, C., Han, H., Zhang, Y., Zhu, D., Wan, Y.: Dynamic clustering based contextual combinatorial multi-armed bandit for online recommendation. Knowl.-Based Syst. 257, 109927 (2022)

    Article  Google Scholar 

  15. Zhu, Z., Huang, L., Xu, H.: Self-accelerated thompson sampling with near-optimal regret upper bound. Neurocomputing 399, 37–47 (2020)

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China (No. 62206046) and Shanghai Science and Technology Innovation Action Plan Project (No. 22511100700).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Cairong Yan or Yanting Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yan, C., Xu, H., Han, H., Zhang, Y., Wang, Z. (2023). Thompson Sampling with Time-Varying Reward for Contextual Bandits. In: Wang, X., et al. Database Systems for Advanced Applications. DASFAA 2023. Lecture Notes in Computer Science, vol 13944. Springer, Cham. https://doi.org/10.1007/978-3-031-30672-3_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-30672-3_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-30671-6

  • Online ISBN: 978-3-031-30672-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics