skip to main content
10.1145/3604915.3608794acmconferencesArticle/Chapter ViewAbstractPublication PagesrecsysConference Proceedingsconference-collections
research-article
Public Access

Incentivizing Exploration in Linear Contextual Bandits under Information Gap

Published: 14 September 2023 Publication History

Abstract

Contextual bandit algorithms have been popularly used to address interactive recommendation, where the users are assumed to be cooperative to explore all recommendations from a system. In this paper, we relax this strong assumption and study the problem of incentivized exploration with myopic users, where the users are only interested in recommendations with their currently highest estimated reward. As a result, in order to obtain long-term optimality, the system needs to offer compensation to incentivize the users to take the exploratory recommendations. We consider a new and practically motivated setting where the context features employed by the user are more informative than those used by the system: for example, features based on users’ private information are not accessible by the system. We develop an effective solution for incentivized exploration under such an information gap, and prove that the method achieves a sublinear rate in both regret and compensation. We theoretically and empirically analyze the added compensation due to the information gap, compared with the case where the system has access to the same context features as the user does, i.e., without information gap. Moreover, we also provide a compensation lower bound of this problem.

References

[1]
Yasin Abbasi-yadkori, Dávid Pál, and Csaba Szepesvári. 2011. Improved Algorithms for Linear Stochastic Bandits. In NIPS. 2312–2320.
[2]
Marc Abeille and Alessandro Lazaric. 2017. Linear thompson sampling revisited. In Artificial Intelligence and Statistics. PMLR, 176–184.
[3]
Priyank Agrawal and Theja Tulabandhula. 2020. Incentivising Exploration and Recommendations for Contextual Bandits with Payments. In Multi-Agent Systems and Agreement Technologies. Springer, 159–170.
[4]
Shipra Agrawal and Navin Goyal. 2013. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning. PMLR, 127–135.
[5]
Peter Auer. 2002. Using Confidence Bounds for Exploitation-Exploration Trade-offs. Journal of Machine Learning Research 3 (2002), 397–422.
[6]
Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47, 2-3 (2002), 235–256.
[7]
Bin Bi, Milad Shokouhi, Michal Kosinski, and Thore Graepel. 2013. Inferring the demographics of search users: Social data meets search queries. In Proceedings of the 22nd international conference on World Wide Web. 131–140.
[8]
Nicolo Cesa-Bianchi, Claudio Gentile, and Giovanni Zappella. 2013. A gang of bandits. In Advances in Neural Information Processing Systems. 737–745.
[9]
Olivier Chapelle and Lihong Li. 2011. An empirical evaluation of thompson sampling. In Advances in neural information processing systems. 2249–2257.
[10]
Bangrui Chen, Peter Frazier, and David Kempe. 2018. Incentivizing exploration by heterogeneous users. In Conference On Learning Theory. PMLR, 798–818.
[11]
Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H Chi. 2019. Top-k off-policy correction for a REINFORCE recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 456–464.
[12]
Minmin Chen, Can Xu, Vince Gatto, Devanshu Jain, Aviral Kumar, and Ed Chi. 2022. Off-policy actor-critic for recommender systems. In Proceedings of the 16th ACM Conference on Recommender Systems. 338–349.
[13]
Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. 2011. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. 208–214.
[14]
Peter Frazier, David Kempe, Jon Kleinberg, and Robert Kleinberg. 2014. Incentivizing exploration. In Proceedings of the fifteenth ACM conference on Economics and computation. ACM, 5–22.
[15]
Dalin Guo, Sofia Ira Ktena, Pranay Kumar Myana, Ferenc Huszar, Wenzhe Shi, Alykhan Tejani, Michael Kneier, and Sourav Das. 2020. Deep bayesian bandits: Exploring in online personalized recommendations. In Proceedings of the 14th ACM Conference on Recommender Systems. 456–461.
[16]
Christoph Hirnschall, Adish Singla, Sebastian Tschiatschek, and Andreas Krause. 2018. Learning user preferences to incentivize exploration in the sharing economy. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[17]
Nicole Immorlica, Jieming Mao, Aleksandrs Slivkins, and Zhiwei Steven Wu. 2018. Incentivizing Exploration with Selective Data Disclosure. arXiv preprint arXiv:1811.06026 (2018).
[18]
Olivier Jeunen and Bart Goethals. 2021. Pessimistic reward models for off-policy learning in recommendation. In Proceedings of the 15th ACM Conference on Recommender Systems. 63–74.
[19]
Sampath Kannan, Michael Kearns, Jamie Morgenstern, Mallesh Pai, Aaron Roth, Rakesh Vohra, and Zhiwei Steven Wu. 2017. Fairness incentives for myopic agents. In Proceedings of the 2017 ACM Conference on Economics and Computation. 369–386.
[20]
Ilan Kremer, Yishay Mansour, and Motty Perry. 2014. Implementing the “wisdom of the crowd”. Journal of Political Economy 122, 5 (2014), 988–1012.
[21]
Tze Leung Lai and Herbert Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6, 1 (1985), 4–22.
[22]
Sahin Lale, Kamyar Azizzadenesheli, Anima Anandkumar, and Babak Hassibi. 2019. Stochastic linear bandits with hidden low rank structure. arXiv preprint arXiv:1901.09490 (2019).
[23]
Tor Lattimore and Csaba Szepesvari. 2017. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics. PMLR, 728–737.
[24]
Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web. ACM, 661–670.
[25]
Zhiyuan Liu, Huazheng Wang, Fan Shen, Kai Liu, and Lijun Chen. 2020. Incentivized Exploration for Multi-Armed Bandits under Reward Drift. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 4981–4988.
[26]
Jiaqi Ma, Zhe Zhao, Xinyang Yi, Ji Yang, Minmin Chen, Jiaxi Tang, Lichan Hong, and Ed H Chi. 2020. Off-policy learning in two-stage recommender systems. In Proceedings of The Web Conference 2020. 463–473.
[27]
Yishay Mansour, Aleksandrs Slivkins, and Vasilis Syrgkanis. 2015. Bayesian incentive-compatible bandit exploration. In Proceedings of the Sixteenth ACM Conference on Economics and Computation. ACM, 565–582.
[28]
Mark Sellke and Aleksandrs Slivkins. 2020. Sample complexity of incentivized exploration. arXiv preprint arXiv:2002.00558 (2020).
[29]
Max Simchowitz and Aleksandrs Slivkins. 2021. Exploration and incentives in reinforcement learning. arXiv preprint arXiv:2103.00360 (2021).
[30]
Aleksandrs Slivkins. 2017. Incentivizing exploration via information asymmetry. XRDS: Crossroads, The ACM Magazine for Students 24, 1 (2017), 38–41.
[31]
Huazheng Wang, Qingyun Wu, and Hongning Wang. 2017. Factorization bandits for interactive recommendation. In Thirty-First AAAI Conference on Artificial Intelligence.
[32]
Siwei Wang and Longbo Huang. 2018. Multi-armed Bandits with Compensation. In NeurIPS.
[33]
Ingmar Weber and Carlos Castillo. 2010. The demographics of web search. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. 523–530.
[34]
Qingyun Wu, Huazheng Wang, Quanquan Gu, and Hongning Wang. 2016. Contextual bandits in a collaborative environment. In SIGIR 2016. ACM, 529–538.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
RecSys '23: Proceedings of the 17th ACM Conference on Recommender Systems
September 2023
1406 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 September 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Incentivized exploration
  2. information gap
  3. linear bandits

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

RecSys '23: Seventeenth ACM Conference on Recommender Systems
September 18 - 22, 2023
Singapore, Singapore

Acceptance Rates

Overall Acceptance Rate 254 of 1,295 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 337
    Total Downloads
  • Downloads (Last 12 months)170
  • Downloads (Last 6 weeks)23
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media