skip to main content
10.1145/3640457.3688099acmconferencesArticle/Chapter ViewAbstractPublication PagesrecsysConference Proceedingsconference-collections
research-article

Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits

Published: 08 October 2024 Publication History

Abstract

We explore off-policy evaluation and learning (OPE/L) in contextual combinatorial bandits (CCB), where a policy selects a subset in the action space. For example, it might choose a set of furniture pieces (a bed and a drawer) from available items (bed, drawer, chair, etc.) for interior design sales. This setting is widespread in fields such as recommender systems and healthcare, yet OPE/L of CCB remains unexplored in the relevant literature. Typical OPE/L methods such as regression and importance sampling can be applied to the CCB problem, however, they face significant challenges due to high bias or variance, exacerbated by the exponential growth in the number of available subsets. To address these challenges, we introduce a concept of factored action space, which allows us to decompose each subset into binary indicators. This formulation allows us to distinguish between the “main effect” derived from the main actions, and the “residual effect”, originating from the supplemental actions, facilitating more effective OPE. Specifically, our estimator, called OPCB, leverages an importance sampling-based approach to unbiasedly estimate the main effect, while employing regression-based approach to deal with the residual effect with low variance. OPCB achieves substantial variance reduction compared to conventional importance sampling methods and bias reduction relative to regression methods under certain conditions, as illustrated in our theoretical analysis. Experiments demonstrate OPCB’s superior performance over typical methods in both OPE and OPL.

Supplemental Material

PDF File
Appendix

References

[1]
Alina Beygelzimer and John Langford. 2009. The Offset Tree for Learning with Partial Labels. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, Paris, France, 129–138.
[2]
Wei Chen, Yajun Wang, and Yang Yuan. 2013. Combinatorial multi-armed bandit: General framework and applications. In International conference on machine learning. PMLR, PMLR, Atlanta, Georgia, USA, 151–159.
[3]
Matej Cief, Michal Kompan, and Branislav Kveton. 2024. Cross-Validated Off-Policy Evaluation. arXiv preprint arXiv:2405.15332 (2024).
[4]
Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. 2014. Doubly Robust Policy Evaluation and Optimization. Statist. Sci. 29, 4 (2014), 485–511.
[5]
Miroslav Dudík, John Langford, and Lihong Li. 2011. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on Machine Learning. PMLR, Bellevue, WA, USA, 1097–1104.
[6]
Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More Robust Doubly Robust Off-Policy Evaluation. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80. PMLR, PMLR, Stockholm, Sweden, 1447–1456.
[7]
Nicolò Felicioni, Michael Benigni, and Maurizio Ferrari Dacrema. 2024. AutoOPE: Automated Off-Policy Estimator Selection. arXiv preprint arXiv:2406.18022 (2024).
[8]
Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. 2022. KuaiRec: A Fully-Observed Dataset and Insights for Evaluating Recommender Systems. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (Atlanta, GA, USA) (CIKM ’22). Association for Computing Machinery, Atlanta, GA, USA, 540–550. https://doi.org/10.1145/3511808.3557220
[9]
Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association 47, 260 (1952), 663–685.
[10]
Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, and Yuta Saito. 2024. Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation. In The Twelfth International Conference on Learning Representations.
[11]
Haruka Kiyohara, Masahiro Nomura, and Yuta Saito. 2024. Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction. In Proceedings of The Web Conference. Association for Computing Machinery, Singapore, Singapore.
[12]
Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, and Yasuo Yamamoto. 2022. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. In Proceedings of the 15th International Conference on Web Search and Data Mining. Association for Computing Machinery, Virtual Event, AZ, USA, 487–497.
[13]
Haruka Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, and Yuta Saito. 2023. Off-policy evaluation of ranking policies under diverse user behavior. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, Long Beach, CA, USA, 1154–1163.
[14]
Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S Muthukrishnan, Vishwa Vinay, and Zheng Wen. 2018. Offline Evaluation of Ranking Policies with Click Models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, London United Kingdom, 1685–1694.
[15]
James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra, and Benjamin Carterette. 2020. Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, Virtual Event CA USA, 1779–1788.
[16]
Alberto Maria Metelli, Alessio Russo, and Marcello Restelli. 2021. Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning. Advances in Neural Information Processing Systems 34 (2021), 8119–8132.
[17]
Lijing Qin, Shouyuan Chen, and Xiaoyan Zhu. 2014. Contextual combinatorial bandit and its application on diversified online recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining. SIAM, SIAM, Philadelphia, PA, 461–469.
[18]
Noveen Sachdeva, Lequn Wang, Dawen Liang, Nathan Kallus, and Julian McAuley. 2024. Off-policy evaluation for large action spaces via policy convolution. In Proceedings of the ACM on Web Conference 2024. 3576–3585.
[19]
Yuta Saito, Himan Abdollahpouri, Jesse Anderton, Ben Carterette, and Mounia Lalmas. 2024. Long-term Off-Policy Evaluation and Learning. In Proceedings of the ACM on Web Conference 2024. 3432–3443.
[20]
Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. In Proceedings of the 15th ACM Conference on Recommender Systems. 828–830.
[21]
Yuta Saito and Thorsten Joachims. 2022. Off-Policy Evaluation for Large Action Spaces via Embeddings. In International Conference on Machine Learning. PMLR, PMLR, Baltimore, Maryland, USA, 19089–19122.
[22]
Yuta Saito and Masahiro Nomura. 2024. Hyperparameter Optimization Can Even be Harmful in Off-Policy Learning and How to Deal with It. arXiv preprint arXiv:2404.15084 (2024).
[23]
Yuta Saito, Qingyang Ren, and Thorsten Joachims. 2023. Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling. In Proceedings of the 40th International Conference on Machine Learning, Vol. 202. PMLR, Honolulu, Hawaii, USA., 29734–29759.
[24]
Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. 2021. Evaluating the Robustness of Off-Policy Evaluation. In Proceedings of the 15th ACM Conference on Recommender Systems. 114–123.
[25]
Yuta Saito, Jihan Yao, and Thorsten Joachims. 2024. POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy Decomposition. arXiv preprint arXiv:2402.06151 (2024).
[26]
Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. 2010. Learning from Logged Implicit Exploration Data. In Advances in Neural Information Processing Systems, Vol. 23. Curran Associates Inc., Vancouver British Columbia, Canada, 2217–2225.
[27]
Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. 2020. Doubly Robust Off-Policy Evaluation with Shrinkage. In Proceedings of the 37th International Conference on Machine Learning, Vol. 119. PMLR, PMLR, Online, 9167–9176.
[28]
Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. 2020. Adaptive estimator selection for off-policy evaluation. In International Conference on Machine Learning. PMLR, PMLR, Online, 9196–9205.
[29]
Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. 2019. Cab: Continuous adaptive blending for policy evaluation and learning. In International Conference on Machine Learning, Vol. 84. PMLR, Long Beach, California, USA, 6005–6014.
[30]
Adith Swaminathan and Thorsten Joachims. 2015. Batch learning from logged bandit feedback through counterfactual risk minimization. The Journal of Machine Learning Research 16, 1 (2015), 1731–1755.
[31]
Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning. PMLR, PMLR, Lille, France, 814–823.
[32]
Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-Policy Evaluation for Slate Recommendation. In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates Inc., Long Beach, CA, USA, 3632–3642.
[33]
Takuma Udagawa, Haruka Kiyohara, Yusuke Narita, Yuta Saito, and Kei Tateno. 2023. Policy-adaptive estimator selection for off-policy evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. AAAI Press, Washington, DC, USA, 10025–10033.
[34]
Nikos Vlassis, Ashok Chandrashekar, Fernando Amat, and Nathan Kallus. 2021. Control variates for slate off-policy evaluation. Advances in Neural Information Processing Systems 34 (2021), 3667–3679.
[35]
Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudık. 2017. Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning. PMLR, PMLR, Sydney, Australia, 3589–3597.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
RecSys '24: Proceedings of the 18th ACM Conference on Recommender Systems
October 2024
1438 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 October 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Combinatorial Bandits.
  2. Off-Policy Evaluation and Learning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

Acceptance Rates

Overall Acceptance Rate 254 of 1,295 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 234
    Total Downloads
  • Downloads (Last 12 months)234
  • Downloads (Last 6 weeks)17
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media