research-article

Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits

Authors:

Tatsuhiro Shimizu,

Haruka Kiyohara,

Masahiro Nomura,

Yuta SaitoAuthors Info & Claims

RecSys '24: Proceedings of the 18th ACM Conference on Recommender Systems

Pages 733 - 741

https://doi.org/10.1145/3640457.3688099

Published: 08 October 2024 Publication History

Abstract

We explore off-policy evaluation and learning (OPE/L) in contextual combinatorial bandits (CCB), where a policy selects a subset in the action space. For example, it might choose a set of furniture pieces (a bed and a drawer) from available items (bed, drawer, chair, etc.) for interior design sales. This setting is widespread in fields such as recommender systems and healthcare, yet OPE/L of CCB remains unexplored in the relevant literature. Typical OPE/L methods such as regression and importance sampling can be applied to the CCB problem, however, they face significant challenges due to high bias or variance, exacerbated by the exponential growth in the number of available subsets. To address these challenges, we introduce a concept of factored action space, which allows us to decompose each subset into binary indicators. This formulation allows us to distinguish between the “main effect” derived from the main actions, and the “residual effect”, originating from the supplemental actions, facilitating more effective OPE. Specifically, our estimator, called OPCB, leverages an importance sampling-based approach to unbiasedly estimate the main effect, while employing regression-based approach to deal with the residual effect with low variance. OPCB achieves substantial variance reduction compared to conventional importance sampling methods and bias reduction relative to regression methods under certain conditions, as illustrated in our theoretical analysis. Experiments demonstrate OPCB’s superior performance over typical methods in both OPE and OPL.

Supplemental Material

PDF File

Appendix

Download
3.35 MB

References

[1]

Alina Beygelzimer and John Langford. 2009. The Offset Tree for Learning with Partial Labels. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, Paris, France, 129–138.

Digital Library

[2]

Wei Chen, Yajun Wang, and Yang Yuan. 2013. Combinatorial multi-armed bandit: General framework and applications. In International conference on machine learning. PMLR, PMLR, Atlanta, Georgia, USA, 151–159.

[3]

Matej Cief, Michal Kompan, and Branislav Kveton. 2024. Cross-Validated Off-Policy Evaluation. arXiv preprint arXiv:2405.15332 (2024).

[4]

Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. 2014. Doubly Robust Policy Evaluation and Optimization. Statist. Sci. 29, 4 (2014), 485–511.

[5]

Miroslav Dudík, John Langford, and Lihong Li. 2011. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on Machine Learning. PMLR, Bellevue, WA, USA, 1097–1104.

[6]

Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More Robust Doubly Robust Off-Policy Evaluation. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80. PMLR, PMLR, Stockholm, Sweden, 1447–1456.

[7]

Nicolò Felicioni, Michael Benigni, and Maurizio Ferrari Dacrema. 2024. AutoOPE: Automated Off-Policy Estimator Selection. arXiv preprint arXiv:2406.18022 (2024).

[8]

Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. 2022. KuaiRec: A Fully-Observed Dataset and Insights for Evaluating Recommender Systems. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (Atlanta, GA, USA) (CIKM ’22). Association for Computing Machinery, Atlanta, GA, USA, 540–550. https://doi.org/10.1145/3511808.3557220

Digital Library

[9]

Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association 47, 260 (1952), 663–685.

[10]

Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, and Yuta Saito. 2024. Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation. In The Twelfth International Conference on Learning Representations.

[11]

Haruka Kiyohara, Masahiro Nomura, and Yuta Saito. 2024. Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction. In Proceedings of The Web Conference. Association for Computing Machinery, Singapore, Singapore.

Digital Library

[12]

Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, and Yasuo Yamamoto. 2022. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. In Proceedings of the 15th International Conference on Web Search and Data Mining. Association for Computing Machinery, Virtual Event, AZ, USA, 487–497.

Digital Library

[13]

Haruka Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, and Yuta Saito. 2023. Off-policy evaluation of ranking policies under diverse user behavior. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, Long Beach, CA, USA, 1154–1163.

Digital Library

[14]

Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S Muthukrishnan, Vishwa Vinay, and Zheng Wen. 2018. Offline Evaluation of Ranking Policies with Click Models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, London United Kingdom, 1685–1694.

Digital Library

[15]

James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra, and Benjamin Carterette. 2020. Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, Virtual Event CA USA, 1779–1788.

Digital Library

[16]

Alberto Maria Metelli, Alessio Russo, and Marcello Restelli. 2021. Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning. Advances in Neural Information Processing Systems 34 (2021), 8119–8132.

[17]

Lijing Qin, Shouyuan Chen, and Xiaoyan Zhu. 2014. Contextual combinatorial bandit and its application on diversified online recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining. SIAM, SIAM, Philadelphia, PA, 461–469.

[18]

Noveen Sachdeva, Lequn Wang, Dawen Liang, Nathan Kallus, and Julian McAuley. 2024. Off-policy evaluation for large action spaces via policy convolution. In Proceedings of the ACM on Web Conference 2024. 3576–3585.

Digital Library

[19]

Yuta Saito, Himan Abdollahpouri, Jesse Anderton, Ben Carterette, and Mounia Lalmas. 2024. Long-term Off-Policy Evaluation and Learning. In Proceedings of the ACM on Web Conference 2024. 3432–3443.

Digital Library

[20]

Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. In Proceedings of the 15th ACM Conference on Recommender Systems. 828–830.

Digital Library

[21]

Yuta Saito and Thorsten Joachims. 2022. Off-Policy Evaluation for Large Action Spaces via Embeddings. In International Conference on Machine Learning. PMLR, PMLR, Baltimore, Maryland, USA, 19089–19122.

[22]

Yuta Saito and Masahiro Nomura. 2024. Hyperparameter Optimization Can Even be Harmful in Off-Policy Learning and How to Deal with It. arXiv preprint arXiv:2404.15084 (2024).

[23]

Yuta Saito, Qingyang Ren, and Thorsten Joachims. 2023. Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling. In Proceedings of the 40th International Conference on Machine Learning, Vol. 202. PMLR, Honolulu, Hawaii, USA., 29734–29759.

Digital Library

[24]

Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. 2021. Evaluating the Robustness of Off-Policy Evaluation. In Proceedings of the 15th ACM Conference on Recommender Systems. 114–123.

Digital Library

[25]

Yuta Saito, Jihan Yao, and Thorsten Joachims. 2024. POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy Decomposition. arXiv preprint arXiv:2402.06151 (2024).

[26]

Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. 2010. Learning from Logged Implicit Exploration Data. In Advances in Neural Information Processing Systems, Vol. 23. Curran Associates Inc., Vancouver British Columbia, Canada, 2217–2225.

[27]

Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. 2020. Doubly Robust Off-Policy Evaluation with Shrinkage. In Proceedings of the 37th International Conference on Machine Learning, Vol. 119. PMLR, PMLR, Online, 9167–9176.

Digital Library

[28]

Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. 2020. Adaptive estimator selection for off-policy evaluation. In International Conference on Machine Learning. PMLR, PMLR, Online, 9196–9205.

[29]

Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. 2019. Cab: Continuous adaptive blending for policy evaluation and learning. In International Conference on Machine Learning, Vol. 84. PMLR, Long Beach, California, USA, 6005–6014.

[30]

Adith Swaminathan and Thorsten Joachims. 2015. Batch learning from logged bandit feedback through counterfactual risk minimization. The Journal of Machine Learning Research 16, 1 (2015), 1731–1755.

Digital Library

[31]

Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning. PMLR, PMLR, Lille, France, 814–823.

[32]

Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-Policy Evaluation for Slate Recommendation. In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates Inc., Long Beach, CA, USA, 3632–3642.

[33]

Takuma Udagawa, Haruka Kiyohara, Yusuke Narita, Yuta Saito, and Kei Tateno. 2023. Policy-adaptive estimator selection for off-policy evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. AAAI Press, Washington, DC, USA, 10025–10033.

Digital Library

[34]

Nikos Vlassis, Ashok Chandrashekar, Fernando Amat, and Nathan Kallus. 2021. Control variates for slate off-policy evaluation. Advances in Neural Information Processing Systems 34 (2021), 3667–3679.

[35]

Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudık. 2017. Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning. PMLR, PMLR, Sydney, Australia, 3589–3597.

Index Terms

Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Learning to rank
        Ranking
    2. Learning settings
      1. Batch learning
      2. Learning from implicit feedback

Recommendations

Variance-Minimizing Augmentation Logging for Counterfactual Evaluation in Contextual Bandits
WSDM '23: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining

Methods for offline A/B testing and counterfactual learning are seeing rapid adoption in search and recommender systems, since they allow efficient reuse of existing log data. However, there are fundamental limits to using existing log data alone, since ...
Contextual bandits with cross-learning
NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems

In the classical contextual bandits problem, in each round t, a learner observes some context c, chooses some action a to perform, and receives some reward r_a,t(c). We consider the variant of this problem where in addition to receiving the reward r_a,t(c′),...
Combinatorial neural bandits
ICML'23: Proceedings of the 40th International Conference on Machine Learning

We consider a contextual combinatorial bandit problem where in each round a learning agent selects a subset of arms and receives feedback on the selected arms according to their scores. The score of an arm is an unknown function of the arm's feature. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

RecSys '24: Proceedings of the 18th ACM Conference on Recommender Systems

October 2024

1438 pages

ISBN:9798400705052

DOI:10.1145/3640457

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

RecSys '24

Sponsor:

RecSys '24: 18th ACM Conference on Recommender Systems

October 14 - 18, 2024

Bari, Italy

Acceptance Rates

Overall Acceptance Rate 254 of 1,295 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
234
Total Downloads

Downloads (Last 12 months)234
Downloads (Last 6 weeks)17

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten