research-article

Variance-Minimizing Augmentation Logging for Counterfactual Evaluation in Contextual Bandits

Authors:
Aaron D. Tucker

Cornell University, Ithaca, NY, USA

Cornell University, Ithaca, NY, USA

0000-0003-3967-9711
View Profile

,
Thorsten Joachims

Cornell University, Ithaca, NY, USA

Cornell University, Ithaca, NY, USA

0000-0003-3654-3683
View Profile

WSDM '23: Proceedings of the Sixteenth ACM International Conference on Web Search and Data MiningFebruary 2023Pages 967–975https://doi.org/10.1145/3539597.3570452

Published:27 February 2023Publication History

WSDM '23: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining

Pages 967–975

ABSTRACT

Methods for offline A/B testing and counterfactual learning are seeing rapid adoption in search and recommender systems, since they allow efficient reuse of existing log data. However, there are fundamental limits to using existing log data alone, since the counterfactual estimators that are commonly used in these methods can have large bias and large variance when the logging policy is very different from the target policy being evaluated. To overcome this limitation, we explore the question of how to design data-gathering policies that most effectively augment an existing dataset of bandit feedback with additional observations for both learning and evaluation. To this effect, this paper introduces Minimum Variance Augmentation Logging (MVAL), a method for constructing logging policies that minimize the variance of the downstream evaluation or learning problem. We explore multiple approaches to computing MVAL policies efficiently, and find that they can be substantially more effective in decreasing the variance of an estimator than naïve approaches.

Supplemental Material

wsdmfp0561.mp4

mp4

26.2 MB

Download

References

A. Agarwal, S. Basu, T. Schnabel, and T. Joachims. 2017. Effective Evaluation using Logged Bandit Feedback from Multiple Loggers. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).Google Scholar
Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. 2014. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. In Proceedings of the 31st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 32), Eric P. Xing and Tony Jebara (Eds.). PMLR, Bejing, China, 1638--1646. http://proceedings.mlr.press/v32/agarwalb14.htmlGoogle Scholar
Shipra Agrawal and Navin Goyal. 2013. Thompson Sampling for Contextual Bandits with Linear Payoffs. In Proceedings of the 30th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 28), Sanjoy Dasgupta and David McAllester (Eds.). PMLR, Atlanta, Georgia, USA, 127--135. http://proceedings.mlr.press/v28/agrawal13.htmlGoogle Scholar
Alina Beygelzimer and John Langford. 2009. The offset tree for learning with partial labels. In KDD. ACM, 129--138.Google Scholar
Alberto Bietti, Alekh Agarwal, and John Langford. 2018. A contextual bandit bake-off. arXiv preprint arXiv:1802.04064 (2018).Google Scholar
Lé on Bottou, Jonas Peters, Joaquin Qui n onero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research, Vol. 14, 65 (2013), 3207--3260. http://jmlr.org/papers/v14/bottou13a.htmlGoogle ScholarDigital Library
N. Cesa-Bianchi and G. Lugosi. 2006. Prediction, Learning, and Games. Cambridge University Press.Google Scholar
Wei Chu, Seung-Taek Park, Todd Beaupre, Nitin Motgi, Amit Phadke, Seinjuti Chakraborty, and Joe Zachariah. 2009. A case study of behavior-driven conjoint analysis on Yahoo! Front Page Today module. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 1097--1104.Google ScholarDigital Library
JEAN-MARIE CORNUET, JEAN-MICHEL MARIN, Antonietta Mira, and Christian P Robert. 2012. Adaptive multiple importance sampling. Scandinavian Journal of Statistics, Vol. 39, 4 (2012), 798--812.Google ScholarCross Ref
Miroslav Dud'ik, John Langford, and Lihong Li. 2011. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning. 1097--1104.Google Scholar
Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More Robust Doubly Robust Off-policy Evaluation. In ICML.Google Scholar
Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. 2019. Batched Multi-armed Bandits Problem. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/20f07591c6fcb220ffe637cda29bb3f6-Paper.pdfGoogle Scholar
Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, Vol. 47, 260 (1952), 663--685.Google ScholarCross Ref
T. Joachims, A. Swaminathan, and M. de Rijke. 2018. Deep Learning with Logged Bandit Feedback. In ICLR.Google Scholar
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
John Langford and Tong Zhang. 2007. The epoch-greedy algorithm for multi-armed bandits with side information. Advances in neural information processing systems, Vol. 20 (2007), 817--824.Google Scholar
Lihong Li, Shunbao Chen, Jim Kleban, and Ankur Gupta. 2015. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web. 929--934.Google ScholarDigital Library
Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web. 661--670.Google ScholarDigital Library
Ben London and Ted Sandler. 2019. Bayesian Counterfactual Risk Minimization. In ICML. 4125--4133.Google Scholar
Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning.Google ScholarDigital Library
Art Owen. 2013. Monte Carlo Theory, Methods and Examples. Stanford.Google Scholar
N. Sachdeva, Yi Su, and T. Joachims. 2020. Off-policy Bandits with Deficient Support. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD).Google Scholar
Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. 2011. Learning from Logged Implicit Exploration Data. In NeurIPS.Google Scholar
Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. 2019. CAB: Continuous Adaptive Blending for Policy Evaluation and Learning. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 6005--6014. http://proceedings.mlr.press/v97/su19a.htmlGoogle Scholar
Adith Swaminathan and Thorsten Joachims. 2015a. Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization. Journal of Machine Learning Research, Vol. 16, 52 (2015), 1731--1755. http://jmlr.org/papers/v16/swaminathan15a.htmlGoogle ScholarDigital Library
A. Swaminathan and T. Joachims. 2015b. The Self-Normalized Estimator for Counterfactual Learning. In NeurIPS.Google Scholar
A. Swaminathan, A. Krishnamurthy, A. Agarwal, M. Dudik, J. Langford, D. Jose, and I. Zitouni. 2017. Off-policy Evaluation for Slate Recommendation. In Advances in Neural Information Processing Systems (NeurIPS).Google Scholar
Philip Thomas and Emma Brunskill. 2016. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning. PMLR, 2139--2148.Google Scholar
Hastagiri P. Vanchinathan, Isidor Nikolic, Fabio De Bona, and Andreas Krause. 2014. Explore-Exploit in Top-N Recommender Systems via Gaussian Processes. In Proceedings of the 8th ACM Conference on Recommender Systems (Foster City, Silicon Valley, California, USA) (RecSys '14). Association for Computing Machinery, New York, NY, USA, 225--232. https://doi.org/10.1145/2645710.2645733Google ScholarDigital Library
Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. 2017. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70). PMLR, International Convention Centre, Sydney, Australia, 3589--3597. http://proceedings.mlr.press/v70/wang17a.htmlGoogle Scholar
Songbai Yan, Kamalika Chaudhuri, and Tara Javidi. 2018. Active Learning with Logged Data. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 5521--5530. https://proceedings.mlr.press/v80/yan18a.htmlGoogle Scholar
Songbai Yan, Kamalika Chaudhuri, and Tara Javidi. 2019. The Label Complexity of Active Learning from Observational Data. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/1019c8091693ef5c5f55970346633f92-Paper.pdfGoogle Scholar

Index Terms

Variance-Minimizing Augmentation Logging for Counterfactual Evaluation in Contextual Bandits
1. Computing methodologies
  1. Machine learning
    1. Learning settings
      1. Active learning settings
      2. Batch learning
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Reinforcement learning
        Sequential decision making

Recommendations

Counterfactual Evaluation and Learning for Interactive Systems: Foundations, Implementations, and Recent Advances
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Counterfactual estimators enable the use of existing log data to estimate how some new target policy would have performed, if it had been used instead of the policy that logged the data. We say that those estimators work "off-policy", since the policy ...
Read More
Contextual Bandits with Cross-Learning
In the classic contextual bandits problem, in each round t, a learner observes some context c, chooses some action i to perform, and receives some reward ri,t(c). We consider the variant of this problem in which in addition to receiving the reward ri,t(c), the learner ...
Read More
Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits
KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

It has become increasingly common for data to be collected adaptively, for example using contextual bandits. Historical data of this type can be used to evaluate other treatment assignment policies to guide future innovation or experiments. However, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WSDM '23: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining
February 2023
1345 pages
ISBN:9781450394079
DOI:10.1145/3539597
General Chairs:
Tat-Seng Chua
National University of Singapore
,
Hady Lauw
Singapore Management University
,
Program Chairs:
Luo Si
Salesforce
,
Evimaria Terzi
Boston University
,
Panayiotis Tsaparas
University of Ioannina
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 February 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
contextual bandits
counterfactual inference
off-policy evaluation and learning
recommender systems
reinforcement learning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate498of2,863submissions,17%
Upcoming Conference
WSDM '25

Sponsor:

sigir

sigir

sigir

sigir

The Eighteenth ACM International Conference on Web Search and Data Mining

April 7 - 11, 2025

Hannover , Germany
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 115
  Total Downloads
- Downloads (Last 12 months)63
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Variance-Minimizing Augmentation Logging for Counterfactual Evaluation in Contextual Bandits

WSDM '23: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Counterfactual Evaluation and Learning for Interactive Systems: Foundations, Implementations, and Recent Advances

Contextual Bandits with Cross-Learning

Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Variance-Minimizing Augmentation Logging for Counterfactual Evaluation in Contextual Bandits

WSDM '23: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Counterfactual Evaluation and Learning for Interactive Systems: Foundations, Implementations, and Recent Advances

Contextual Bandits with Cross-Learning

Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media