skip to main content
10.1145/3539597.3570452acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Variance-Minimizing Augmentation Logging for Counterfactual Evaluation in Contextual Bandits

Published:27 February 2023Publication History

ABSTRACT

Methods for offline A/B testing and counterfactual learning are seeing rapid adoption in search and recommender systems, since they allow efficient reuse of existing log data. However, there are fundamental limits to using existing log data alone, since the counterfactual estimators that are commonly used in these methods can have large bias and large variance when the logging policy is very different from the target policy being evaluated. To overcome this limitation, we explore the question of how to design data-gathering policies that most effectively augment an existing dataset of bandit feedback with additional observations for both learning and evaluation. To this effect, this paper introduces Minimum Variance Augmentation Logging (MVAL), a method for constructing logging policies that minimize the variance of the downstream evaluation or learning problem. We explore multiple approaches to computing MVAL policies efficiently, and find that they can be substantially more effective in decreasing the variance of an estimator than naïve approaches.

Skip Supplemental Material Section

Supplemental Material

wsdmfp0561.mp4

mp4

26.2 MB

References

  1. A. Agarwal, S. Basu, T. Schnabel, and T. Joachims. 2017. Effective Evaluation using Logged Bandit Feedback from Multiple Loggers. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).Google ScholarGoogle Scholar
  2. Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. 2014. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. In Proceedings of the 31st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 32), Eric P. Xing and Tony Jebara (Eds.). PMLR, Bejing, China, 1638--1646. http://proceedings.mlr.press/v32/agarwalb14.htmlGoogle ScholarGoogle Scholar
  3. Shipra Agrawal and Navin Goyal. 2013. Thompson Sampling for Contextual Bandits with Linear Payoffs. In Proceedings of the 30th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 28), Sanjoy Dasgupta and David McAllester (Eds.). PMLR, Atlanta, Georgia, USA, 127--135. http://proceedings.mlr.press/v28/agrawal13.htmlGoogle ScholarGoogle Scholar
  4. Alina Beygelzimer and John Langford. 2009. The offset tree for learning with partial labels. In KDD. ACM, 129--138.Google ScholarGoogle Scholar
  5. Alberto Bietti, Alekh Agarwal, and John Langford. 2018. A contextual bandit bake-off. arXiv preprint arXiv:1802.04064 (2018).Google ScholarGoogle Scholar
  6. Lé on Bottou, Jonas Peters, Joaquin Qui n onero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research, Vol. 14, 65 (2013), 3207--3260. http://jmlr.org/papers/v14/bottou13a.htmlGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  7. N. Cesa-Bianchi and G. Lugosi. 2006. Prediction, Learning, and Games. Cambridge University Press.Google ScholarGoogle Scholar
  8. Wei Chu, Seung-Taek Park, Todd Beaupre, Nitin Motgi, Amit Phadke, Seinjuti Chakraborty, and Joe Zachariah. 2009. A case study of behavior-driven conjoint analysis on Yahoo! Front Page Today module. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 1097--1104.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. JEAN-MARIE CORNUET, JEAN-MICHEL MARIN, Antonietta Mira, and Christian P Robert. 2012. Adaptive multiple importance sampling. Scandinavian Journal of Statistics, Vol. 39, 4 (2012), 798--812.Google ScholarGoogle ScholarCross RefCross Ref
  10. Miroslav Dud'ik, John Langford, and Lihong Li. 2011. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning. 1097--1104.Google ScholarGoogle Scholar
  11. Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More Robust Doubly Robust Off-policy Evaluation. In ICML.Google ScholarGoogle Scholar
  12. Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. 2019. Batched Multi-armed Bandits Problem. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/20f07591c6fcb220ffe637cda29bb3f6-Paper.pdfGoogle ScholarGoogle Scholar
  13. Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, Vol. 47, 260 (1952), 663--685.Google ScholarGoogle ScholarCross RefCross Ref
  14. T. Joachims, A. Swaminathan, and M. de Rijke. 2018. Deep Learning with Logged Bandit Feedback. In ICLR.Google ScholarGoogle Scholar
  15. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  16. John Langford and Tong Zhang. 2007. The epoch-greedy algorithm for multi-armed bandits with side information. Advances in neural information processing systems, Vol. 20 (2007), 817--824.Google ScholarGoogle Scholar
  17. Lihong Li, Shunbao Chen, Jim Kleban, and Ankur Gupta. 2015. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web. 929--934.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web. 661--670.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ben London and Ted Sandler. 2019. Bayesian Counterfactual Risk Minimization. In ICML. 4125--4133.Google ScholarGoogle Scholar
  20. Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Art Owen. 2013. Monte Carlo Theory, Methods and Examples. Stanford.Google ScholarGoogle Scholar
  22. N. Sachdeva, Yi Su, and T. Joachims. 2020. Off-policy Bandits with Deficient Support. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD).Google ScholarGoogle Scholar
  23. Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. 2011. Learning from Logged Implicit Exploration Data. In NeurIPS.Google ScholarGoogle Scholar
  24. Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. 2019. CAB: Continuous Adaptive Blending for Policy Evaluation and Learning. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 6005--6014. http://proceedings.mlr.press/v97/su19a.htmlGoogle ScholarGoogle Scholar
  25. Adith Swaminathan and Thorsten Joachims. 2015a. Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization. Journal of Machine Learning Research, Vol. 16, 52 (2015), 1731--1755. http://jmlr.org/papers/v16/swaminathan15a.htmlGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Swaminathan and T. Joachims. 2015b. The Self-Normalized Estimator for Counterfactual Learning. In NeurIPS.Google ScholarGoogle Scholar
  27. A. Swaminathan, A. Krishnamurthy, A. Agarwal, M. Dudik, J. Langford, D. Jose, and I. Zitouni. 2017. Off-policy Evaluation for Slate Recommendation. In Advances in Neural Information Processing Systems (NeurIPS).Google ScholarGoogle Scholar
  28. Philip Thomas and Emma Brunskill. 2016. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning. PMLR, 2139--2148.Google ScholarGoogle Scholar
  29. Hastagiri P. Vanchinathan, Isidor Nikolic, Fabio De Bona, and Andreas Krause. 2014. Explore-Exploit in Top-N Recommender Systems via Gaussian Processes. In Proceedings of the 8th ACM Conference on Recommender Systems (Foster City, Silicon Valley, California, USA) (RecSys '14). Association for Computing Machinery, New York, NY, USA, 225--232. https://doi.org/10.1145/2645710.2645733Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. 2017. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70). PMLR, International Convention Centre, Sydney, Australia, 3589--3597. http://proceedings.mlr.press/v70/wang17a.htmlGoogle ScholarGoogle Scholar
  31. Songbai Yan, Kamalika Chaudhuri, and Tara Javidi. 2018. Active Learning with Logged Data. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 5521--5530. https://proceedings.mlr.press/v80/yan18a.htmlGoogle ScholarGoogle Scholar
  32. Songbai Yan, Kamalika Chaudhuri, and Tara Javidi. 2019. The Label Complexity of Active Learning from Observational Data. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/1019c8091693ef5c5f55970346633f92-Paper.pdfGoogle ScholarGoogle Scholar

Index Terms

  1. Variance-Minimizing Augmentation Logging for Counterfactual Evaluation in Contextual Bandits

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          WSDM '23: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining
          February 2023
          1345 pages
          ISBN:9781450394079
          DOI:10.1145/3539597

          Copyright © 2023 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 27 February 2023

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate498of2,863submissions,17%

          Upcoming Conference

        • Article Metrics

          • Downloads (Last 12 months)63
          • Downloads (Last 6 weeks)3

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader