skip to main content
10.1145/3640457.3688162acmconferencesArticle/Chapter ViewAbstractPublication PagesrecsysConference Proceedingsconference-collections
short-paper

Δ-OPE: Off-Policy Estimation with Pairs of Policies

Published: 08 October 2024 Publication History

Abstract

The off-policy paradigm casts recommendation as a counterfactual decision-making task, allowing practitioners to unbiasedly estimate online metrics using offline data. This leads to effective evaluation metrics, as well as learning procedures that directly optimise online success. Nevertheless, the high variance that comes with unbiasedness is typically the crux that complicates practical applications. An important insight is that the difference between policy values can often be estimated with significantly reduced variance, if said policies have positive covariance. This allows us to formulate a pairwise off-policy estimation task: Δ-OPE.
Δ-OPE subsumes the common use-case of estimating improvements of a learnt policy over a production policy, using data collected by a stochastic logging policy. We introduce Δ-OPE methods based on the widely used Inverse Propensity Scoring estimator and its extensions. Moreover, we characterise a variance-optimal additive control variate that further enhances efficiency. Simulated, offline, and online experiments show that our methods significantly improve performance for both evaluation and learning tasks.

References

[1]
Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research 14, 101 (2013), 3207–3260. http://jmlr.org/papers/v14/bottou13a.html
[2]
Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H. Chi. 2019. Top-K Off-Policy Correction for a REINFORCE Recommender System. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining(WSDM ’19). ACM, 456–464. https://doi.org/10.1145/3289600.3290999
[3]
Minmin Chen, Bo Chang, Can Xu, and Ed H. Chi. 2021. User Response Models to Improve a REINFORCE Recommender System. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining(WSDM ’21). ACM, 121–129. https://doi.org/10.1145/3437963.3441764
[4]
Minmin Chen, Can Xu, Vince Gatto, Devanshu Jain, Aviral Kumar, and Ed Chi. 2022. Off-Policy Actor-Critic for Recommender Systems. In Proceedings of the 16th ACM Conference on Recommender Systems(RecSys ’22). ACM, 338–349. https://doi.org/10.1145/3523227.3546758
[5]
Peter Dayan. 1991. Reinforcement Comparison. In Connectionist Models, David S. Touretzky, Jeffrey L. Elman, Terrence J. Sejnowski, and Geoffrey E. Hinton (Eds.). Morgan Kaufmann, 45–51. https://doi.org/10.1016/B978-1-4832-1448-1.50011-1
[6]
Zhenhua Dong, Hong Zhu, Pengxiang Cheng, Xinhua Feng, Guohao Cai, Xiuqiang He, Jun Xu, and Jirong Wen. 2020. Counterfactual learning for recommender system. In Proceedings of the 14th ACM Conference on Recommender Systems(RecSys ’20). ACM, 568–569. https://doi.org/10.1145/3383313.3411552
[7]
Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. 2014. Doubly Robust Policy Evaluation and Optimization. Statist. Sci. 29, 4 (2014), 485 – 511. https://doi.org/10.1214/14-STS500
[8]
Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More Robust Doubly Robust Off-policy Evaluation. In Proceedings of the 35th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 1447–1456. https://proceedings.mlr.press/v80/farajtabar18a.html
[9]
Louis Faury, Ugo Tanielian, Elvis Dohmatob, Elena Smirnova, and Flavian Vasile. 2020. Distributionally Robust Counterfactual Risk Minimization. Proceedings of the AAAI Conference on Artificial Intelligence 34, 04 (Apr. 2020), 3850–3857. https://doi.org/10.1609/aaai.v34i04.5797
[10]
Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. In Proc. of the Eleventh ACM International Conference on Web Search and Data Mining(WSDM ’18). ACM, 198–206. https://doi.org/10.1145/3159652.3159687
[11]
Evan Greensmith, Peter L. Bartlett, and Jonathan Baxter. 2004. Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning. J. Mach. Learn. Res. 5 (dec 2004), 1471–1530.
[12]
Alois Gruson, Praveen Chandar, Christophe Charbuillet, James McInerney, Samantha Hansen, Damien Tardieu, and Ben Carterette. 2019. Offline Evaluation to Make Decisions About PlaylistRecommendation Algorithms. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining(WSDM ’19). ACM, 420–428. https://doi.org/10.1145/3289600.3291027
[13]
Shashank Gupta, Philipp Hager, Jin Huang, Ali Vardasbi, and Harrie Oosterhuis. 2024. Unbiased Learning to Rank: On Recent Advances and Practical Applications. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining(WSDM ’24). ACM, 1118–1121. https://doi.org/10.1145/3616855.3636451
[14]
Shashank Gupta, Olivier Jeunen, Harrie Oosterhuis, and Maarten de Rijke. 2024. Optimal Baseline Corrections for Off-Policy Contextual Bandits. In Proceedings of the 18th ACM Conference on Recommender Systems(RecSys ’24). arxiv:2405.05736 [cs.LG]
[15]
Shashank Gupta, Harrie Oosterhuis, and Maarten de Rijke. 2023. A Deep Generative Recommendation Method for Unbiased Learning from Implicit Feedback. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. 87–93.
[16]
Shashank Gupta, Harrie Oosterhuis, and Maarten de Rijke. 2023. Safe deployment for counterfactual learning to rank with exposure-based risk minimization. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 249–258.
[17]
Daniel G. Horvitz and Donovan J. Thompson. 1952. A Generalization of Sampling Without Replacement from a Finite Universe. J. Amer. Statist. Assoc. 47, 260 (1952), 663–685. https://doi.org/10.1080/01621459.1952.10483446
[18]
Olivier Jeunen. 2021. Offline approaches to recommendation with online success. Ph. D. Dissertation. University of Antwerp.
[19]
Olivier Jeunen and Bart Goethals. 2020. An Empirical Evaluation of Doubly Robust Learning for Recommendation. In Proc. of the ACM RecSys Workshop on Bandit Learning from User Interactions(REVEAL ’20).
[20]
Olivier Jeunen and Bart Goethals. 2021. Pessimistic Reward Models for Off-Policy Learning in Recommendation. In Proceedings of the 15th ACM Conference on Recommender Systems(RecSys ’21). ACM, 63–74. https://doi.org/10.1145/3460231.3474247
[21]
Olivier Jeunen and Bart Goethals. 2023. Pessimistic Decision-Making for Recommender Systems. ACM Trans. Recomm. Syst. 1, 1, Article 4 (feb 2023), 27 pages. https://doi.org/10.1145/3568029
[22]
Olivier Jeunen, Thorsten Joachims, Harrie Oosterhuis, Yuta Saito, and Flavian Vasile. 2022. CONSEQUENCES — Causality, Counterfactuals and Sequential Decision-Making for Recommender Systems. In Proceedings of the 16th ACM Conference on Recommender Systems(RecSys ’22). ACM, 654–657. https://doi.org/10.1145/3523227.3547409
[23]
Olivier Jeunen and Ben London. 2023. Offline Recommender System Evaluation under Unobserved Confounding. In RecSys workshop on Causality, Counterfactuals and Sequential Decision-Making(CONSEQUENCES ’24). arxiv:2309.04222 [cs.LG]
[24]
Olivier Jeunen, Jatin Mandav, Ivan Potapov, Nakul Agarwal, Sourabh Vaid, Wenzhe Shi, and Aleksei Ustimenko. 2024. Multi-Objective Recommendation via Multivariate Policy Learning. arxiv:2405.02141 [cs.IR]
[25]
Olivier Jeunen, David Rohde, Flavian Vasile, and Martin Bompaire. 2020. Joint Policy-Value Learning for Recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(KDD ’20). ACM, 1223–1233. https://doi.org/10.1145/3394486.3403175
[26]
Olivier Jeunen and Aleksei Ustimenko. 2024. Learning Metrics that Maximise Power for Accelerated A/B-Tests. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(KDD ’24). arxiv:2402.03915 [cs.LG]
[27]
Thorsten Joachims, Adith Swaminathan, and Maarten de Rijke. 2018. Deep Learning with Logged Bandit Feedback. In International Conference on Learning Representations. https://openreview.net/forum?id=SJaP_-xAb
[28]
Thorsten Joachims, Adith Swaminathan, Yves Raimond, Olivier Koch, and Flavian Vasile. 2018. REVEAL 2018: offline evaluation for recommender systems. In Proceedings of the 12th ACM Conference on Recommender Systems (Vancouver, British Columbia, Canada) (RecSys ’18). Association for Computing Machinery, New York, NY, USA, 514–515. https://doi.org/10.1145/3240323.3240334
[29]
Augustine Kong. 1992. A note on importance sampling using standardized weights. University of Chicago, Dept. of Statistics, Tech. Rep 348 (1992).
[30]
Erich L Lehmann and Joseph P Romano. 2005. Testing statistical hypotheses.
[31]
Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web(WWW ’10). ACM, 661–670. https://doi.org/10.1145/1772690.1772758
[32]
Dugang Liu, Pengxiang Cheng, Zhenhua Dong, Xiuqiang He, Weike Pan, and Zhong Ming. 2020. A General Knowledge Distillation Framework for Counterfactual Recommendation via Uniform Data. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’20). ACM, 831–840. https://doi.org/10.1145/3397271.3401083
[33]
Yaxu Liu, Jui-Nan Yen, Bowen Yuan, Rundong Shi, Peng Yan, and Chih-Jen Lin. 2022. Practical Counterfactual Policy Learning for Top-K Recommendations. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(KDD ’22). ACM, 1141–1151. https://doi.org/10.1145/3534678.3539295
[34]
Jiaqi Ma, Zhe Zhao, Xinyang Yi, Ji Yang, Minmin Chen, Jiaxi Tang, Lichan Hong, and Ed H. Chi. 2020. Off-Policy Learning in Two-Stage Recommender Systems. In Proceedings of The Web Conference 2020(WWW ’20). ACM, 463–473. https://doi.org/10.1145/3366423.3380130
[35]
Francois Mairesse, Zhonghao Luo, and Tao Ye. 2021. Learning a Voice-based Conversational Recommender using Offline Policy Optimization. In Proceedings of the 15th ACM Conference on Recommender Systems(RecSys ’21). ACM, 562–564. https://doi.org/10.1145/3460231.3474600
[36]
Andreas Maurer and Massimiliano Pontil. 2009. Empirical Bernstein Bounds and Sample Variance Penalization. Stat. 1050 (2009), 21.
[37]
Art B. Owen. 2013. Monte Carlo theory, methods and examples.
[38]
Hitesh Sagtani, Madan Gopal Jhawar, Rishabh Mehrotra, and Olivier Jeunen. 2024. Ad-load Balancing via Off-policy Learning in a Content Marketplace. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining(WSDM ’24). ACM, 586–595. https://doi.org/10.1145/3616855.3635846
[39]
Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. 2021. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Vol. 1.
[40]
Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. In Proc. of the 15th ACM Conference on Recommender Systems(RecSys ’21). ACM, 828–830. https://doi.org/10.1145/3460231.3473320
[41]
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 37), Francis Bach and David Blei (Eds.). PMLR, Lille, France, 1889–1897. https://proceedings.mlr.press/v37/schulman15.html
[42]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. arxiv:1707.06347 [cs.LG]
[43]
Nian Si, Fan Zhang, Zhengyuan Zhou, and Jose Blanchet. 2020. Distributionally Robust Policy Evaluation and Learning in Offline Contextual Bandits. In Proceedings of the 37th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR, 8884–8894. https://proceedings.mlr.press/v119/si20a.html
[44]
Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. 2020. Doubly robust off-policy evaluation with shrinkage. In International Conference on Machine Learning. PMLR, 9167–9176.
[45]
Adith Swaminathan and Thorsten Joachims. 2015. Batch learning from logged bandit feedback through counterfactual risk minimization. The Journal of Machine Learning Research 16, 1 (2015), 1731–1755.
[46]
Adith Swaminathan and Thorsten Joachims. 2015. The Self-Normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems, Vol. 28. https://proceedings.neurips.cc/paper_files/paper/2015/file/39027dfad5138c9ca0c474d71db915c3-Paper.pdf
[47]
Bram van den Akker, Olivier Jeunen, Ying Li, Ben London, Zahra Nazari, and Devesh Parekh. 2024. Practical Bandits: An Industry Perspective. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining(WSDM ’24). ACM, 1132–1135. https://doi.org/10.1145/3616855.3636449
[48]
Flavian Vasile, David Rohde, Olivier Jeunen, and Amine Benhalloum. 2020. A Gentle Introduction to Recommendation as Counterfactual Policy Learning. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization(UMAP ’20). ACM, 392–393. https://doi.org/10.1145/3340631.3398666
[49]
Flavian Vasile, David Rohde, Olivier Jeunen, Amine Benhalloum, and Otmane Sakhi. 2021. Recommender Systems Through the Lens of Decision Theory. In Proceedings of the 30th World Wide Web Conference ACM Conference.
[50]
Runzhe Wan, Branislav Kveton, and Rui Song. 2022. Safe exploration for efficient policy evaluation and comparison. In International Conference on Machine Learning. PMLR, 22491–22511.
[51]
Ronald J Williams. 1988. Toward a theory of reinforcement-learning connectionist systems. Technical Report (1988).
[52]
Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Deborah Estrin. 2018. Unbiased Offline Recommender Evaluation for Missing-Not-at-Random Implicit Feedback. In Proceedings of the 12th ACM Conference on Recommender Systems(RecSys ’18). ACM, 279–287. https://doi.org/10.1145/3240323.3240355

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
RecSys '24: Proceedings of the 18th ACM Conference on Recommender Systems
October 2024
1438 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 October 2024

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

Acceptance Rates

Overall Acceptance Rate 254 of 1,295 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 121
    Total Downloads
  • Downloads (Last 12 months)121
  • Downloads (Last 6 weeks)6
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media