short-paper

Δ-OPE: Off-Policy Estimation with Pairs of Policies

Authors:

Olivier Jeunen,

Aleksei UstimenkoAuthors Info & Claims

RecSys '24: Proceedings of the 18th ACM Conference on Recommender Systems

Pages 878 - 883

https://doi.org/10.1145/3640457.3688162

Published: 08 October 2024 Publication History

Abstract

The off-policy paradigm casts recommendation as a counterfactual decision-making task, allowing practitioners to unbiasedly estimate online metrics using offline data. This leads to effective evaluation metrics, as well as learning procedures that directly optimise online success. Nevertheless, the high variance that comes with unbiasedness is typically the crux that complicates practical applications. An important insight is that the difference between policy values can often be estimated with significantly reduced variance, if said policies have positive covariance. This allows us to formulate a pairwise off-policy estimation task: Δ-OPE.

Δ-OPE subsumes the common use-case of estimating improvements of a learnt policy over a production policy, using data collected by a stochastic logging policy. We introduce Δ-OPE methods based on the widely used Inverse Propensity Scoring estimator and its extensions. Moreover, we characterise a variance-optimal additive control variate that further enhances efficiency. Simulated, offline, and online experiments show that our methods significantly improve performance for both evaluation and learning tasks.

References

[1]

Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research 14, 101 (2013), 3207–3260. http://jmlr.org/papers/v14/bottou13a.html

Digital Library

[2]

Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H. Chi. 2019. Top-K Off-Policy Correction for a REINFORCE Recommender System. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining(WSDM ’19). ACM, 456–464. https://doi.org/10.1145/3289600.3290999

Digital Library

[3]

Minmin Chen, Bo Chang, Can Xu, and Ed H. Chi. 2021. User Response Models to Improve a REINFORCE Recommender System. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining(WSDM ’21). ACM, 121–129. https://doi.org/10.1145/3437963.3441764

Digital Library

[4]

Minmin Chen, Can Xu, Vince Gatto, Devanshu Jain, Aviral Kumar, and Ed Chi. 2022. Off-Policy Actor-Critic for Recommender Systems. In Proceedings of the 16th ACM Conference on Recommender Systems(RecSys ’22). ACM, 338–349. https://doi.org/10.1145/3523227.3546758

Digital Library

[5]

Peter Dayan. 1991. Reinforcement Comparison. In Connectionist Models, David S. Touretzky, Jeffrey L. Elman, Terrence J. Sejnowski, and Geoffrey E. Hinton (Eds.). Morgan Kaufmann, 45–51. https://doi.org/10.1016/B978-1-4832-1448-1.50011-1

[6]

Zhenhua Dong, Hong Zhu, Pengxiang Cheng, Xinhua Feng, Guohao Cai, Xiuqiang He, Jun Xu, and Jirong Wen. 2020. Counterfactual learning for recommender system. In Proceedings of the 14th ACM Conference on Recommender Systems(RecSys ’20). ACM, 568–569. https://doi.org/10.1145/3383313.3411552

Digital Library

[7]

Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. 2014. Doubly Robust Policy Evaluation and Optimization. Statist. Sci. 29, 4 (2014), 485 – 511. https://doi.org/10.1214/14-STS500

[8]

Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More Robust Doubly Robust Off-policy Evaluation. In Proceedings of the 35th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 1447–1456. https://proceedings.mlr.press/v80/farajtabar18a.html

[9]

Louis Faury, Ugo Tanielian, Elvis Dohmatob, Elena Smirnova, and Flavian Vasile. 2020. Distributionally Robust Counterfactual Risk Minimization. Proceedings of the AAAI Conference on Artificial Intelligence 34, 04 (Apr. 2020), 3850–3857. https://doi.org/10.1609/aaai.v34i04.5797

[10]

Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. In Proc. of the Eleventh ACM International Conference on Web Search and Data Mining(WSDM ’18). ACM, 198–206. https://doi.org/10.1145/3159652.3159687

Digital Library

[11]

Evan Greensmith, Peter L. Bartlett, and Jonathan Baxter. 2004. Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning. J. Mach. Learn. Res. 5 (dec 2004), 1471–1530.

[12]

Alois Gruson, Praveen Chandar, Christophe Charbuillet, James McInerney, Samantha Hansen, Damien Tardieu, and Ben Carterette. 2019. Offline Evaluation to Make Decisions About PlaylistRecommendation Algorithms. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining(WSDM ’19). ACM, 420–428. https://doi.org/10.1145/3289600.3291027

Digital Library

[13]

Shashank Gupta, Philipp Hager, Jin Huang, Ali Vardasbi, and Harrie Oosterhuis. 2024. Unbiased Learning to Rank: On Recent Advances and Practical Applications. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining(WSDM ’24). ACM, 1118–1121. https://doi.org/10.1145/3616855.3636451

Digital Library

[14]

Shashank Gupta, Olivier Jeunen, Harrie Oosterhuis, and Maarten de Rijke. 2024. Optimal Baseline Corrections for Off-Policy Contextual Bandits. In Proceedings of the 18th ACM Conference on Recommender Systems(RecSys ’24). arxiv:2405.05736 [cs.LG]

Digital Library

[15]

Shashank Gupta, Harrie Oosterhuis, and Maarten de Rijke. 2023. A Deep Generative Recommendation Method for Unbiased Learning from Implicit Feedback. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. 87–93.

Digital Library

[16]

Shashank Gupta, Harrie Oosterhuis, and Maarten de Rijke. 2023. Safe deployment for counterfactual learning to rank with exposure-based risk minimization. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 249–258.

Digital Library

[17]

Daniel G. Horvitz and Donovan J. Thompson. 1952. A Generalization of Sampling Without Replacement from a Finite Universe. J. Amer. Statist. Assoc. 47, 260 (1952), 663–685. https://doi.org/10.1080/01621459.1952.10483446

[18]

Olivier Jeunen. 2021. Offline approaches to recommendation with online success. Ph. D. Dissertation. University of Antwerp.

[19]

Olivier Jeunen and Bart Goethals. 2020. An Empirical Evaluation of Doubly Robust Learning for Recommendation. In Proc. of the ACM RecSys Workshop on Bandit Learning from User Interactions(REVEAL ’20).

[20]

Olivier Jeunen and Bart Goethals. 2021. Pessimistic Reward Models for Off-Policy Learning in Recommendation. In Proceedings of the 15th ACM Conference on Recommender Systems(RecSys ’21). ACM, 63–74. https://doi.org/10.1145/3460231.3474247

Digital Library

[21]

Olivier Jeunen and Bart Goethals. 2023. Pessimistic Decision-Making for Recommender Systems. ACM Trans. Recomm. Syst. 1, 1, Article 4 (feb 2023), 27 pages. https://doi.org/10.1145/3568029

Digital Library

[22]

Olivier Jeunen, Thorsten Joachims, Harrie Oosterhuis, Yuta Saito, and Flavian Vasile. 2022. CONSEQUENCES — Causality, Counterfactuals and Sequential Decision-Making for Recommender Systems. In Proceedings of the 16th ACM Conference on Recommender Systems(RecSys ’22). ACM, 654–657. https://doi.org/10.1145/3523227.3547409

Digital Library

[23]

Olivier Jeunen and Ben London. 2023. Offline Recommender System Evaluation under Unobserved Confounding. In RecSys workshop on Causality, Counterfactuals and Sequential Decision-Making(CONSEQUENCES ’24). arxiv:2309.04222 [cs.LG]

[24]

Olivier Jeunen, Jatin Mandav, Ivan Potapov, Nakul Agarwal, Sourabh Vaid, Wenzhe Shi, and Aleksei Ustimenko. 2024. Multi-Objective Recommendation via Multivariate Policy Learning. arxiv:2405.02141 [cs.IR]

[25]

Olivier Jeunen, David Rohde, Flavian Vasile, and Martin Bompaire. 2020. Joint Policy-Value Learning for Recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(KDD ’20). ACM, 1223–1233. https://doi.org/10.1145/3394486.3403175

Digital Library

[26]

Olivier Jeunen and Aleksei Ustimenko. 2024. Learning Metrics that Maximise Power for Accelerated A/B-Tests. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(KDD ’24). arxiv:2402.03915 [cs.LG]

Digital Library

[27]

Thorsten Joachims, Adith Swaminathan, and Maarten de Rijke. 2018. Deep Learning with Logged Bandit Feedback. In International Conference on Learning Representations. https://openreview.net/forum?id=SJaP_-xAb

[28]

Thorsten Joachims, Adith Swaminathan, Yves Raimond, Olivier Koch, and Flavian Vasile. 2018. REVEAL 2018: offline evaluation for recommender systems. In Proceedings of the 12th ACM Conference on Recommender Systems (Vancouver, British Columbia, Canada) (RecSys ’18). Association for Computing Machinery, New York, NY, USA, 514–515. https://doi.org/10.1145/3240323.3240334

Digital Library

[29]

Augustine Kong. 1992. A note on importance sampling using standardized weights. University of Chicago, Dept. of Statistics, Tech. Rep 348 (1992).

[30]

Erich L Lehmann and Joseph P Romano. 2005. Testing statistical hypotheses.

[31]

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web(WWW ’10). ACM, 661–670. https://doi.org/10.1145/1772690.1772758

Digital Library

[32]

Dugang Liu, Pengxiang Cheng, Zhenhua Dong, Xiuqiang He, Weike Pan, and Zhong Ming. 2020. A General Knowledge Distillation Framework for Counterfactual Recommendation via Uniform Data. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’20). ACM, 831–840. https://doi.org/10.1145/3397271.3401083

Digital Library

[33]

Yaxu Liu, Jui-Nan Yen, Bowen Yuan, Rundong Shi, Peng Yan, and Chih-Jen Lin. 2022. Practical Counterfactual Policy Learning for Top-K Recommendations. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(KDD ’22). ACM, 1141–1151. https://doi.org/10.1145/3534678.3539295

Digital Library

[34]

Jiaqi Ma, Zhe Zhao, Xinyang Yi, Ji Yang, Minmin Chen, Jiaxi Tang, Lichan Hong, and Ed H. Chi. 2020. Off-Policy Learning in Two-Stage Recommender Systems. In Proceedings of The Web Conference 2020(WWW ’20). ACM, 463–473. https://doi.org/10.1145/3366423.3380130

Digital Library

[35]

Francois Mairesse, Zhonghao Luo, and Tao Ye. 2021. Learning a Voice-based Conversational Recommender using Offline Policy Optimization. In Proceedings of the 15th ACM Conference on Recommender Systems(RecSys ’21). ACM, 562–564. https://doi.org/10.1145/3460231.3474600

Digital Library

[36]

Andreas Maurer and Massimiliano Pontil. 2009. Empirical Bernstein Bounds and Sample Variance Penalization. Stat. 1050 (2009), 21.

[37]

Art B. Owen. 2013. Monte Carlo theory, methods and examples.

[38]

Hitesh Sagtani, Madan Gopal Jhawar, Rishabh Mehrotra, and Olivier Jeunen. 2024. Ad-load Balancing via Off-policy Learning in a Content Marketplace. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining(WSDM ’24). ACM, 586–595. https://doi.org/10.1145/3616855.3635846

Digital Library

[39]

Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. 2021. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Vol. 1.

[40]

Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. In Proc. of the 15th ACM Conference on Recommender Systems(RecSys ’21). ACM, 828–830. https://doi.org/10.1145/3460231.3473320

Digital Library

[41]

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 37), Francis Bach and David Blei (Eds.). PMLR, Lille, France, 1889–1897. https://proceedings.mlr.press/v37/schulman15.html

[42]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. arxiv:1707.06347 [cs.LG]

[43]

Nian Si, Fan Zhang, Zhengyuan Zhou, and Jose Blanchet. 2020. Distributionally Robust Policy Evaluation and Learning in Offline Contextual Bandits. In Proceedings of the 37th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR, 8884–8894. https://proceedings.mlr.press/v119/si20a.html

Digital Library

[44]

Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. 2020. Doubly robust off-policy evaluation with shrinkage. In International Conference on Machine Learning. PMLR, 9167–9176.

[45]

Adith Swaminathan and Thorsten Joachims. 2015. Batch learning from logged bandit feedback through counterfactual risk minimization. The Journal of Machine Learning Research 16, 1 (2015), 1731–1755.

Digital Library

[46]

Adith Swaminathan and Thorsten Joachims. 2015. The Self-Normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems, Vol. 28. https://proceedings.neurips.cc/paper_files/paper/2015/file/39027dfad5138c9ca0c474d71db915c3-Paper.pdf

[47]

Bram van den Akker, Olivier Jeunen, Ying Li, Ben London, Zahra Nazari, and Devesh Parekh. 2024. Practical Bandits: An Industry Perspective. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining(WSDM ’24). ACM, 1132–1135. https://doi.org/10.1145/3616855.3636449

Digital Library

[48]

Flavian Vasile, David Rohde, Olivier Jeunen, and Amine Benhalloum. 2020. A Gentle Introduction to Recommendation as Counterfactual Policy Learning. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization(UMAP ’20). ACM, 392–393. https://doi.org/10.1145/3340631.3398666

Digital Library

[49]

Flavian Vasile, David Rohde, Olivier Jeunen, Amine Benhalloum, and Otmane Sakhi. 2021. Recommender Systems Through the Lens of Decision Theory. In Proceedings of the 30th World Wide Web Conference ACM Conference.

[50]

Runzhe Wan, Branislav Kveton, and Rui Song. 2022. Safe exploration for efficient policy evaluation and comparison. In International Conference on Machine Learning. PMLR, 22491–22511.

[51]

Ronald J Williams. 1988. Toward a theory of reinforcement-learning connectionist systems. Technical Report (1988).

[52]

Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Deborah Estrin. 2018. Unbiased Offline Recommender Evaluation for Missing-Not-at-Random Implicit Feedback. In Proceedings of the 12th ACM Conference on Recommender Systems(RecSys ’18). ACM, 279–287. https://doi.org/10.1145/3240323.3240355

Digital Library

Index Terms

Index terms have been assigned to the content through auto-classification.

Recommendations

Autonomous abstraction of policies based on policy homomorphism
Policy-driven reflective enforcement of security policies
SAC '06: Proceedings of the 2006 ACM symposium on Applied computing

Practical experience has shown that separating security enforcement code from functional code using separation of concerns techniques such as behavioural reflection leads to improvements in code undestandability and maintainability. However, using these ...
Policies as signals in collaborative policy engineering
HotAC II: Hot Topics in Autonomic Computing on Hot Topics in Autonomic Computing

Policy Engineeing is the process of authoring policies, detecting and resolving policy conflicts and revising existing policies to accommodate changing resources, business goals and business processes. In operations of any scale, policy engineeing is a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

RecSys '24: Proceedings of the 18th ACM Conference on Recommender Systems

October 2024

1438 pages

ISBN:9798400705052

DOI:10.1145/3640457

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Short-paper
Research
Refereed limited

Conference

RecSys '24

Sponsor:

RecSys '24: 18th ACM Conference on Recommender Systems

October 14 - 18, 2024

Bari, Italy

Acceptance Rates

Overall Acceptance Rate 254 of 1,295 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
121
Total Downloads

Downloads (Last 12 months)121
Downloads (Last 6 weeks)6

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten