skip to main content
10.1145/3488560.3498380acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model

Published: 15 February 2022 Publication History

Abstract

In real-world recommender systems and search engines, optimizing ranking decisions to present a ranked list of relevant items is critical. Off-policy evaluation (OPE) for ranking policies is thus gaining a growing interest because it enables performance estimation of new ranking policies using only logged data. Although OPE in contextual bandits has been studied extensively, its naive application to the ranking setting faces a critical variance issue due to the huge item space. To tackle this problem, previous studies introduce some assumptions on user behavior to make the combinatorial item space tractable. However, an unrealistic assumption may, in turn, cause serious bias. Therefore, appropriately controlling the bias-variance tradeoff by imposing a reasonable assumption is the key for success in OPE of ranking policies. To achieve a well-balanced bias-variance tradeoff, we propose the Cascade Doubly Robust estimator building on the cascade assumption, which assumes that a user interacts with items sequentially from the top position in a ranking. We show that the proposed estimator is unbiased in more cases compared to existing estimators that make stronger assumptions on user behavior. Furthermore, compared to a previous estimator based on the same cascade assumption, the proposed estimator reduces the variance by leveraging a control variate. Comprehensive experiments on both synthetic and real-world e-commerce data demonstrate that our estimator leads to more accurate OPE than existing estimators in a variety of settings.

Supplementary Material

MP4 File (WSDM22-fp083.mp4)
Off-policy evaluation (OPE) for ranking policies is gaining a growing interest, as it enables performance estimation of new ranking policies using only logged data. A naive application of OPE in the ranking setting, however, faces a critical variance issue. To tackle this, previous studies introduce some assumptions on user behavior to make the combinatorial item space tractable. However, as an unrealistic assumption may cause serious bias, appropriately controlling the bias-variance tradeoff by imposing a reasonable assumption is the key. To achieve a well-balanced tradeoff, we propose the Cascade Doubly Robust estimator building on the cascade assumption, which assumes that a user interacts with items sequentially from the top. We show that the proposed estimator is unbiased in more cases compared to existing estimators that make stronger assumptions. Furthermore, compared to a previous estimator based on the same cascade assumption, the proposed estimator reduces the variance by leveraging a control variate.

References

[1]
Alina Beygelzimer and John Langford. 2009. The Offset Tree for Learning with Partial Labels. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . 129--138.
[2]
Miroslav Dud'ik, Dumitru Erhan, John Langford, and Lihong Li. 2014. Doubly Robust Policy Evaluation and Optimization. Statist. Sci., Vol. 29, 4 (2014), 485--511.
[3]
Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More Robust Doubly Robust Off-Policy Evaluation. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80. PMLR, 1447--1456.
[4]
Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. 198--206.
[5]
Alois Gruson, Praveen Chandar, Christophe Charbuillet, James McInerney, Samantha Hansen, Damien Tardieu, and Ben Carterette. 2019. Offline Evaluation to Make Decisions About Playlist Recommendation Algorithms. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining . 420--428.
[6]
Fan Guo, Chao Liu, and Yi Min Wang. 2009. Efficient Multiple-Click Models in Web Search. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. 124--131.
[7]
Kalervo J"arvelin and Jaana Kek"al"ainen. 2002. Cumulated Gain-Based Evaluation of IR Techniques. ACM Transactions on Information Systems (TOIS), Vol. 20, 4 (2002), 422--446.
[8]
Nan Jiang and Lihong Li. 2016. Doubly Robust Off-Policy Value Evaluation for Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning, Vol. 48. PMLR, 652--661.
[9]
Nathan Kallus, Yuta Saito, and Masatoshi Uehara. 2021. Optimal Off-Policy Evaluation from Multiple Logging Policies. In Proceedings of the 38th International Conference on Machine Learning, Vol. 139. PMLR, 5247--5256.
[10]
Haruka Kiyohara, Kosuke Kawakami, and Yuta Saito. 2021. Accelerating Offline Reinforcement Learning Application in Real-Time Bidding and Recommendation: Potential Use of Simulation. arXiv preprint arXiv:2109.08331 (2021).
[11]
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. 2020. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv preprint arXiv:2005.01643 (2020).
[12]
Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S Muthukrishnan, Vishwa Vinay, and Zheng Wen. 2018. Offline Evaluation of Ranking Policies with Click Models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . 1685--1694.
[13]
James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra, and Benjamin Carterette. 2020. Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . 1779--1788.
[14]
Doina Precup, Richard S. Sutton, and Satinder P. Singh. 2000. Eligibility Traces for Off-Policy Policy Evaluation. In Proceedings of the 17th International Conference on Machine Learning. 759--766.
[15]
Yuta Saito. 2020. Doubly Robust Estimator for Ranking Metrics with Post-Click Conversions. In 14th ACM Conference on Recommender Systems. 92--100.
[16]
Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. 2020 a. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. arXiv preprint arXiv:2008.07146 (2020).
[17]
Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. In Proceedings of the 15th ACM Conference on Recommender Systems. 828--830.
[18]
Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. 2021. Evaluating the Robustness of Off-Policy Evaluation. In Proceedings of the 15th ACM Conference on Recommender Systems. 114--123.
[19]
Yuta Saito, Suguru Yaginuma, Yuta Nishino, Hayato Sakata, and Kazuhide Nakata. 2020 b. Unbiased recommender learning from missing-not-at-random implicit feedback. In Proceedings of the 13th International Conference on Web Search and Data Mining. 501--509.
[20]
Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. 2010. Learning from Logged Implicit Exploration Data. In Advances in Neural Information Processing Systems, Vol. 23. 2217--2225.
[21]
Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dud'ik. 2020. Doubly Robust Off-Policy Evaluation with Shrinkage. In Proceedings of the 37th International Conference on Machine Learning, Vol. 119. PMLR, 9167--9176.
[22]
Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction .MIT press.
[23]
Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-Policy Evaluation for Slate Recommendation. In Advances in Neural Information Processing Systems, Vol. 30. 3632--3642.
[24]
Philip Thomas and Emma Brunskill. 2016. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning, Vol. 48. PMLR, 2139--2148.
[25]
Nikos Vlassis, Fernando Amat Gil, and Ashok Chandrashekar. 2021. Off-Policy Evaluation of Slate Policies under Bayes Risk. arXiv preprint arXiv:2101.02553 (2021).

Cited By

View all
  • (2024)Effective Off-Policy Evaluation and Learning in Contextual Combinatorial BanditsProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688099(733-741)Online publication date: 8-Oct-2024
  • (2024)CONSEQUENCES --- The 3rd Workshop on Causality, Counterfactuals and Sequential Decision-Making for Recommender SystemsProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3687095(1206-1209)Online publication date: 8-Oct-2024
  • (2024)Causal Inference in Recommender Systems: A Survey and Future DirectionsACM Transactions on Information Systems10.1145/363904842:4(1-32)Online publication date: 9-Feb-2024
  • Show More Cited By

Index Terms

  1. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining
      February 2022
      1690 pages
      ISBN:9781450391320
      DOI:10.1145/3488560
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 15 February 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. cascade model
      2. doubly robust
      3. inverse propensity score
      4. off policy evaluation
      5. slate recommendation

      Qualifiers

      • Research-article

      Conference

      WSDM '22

      Acceptance Rates

      Overall Acceptance Rate 498 of 2,863 submissions, 17%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)83
      • Downloads (Last 6 weeks)6
      Reflects downloads up to 13 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Effective Off-Policy Evaluation and Learning in Contextual Combinatorial BanditsProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688099(733-741)Online publication date: 8-Oct-2024
      • (2024)CONSEQUENCES --- The 3rd Workshop on Causality, Counterfactuals and Sequential Decision-Making for Recommender SystemsProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3687095(1206-1209)Online publication date: 8-Oct-2024
      • (2024)Causal Inference in Recommender Systems: A Survey and Future DirectionsACM Transactions on Information Systems10.1145/363904842:4(1-32)Online publication date: 9-Feb-2024
      • (2024)On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-n RecommendationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671687(1222-1233)Online publication date: 25-Aug-2024
      • (2024)Exploring the Landscape of Recommender Systems Evaluation: Practices and PerspectivesACM Transactions on Recommender Systems10.1145/36291702:1(1-31)Online publication date: 7-Mar-2024
      • (2024)Counterfactual Ranking Evaluation with Flexible Click ModelsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657810(1200-1210)Online publication date: 10-Jul-2024
      • (2024)Unbiased Learning to Rank: On Recent Advances and Practical ApplicationsProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3636451(1118-1121)Online publication date: 4-Mar-2024
      • (2024)Practical Bandits: An Industry PerspectiveProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3636449(1132-1135)Online publication date: 4-Mar-2024
      • (2024)Long-term Off-Policy Evaluation and LearningProceedings of the ACM Web Conference 202410.1145/3589334.3645446(3432-3443)Online publication date: 13-May-2024
      • (2024)Off-Policy Evaluation of Slate Bandit Policies via Optimizing AbstractionProceedings of the ACM Web Conference 202410.1145/3589334.3645343(3150-3161)Online publication date: 13-May-2024
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media