research-article

Pessimistic Reward Models for Off-Policy Learning in Recommendation

Authors:

Olivier Jeunen,

Bart GoethalsAuthors Info & Claims

RecSys '21: Proceedings of the 15th ACM Conference on Recommender Systems

Pages 63 - 74

https://doi.org/10.1145/3460231.3474247

Published: 13 September 2021 Publication History

Get Access

Abstract

Methods for bandit learning from user interactions often require a model of the reward a certain context-action pair will yield – for example, the probability of a click on a recommendation. This common machine learning task is highly non-trivial, as the data-generating process for contexts and actions is often skewed by the recommender system itself. Indeed, when the deployed recommendation policy at data collection time does not pick its actions uniformly-at-random, this leads to a selection bias that can impede effective reward modelling. This in turn makes off-policy learning – the typical setup in industry – particularly challenging.

In this work, we propose and validate a general pessimistic reward modelling approach for off-policy learning in recommendation. Bayesian uncertainty estimates allow us to express scepticism about our own reward model, which can in turn be used to generate a conservative decision rule. We show how it alleviates a well-known decision making phenomenon known as the Optimiser’s Curse, and draw parallels with existing work on pessimistic policy learning. Leveraging the available closed-form expressions for both the posterior mean and variance when a ridge regressor models the reward, we show how to apply pessimism effectively and efficiently to an off-policy recommendation use-case. Empirical observations in a wide range of environments show that being conservative in decision-making leads to a significant and robust increase in recommendation performance. The merits of our approach are most outspoken in realistic settings with limited logging randomisation, limited training samples, and larger action spaces.

Supplementary Material

MP4 File (RecSys2021_Video_PaperA_4K.mp4)

Presentation video

Download
231.11 MB

References

[1]

A. Agarwal, S. Basu, T. Schnabel, and T. Joachims. 2017. Effective Evaluation Using Logged Bandit Feedback from Multiple Loggers. In Proc. of the 23rd ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(KDD ’17). ACM, 687–696.

Abstract

Supplementary Material

References

Cited By

Recommendations

Pessimistic Decision-Making for Recommender Systems

ROLeR: Effective Reward Shaping in Offline Reinforcement Learning for Recommender Systems

Reward-free offline reinforcement learning: Optimizing behavior policy via action exploration

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

HTML Format

Share

Share this Publication link

Share on social media

Affiliations