1 Introduction and Motivation
Recommendation systems are tools that are often used to stimulate some form of engagement between users and items on online platforms. In the early days, fuelled by the popularity of the MovieLens datasets and the Netflix Prize [
6,
32], the problem was typically framed as that of
rating prediction. When presented with a dataset consisting of explicit ratings given from users to items, the recommendation system would then generate a model that predicts the rating a certain user would give to a yet unseen item. Items with higher predicted ratings would then be assumed to make up better recommendations, and these assumptions allowed the field to thrive. As the web evolved, the availability of
implicit feedback quickly outgrew that of its
explicit counterpart, and recommendation models evolved with it. The focus on
rating prediction moved towards
item prediction, where the quality of a recommendation list is determined via ranking metrics borrowed from the neighbouring field of Information Retrieval [
100,
108]. This again allowed the field to thrive for several years.
More recently, several real-world recommendation systems have moved from the
prediction to the
decision paradigm, explicitly acknowledging that the recommendations we choose to show have an influence on the state of the world.
1 Moreover, the goals of live systems can diverge, and need not be focused on a single metric [
76]. Indeed, many modern web services deploy machine-learnt models on their websites to help steer traffic towards certain items. Retail websites try to predict which of their recommendations might lead to a sale, music streaming platforms suggest songs in your queue to optimise engagement metrics, search engines will often rank items in decreasing estimated probability of receiving a click, et cetera. In these and in many more use-cases, the system consists of a model that estimates the consequences of its actions, and weighs them before making a decision. For example, we might model the probability of receiving a click when showing a recommendation, and decide to show recommendations that maximise the estimated expected number of clicks.
These models are generally part of a (1) collect data, (2) train model, and (3) deploy model loop, where models are iteratively retrained and earlier versions influence the training data that is used for future iterations. This correlation between the deployed model and the collected training data can impede effective learning if we are unable to somehow correct for the bias it creates. Recent work has shown how such “algorithmic confounding” leads to feedback loops when left untreated, which can be detrimental to the users, the platforms, and the models themselves [
9,
70]. Traditional recommendation research that falls under the
prediction paradigm bypasses these feedback loops by (often implicitly) assuming
organic user-item interaction data that was collected independently of any existing recommendation process. Nevertheless, this assumption is often questionable, and can have a significant impact on evaluation results when violated [
38]. In deployed systems, feedback loops cannot be dismissed. In this work, we wish to learn directly from the logs of the deployed recommender system, casting the recommendation task in a bandit learning framework [
11,
109]. Here, the feedback loop is a
feature, as it allows us to directly optimise online reward metrics in an offline manner [
40]. Notwithstanding this, the biased nature of data collected by deployed recommendation policies should be taken into account appropriately.
Learning from biased data is not a novel problem, and many
unbiased learning procedures have proven to be effective in counteracting
position,
presentation,
trust, and
selection bias [
2,
3,
11,
52,
83]. These methods typically make use of importance sampling or
Inverse Propensity Score (IPS) weighting, in order to obtain an unbiased estimate of the counterfactual value-of-interest [
86]. They aim to answer questions of the form: “
What click-through-rate would this new policy have obtained, if it were deployed instead of the old policy”? The policy that maximises the answer to this question is the policy we want to deploy. Answering this question effectively and efficiently, however, is not an easy feat.
IPS is the cornerstone of counterfactual reasoning [
8], but by no means a silver bullet. It is plagued by variance issues that are exacerbated at scale, often making it hard to deploy these systems reliably in the real world [
29]. Furthermore, the randomisation requirements for IPS to remain unbiased are often unrealistic or simply unattainable. Recent work explores the effectiveness of counterfactual models in cases where IPS assumptions in the training data are violated, highlighting an interesting area for future research and a commonly-encountered yet understudied problem [
47,
89].
An alternative family of approaches are so-called “value-based” models. These methods rely on an explicit model of the reward conditioned on a context-action pair – for example, the probability of a user clicking on a given recommendation when it is shown [
33,
75]. When prompted, the model then simply takes the action that maximises the probability of a positive reward, given the presented context and the learnt model. Aside from the typical problems of model misspecification in supervised learning [
71], another issue with value-based methods is that learning an accurate model of the reward is not straightforward when the collected training data is heavily influenced by the model that was deployed in production at the time. Methods that use IPS to re-weight the data as if it were unbiased exist [
97], but their performance when deployed as recommendation policies is often disappointing in comparison with policy-based methods or even reward models that do not re-weight the data [
47,
80]. Furthermore, the logging policy is not always known before-hand, several logging policies might be at play concurrently [
1,
53], and even when we do manage to obtain unbiased value estimates we should expect the true obtained reward from acting on them to be disappointing. Indeed, only considering the action with the highest estimated reward can be a flawed decision procedure in and of itself – a phenomenon known as “the Optimiser’s curse” [
99].
Contributions. In this paper, we focus on improving the recommendation performance of policies that rely on value-based models of expected reward. We propose and validate a general pessimistic reward modelling framework, with a focus on the task of off-policy learning in recommendation. Bayesian uncertainty estimates allow us to express scepticism about our own reward model, which can then in turn be used to generate conservative decision rules based on the resulting reward predictions – instead of the usual ones based on
Maximum Likelihood (MLE) or
Maximum A Posteriori (MAP) estimates. We show how closed-form expressions for both the posterior mean and variance can be leveraged to express pessimism when a ridge regressor models the reward, and how to apply them effectively and efficiently to an off-policy recommendation use-case. Our approach is agnostic to the logging policy, and does not require (a model of) propensity scores to quantify selection bias. As a result, we are not bound to the strict assumptions that make IPS work, and abide by statistical conjectures such as the likelihood principle [
7]. Pessimistic decision-making not only significantly increases the reward obtained by the learnt policy’s recommendations, but we additionally show how our proposed framework lifts the Optimiser’s Curse. By essentially accepting an increase in model bias over the
full action space for reduced variance in the
topmost actions, we significantly improve the recommender’s ability to forecast its own performance. That is, we limit “post-decision disappointment”, defined as the difference between the estimated expected reward and the true obtained reward. We discuss the important consequences this has for offline evaluation and downstream applications such as computational advertising.
Bias-variance trade-offs and pessimism are not new to the machine learning field. Especially in general
Reinforcement Learning (RL) research, estimators are often replaced by conservative lower-bounds, albeit for slightly different reasons [
92,
93,
112].
2 In the case of policy-based methods particularly, IPS’ well-known variance issues have caused a plethora of extensions to sprout in recent years. Some of those explicitly recognise their pessimistic nature, others do not. We provide an overview of those relevant to the recommendation use-case, draw parallels with our proposed value-based pessimism, and highlight connections and differences.
One further connection that cannot be overlooked, is how our proposed approach for off-policy learning seemingly goes against the “optimism in the face of uncertainty” adage adopted by decades of work on on-policy bandits [
4,
10,
59]. We show how the insights presented in this work are complementary to theirs, and propose a research agenda for future work connecting off- and on-policy learning with hybrid approaches.
Evaluation. The empirical performance of counterfactual learning methods is often reported with a supervised-to-bandit conversion on existing multi-class or multi-label classification datasets [
51,
69,
102]. As publicly available datasets with propensity information are scarce, this inhibits robust and reproducible evaluation of such methods on off-policy
recommendation tasks. In line with recent work [
41,
45,
47,
91], we adopt the RecoGym simulation environment in our experiments to yield reproducible results that are aligned with the specifics of real-world recommendation scenarios, such as stochastic rewards, limited randomisation, and small effect sizes [
87]. An added advantage of adopting such a simulation framework is the freedom gained to change environmental parameters and better understand how these changes affect the trade-offs between different methods. This allows us to present insights in our proposed method that offline datasets would not be able to uncover. Empirical observations for a wide range of configurations show that our proposed approach of pessimistic decision-making leads to a significant and robust increase in recommendation performance. The merits of our method are most outspoken in realistic settings where the amount of randomisation in the logging policy is limited, training sample sizes are small, and action spaces are large. Indeed, these are exactly the cases where selection bias will be strong, and over-estimation is likely to occur. All source code to reproduce the reported results is available at
github.com/olivierjeunen/pessimism-recsys-2021.
To summarise, the main contributions we present in this work are:
(1)
We propose the use of explicit pessimism in reward models for off-policy recommendation use-cases.
(2)
We introduce the decision-making phenomenon known as the Optimiser’s Curse in the context of recommendation, and show how naïve reward models suffer from it. In contrast, principled pessimism lifts the curse.
(3)
We show how to leverage closed-form estimates for the posterior mean and variance of a ridge regressor to express pessimism, and how to apply this effectively and efficiently to an off-policy recommendation use-case.
(4)
Empirical observations from reproducible simulation experiments highlight that explicit pessimism significantly and robustly improves online recommendation performance, compared to ML or MAP-based decision-making.
(5)
We draw parallels with existing work in general on-policy bandits as well as pessimistic policy-learning, and present a scope for future research connecting these problem settings and research areas.
Extensions to [42]. The introduction and motivation of our work have been extended to explicitly highlight connections with related research areas, both within and outside of the Recommender Systems field. The background and related work section more clearly depicts what the data-generating process looks like in our use-case – additionally providing more detail on doubly robust methods and their relevance. We have included a stream of related work on using Markov Decision Processes (model-based reinforcement learning) for recommendation, motivating how their use-case differs from ours. The core methodological section of the paper provides more details on reward estimation and decision-making in policy learning, allowing us to highlight some examples of pessimism in these cases, and draw connections with existing well-known reinforcement learning methods such as TRPO and PPO [
92,
93]. Further detail on applications in related areas such as computational advertising has been included, where the impact of pessimistic decision-making will be especially tangible. Deeper connections between optimism in on-policy bandits and our proposed pessimism in off-policy bandits are discussed. We highlight their differences and commonalities, and propose a scope for future work on hybrid approaches. More detail on the experimental setup is included, and we further motivate the simulation framework we have adopted to empirically validate our proposed method. We have further extended the experiments with respect to research questions 1–3 over action spaces with varying sizes, and significantly expanded the discussion of the results and their impact. The conclusion has been extended and rewritten to include a scope for future research, and the abstract reflects the new contents of the paper.
2 Background and Related Work
We are interested in modelling recommendation systems following the
“Batch Learning from Bandit Feedback” (BLBF) paradigm [
104]: a general machine learning setting that properly characterises the off-policy recommendation use-case as it widely occurs in practice. A recommender system is modelled as a stochastic policy
\(\pi\) that samples its recommendations from a probability distribution over actions
A conditioned on contexts
C:
\(\mathsf {P}(A|C, \pi)\), often denoted
\(\pi (A|C)\).
3 Note that
\(\pi\) is modelled to be stochastic for generality, but that deterministic systems are implied when
\(\mathsf {P}(A|C,\pi)\) is a degenerate distribution. Contexts are drawn from some unknown marginal distribution
\(\mathsf {P}(C)\) and can represent a variety of information about the user visiting the system, such as their consumption history, the time of day and the device they are using. When talking about the feature vector for a specific context, we denote it as
\(\mathbf {c}\). Analogously, feature vectors for specific actions are represented as
\(\mathbf {a}\), which can include discrete identifiers as well as information about interactions with the item or its content. The sets of all possible contexts and actions are
\(\mathcal {C}\) and
\(\mathcal {A}\), respectively. The combined feature representation of a context-action pair is
\(\mathbf {x} := \Phi (\mathbf {c},\mathbf {a})\), where
\(\Phi\) is a function that maps context- and action-features to a joint space. Note that this step – including interaction terms between contexts and actions – is necessary to allow for linear models to learn personalised treatments.
\(\Phi\) can be anything from a simple Kronecker product between one-hot-encoded contexts and actions [
80], to a specialised neural network architecture that learns a shared embedding for multi-task learning [
67,
106,
115]. In the off-policy or counterfactual setting, we have access to a dataset consisting of logged context-action pairs and their associated rewards:
\(\mathcal {D} := \lbrace (c, a, r)_{t}\rbrace _{t=1}^{t_{\max }}\), where
\(c \sim \mathsf {P}(C)\),
\(a \sim \pi _{0}(a|c)\) and
\(r \sim \mathsf {P}(R|C,A)\). Here,
r represents the immediate reward that the system obtained from recommending
a when presented with context
c at a given time
t. In the general case this reward can be binary (e.g., clicks), real-valued (e.g., dwell time or profit), or higher-dimensional to support multiple objectives (e.g., fairness and relevance) [
77,
78]. The policy that was deployed at data collection time is called the logging policy (
\(\pi _0\)). This type of setting is called “bandit feedback”, as we only observe the reward of the actions chosen by the contextual bandit
\(\pi _0\). It is referred to as being “off-policy”, as we have no control over
\(\pi _0\) or its exploration strategy. We place this paradigm at the focal point of our work, as it is the most closely aligned with the recommendation use-case that practitioners typically face in industry. Indeed, truly on-policy methods are often prohibitively costly to implement due to the need for frequent real-time updates [
11], and continuous experimentation practices lead to multiple logging policies that interact and give rise to selection bias that needs to be addressed [
1,
24,
53]. We discuss deeper connections between typical on- and off-policy approaches, as well as a need for hybrid methods in Section
3.5. Figure
1 visualises our interactive data-generating process on the left-hand side, with the learning objective on the right-hand side.
Learning to recommend from organic user-item interactions. Most traditional approaches to recommendation do not make use of this type of experimental data tying recommendations to observed outcomes. Instead, they typically adopt observational datasets consisting of “organic” interactions between users and items, such as product views on retail websites. By framing the recommendation task as next-item prediction in such a setting, the goal of these systems is no longer that of learning optimal interventions. Maybe unsurprisingly, offline evaluation results in such environments are notoriously uncorrelated with online success metrics based on shown recommendations, making it harder to discern
true progress with regard to online gains [
17,
28,
39,
88]. Nevertheless, it is a very active research area that yields many interesting publications and results every year. Recent trends are geared towards the use of Bayesian techniques that explicitly model uncertainty [
23,
61,
66,
96], and linear item-based models that achieve state-of-the-art performance whilst being highly efficient to compute [
14,
16,
48,
49,
81,
101].
Off-policy learning from bandit feedback. The bandit feedback setup described above finds its roots in the field of offline
reinforcement learning (RL), with the additional simplifying assumption that past actions do not influence future states (more formally, the underlying Markov Decision Process consists of a single time-step) [
58]. This type of learning setup is not specific to the recommendation task, and many learning methods are evaluated on simulated bandit feedback scenarios using general purpose multi-class or multi-label datasets. Approaches for off-policy learning optimise a parametric policy for some counterfactual estimate of the reward it would have obtained, if deployed.
The go-to technique that enables this type of counterfactual reasoning is importance sampling [
8,
86]. Equation (
1) shows how it obtains an empirical estimate for the value of a policy
\(\pi\), using data
\(\mathcal {D}\), and a model of the logging policy
\(\widehat{\pi _{0}}\) (which can be exact and known, or learnt from data). Many learning algorithms in this family aim to mitigate the increased variance that is a consequence of the IPS weights. Capping the probability ratio to a fixed value [
37], self-normalising the weights [
51,
105], imposing variance regularisation [
72,
104], imitation learning [
69], or distributional robustness [
26,
98] on the learnt policy are commonly used tools to trade off the unbiasedness of IPS for improved variance properties in finite sample scenarios. Many of these techniques can be interpreted as a form of principled
pessimism, where we would rather be conservative with the IPS weights than over-estimate the value of an action to a policy.
A conceptually simpler family of approaches are value-based methods, often referred to as Q-learning in the RL community, or the
“Direct Method” (DM) in the bandit literature. Equation (
2) shows how DM obtains an empirical estimate of policy
\(\pi\)’s value w.r.t. a dataset of logged bandit feedback
\(\mathcal {D}\):
Value-based counterfactual estimators do not rely on a model of the logging policy, but rather learn a model for the context-specific immediate reward of an action:
\(\widehat{r}(a,c) \approx \mathbb {E}[R|C = c, A = a]\). In practice, the available bandit feedback
\(\mathcal {D}\) is split into disjoint training sets for the optimisation of the reward model and the resulting policy, respectively. Nevertheless, it is easy to see that the optimal policy
\(\pi ^{*}_{\text{DM}}\) with respect to a given reward model places all its probability mass on the action with the highest estimated reward:
As a consequence, we can directly obtain a decision rule from the reward estimates and train the reward estimator on all available data [
47]. Note that this simple decision rule leads to a deterministic policy, but stochastic value-based policies can be obtained by explicitly optimising Equation (
2) with an entropy regularisation term [
31]. Value-based methods as laid out above are typically biased, but exhibit more favourable variance properties than IPS-based models. While policy-based methods for learning from bandit feedback need (a model of) the logging propensities [
11,
109], this is not a constraint for the value-based family. When multiple logging policies are at play (e.g., during an A/B-test), this complicates the use of standard importance sampling techniques even further [
1,
24,
53].
A unifying family of
Doubly Robust (DR) methods aims to marry these two types of approaches in an attempt to get the best of both worlds [
19], as shown in Equation (
4). DR essentially augments DM with an IPS-term that is weighted by the error of the value-based model. When either the propensities
\(\widehat{\pi _{0}}\) or the reward model
\(\widehat{r}\) are correct, this estimator is provably unbiased. However, as it requires a separate model for the policy and the reward, we leave an explicit analysis of pessimism in doubly robust estimators for future work.
Recent advances in doubly robust learning typically optimise the trade-off between DM and IPS [
103], optimise the reward model to minimise the overall variance of the estimator [
25], or transform the IPS weights to minimise bounds on the expected error of the estimate [
102]. Nevertheless, the performance of the reward model remains paramount for doubly robust approaches to attain competitive performance – and it is not uncommon for DR to be outperformed by either DM or IPS [
41].
Reinforcement learning for recommendation. The methods introduced above are bandit-based: they aim to learn which actions to take based on the outcomes of previously logged actions. They focus on
immediate rewards, and include no notion of
planning to improve future rewards nor notions of long-term value. Indeed, because bandit-based approaches do not explicitly model
state transitions, they implicitly assume that current actions do not influence the distribution of future contexts or rewards. This assumption significantly simplifies modelling and learning, but can be limiting in more complex scenarios where the optimisation of rewards over a sequence of actions is required. Several works use
Markov Decision Processes (MDPs) to incorporate such notions of long-term rewards into recommendation use-cases [
36,
82,
95]. We do not explicitly model long-term value in this work, and focus on the bandit setting as opposed to a full RL setting – but note that ideas of pessimism have recently been adopted in general RL use-cases [
54,
56,
63,
114]. Exploring whether our presented insights extend to RL recommendation settings is an open and interesting area for future research.
Off-policy learning for recommendation. Methods that apply ideas from the bandit and RL literature to recommendation problems have seen increased research interest in recent years – typically in off-policy settings. Chen et al. extend a policy gradient-based method with a top-
K IPS estimator and show significant gains from exploiting bandit feedback in online experiments [
11]. In the top-1 use-case we consider with the additional bandit assumption, their method yields a policy that is analogous to one optimised for
\(\widehat{V}_{\text{IPS}}\) (Equation (
1)). This work has been extended to deal with two-stage recommender systems pipelines that are typically adopted to deal with large action spaces [
68]. Xin et al. adopt a Q-learning perspective to deal with sequential recommendation tasks, exploiting both self-supervised (
organic) and reinforcement (
bandit) signals [
113]. Analogously, Sakhi et al. propose a probabilistic latent model that combines organic and bandit signals in a Bayesian value-based manner [
91]. The work of Jeunen et al. studies the performance of both value- and policy-based approaches when the organic data is only used to describe the context, proposing a joint policy-value approach that outperforms stand-alone methods without the need for an external reward model [
47]. Their experimental set-up is the closest to the one we tackle in this work.
On-policy learning for recommendation. Off-policy methods learn from data that was collected under a different policy. In contrast, on-policy methods learn from data that they themselves collect. In such cases, the well-known exploration-exploitation trade-off becomes important, as the policy needs to balance the immediate reward with the informational value of an action [
60,
74]. Successful methods use variants of Thompson sampling [
10,
20,
73] or confidence bounds [
59]; recent work benchmarks a number of different exploration approaches to predict clicks on advertisements when the reward model is parameterised as a neural network [
30]. Although the use-case we tackle in this work does not include any interactive component, we draw upon existing work in learning from on-policy bandit feedback to obtain improved, uncertainty-aware decision strategies in the off-policy setting.
Uncertainty estimation. Both Thompson sampling and confidence-bound-based methods make use of a posterior distribution for the reward estimates, instead of the usual point estimate that is obtained from uncertainty-agnostic models. Principled Bayesian methods can be used to obtain closed-form expressions for exact or approximate posteriors, but they are often restricted to specific model classes [
10,
59]. The Bootstrap principle [
21], its extensions [
85] (originally proposed in the context of Q-learning), and Monte Carlo Dropout [
27] can provide practical uncertainty estimates for general neural network models. The work of Guo et al. proposes a hybrid Bootstrap-Dropout approach, and validates the effectiveness of the obtained uncertainty estimates in an on-policy recommendation scenario [
30]. Finally, other recent work shows promising results in inferring model uncertainty from neuron activation strength [
15]. All these uncertainty estimation methods are complementary to the framework we propose in this paper, and can be used to explicitly express either
optimism in on-policy settings, or our proposed
pessimism for off-policy learning.
4 Experimental Results and Discussion
A key component of recommender systems is their interactive nature: evaluating recommendation policies on offline datasets is not a straightforward task, and conclusions drawn from offline results often contrast with the online metrics that we care about [
28,
39,
88]. Indeed, this is a strong motivation for casting recommendation as a bandit learning problem, allowing for
offline optimisation of
online objectives [
40,
46].
Methods for BLBF are often evaluated using supervised-to-bandit conversions on multi-class or multi-label classification datasets [
26,
104,
105]. This type of empirical validation is warranted in general machine learning use-cases, but it is unclear how these result translate to improved recommendations [
47]. Recent work on BLBF
for recommenders either shows empirical success by adopting the same supervised-to-bandit conversions on
organic user-item datasets [
68], through live experiments [
11,
74,
78], or by adopting open-source simulation environments [
47,
91] (which have seen a growing interest in the Recommender Systems community as of late [
22,
35], among related research areas [
44]).
To aid in the reproducibility of our work, we make use of the RecoGym simulation environment [
87]. RecoGym provides functionality to simulate organic user-item interactions (e.g., users viewing products on a retail website), as well as bandit interactions under a given logging policy (users clicking on shown recommendations). Publicly available datasets that contain both types of data (observational
and experimental) are scarce, and still insufficient for reliable counterfactual evaluation. A considerable advantage of RecoGym is the opportunity to simulate online experiments such as A/B-tests, that can then be used to reliably estimate the online performance of an intervention policy in the synthetic environment. We refer the interested reader to the source code of the simulator
5 or the reproducibility appendix of [
47] for an overview of the inner workings of the simulation environment, while pointing out that the underlying RecoGym reward model adopts a latent factor model assumption that is often made in recommender systems research [
55]. The source code to reproduce our experiments is publicly available at
github.com/olivierjeunen/pessimism-recsys-2021. The research questions we wish to answer are the following:
RQ1
Can we find empirical evidence of the Optimiser’s Curse in off-policy recommendation environments?
RQ2
Can our proposed LCB decision-making strategy effectively limit post-decision disappointment?
RQ3
Can we increase online performance with a recommendation policy using a reward model with LCB predictions?
RQ4
How are these methods influenced by the amount of randomisation in the logging policy?
RQ5
How are these methods influenced by the number of training samples and the size of the action space?
4.1 Logging Policies
An important factor to take into account when learning from bandit feedback is the logging policy that was deployed at the time of data collection. Deterministic policies make bandit learning nearly impossible, whereas a uniformly random logging policy generates unbiased data, but is an idealised case in practice. Realistic logging policies will aim to show recommendations that they perceive to be relevant, whilst allowing other actions to be taken in an explorative manner. We adopt a simple but effective personalised popularity policy based on the organic user-item interactions that have preceded the impression opportunity. For a context
c consisting of historical counts of organic interactions with items (as laid out in the parameterisation in Section
3.4), the logging policy
\(\pi _{\text{pop}}\) samples actions proportionately to their organic occurrences. This policy is deficient, as it does not assign a non-zero probability mass to every possible action in every possible context [
89]. Deficient logging policies violate the assumptions made by IPS to yield an unbiased reward estimate [
86], which poses a significant hurdle for policy-based methods. Nevertheless, they are realistic to consider in real-world off-policy recommendation scenarios. This extreme form of selection bias impedes effective reward modelling as well, as we will show in the following section. Indeed, when a context-action pair has zero probability of occurring in the training sample, we
need to resort to appropriate priors or conservative decision making. The deficiency of
\(\pi _{\text{pop}}\) can be mitigated easily by adopting an
\(\epsilon\)-greedy exploration mechanism, where we resort to the uniform policy with probability
\(\epsilon \in [0,1]\). Naturally, this implies both
\(\pi _{\text{pop}}\) and
\(\pi _{\text{uni}}\) when
\(\epsilon\) is respectively 0 or 1. For arbitrarily small values of
\(\epsilon\),
\(\pi _{0}\) is no longer deficient in theory, but extremely unlikely to explore the full context-action space within finite samples.
We vary
\(\epsilon \in \lbrace 0, 10^{-6}, 10^{-4}, 10^{-2}, 1\rbrace\) in our experimental setup. Note that this type of logging policy is equivalent to the ones used in previous works [
41,
45,
47,
80,
91], but that we explore a wider range of logging policy randomisation to highlight the effects on naïve reward modelling procedures.
4.2 Optimiser’s Curse (RQ1-3)
To validate whether the theoretical concept of the Optimiser’s Curse actually occurs when reward models are learned in off-policy recommendation settings, we adopt the following procedure:
(1)
Generate a dataset containing organic and bandit feedback,
(2)
train a reward model as described in Section
3.4 – optimising the regularisation strength
\(\lambda\) to minimise
Mean Squared Error (MSE) on a validation set of 20%,
(3)
simulate an A/B-test and log the difference between the reward estimates \(\widehat{p}_{i^{*}}\) and the true reward probability \(p^{*}_{i^{*}}\) for the actions selected by the competing decision strategies.
We then vary the logging policy in (1), and repeat this process five times to ensure statistically robust and significant results. Every generated training set and every simulated A/B-test consists of 10 000 distinct users, leading to approximately 800 000 bandit opportunities in the training set as well as 800 000 online impressions per evaluated policy.
The Optimiser’s Curse states that we should expect to be disappointed with respect to our reward estimates. As such, we define the average empirical disappointment as the difference between the true expected reward and the expected reward estimated by the reward model:
\(\widehat{p_{i^{*}}}-p^{*}_{i^{*}}\). As we have argued earlier, a simple bias term on the estimates
\(\widehat{p_{i^{*}}}\) can be tuned to bring the average empirical disappointment to zero. This, however, has no impact on the decision making strategy, and therefore does not solve our problems. Indeed, our goal is two-fold: we wish to
decrease absolute disappointment, whilst
increasing the online reward our recommendation policy obtains. Figure
3 plots these two quantities for competing decision strategies, varying the amount of selection bias in the logging policy per column, and increasing the size of the action space over the rows. Plots in the upper right quadrant of the figure correspond to less realistic environments, where the size of the action space and the cost of randomisation are limited. In real-world scenarios, the opposite will often be true. The plots on the lower left side of the figure reflect these constraints. The x-axes show disappointment (closer to zero is better), and the y-axes shows a 95% credible interval for the obtained
click-through-rate (CTR) per recommendation policy in the simulated A/B-test (higher is better). Maximum Likelihood Estimates are consistently so far off that we do not include them in this analysis. The baseline and widely adopted decision strategy of taking the highest MAP action is shown (
\(\alpha =0\)), along with our pessimistic lower-confidence-bound strategy, varying the lower posterior quantile
\(\alpha \in \lbrace 0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45\rbrace\). Increased values of
\(\alpha\) strictly decrease disappointment, and
\(\alpha =0\) always corresponds to the rightmost measurement within a plot, whereas
\(\alpha =0.45\) is shown leftmost. Note that this hyperparameter
\(\alpha\) plays an important role, and that it can always be increased to achieve zero post-decision disappointment (and even lower – indicating that we are being overly pessimistic). While this sets more realistic expectations for the performance of the reward model (and hence is important when the model is used for offline evaluation or computational advertising), this provides no theoretically guaranteed improvement in the online metrics we care about. Also note that this type of experimental procedure would not be feasible without the use of a simulation environment, as we usually don’t have access to the true reward probability
\(p^{*}_{i}\). In such cases, we would need to resort to empirical averages based on the observed reward.
Empirical Observations. First, we see clear empirical evidence of the Optimiser’s Curse in action: when acting based on the MAP estimate (
\(\alpha =0\)), we encounter post-decision disappointment regardless of the logging policy. As our trained reward models are even slightly under-calibrated w.r.t. the empirical training sample (i.e. negative mean error), this result can seem counter-intuitive and is not straightforward to mitigate with a bias term tuned on offline data. Second, we observe that pessimistic decision-making based on predictive uncertainty consistently decreases disappointment, and that it can significantly increase the policy’s attained CTR in A/B-tests. The optimal value of
\(\alpha\) with respect to online performance also brings the average empirical disappointment closer to zero, indicating that these values are closely related.
\(\alpha\)’s interpretation relating to the coverage of the approximate posterior of
\(\widehat{r}\) helps when tuning it [
111]. Naturally, when the variance on the reward estimates is homoscedastic w.r.t. the actions, LCB does not affect the ordering of the reward estimates or the resulting policy. This explains why online performance is not significantly impacted when the logging policy is uniform, while post-decision disappointment can consistently be alleviated. We observe that the expected benefits of pessimism, both in terms of decreased disappointment and in terms of increased online reward, are lower in the upper right quadrant. This is to be expected, as the context-action space is more likely to be well explored in these cases, and the MAP estimate achieves good performance. In the more realistic settings in the lower left, the improvements are significant and consistent. Indeed, we observe that MAP estimates consistently over-estimate the expected reward, by a large margin. In the case of
\(\epsilon =0\) and
\(|\mathcal {A}|=250\), the MAP strategy obtains a CTR of 1.6%, with a disappointment of 5.2%: over-estimating the reward by a factor of 3.25. Our proposed explicitly pessimistic decision-making strategy removes all empirical disappointment while improving CTR by 28%.
4.3 Performance Comparison (RQ3-5)
To further assess when our proposed pessimistic decision-making procedure can lead to an offline learnt policy with improved online performance, we train models on a range of datasets generated under different environmental conditions and report results from several simulated A/B-tests. The resulting CTR estimates with their 95% credible intervals are shown in Figure
4. Every row corresponds to a differently sized action space (
\(|\mathcal {A}| \in {10, 25, 50, 100, 250}\)), every column shows results for a different amount of randomisation in the logging policy. The amount of available training data for the reward model increases over the x-axis for every plot. We report CTR estimates for policies that act according to reward models based on ML or MAP estimates, and those that use lower confidence bounds with a tuned
\(\alpha\). Additionally, we show the CTR attained by the logging policy
\(\pi _0\), and an unattainable skyline policy
\(\pi ^{*}\) that acts based on the true reward probabilities
\(p^{*}\). This provides an upper bound on the expected CTR that any decision-making strategy can obtain. Every measurement shown in Figure
4 shows a 95% credible interval over five runs with 10 000 evaluation users, totalling
\(1\,000\) simulated A/B-tests with five competing policies each, or more than three billion impressions summed up. As our reward models are agnostic to the logging propensities, we do not include policy-based approaches that would require them (either purely based on IPS [
8], hybrid [
47], or doubly robust [
19]). We do note that our results are directly comparable to those presented in [
47,
80], and both our novel LCB method and MAP baseline show significant improvements over all their policy- and value-based competitors.
Empirical Observations. In line with our observations from Figure
3, we see that LCB decision-making yields a robust and significant improvement over naively acting on ML or MAP estimates. This result is consistent over varying training sample sizes, action spaces and logging policies, but most outspoken in cases where the amount of randomisation and the number of available training samples are limited, and the action space is larger. As explicit randomisation and data collection can be expensive in practice, the environments where LCB excels are the ones that are most commonly encountered in real-world systems. Additionally, we observe more consistent and robust behaviour for policies that use LCB decisions compared to those that do not. This decreased variance in online performance can also be attributed to pessimistic decision making: because we no longer take our chances with high-uncertainty predictions, we fall back to more robust alternatives. We know what the reward model does not know, and this gained knowledge significantly benefits the interpretation of reward predictions, and the resulting decisions.
Limitations of the Study Design. Off-policy approaches for learning from bandit feedback are typically evaluated in set-ups where the size of the action space is a few dozen at most [
51,
102,
104]. As a result, methods for counterfactual learning in recommendation are often evaluated in modestly sized action spaces too [
47,
80,
91]. Therefore, the reported results are most relevant to personalisation use-cases where the number of alternatives is limited, such as personalising tiles or rows on a homepage, recommending news articles from a set of recently published ones, or predicting clicks within a slate. The size of the item catalogue in general purpose recommendation scenarios can be in the hundreds of thousands, warranting further research into off-policy recommendation for very large action spaces [
65]. In such environments, learning continuous item embeddings as opposed to the discrete representation we have adopted can provide a way forward. Moreover, the lack of publicly available datasets for the off-policy recommendation task can be prohibitive for reproducible empirical validation of newly proposed methods. The few alternatives that do exist [
57,
90], still deal with comparatively small action spaces and need to resort to counterfactual evaluation procedures with high variance and limited statistical power (compared to simulated online experiments). Furthermore, a single dataset would be comparable to a single measurement in Figure
4, limiting the range of environmental parameters we can change to observe effects on the online performance for competing methods. Because of these reasons, we believe the RecoGym environment to be an appropriate choice for the experimental validation of our methods [
87].
5 Conclusions and Future Work
Recommender systems are evolving, turning from prediction-based systems into decision-based systems. Under this new paradigm, effective and efficient learning from bandit feedback is crucial in order to flourish. One problematic aspect is that bandit feedback is typically collected under some logging policy, which leads to selection biases that can be difficult to deal with. Policy-based methods based on importance sampling are often adopted in these cases – and pessimistic variants have been known to improve empirical performance. Nevertheless, they often rely on strict randomisation assumptions and their high variance remains especially troublesome. Moreover, several application areas rely on calibrated predictions for the probability of the outcome conditional on the action that the system takes, which is exactly what policy-based methods avoid to model.
In this work, we aim to increase the reward obtained through value-based recommendation methods that rely on explicit reward models. We have argued that in the off-policy setting, selection bias is the most prominent and problematic, and have introduced the decision-making phenomenon of the “Optimiser’s Curse”. In order to lift the curse, we have proposed a general framework for the use of principled pessimism. For the specific case where a ridge regressor models the reward, we have shown how to translate closed-form uncertainty estimates into a conservative decision rule. Extensive experiments with synthetic data show that our proposed method lifts the Optimiser’s Curse whilst achieving a significant and robust boost in recommendation performance for a variety of settings. When randomisation in the logging policy is limited, the action space is large, and the size of the training sample is limited, our Lower-Confidence-Bound approach yields the highest improvements over decision-making alternatives. This is a promising and encouraging result, as these settings are exactly those that widely occur in practice.
Pessimism has widely been implicitly accepted as a tool to improve policy learning performance for recommendation problems. We draw parallels with existing work and highlight key differences and overlap. Furthermore, we explore connections with on-policy use-cases where
optimistic decision-making reigns supreme, emphasising that our novel insights are not in conflict with those presented in earlier work. Indeed, the goals of deployed recommender systems might not be best measured in terms of cumulative regret, but rather in terms of obtained reward. How to best balance efficient exploration in these settings is a largely open problem, although its value is clear [
13].
Further directions for future work include to investigate whether pessimistic reward predictions can lead to improved doubly robust learning [
41], whether our results can be generalised to larger action spaces, and to investigate the effects of scepticism on the informational value of data collected under such a policy. Moving to realistic settings with multiple iterations of logging and learning, we wish to make our proposed decision-making method more widely applicable in real-world deployments. In order for this to be successful, we need a notion of long-term consequences of actions, and may need to balance optimism with pessimism when appropriate.