Elsevier

Neurocomputing

Volume 518, 21 January 2023, Pages 401-407
Neurocomputing

Optimal distributions of rewards for a two-armed slot machine

https://doi.org/10.1016/j.neucom.2022.11.019Get rights and content

Abstract

In this paper we consider the continuous time two-armed bandit (TAB) problem where the slot machine has two different arms in the sense that the two arms have different expected rewards and variances. We explore the optimal distribution of rewards for two-armed bandit problems, and obtain the explicit distribution function as well as the searching rules of optimal strategy. As a by-product, we find two new counter-intuitive phenomena in nonlinear probability framework (optimal strategic framework). The first is that the combination of losing arm and winning arm can make the winning arm achieve a greater coverage probability to win expected reward, which is also referred to “good + bad = better”. The discovery implies that the traditional advice of always pursuing the arm with larger expected reward (i.e., stay on a winner rule) is not optimal in the probability framework. The second is that the combination sequence out of two independent and normal distribution-based arms is not normally distributed if the two arms are different, which is straightforward understood as “mutually independent normal + normal = unnormal”. Furthermore, we provide the optimal sequential strategy to construct the “combination” arm and numerically examine the underlying mechanism.

Introduction

In many engineered and natural systems, the agents are faced with the situations of making decisions with uncertainty, i.e., making decisions among alternatives while still learning about those options. The exploration–exploitation trade-off can be formally investigated within the context of the two-armed bandit (TAB) problem. In a two-armed bandit problem, an agent applies some strategy to choose between different arms of TAB to receive the rewards and aims to maximize the total expected reward in the long run. A commonly used measure of an arm selection policy is the so-called regret or the cost of learning, defined as the reward loss with respect to the case with a known reward model. It is clear that when applying this measure, the agent should always choose the arm with the larger expected reward.

The above selection problem is sequential in the sense that the process which is selected at a particular stage is a function of the results of previous selections as well as of prior information on the two arms. The reward distributions of TAB include Bernoulli, Poisson, Gaussian, and Laplace distribution, etc. For the case of Bernoulli distribution, Robbins [20] constructed the optimal strategy without prior probability based on law of large numbers; Bradt, Johnson and Karlin [6] considered the TAB problem where the maximal and minimal mean of rewards are known and assigned a prior distribution to the mean of two arms; Bellman [3] used the techniques in dynamic programming theory to determine the structure of the optimal policy; Feldman [11] studied TAB with prior probability and showed that the policy always taking the arm whose posterior expected profit is largest is the optimal strategy; Berry [4] studied the strategy to maximize the expected number of successes from the n selections and showed that stay on a winner rule is optimal. Chernoff [10] first considered a continuous time TAB process where the two rewards of two arms are independent Wiener processes with unknown means and known variances. Yushkevich [24] obtained explicit formulae for the value and a stationary optimal policy in some cases of the continuous-time two-armed bandit problem with expected discounted reward.

As is well known, the TAB problem is also used in reinforcement learning to formalize the notion of decision-making under uncertainty. In light of the law of large numbers, classical two-armed bandit algorithms apply sample-average method to estimate action values, because each estimate is a simple average of the sample of relevant rewards. Moreover, when considering the regret measure, the essence of the problem is thus to identify the best arm without engaging other arms too often. Lai and Robbins [15] showed that regret grows at least at a logarithmic order with time, and an optimal policy was explicitly constructed to achieve the minimum regret growth rate for several reward distributions. Auer, Bianchi and Fischer [2] proposed upper confidence bound 1 (UCB1) that achieves logarithmic regret for any reward distributions with bounded support. There arise many literatures on the algorithms maximizing average reward so as to approach the winning arm. We refer [1], [12], [16], [17], [21], [22], [23] to readers for more information.

In a word, the classical algorithms in the literature of methodologies and applications aim at maximizing the average rewards after infinite experiments based on the theory of law of large numbers, essentially approaching the winning arm generating the larger expected reward. Although the optimal strategy focusing on the average reward is well studied in current literature, to the best of our knowledge, there exists no results on the optimal distribution of rewards of TAB since Chernoff [10] proposed the continuous time TAB model in 1968. In this paper, we show that the optimal distribution of rewards of a continuous TAB model obey a new distribution (see (9) below) with density (see (10) below). When considering the optimal distribution of TAB, the derived distribution is related to the nonlinear probability theory. Please refer to Chen and Epstein [7], Chen, Epstein and Zhang [8], Chen, Feng and Zhang [9], Peng [18] for more information on nonlinear probability/expectation. As an application of the optimal distribution, we could demonstrate that a new counterintuitive phenomenon could occur in TAB. That is, a winning arm combining with a losing arm can achieve a greater win by enhancing the probability of achieving the maximum expected reward, or generating greater coverage probability around the maximum expected reward. This result implies that traditional advice of always pursuing the classical “winning” arm with larger expected reward is not optimal. Or in other words, in some cases, “loss can make win more possible”.

More specifically, suppose the expected returns of left and right arms are given by μL and μR, respectively. Let μ=μLμR. If we know the population average rewards μR>0>μL, it is natural for us to choose the right arm all the time since according to the law of large numbers, the average reward in this case converges to μR as the number of experiments goes to infinity. However, we construct a “combination” arm and show that the “combination” arm always generates higher coverage probability on the interval [μR-δ1,μR+δ2] than the right arm for any positive constants δ1 and δ2 such that μR-δ1>0. This is to say, the designing of left arm does not necessarily lead to a bad influence on TAB process, whilst the combination with the right arm induces a greater chance of success.

The main reason for above counter-intuitive finding is that the classical algorithms resolving the two-armed bandit problem only focus on the expectation of the total reward, while we concentrate on the maximal probability that the total reward lies in any given interval [μ-δ1,μ+δ2]. According to Theorem 3.1 below, we construct a new distribution (which we refer as optimal distribution with strategy) and the corresponding (optimal) strategy of constructing the “combination” arm such that it achieves a greater coverage probability. Moreover, through the joint density function of Brownian motion and its local time, we can give the explicit form of the probability of “combination” arm’s reward falling in the interval. That is, we can give the explicit improvement of “combination” arm in terms of coverage probability.

To sum up, the main contributions of the present paper may be summarized along the following lines:

  • We introduce a new distribution, optimal distribution with strategy, which is the distribution of the optimal reward of TAB in the sense that the optimal strategy can generate a higher coverage probability on the interval [μ-δ1,μ+δ2] than other strategies (including always pursing the right arm or left arm) for any positive constants δ1 and δ2 such that μ-δ1>0. The density function of optimal distribution with strategy (w.r.t to [μ-δ1,μ+δ2]) isf(z)=12πexp-(z-h)2+2k(|z-c|-|c-h|)+k22+kΦ(-|c-h|-|z-c|+k)e2k|z-c|,where Φ(·) is the distribution function of standard normal distribution with h=μ+μ̲2,k=μ-μ̲2 and c=2μ-δ1+δ22. To our best knowledge, this is the first time to explore the optimal distribution of rewards for two-armed bandit problems, along with the developed explicit distribution function and the searching rules of optimal strategy.

  • As the first application of optimal distribution with strategy, we find a new counterintuitive phenomenon that for TAB, the combination of winning arm with losing arm can achieve a greater coverage probability around the maximum expected reward.

  • Note that the optimal distribution reduces to normal distribution if and only if the left arm and right arm have the same expected reward. Therefore as the second application, the construction of optimal reward indicates another new counterintuitive phenomenon that the combination of two independent sequence of normal distributed random variables might not be normal distribution but induce a new distribution.

The rest of the paper is organized as follows. In Section 2, we give the framework of continuous time two armed bandit process. In Section 3, we give the main results of the paper, that is, we construct the optimal strategy and establish the distribution of optimal reward. The proofs of the main results are given in Section 4. In Section 5, we give some experiment studies to illustrate the optimal distribution with strategy. The conclusion of this paper is given in Section 6.

Section snippets

Framework

In this section we will give the preliminaries of the constructed mathematical frameworks for the two-armed bandit process. Assume that (Ω,F,P) is a probability space on which define a two-dimensional Brownian motion {B1,B2}, {Ft} is the minimal augmented filtration generated by Brownian motion {B1,B2} such that F0 contains all the P-null subsets of F. Let Xt and Yt represent the rewards from left arm L and right arm R of continuous time two-armed bandit process respectively. In this paper, we

Main results

In this section, we explore the sequential strategy to construct an optimal arm, referred as “combination” arm, in terms of that it achieves a greater win with respect to the maximum coverage probability on any reward interval. Furthermore, we obtain the explicit forms of distribution function and density function of the “combination” arm.

Therefore, the proposed counter-intuitive finding can be explicitly understood by a new distribution, which is different from the classical sample-average

Proof of main results

Remark 4.1

In the paper we use E[·] to denote expectation with respect to probability measure P. For expectation with respect to other probability measure, we will use the subscript to denote the corresponding probability measure, e.g., EP^·.

By Remark 2.2, in this section we will assume that the of rewards of Xt and Yt satisfy (5), (6), i.e., the equations of Xt and Yt are driven by the generic Brownian motion Bt.

Experiment study

In this section, some experiment studies are designed to verify the theoretical conclusions and certify the mathematically analytical mechanism.

Specifically, in the setting of Fig. 2, we assume the reward distribution of left- and right- arm follows two normal distributions of unequal mean μL=-1 (left arm) and μR=1 (right arm) respectively with common variance σL=σR=σ=1. With the constructed “combination” arm R10,π under three cases a=0,b=1 (left panel), a=0.5,b=1.5 (middle panel), a=1,b=2

Conclusion

We have discussed the continuous time two-armed bandit (TAB) problem where the slot machine has two different arms in the sense that the two arms have different expected rewards and variances. Different from the current literature focusing on the average rewards, we explore the optimal distribution of rewards for two-armed bandit problems. Moreover, the explicit distribution function as well as the searching rules of optimal strategy are well established. There are still some interesting topics

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Zengjing Chen Professor of Shandong University. The work of Zengjing Chen is supported by National Key R&D Program of China (grant No. 2018YFA0703900) and Taishan Scholars Project.

References (24)

  • Z. Chen et al.

    A Central Limit Theorem for sets of probability measures

    Stoch. Process Their Appl.

    (2022)
  • T.L. Lai et al.

    Asymptotically efficient adaptive allocation rules

    Adv. Appl. Math.

    (1985)
  • R. Agrawal

    Sample mean based index policies by O(logn) regret for the multi-armed bandit problem

    Adv. Appl. Prob.

    (1995)
  • P. Auer et al.

    Finite-time analysis of the multiarmed bandit problem

    Mach. Learn.

    (2002)
  • R. Bellman

    A Problem in the Sequential Design of Experiments

    Sankhy

    (1956)
  • D.A. Berry

    A Bernoulli two-armed bandit

    Ann. Math. Stat.

    (1972)
  • A.N. Borodin et al.

    Handbook of Brownian motion-facts and formulae

    (2015)
  • R.N. Bradt et al.

    On sequential designs for maximizing the sum of n observations

    Ann. Math. Stat.

    (1956)
  • Z. Chen, L. Epstein and G. Zhang. A Central Limit Theorem, Loss Aversion and Multi-Armed Bandits. arXiv preprint...
  • Z. Chen, S. Feng and G. Zhang. Strategy-Driven Limit Theorems Associated with Bandit Problems. arXiv preprint...
  • H. Chernoff. Optimal Stochastic Control. Sankhy: The Indian Journal of Statistics, Series A, 30(1968),...
  • D. Feldman

    Contributions to the “Two-Armed Bandit Problem

    Ann. Math. Stat.

    (1962)
  • Cited by (2)

    Zengjing Chen Professor of Shandong University. The work of Zengjing Chen is supported by National Key R&D Program of China (grant No. 2018YFA0703900) and Taishan Scholars Project.

    Xinwei Feng Professor of Shandong University, Post Doctoral Fellow of the Chinese University of Hong Kong and Hong Kong Polytechnic University. The work of Xinwei Feng is supported by National Natural Science Foundation of China (No. 12001317), Shandong Provincial Natural Science Foundation (No. ZR2020QA019) and QILU Young Scholars Program of Shandong University.

    Shuhui Liu Joint PhD Candidate of Shandong University and University of Alberta.

    Xiaodong Yan Associate Professor of Shandong University, Post Doctoral Fellow of the University of Alberta and Joint PhD of Yunnan University and Hong Kong Polytechnic University. The work of Xiaodong Yan is supported by National Natural Science Foundation of China (No.11901352), National Statistical Science Research Project (No. 2022LY080) and Jinan Science and Technology Bureau (No. 2021GXRC056).

    View full text