Elsevier

Neurocomputing

Volume 258, 4 October 2017, Pages 13-29
Neurocomputing

Finite budget analysis of multi-armed bandit problems

https://doi.org/10.1016/j.neucom.2016.12.079Get rights and content

Abstract

In the budgeted multi-armed bandit (MAB) problem, a player receives a random reward and needs to pay a cost after pulling an arm, and he cannot pull any more arm after running out of budget. In this paper, we give an extensive study of the upper confidence bound based algorithms and a greedy algorithm for budgeted MABs. We perform theoretical analysis on the proposed algorithms, and show that they all enjoy sublinear regret bounds with respect to the budget B. Furthermore, by carefully choosing the parameters, they can even achieve log linear regret bounds. We also prove that the asymptotic lower bound for budgeted Bernoulli bandits is Ω(ln B). Our proof technique can be used to improve the theoretical results for fractional KUBE [26] and Budgeted Thompson Sampling [30].

Introduction

Multi-armed bandits (MAB) correspond to a typical sequential decision problem, in which a player receives a random reward by playing one of K arms from a slot machine at each round and wants to maximize his cumulated reward. Many real world applications can be modeled as MAB problems, such as auction mechanism design [24], search advertising [27], UGC mechanism design [17] and personalized recommendation [22]. Many algorithms have been designed for MAB problems and studied from both theoretical and empirical perspectives, like UCB1, ϵn-GREEDY [6], UCB-V [4], LinRel [5], DMED [18], and KL-UCB [16]. A survey on MAB can be found in [11].

Most of the aforementioned works assume that playing an arm is costless, however, in many real applications including the real-time bidding problem in ad exchange [12], the bid optimization problem in sponsored search [10], the on-spot instance bidding problem in Amazon EC2 [9], and the cloud service provider selection problem in IaaS [2], one needs to pay some cost to play an arm and the number of plays is constrained by a budget. To model these applications, a new kind of MAB problems, called budgeted MAB, have been proposed and studied in recent years, in which the play of an arm is associated with both a random reward and a cost. According to different settings of budgeted MAB, the cost could be either deterministic or random, either discrete or continuous.

In the literature, a few algorithms have been developed for some particular settings of the budgeted MAB problem. The setting of deterministic cost was studied in [26], and two algorithms named KUBE and fractional KUBE were proposed, which learn the probability of pulling an arm by solving an integer programming problem. It has been proven that these two algorithms can lead to a regret bound of O(ln B). The setting of random discrete cost was studied in [13], and two upper confidence bound (UCB) based algorithms with specifically designed (complex) indexes were proposed and the log linear regret bounds were derived for the two algorithms.

The above algorithms for the budgeted MAB problem could only address some (but not all) settings of budgeted MAB. Given the tight connection between budgeted MAB and standard MAB problems, an interesting question to ask is whether some extensions of the algorithms originally designed for standard MAB (without budget constraint) could be good enough to fulfill the budgeted MAB tasks, and perhaps in a more general way. Xia et al. [30] shows that a simple extension of the Thompson sampling algorithm, which is designed for standard MAB, works quite well for budgeted MAB with very general settings. Inspired by that work, we are interested in whether we can handle budgeted MAB by extending other algorithms designed for standard MAB.

In order to answer the question, we study the following natural extensions of the UCB1 and ϵn-GREEDY algorithms [6] in this paper (these extensions do not need to know the budget B in advance). We first propose four basic algorithms for budgeted MABs: (1) i-UCB, which replaces the average reward of an arm in the exploitation term of UCB1 by the average reward-to-cost ratio; (2) c-UCB, which further incorporates the average cost of an arm into the exploration term of UCB1 (c-UCB can be regarded as an adaptive version of i-UCB); (3) m-UCB, which mixes the upper confidence bound of reward and the lower confidence bound of cost; (4) b-GREEDY, which replaces the average reward in ϵn-GREEDY with the average reward-to-cost ratio.

We conduct theoretical analysis on these algorithms, and show that they all enjoy sublinear regret bounds with respect to B. By carefully setting the hyperparameter in each algorithm, we show that the regret bounds can be further improved to be log linear. Although the basic idea of regret analysis for the algorithms for budgeted MAB is similar to standard MAB, i.e., we only need to bound the number of plays of all the suboptimal arms (the arms whose expected-reward-to-expected-cost ratios are not the maximum), there are two challenges to address comparing with the standard MAB (without budgets). First, there are two random factors, rewards and costs, that can influence the choice of arm simultaneously, which will bring difficulties when decomposing the probabilities that suboptimal arms are pulled. Second, the stopping time (i.e., the time/round that a pulling algorithm runs out of budget and stops) is a random variable related to the costs of each arm, due to which the independence of the costs at different rounds is destroyed and it brings difficulties when applying concentration inequalities. To address the first challenge, we introduce the δ-gap (5), with which we can separate the terms w.r.t the ratio of rewards and costs into terms related to rewards only and costs only. To address the second one, we make a decomposition like (7): before round 2B/μminc (μminc is the minimum expected cost among all the arms), we only consider the event about pulling a suboptimal arm; after round 2B/μminc, we only consider the event that there is remaining budget at each round. Combining the two techniques with those for standard MAB, we can derive the regret bounds for the aforementioned four algorithms.

Furthermore, we give a lower bound to the budgeted Bernoulli MAB (the reward and cost of a budgeted Bernoulli bandit is either 0 or 1), and show that our proposed algorithms can match the lower bound (by carefully setting the hyper parameters).

In addition to the theoretical analysis, we also conduct empirical study on these algorithms. We simulate two bandits, one with 10 arms and the other one with 50 arms. For each bandit, we consider two sub cases with different distributions for rewards and costs: the Bernoulli distribution in the first sub case and the beta distribution in the second one. The simulation results show that our extensions work surprisingly well for both the bandits and the sub cases, and can achieve comparable or better performances than existing algorithms.

The Budget-UCB algorithm proposed in [29] can be seen as a combined version of c-UCB and m-UCB. With our proposed δ-gap, the regret of Budget-UCB can be improved.

Besides the literature mentioned above, there also exist some works studying the MAB problems with multiple budget constraints. For example, in [7], the bandits with knapsacks setting (shortly, BwK) is studied: the total number of plays is constrained by a predefined number T and the total cost of the plays is constrained by a monetary budget B (the problem). [7] studies the distribution-free regret bound for BwK while [15] provides distribution-dependent bound. In [1], the above setting is extended to arms with concave rewards. Furthermore, in [8] contextual bandits are considered. At the first glance, these settings seem to be more general than the budgeted MAB problem defined in the previous subsection. We would like to point out that their algorithms and theoretical analysis cannot be directly applied here, because in their settings the total number of plays (T) is critical to their algorithms and analysis, however, such a number T does not exist in the setting under our investigation.

Section snippets

Problem setup

A budgeted MAB problem can be described as follows. A player is facing a slot machine with K arms (K ≥ 2). At time/round t, he pulls an arm i ∈ [K], (for ease of reference, let [K] denote the set {1,2,,K}) receives a random reward ri(t), and pays a cost ci(t) until his budget, B, runs out. Both the rewards and costs are supported in [0, 1]. There can be different settings depending on the values of the costs, which could be either deterministic or random, either discrete or continuous. In this

Algorithms

We present two kinds of algorithms for budgeted multi-armed bandit problems. As pointed in Section 2, although the optimal policy is quite complex for budgeted MAB, always pulling the optimal arm can bring almost the same expected reward as the optimal policy. Therefore, our proposed algorithms target at pulling the optimal arm as frequently as possible, with some tradeoff between exploration (on the less pulled arms) and exploitation (on the empirical best arms).

Upper bounds of the regrets

For any suboptimal arm i ≥ 2, the weighted ratio gap Δi, asymmetric δ-gap δi(γ) with γ ≥ 0 and the symmetric ϱ-gap ϱi play important roles throughout this work, which are defined as follows: Δi=μicμ1rμ1cμir;δi(γ)=Δiγμ1r/μ1c+1;ϱi=μ1cΔiμ1r+μ1c+μir+μic.One can verify that μ1rμ1c=μir+δi(γ)μicγδi(γ);μ1rϱiμ1c+ϱi=μir+ϱiμicϱi.(6) shows two kinds of gaps between arm 1 and arm i: (i) to make a suboptimal arm i an optimal arm, we can increase the μir by δi(γ) while decreasing μic by γδ(γ). When γ=1,

Empirical evaluations

The theoretical results in the previous section can guarantee the asymptotic performances of the algorithms when the budget is very large. In this section, we empirically evaluate the practical performances of the algorithms when the budget is limited. Temporally, we do not verify the performances of the variants of the UCB algorithms.

Specifically, we first simulate a 10-armed Bernoulli bandit, in which the rewards and costs of each arm follow Bernoulli distributions, and a 10-armed Beta

Better regret bound for budget-UCB

The Budget-UCB algorithm proposed in [29] can be seen as a combined version of our c-UCB and m-UCB and specializes the α as 2. Mathematically, the index for Budget-UCB is5: for any i ∈ [K], t > K, Di,tBudgetUCB=r¯i,tc¯i,t+Ei,tαc¯i,t+Ei,tαc¯i,tmin{r¯i,t+Ei,tα,1}max{c¯i,tEi,tα,0}.

With the δi(1) gap, we can get an improved regret bound compared to that in [29]:

Conclusions

In this work we have studied the budgeted MAB problems. We show that simple extensions of the algorithms designed for standard MAB work surprisingly well for budgeted MAB: they enjoy sublinear regret bounds with respect to the budget and perform comparably or even better than the baseline methods in our simulations. We also give a lower bound for the budgeted Bernoulli MAB.

There are many directions to explore in the future. First, in addition to the simulations, it would make more sense to test

Yingce Xia is currently a 3rd-year PhD student from University of Science and Technology of China. His research interests include machine learning and optimization algorithms. He receives the B.S from University of Science and Technology of China in June, 2013.

References (32)

  • J.-Y. Audibert et al.

    Exploration-exploitation tradeoff using variance estimates in multi-armed bandits

    Theor. Comput. Sci.

    (2009)
  • LaiT.L. et al.

    Asymptotically efficient adaptive allocation rules

    Adv. Appl. Math.

    (1985)
  • S. Martello et al.

    Knapsack Problems: Algorithms and Computer Implementations

    (1990)
  • S. Agrawal et al.

    Bandits with concave rewards and convex knapsacks

    Proceedings of the Fifteenth ACM Conference on Economics and Computation

    (2014)
  • D. Ardagna et al.

    A game theoretic formulation of the service provisioning problem in cloud systems

    Proceedings of the 20th International Conference on World wide web

    (2011)
  • J.-Y. Audibert et al.

    Best arm identification in multi-armed bandits

    COLT-23th Conference on Learning Theory-2010

    (2010)
  • P. Auer

    Using confidence bounds for exploitation-exploration trade-offs

    J. Mach. Learn. Res.

    (2003)
  • P. Auer et al.

    Finite-time analysis of the multiarmed bandit problem

    Mach. Learn.

    (2002)
  • A. Badanidiyuru et al.

    Bandits with knapsacks

    IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), 2013

    (2013)
  • A. Badanidiyuru et al.

    Resourceful contextual bandits

    27th Conference on Learning Theory (COLT)

    (2014)
  • O. Ben-Yehuda et al.

    Deconstructing amazon ec2 spot instance pricing

    IEEE Third International Conference on Cloud Computing Technology and Science (CloudCom), 2011

    (2011)
  • C. Borgs et al.

    Dynamics of bid optimization in online advertisement auctions

    WWW

    (2007)
  • S. Bubeck, N. Cesa-Bianchi, Regret analysis of stochastic and nonstochastic multi-armed bandit problems, arXiv preprint...
  • T. Chakraborty et al.

    Selective call out and real time bidding

    Internet and Network Economics

    (2010)
  • DingW. et al.

    Multi-armed bandit with budget constraint and variable costs

    Twenty-Seventh AAAI Conference on Artificial Intelligence

    (2013)
  • M.L. Fisher

    Worst-case analysis of heuristic algorithms

    Manag. Sci.

    (1980)
  • Cited by (18)

    • A numerical analysis of allocation strategies for the multi-armed bandit problem under delayed rewards conditions in digital campaign management

      2019, Neurocomputing
      Citation Excerpt :

      A review of the most important allocation strategies belonging to the above bandit settings can be found in [20], and a numerical study on the basis of five complex and representative scenarios was performed in [21] to compare their performances. Other variants of the multi-armed bandit (MAB) problem (bandits with side information, bandits with no stochastic rewards, bandits with budgeted cost allocations...) can also be found in the literature [8,18,19,26,27]. The allocation strategies analyzed in [21] do not account for delayed rewards.

    • You Can Trade Your Experience in Distributed Multi-Agent Multi-Armed Bandits

      2023, IEEE International Workshop on Quality of Service, IWQoS
    • MABAT: A Multi-Armed Bandit Approach for Threat-Hunting

      2023, IEEE Transactions on Information Forensics and Security
    View all citing articles on Scopus

    Yingce Xia is currently a 3rd-year PhD student from University of Science and Technology of China. His research interests include machine learning and optimization algorithms. He receives the B.S from University of Science and Technology of China in June, 2013.

    Dr. Tao Qin is currently a Lead Researcher in Microsoft Research Asia. His research interests include machine learning (with the focus on deep learning and reinforcement learning), artificial intelligence (with applications to robotics), game theory (with applications to cloud computing, online and mobile advertising, ecommerce), information retrieval and computational advertising. He got his PhD degree and Bachelor degree both from Tsinghua University. He is a member of ACM and IEEE, and an Adjunct Professor (PhD advisor) in the University of Science and Technology of China.

    Dr. Wenkui Ding got his PhD degree and Bachelor degree both from Tsinghua University. He currently works at Hulu LLC.

    Haifang Li received the B.Sc. degree in Statistics from Shandong University, China, in 2011. In 2016, she completed a Ph.D. degree in Probability and Mathematical Statistics from University of Chinese Academy of Sciences. She is currently an assistant researcher at Institute of Automation, Chinese Academy of Sciences. Her research interests include deep learning and probabilistic graph model.

    Xudong Zhang, Full Professor of the Department of Electronic Engineering, Tsinghua University. He received the Ph. D degrees from Tsinghua University in 1997. He has been with the Department of Electronics Engineering, Tsinghua University since 1997. His research interests include statistical signal processing, machine learning theory and multimedia signal processing. He has published more than 130 papers and three books in the field of signal processing.

    Yu Nenghai is a professor, Ph.D. supervisor,the director of Information Processing Center of USTC, deputy director of academic committee of School of Information Science and Technology, director of multimedia and communication lab, deputy director of Ministry of Education-Microsoft Key Laboratory of Multimedia Computing and Communications(2004–2010),the standing member of council of Image and Graphics Society of China, a member of Expert Committee of Cloud Computing of Chinese Institute of Electronics, member of Expert Committee of IP Applications and Value-added Telecommunications Technology of Chinese Institute of Communications, and a member of Expert Committee of Multimedia Safety of Division of Communications of Chinese Institute of Electronics. He was a visiting scholar in Institute of Production Technology, Faculty of Engineering, University of Tokyo, in 1999 and did cooperative research as the senior visiting scholar in Dept. of Electrical Engineering, Columbia University, from Apr. to Oct. 2008.

    Tie-Yan Liu is a principal researcher of Microsoft Research Asia, leading the machine learning group. His research interests include artificial intelligence, machine learning, information retrieval, data mining, and computational economics. As a researcher in an industrial lab, Tie-Yan is making his unique contributions to the world. On one hand, many of his technologies have been transferred to Microsofts products and online services, such as Bing, Microsoft Advertising, and Azure. He has received many recognitions and awards in Microsoft for his significant product impacts. On the other hand, he has been actively contributing to the academic community. He is an adjunct professor at CMU and several universities in China, and an honorary professor at Nottingham University. He is frequently invited to chair or give keynote speeches at major machine learning and information retrieval conferences. He is a senior member of the IEEE and the ACM, as well as a senior member, distinguished speaker, and academic committee member of the CCF.

    View full text