The pure exploration problem with general reward functions depending on full distributions

Wang, Siwei; Chen, Wei

doi:10.1007/s10994-022-06214-8

The pure exploration problem with general reward functions depending on full distributions

Published: 18 July 2022

Volume 111, pages 3279–3306, (2022)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

The pure exploration problem with general reward functions depending on full distributions

Download PDF

436 Accesses
1 Altmetric
Explore all metrics

Abstract

In this paper, we study the pure exploration model with general distribution functions, which means that the reward function of each arm depends on the whole distribution, not only its mean. We adapt the elimination framework and the LUCB framework to solve this problem, and design algorithms for estimating the value of the reward functions with different types of distributions. Then we show that our estimation methods have the correctness guarantee with proper parameters, and obtain sample complexity upper bounds for them. The order of the complexity upper bounds matches the complexity lower bound when the error constraint $\delta$ converges to 0, which implies that our algorithms are asymptotically optimal. Finally, we discuss about some important applications and their corresponding solutions under our learning framework, and use some experiments to demonstrate the effectiveness of our algorithms. The experimental results show that our algorithms outperform existing solutions.

Monte Carlo Tree Search with Robust Exploration

BelMan: An Information-Geometric Approach to Stochastic Bandits

Sub-sampling for Multi-armed Bandits

1 Introduction

The pure exploration problem can be described as a game between a player (or a learner) and the environment. The player has m arms to pull. After an arm i is pulled, the player receives an observation $X_i$ independently sampled from a fixed unknown distribution $D_i$ corresponding to arm i by the environment. There is a known reward function H for the distributions $D_i$’s. The player needs to design a strategy for selecting arms to pull and for deciding when to stop and which arm to output based on his observations, such that the output arm i has the largest reward value $H(D_i)$.

In the pure exploration model, the player cannot guarantee to always output the correct answer if he only gets a finite number of observations. One type of strategy is to maximize the success probability, given the constraint on the number of observations, referred to as the fixed-budget model (Audibert et al., 2010; Bubeck et al., 2013). Another type of strategy is to constrain the error probability $\delta$, and tries to minimize the number of samples that the learner needs to obtain, referred to as the fixed-confidence model (Evendar et al., 2006).

The pure exploration model can be used to solve many online optimization problems, thus it gets more and more attention in recent years. Most existing studies model the reward function H to be the mean function, i.e., $H(D) = \mathbb {E}_{X\sim D}[X]$. However, in many real-world applications, the function H(D) not only depends on its mean but on the entire distribution D. One example is that H(D) measures the similarity of the distribution D to a known distribution G, and the player aims to find among unknown arms the one that has the distribution most similar to G. Such reward functions can be used in target selection, e.g., search and rescue. Consider the scenario that we are using satellite images to look for a plane that crashed in the ocean: we divide the ocean into small areas, and try to figure out which area contains the target (based on the known flight track of the plane, we are sure that it should be in one of these areas). From existing photo images of air crashes, we can construct a known probability distribution G for photos in the area that contains the target. However, due to the weather and other geographical conditions, the real probability distribution of photos in that area can be a little different from G. Therefore, we want to take photos from different areas efficiently to find out the area whose corresponding distribution is the most similar to G.

While the mean function as the reward function H has been well studied, the case of general reward functions has not been addressed in the literature. In this paper, we study the novel problem setting where the reward function depends on the entire underlying unknown distribution $D_i$ (including but not limited to its mean value). To solve this problem, we consider two frameworks in the fixed-confidence setting: the elimination framework (Maron & Moore, 1997; Evendar et al., 2006; Kaufmann & Kalyanakrishnan, 2013) and the LUCB framework (Kalyanakrishnan et al., 2012). Our main contribution is to design proper algorithms to estimate the reward $H(D_i)$, given a set of observations from distribution $D_i$ and the reward function H. We divide possible H functions into three types based on its continuity property, and concentrate on the case that H follows the Lipschitz continuity with the total variation distance. We show that the sample complexity of using our estimation algorithm in those frameworks is upper bounded by $O(m\log (1/\delta ) / \varDelta ^2 + c)$, where $\delta$ is the error toleration rate, $\varDelta$ is the minimal gap between the maximized $H(D_i)$ and other $H(D_j)$’s, and c is a constant that may depend on $\varDelta ,m$ but does not depend on $\delta$. We also show that the complexity lower bound of any algorithm that has the correctness guarantee is $\varOmega (m\log (1/\delta ) / \varDelta ^2)$, which means that the complexity of our algorithms is asymptotically optimal.

We highlight the difference between our general model and the existing pure exploration model here. Our model concentrates on learning not only a single parameter, but the entire shape of every distribution. In the traditional pure exploration model that only concerns finding an arm with the best mean reward, each observation of the distribution is an independent and unbiased estimator of the mean reward, thus they only need to repeat this procedure and then take the average so that the expected reward is learned precisely. In our model, such independent and unbiased estimators may not exist. For example, when we want to look for the distribution with the minimal distance to the Gaussian distribution $\mathcal {N}(0,1)$, a single observation means nothing about the distribution distance, and one cannot obtain an unbiased estimator of that distance based on the single observation. Only when we take a large set of observations into consideration, we can obtain an estimator (with a small bias) and a corresponding confidence interval for that distance.

1.1 Related works

The stochastic multi-armed bandit (MAB) model (Berry & Fristedt, 1985; Sutton & Barto, 1998) describes the trade-off between exploitation and exploration, different from pure exploration which only concentrates on exploration. It is the origin of the pure exploration model, and the algorithms for pure exploration such as the elimination algorithm and the LUCB algorithm follow the idea of the UCB algorithm (Gittins et al., 2011; Auer et al., 2002) for MAB problems.

The pure exploration model is first proposed by Evendar et al. (2006) and Audibert et al. (2010). After that, people make great efforts on the complexity bounds of this problem. Kalyanakrishnan et al. (2012) first prove that the complexity lower bound is $\varOmega (m\log (1/\delta ) / \varDelta ^2)$ (here $\varDelta$ represents the minimum gap between the largest mean value and any other ones, since in the classic pure exploration problem, H(D) is the mean value of D). Then Kalyanakrishnan et al. (2012) and Kaufmann & Kalyanakrishnan (2013) provide learning algorithms that achieve a complexity upper bound of $O(m\log (m/\delta ) / \varDelta ^2)$, which has the same order as the complexity lower bound, i.e., their algorithms are asymptotically optimal. Kaufmann et al. (2014) provide tight complexity bounds (i.e., the constant factor in the complexity upper bound also matches the complexity lower bound) for the case that $m = 2$. Garivier and Kaufmann (2016) consider the case that the random variables follow an exponential family. The authors propose a tight complexity lower bound, as well as an algorithm with matching complexity upper bound. Degenne et al. (2019) use the game approach to design another optimal learning algorithm for this case. Kaufmann et al. (2016) improve the complexity lower bound in the general case, but the authors do not provide matching complexity upper bounds. Jamieson et al. (2014) propose an algorithm whose complexity is upper bounded by $O(m\log (1/\delta ) / \varDelta ^2)$, i.e., it reduces a constant term of $O(m\log m / \varDelta ^2)$ in the complexity upper bound of previous researches. Chen et al. (2017) then further tighten the constant term in the complexity bounds for the case that the random variables are Gaussian.

As we have mentioned, most of the existing works (e.g., all the above literature) focus on using the mean value as the reward of any given distribution. However, there are still some researches focusing on other kinds of reward, e.g., the $\tau$-quantile (Galichet et al., 2013; Szorenyi et al., 2015; David & Shimkin, 2016) and the extreme value (Carpentier & Valko, 2014). Another example is the risk-aversion multi-armed bandits, in which the reward of a random distribution equals its mean minus its variance (Sani et al., 2012; Vakili & Zhao, 2016). Compared with these works, we can see that our setting is more general, since all these results only focus on one specific measure of random distributions and our algorithms can be used to solve any of them.

In Chen et al. (2016), the authors consider a special kind of general reward function on the Combinatorial MAB models. Besides the difference that they work on the cumulative regret objective while we work on the pure exploration objective, our setting is still very different from theirs. Chen et al. (2016) focus on looking for the best combinatorial arm set based on some restricted reward functions, while our main concern is to consider more general distribution reward functions. For example, we can deal with the case of finding out the distribution with the minimal total variation distance to a target, but they cannot achieve the same goal.

Another similar topic is the Chernoff test (Chernoff, 1959). The original Chernoff test is a game with two hypothesis $H_1,H_2$ and two unknown distributions $D_1,D_2$. The player knows under the first hypothesis, $D_1 = G$, $D_2 = G'$; while the second means $D_1 = G'$, $D_2 = G$, but he does not know which one is true. The distributions G and $G'$ are known to the player. He needs to observe the distributions $D_1$ and $D_2$ multiple times, and decide which hypothesis is correct. In our setting, we can choose the distribution function $H(D) = \mathbb {E}_{X\sim D}[\log g(X)/\log g'(X)]$, where g and $g'$ are the probability density (mass) function of G and $G'$. Then $H(G) - H(G') = \mathsf{KL}(G,G') + \mathsf{KL}(G',G) > 0$, where $\mathsf{KL}$ denotes the KL-divergence between two distributions. If the output is $D_1$, we know that $H(D_1) > H(D_2)$, which means hypothesis $H_1$ is correct. Using our method would achieve the same sample complexity as the Chernoff test. However, it is unclear how to extend the Chernoff test to deal with the general scenario in our paper because our model selects arms among arbitrary unknown distributions, not known distributions.

2 Preliminaries

2.1 Models and definitions

A pure exploration problem with general distribution functions can be modelled as a tuple $(A,D,H,\delta )$. $A = \{1,2,\cdots ,m\}$ is the set of all arms. $D = \{D_1,\cdots ,D_m\}$ is the set of corresponding probability distributions of arms in A. H is a reward function $\varPhi \rightarrow \mathbb {R}$, where $\varPhi$ is the set of all possible probability distributions, and $\delta$ is the error probability. At each time slot, a policy $\pi$ needs to choose an arm i(t) based on the previous observations, and then it observes a random variable $X(t) \sim D_{i(t)}$. The random variables $\{X(\tau )\}_{\tau =1}^t$ are independent. The policy $\pi$ can also choose to stop the game and output a target arm whenever it wants. We use $T_\pi$ to denote the random variable of the stopping time of policy $\pi$, and $S_\pi$ to be the random variable of the output arm. Then the goal of the player is to design a policy $\pi$ such that with high probability, it can find the arm i with the maximum value $H(D_i)$, i.e., $\Pr [S_\pi \in {\mathrm{argmax}}_i H(D_i)] \ge 1 - \delta$ (this is also called a $\delta$-PAC algorithm). Under this constraint, the player wants the sample complexity $T_\pi$ as low as possible.

As commonly assumed in the pure exploration problems, we assume that there is a unique optimal arm $i^*$, i.e., $H(D_{i^*}) = \max _i H(D_{i})$ and $H(D_{i^*}) > \max _{j\ne i^*} H(D_{i})$. By this assumption, we can define the gap $\varDelta _i$ as follows:

$$\begin{aligned} \varDelta _i = \left\{ \begin{array}{ll} H(D_{i^*}) - H(D_i) &{} \hbox { if}\ i\ne i^*\\ H(D_{i^*}) - \max _{j\ne i^*} H(D_j)&{} \hbox { if}\ i= i^* \end{array} \right. \end{aligned}$$

Since the optimal solution is unique, $\forall i \in A, \varDelta _i > 0$.

Assumption 1

There exists a constant B, such that for any distributions G and $G'$, we have

$$\begin{aligned} |H(G)-H(G')| \le B \cdot \varLambda (G,G'), \end{aligned}$$

where $\varLambda (\cdot , \cdot )$ denotes some type of distribution distance, which is positive, symmetric, and follows the triangle inequality.

Assumption 1 is to ensure the continuity of the reward function H. Most functions in real applications satisfy this assumption.

Remark 1

In this paper, we assume that the player knows the reward function H at the beginning, which means that the constant factor B and the type of distribution distance $\varLambda$ are also known. This is common in real applications since the reward function H is chosen carefully to fit the goal of the learning algorithm. For example, if our goal is to look for the distribution that is the most similar to G among several candidates (as in the example of search and rescue), then we can set H(D) to be the opposite value of the total variation distance between D and G. In this case, the distribution $D_i$ that we are looking for has the highest reward value $H(D_i)$. Therefore, we can use this specific H function in our model setting to achieve the goal.

2.2 Complexity lower bound

Here we state the complexity lower bound of our problem setting in the following proposition, which is similar to Theorem 4 in Kaufmann et al. (2016) and can help to understand the asymptotical optimality of our algorithms.

Proposition 1

For a pure exploration problem instance with general reward functions, the expected complexity of any $\delta$-PAC algorithm $\pi$ is lower bounded by

$$\begin{aligned} \mathbb {E}[T_\pi ] \ge \left( \max _{i\ne i^*} {1\over \mathsf{KL}(D_{i^*}, D_{i})} + \sum _{i\ne i^*} {1\over \mathsf{KL}(D_i, D_{i^*})} \right) \log {1\over 2.4\delta }, \end{aligned}$$

(1)

where $\mathsf{KL}(D, D')$ denotes the KL-divergence between D and $D'$.

In fact, the analysis of the complexity lower bound in Kaufmann et al. (2016) does not require the reward function H to be the mean function. Because of this, the lower bound analysis still works in our general reward function setting. The proof of Proposition 1 is deferred to Appendix A.1.

2.3 Example of applications

2.3.1 Selecting the closest distribution to the target distribution

Finding out a target distribution among several candidates is an important problem in hypothesis testing. People propose many frameworks such as the Chernoff test and target scanning (Chernoff, 1959; Bessler, 1960; Zigangirov, 1966; Dragalin, 1996), and they are widely used in quality control (Pochampally & Gupta, 2014) and medical problems (Larsen, 1976). In addition, looking for the distribution that is the closest to the target is also an important question, which can be used in applications such as target selection or target searching (e.g., search and rescue). Our results provide a novel solution to these problems.

2.3.2 Selecting the distribution that follows the target type

People are also interested in the question of looking for distributions that follow a special type, e.g., finding out a Gaussian distribution or an exponential distribution among several candidates. This problem is another kind of hypothesis test and is commonly used in data verification (Avenhaus & Canty, 1996; Bensefia et al., 2004). The traditional solution chooses to use the Kolmogorov-Smirnov test to check which distribution is more likely to be a Gaussian one (Lilliefors, 1967). However, the Kolmogorov-Smirnov test is based on the similarity of cumulative distribution functions and may be inefficient in some problem instances (see details in Sect. 5). Our model setting and solutions, on the other hand, can be used to solve this problem by finding out the distribution whose total variation distance to a Gaussian distribution is the smallest one, which can be more efficient and precise in most cases.

3 Algorithmic frameworks

In this section, we present the elimination framework (Maron & Moore, 1997; Evendar et al., 2006; Kaufmann & Kalyanakrishnan, 2013) and the LUCB framework (Kalyanakrishnan et al., 2012). We make adjustments to these frameworks in order to support general reward functions depending on the full distributions. In particular, we replace the estimation function in these frameworks with $\mathbf{Estimate}$, and abstract the sample-size function $n_H(\delta ,\varDelta )$ and the gap function $\varDelta _H(\delta ,n)$, which will be instantiated for different H functions in Sect. 4. We prove that our adjustment does not influence the correctness guarantee if $n_H(\delta ,\varDelta )$ and $\varDelta _H(\delta ,n)$ are chosen appropriately.

Note that the key point in the pure exploration problems is the high probability error constraint of $\mathbf{Estimate}$, i.e.,

$$\begin{aligned} \Pr [|H(D) - \mathbf{Estimate} (O(D,n))| \ge \varDelta ] \le \delta , \end{aligned}$$

(2)

where O(D, n) is a set of n i.i.d. samples drawn from D. We can see that there are three parameters $n, \varDelta , \delta$ in the above inequality. While $\delta$ is usually known at the beginning in the fixed-confidence setting, the remaining two parameters $(n,\varDelta )$ can deduce each other, i.e., for a fixed $\varDelta$, we can find out an $n = n_H(\delta , \varDelta )$ such that Eq. (2) holds; for a fixed n, we can also find out a $\varDelta = \varDelta _H(\delta , n)$ such that Eq. (2) holds. One of the reasons that we are interested in the elimination framework and the LUCB framework is because they show the effects of these two functions intuitively. In the elimination framework, the algorithm first chooses a fixed $\varDelta$ in each phase, and then pulls all the remaining arms enough times such that Eq. (2) holds with the fixed $\varDelta$. This directly leads to the formulation of $n_H(\delta , \varDelta )$. In the LUCB framework, the algorithm needs to use the current n observations of distribution D to estimate the confidence interval of H(D) in each time step, which means that it needs to find out the confidence radius such that Eq. (2) holds for the fixed n. This directly leads to the formulation of $\varDelta _H(\delta , n)$.

The next reason that we are interested in the elimination framework and the LUCB framework is because these algorithm frameworks are very general and can deal with most of the pure exploration problem settings. Though there exist other algorithm frameworks that achieve better performances in the pure exploration problems, they require further assumptions on the pure exploration problem instance, and may not suit our setting well. For example, the T-a-S algorithm framework in Garivier and Kaufmann (2016) requires that the distributions are parameterized by their means as in one-parameter exponential families (e.g., they are all Gaussian distributions with unit variance), and the theoretical analysis of the tight complexity upper bound of T-a-S fails without such assumption. This assumption makes sense in the classic pure exploration setting, since in this case the algorithm only cares about the mean of the distribution. However, in the pure exploration problem with general reward functions, the algorithm cares about the entire shape of the distribution, and this assumption can be unpractical in many situations. As a concrete example, consider the scenario in which we are looking for a Gaussian distribution among a set of candidates. In this case, assuming that all the distributions are Gaussian is improper. On the other hand, if we have such an assumption in the problem instance, then there is no need to estimate the whole distribution. Since the distributions are parameterized by their means, estimation of their means is enough to solve the pure exploration problem instance. Therefore, the T-a-S framework (Garivier & Kaufmann, 2016) does not suit our problem setting well.

3.1 Elimination framework

The elimination framework is shown in Algorithm 1. The basic idea of the elimination framework is to divide the game into several phases. In each phase, the remaining arms will be pulled the same number of times so that the size of their confidence intervals on $H(D_i)$’s is decreased by half. Then at the end of each phase, the arms with larger gaps $\varDelta _i$’s can be eliminated, since their upper confidence bounds are smaller than the lower confidence bound of a particular arm.

As we have mentioned, an important parameter in the elimination framework is the number of times that an arm should be pulled in the k-th phase. In this paper, we define $n_H(\delta , \varDelta )$ to be the value such that for any distribution D,

$$\begin{aligned} \forall n \ge n_H(\delta , \varDelta ), \Pr [|H(D) - \mathbf{Estimate} (O(D,n))| \ge \varDelta ] \le \delta . \end{aligned}$$

(3)

If at the end of the k-th phase, we want the confidence radius of the estimated $H(D_i)$ to be ${ rad}(k)$ and the confidence level to be $\delta (k)$, then we only need $n_H(\delta (k), { rad}(k))$ observations for all the remaining arms.

Proposition 2

With probability at least $1-\delta$, Algorithm 1 works correctly, and the sample complexity $T_E$ satisfies:

$$\begin{aligned} T_E \le \sum _{i=1}^m n_H\left( {\delta \over 2m\log ^2\left( {8\over \varDelta _i}\right) }, {\varDelta _i \over 8}\right) . \end{aligned}$$

The proof of Proposition 2 is deferred to Appendix A.2.

3.2 LUCB framework

The LUCB framework is shown in Algorithm 2. The idea is to choose the one with the larger uncertainty between the empirically best arm and the arm that has the largest potential, and stop only if there exists an arm whose lower confidence bound is larger than other arms’ upper confidence bounds.

In the LUCB framework, we define $\varDelta _H(\delta ,n)$ to be the value such that for any distribution D,

$$\begin{aligned} \Pr [|H(D) - \mathbf{Estimate} (O(D,n))| \ge \varDelta _H(\delta ,n)] \le \delta . \end{aligned}$$

(4)

Then in any time slot t, we can obtain the confidence bounds for $H(D_i)$ by $\mathbf{Estimate} (O_i(t))$ and $\varDelta _H(\delta (t), N_i(t))$, where $O_i(t)$ is the set of all observations of arm i until time step t, $\delta (t)$ is the confidence level, and $N_i(t) = |O_i(t)|$ is the number of observations of arm i until time t.

Proposition 3

With probability at least $1-\delta$, Algorithm 2 works correctly, and the sample complexity $T_L$ satisfies:

$$\begin{aligned} T_L \le \min _{t> 0} \left\{ t > \sum _{i=1}^m n_H\left( {\delta \over 2mt^2}, {\varDelta _i\over 4}\right) \right\} . \end{aligned}$$

The proof of Proposition 3 is deferred to Appendix A.3.

4 Estimation methods for different type of H functions

In Sect. 3, we see that the key points of solving our problem are the function $\mathbf{Estimate}$ and the corresponding functions $n_H(\delta ,\varDelta ), \varDelta _H(\delta ,n)$. The instantiation of these functions depends on the hardness of estimating H, which in turn depends on the hardness of the distance measure as given in Assumption 1. We consider three types of distance measures below, from easiest to hardest. Our technical analysis will be focused on the hardest one, the total variation distance measure.

Definition 1

$\varLambda _{m}$ is the distance between the means of two distributions, i.e., $\varLambda _{m}(D,D') = |\mathbb {E}_{X\sim D}[X] - \mathbb {E}_{X'\sim D'}[X']|$.

$\varLambda _{K}$ is the Kolmogorov-Smirnov distance between two cumulative distribution functions, i.e., $\varLambda _{K}(D,D') = \sup _x |F_D(x) - F_{D'}(x)|$, where $F_D(x)$ is the cumulative distribution function of D.

$\varLambda _{ TV}$ is the total variation distance, i.e., $\varLambda _{ TV}(D,D') = \sup _{A \subseteq S}|D(A) - D'(A)|$, where S is the support of D and $D'$, and D(A) is the probability mass of set A for D. Moreover, if the probability density functions of D and $D'$ exist and are integrable, then $\varLambda _{ TV}(D,D') = {1\over 2}\int _x |f_D(x) - f_{D'}(x)|dx$, where $f_D(x)$ is the probability density function of D.

4.1 The case of $\varLambda _{m}$ or $\varLambda _{K}$

The case of $\varLambda _{m}$ is almost the same as the classic pure exploration problem since we only need to ensure that the estimated distribution does not have a large bias on its mean. We can define the empirical distribution $\mathbf{p} (O)$ as follows: $p(x) = {\#(x,O)\over |O|}$ for any x where $\#(x,O)$ is the number of x in the observation set O, and then set $\mathbf{Estimate} (O) = H(\mathbf{p} (O))$. When the distributions are bounded by [0, 1], using Chernoff-Hoeffding’s inequality, we know that Eq. (3) holds with $n_H(\delta ,\varDelta ) = {B^2\over 2\varDelta ^2} \log {2\over \delta }$ and Eq. (4) holds with $\varDelta _H(\delta ,n) = B\sqrt{\log {2\over \delta }\over 2n}$. In this case, our solutions are reduced to the elimination algorithm (Kaufmann & Kalyanakrishnan, 2013) and the LUCB algorithm (Kalyanakrishnan et al., 2012) in pure exploration of classic multi-armed bandits, and this shows the flexibility of our solutions.

As for the case of $\varLambda _{K}$, it becomes a little harder, but the complexity does not change since we have the DKW inequality (Dvoretzky et al., 1956; Massart, 1990) (see details in Appendix B). Here the estimation method is the same, i.e., to use $\mathbf{Estimate} (O) = H(\mathbf{p} (O))$, where $\mathbf{p} (O)$ is the empirical distribution. In this case, we still have that Eq. (3) holds with $n_H(\delta ,\varDelta ) = {B^2\over 2\varDelta ^2} \log {2\over \delta }$ and Eq. (4) holds with $\varDelta _H(\delta ,n) = B\sqrt{\log {2\over \delta }\over 2n}$.

Proposition 4

If Assumption 1 holds with distribution distance $\varLambda _{m}$ or $\varLambda _{K}$, then using the empirical distribution $\mathbf{p} (O)$ to estimate $H(D_i)$ (i.e., Algorithm 3) in the elimination framework with $n_H(\delta ,\varDelta ) = {B^2\over 2\varDelta ^2} \log {2\over \delta }$ has sample complexity $O(\sum _{i=1}^m {B^2\over \varDelta _i^2} (\log {m\over \delta }+\log \log {1\over \varDelta _i}))$, and using the empirical distribution $\mathbf{p} (O)$ to estimate $H(D_i)$ in the LUCB framework with $\varDelta _H(\delta ,n) = B\sqrt{\log {2\over \delta }\over 2n}$ has sample complexity $O(B^2 h \log {Bh\over \delta })$, where $h = \sum _{i=1}^m {1\over \varDelta _i^2}$.

The proof of Proposition 4 is deferred to Appendix A.4.

4.2 The case of $\varLambda _{ TV}$

In the case of $\varLambda _{m}$ and $\varLambda _{K}$, we only need to care about the mean value and the cumulative distribution function of each distribution, which does not depend on whether the distribution itself is discrete or continuous. However, when we come to the total variation distance, we need to construct the probability mass (density) function, which can be really different between the discrete case and the continuous case. Because of this, we address each of these cases in separate sections below, and these are the main contributions of our paper.

The reason that we are interested in the total variation distance is: i) using $\varLambda _{ TV}$ to measure the similarity between distributions can achieve a lower sample complexity than traditional solutions that use $\varLambda _{K}$ to measure the similarity (see a detailed example in Sect. 5), and if we use $\varLambda _{ TV}$ to measure the similarity between distributions, then Assumption 1 holds with $\varLambda _{ TV}$ naturally; ii) the total variation distance is related to many other distance metrics via upper and lower bounds (Gibbs and Su 2002), and the convergence with respect to the total variation distance implies the convergence with respect to other distance metrics (Charalambous et al., 2014). Because of this, our study on the total variation distance is very important and general, and the method of extending our algorithms to deal with the problem settings with other distance metrics can be straightforward.

4.2.1 Discrete distributions with finite support

First, we consider the case that $D_i$’s are discrete distributions with finite support S. In this case, we can still use the empirical distribution to estimate $H(D_i)$ (Algorithm 3).

Theorem 1

If H follows Assumption 1 with $\varLambda _{ TV}$ and constant B, and $D_i$’s are discrete distributions with finite support S, then Algorithm 3 satisfies that: Eq. (3) holds with $n_H(\delta , \varDelta ) = {B^2\over \varDelta ^2}(\log {1\over \delta } + {|S|\over 2})$, and Eq. (4) holds with $\varDelta _H(\delta ,n) = B\sqrt{\log {1\over \delta }\over 2n} + B\sqrt{|S| \over 4n}$.

Proof

From Theorem 2 in Berend and Kontorovich (2012), we know that for any $\epsilon \ge \sqrt{|S|\over 4n}$,

$$\begin{aligned} \Pr [\varLambda _{ TV}(\mathbf{p} , D) \ge \epsilon ] \le \exp \left( -2n\left( \epsilon - \sqrt{|S|\over 4n}\right) ^2\right) , \end{aligned}$$

where n is the number of observations.

Set $\delta = \exp \left( -2n\left( \epsilon - \sqrt{|S|\over 4n}\right) ^2\right)$, we can find out that $\epsilon = \sqrt{\log {1\over \delta }\over 2n} + \sqrt{|S| \over 4n}$, which means that Eq. (4) holds with $\varDelta _H(\delta ,n) = B\sqrt{\log {1\over \delta }\over 2n} + B\sqrt{|S| \over 4n}$.

On the other hand, we can get $n \le {1\over \epsilon ^2}(\log {1\over \delta } + {|S|\over 2})$, which means that $n_H(\delta , \varDelta ) = {B^2\over \varDelta ^2}(\log {1\over \delta } + {|S|\over 2})$ is enough to make Eq. (3) hold. $\square$

By Theorem 1, we have the following corollary.

Corollary 1

With $\varDelta _H(\delta ,n), n_H(\delta , \varDelta )$ set as in Theorem 1, using Algorithm 3 in the elimination framework has sample complexity $O(\sum _{i=1}^m {B^2\over \varDelta _i^2} (\log {m\over \delta }+\log \log {1\over \varDelta _i})+ \sum _{i=1}^m {B^2|S| \over \varDelta _i^2})$, and using Algorithm 3 in the LUCB framework has sample complexity $O(B^2h \log {Bh\over \delta } + \sum _{i=1}^m {B^2|S| \over \varDelta _i^2})$, where $h = \sum _{i=1}^m {1\over \varDelta _i^2}$.

Note that there are always some pure exploration problem instances with general reward functions such that $\mathsf{KL}(D_i, D_{i^*}) = \varTheta (\lambda _{ TV}(D_i, D_{i^*})^2) = \varTheta ({\varDelta _i^2 \over B^2})$ (in fact, $\mathsf{KL}(D_i, D_{i^*}) = \varTheta (\lambda _{ TV}(D_i, D_{i^*})^2)$ in many cases, regardless of whether the distributions are discrete or continuous, especially when $\lambda _{ TV}(D_i, D_{i^*})$ is small). Therefore, by Proposition 1, we know that one cannot expect a $\delta$-PAC algorithm to achieve a complexity of $o(\sum _i {B^2\over \varDelta _i^2}\log {1\over \delta })$ for any pure exploration problem instance (with general reward functions). Besides, the gap between the complexity upper bounds and the complexity lower bound ($O(B^2\sum _{i=1}^m {|S| +\log m +\log \log {1\over \varDelta _i}\over \varDelta _i^2})$ for the elimination framework and $O(B^2\sum _{i=1}^m {|S| +\log (Bh)\over \varDelta _i^2})$ for the LUCB framework) does not depend on $\delta$. This means that our algorithms are asymptotically optimal.

4.2.2 Discrete distributions with infinite support

Now we consider the case that $D_i$’s are discrete distributions with infinite (but countable) support. Without loss of generality, we assume that the support is $\mathbb {N} = \{0,1,2,\cdots , \}$. Moreover, in this case, there must be a bounded interval that contains most of the probability mass. To make this idea standard, we assume that the distributions $D_i$’s are sub-exponential, as stated in the following assumption:

Assumption 2

There exist constants $\beta ,\lambda$ such that for any $i \in [m]$ and $z \in \mathbb {N}$, $d_i(z) \le \beta \exp (-\lambda z)$, where $d_i$ is the probability mass function of $D_i$.

In this case, we can also use the empirical distribution to estimate $H(D_i)$ (i.e., Algorithm 3).

Theorem 2

If H follows Assumption 1 with $\varLambda _{ TV}$ and constant B, and $D_i$’s are discrete distributions with infinite support $\mathbb {N}$, then under Assumption 2, Algorithm 3 satisfies that: Eq. (3) holds with

$$\begin{aligned} n_H(\delta , \varDelta ) = {32B^2\log {1\over \delta } \over \varDelta ^2} + {16B^2 \over \lambda \varDelta ^2}\log {16B^2\beta ^2\lambda \over (1-e^{-\lambda })^2\varDelta ^2}, \end{aligned}$$

and Eq. (4) holds with

$$\begin{aligned} \varDelta _H(\delta ,n) = 2B\sqrt{2\log {1\over \delta } \over n} + B\sqrt{{2\over \lambda n}\log {2\beta ^2\lambda n \over (1 - e^{-\lambda })^2}}. \end{aligned}$$

Proof

Denote $z = {1\over \lambda } \log {2\beta \over \epsilon (1 - e^{-\lambda })}$, then we have that

$$\begin{aligned} \sum _{z'=z}^\infty d_i(z') \le \sum _{z'=z}^\infty \beta \exp (-\lambda z') =\beta \exp (-\lambda z)\sum _{z' = z}^\infty e^{-(z'-z)\lambda } = {\epsilon \over 2}. \end{aligned}$$

For a set of observations O, let $\varvec{p}$ denote the corresponding empirical distribution, then

$$\begin{aligned} \lambda _{TV}(\varvec{p}, D_i)= & {} {1\over 2} \sum _{x \ge 0} |p(x) - d_i(x)|\\= & {} {1\over 2}\left( \sum _{x< z}|p(x) - d_i(x)| + \sum _{x \ge z} |p(x) - d_i(x)|\right) \\\le & {} {1\over 2}\left( \sum _{x< z}|p(x) - d_i(x)| + \sum _{x \ge z} |p(x) + d_i(x)| \right) \\= & {} {1\over 2}\left( \sum _{x< z}|p(x) - d_i(x)| + \sum _{x \ge z} (p(x) + d_i(x)) \right) \\= & {} {1\over 2}\left( \sum _{x< z}|p(x) - d_i(x)| + \sum _{x \ge z} (p(x) - d_i(x) + 2d_i(x))\right) \\= & {} {1\over 2}\left( \sum _{x< z}|p(x) - d_i(x)| + \sum _{x \ge z} (p(x) - d_i(x)) + 2 \sum _{x \ge z}d_i(x)\right) \\\le & {} {1\over 2}\left( \sum _{x< z}|p(x) - d_i(x)| + \sum _{x \ge z} (p(x) - d_i(x)) + \epsilon \right) \\= & {} {1\over 2}\left( \sum _{x < z}|p(x) - d_i(x)| + |\sum _{x \ge z} (p(x) - d_i(x))|\right) + {\epsilon \over 2}. \end{aligned}$$

Therefore,

$$\begin{aligned}&\Pr [\lambda _{TV}(\varvec{p}, D_i) \ge \epsilon ]\\&\quad \le \Pr \left[ {1\over 2}\left( \sum _{x< z}|p(x) - d_i(x)| + |\sum _{x \ge z} (p(x) - d_i(x))|\right) + {\epsilon \over 2} \ge \epsilon \right] \\&\quad = \Pr \left[ {1\over 2}\left( \sum _{x < z}|p(x) - d_i(x)| + |\sum _{x \ge z} (p(x) - d_i(x))|\right) \ge {\epsilon \over 2}\right] . \end{aligned}$$

On the other hand, ${1\over 2}\left( \sum _{x < z}|p(x) - d_i(x)| + |\sum _{x \ge z} (p(x) - d_i(x))|\right)$ can be viewed as the total variation distance between $\varvec{p}$ and $D_i$ when we regard all the elements $x \ge z$ as a new element. Thus by Theorem 2 in Berend and Kontorovich (2012), we have

$$\begin{aligned}&\Pr [\lambda _{TV}(\varvec{p}, D_i) \ge \epsilon ]\\&\quad \le \Pr \left[ {1\over 2}\left( \sum _{x < z}|p(x) - d_i(x)| + |\sum _{x \ge z} p(x) - d_i(x)|\right) \ge {\epsilon \over 2}\right] \\&\quad \le \exp \left( -2n\left( {\epsilon \over 2} - \sqrt{z\over 4n}\right) ^2\right) . \end{aligned}$$

Therefore, if $\epsilon \ge \sqrt{2\log {1\over \delta } \over n} + \sqrt{z\over n}$, we must have that $\Pr [\lambda _{TV}(\varvec{p}, D_i) \ge \epsilon ] \le \delta$. Note that z is a function of $\epsilon$ ($z = {1\over \lambda } \log {2\beta \over \epsilon (1 - e^{-\lambda })}$), hence we consider $\epsilon _1, \epsilon _2$ such that ${\epsilon _1 \over 2} \ge \sqrt{2\log {1\over \delta } \over n}$ and ${\epsilon _2 \over 2} \ge \sqrt{ {1\over n\lambda } \log {2\beta \over \epsilon _2(1 - e^{-\lambda })}}$. In this case $\epsilon = \epsilon _1 + \epsilon _2$ satisfies that

$$\begin{aligned} {\epsilon \over 2}={\epsilon _1 + \epsilon _2 \over 2}\ge {\epsilon _1 \over 2}\ge \sqrt{2\log {1\over \delta } \over n}, \end{aligned}$$

and

$$\begin{aligned} {\epsilon \over 2}={\epsilon _1 + \epsilon _2 \over 2}\ge {\epsilon _2 \over 2}\ge \sqrt{ {1\over n\lambda } \log {2\beta \over \epsilon _2(1 - e^{-\lambda })}}\ge \sqrt{ {1\over n\lambda } \log {2\beta \over \epsilon (1 - e^{-\lambda })}}, \end{aligned}$$

which implies that $\epsilon \ge \sqrt{2\log {1\over \delta } \over n} + \sqrt{z\over n}$.

After some basic computations, we have that $\epsilon _1 = 2\sqrt{2\log {1\over \delta } \over n}$ and $\epsilon _2 = \sqrt{{2\over \lambda n}\log {2\beta ^2\lambda n \over (1 - e^{-\lambda })^2}}$, i.e., Eq. (4) holds with $\varDelta _H(\delta ,n) = 2B\sqrt{2\log {1\over \delta } \over n} + B\sqrt{{2\over \lambda n}\log {2\beta ^2\lambda n \over (1 - e^{-\lambda })^2}}$.

To bound $n_H(\delta , \varDelta )$, we use a similar trick, i.e., we define two values $n_1,n_2$ as follows: $2B\sqrt{\log {1\over \delta }\over 2n_1} \le {\varDelta \over 2}$, $B\sqrt{{2\over \lambda n_2}\log {2\beta ^2\lambda n_2 \over (1 - e^{-\lambda })^2}} \le {\varDelta \over 2}$. Then we know that

$$\begin{aligned}&2B\sqrt{2\log {1\over \delta } \over (n_1+n_2)} + B\sqrt{{2\over \lambda (n_1+n_2)}\log {2\beta ^2\lambda (n_1+n_2) \over (1 - e^{-\lambda })^2}}\\&\quad \le 2B\sqrt{\log {1\over \delta }\over 2n_1} +B\sqrt{{2\over \lambda n_2}\log {2\beta ^2\lambda n_2 \over (1 - e^{-\lambda })^2}} \\&\quad \le {\varDelta \over 2} + {\varDelta \over 2}\\&\quad = \varDelta , \end{aligned}$$

which means that we can use $n_1+n_2$ as an upper bound of $n_H(\delta , \varDelta )$. After some basic computations, we have that $n_1 = {32B^2\log {1\over \delta } \over \varDelta ^2}$ and $n_2 = {16B^2 \over \lambda \varDelta ^2}\log {16B^2\beta ^2\lambda \over (1-e^{-\lambda })^2\varDelta ^2}$. $\square$

By Theorem 2, we have the following corollary.

Corollary 2

With $\varDelta _H(\delta ,n), n_H(\delta , \varDelta )$ set as in Theorem 2, using Algorithm 3 in the elimination framework has sample complexity $O(\sum _{i=1}^m {B^2\over \varDelta _i^2} (\log {m\over \delta }+\log \log {1\over \varDelta _i}) + \sum _{i=1}^m {B^2 \over \lambda \varDelta _i^2}\log {B\beta \lambda \over (1-e^{-\lambda })\varDelta _i})$, and using Algorithm 3 in the LUCB framework has sample complexity $O(B^2h\log {Bh\over \delta } + \sum _{i=1}^m {B^2 \over \lambda \varDelta _i^2}\log {B\beta \lambda \over (1-e^{-\lambda })\varDelta _i})$, where $h = \sum _{i=1}^m {1\over \varDelta _i^2}$.

Similarly, there is only a constant gap between the complexity lower bound in Proposition 1 and the complexity upper bounds in Corollary 2. Since the gap does not depend on $\delta$, we know that our algorithms are asymptotically optimal.

4.2.3 Continuous distributions with bounded support

Next, we consider the case that $D_i$’s are continuous distributions with bounded support. Without loss of generality, we assume that the support is [0, 1]. Since the observations are discrete, we cannot deal with non-continuous probability density functions. Because of this, we need the following two assumptions.

Assumption 3

The probability density function exists for every $D_i$, which is $d_i$. $d_i$ is integrable on the support [0, 1].

Assumption 4

There exists a constant C such that

$$\begin{aligned} \forall i \in \{1,\cdots ,m\}, \forall x,y , |d_i(x)-d_i(y)| \le 2C|x-y|. \end{aligned}$$

Assumption 3 states that the probability density functions (of all the distributions $D_i$’s) exist and Assumption 4 ensures the continuity of these probability density functions. Based on these two assumptions, we can obtain the following corollary, and estimate the value of $H(D_i)$ by Algorithm 4.

Corollary 3

Under Assumptions 3 and 4, for all $i \in A$, $0 \le x \le y \le 1$, we have

$$\begin{aligned} {1\over 2}\int _x^y |d_i(z)-\mu (x,y,d_i)|dz \le {C(x-y)^2\over 4}, \end{aligned}$$

where $\mu (x,y,d_i) = {\int _x^y d_i(z)dz \over y-x}$ is the average density of distribution $d_i$ in interval [x, y] or (x, y].

The basic idea of Algorithm 4 is to build a discrete distribution for simulating a continuous one, i.e. we divide the continuous distribution into several intervals, and treat each interval as an element in the support of the discrete distribution. Specifically, we use $\ell$ to denote the size of the intervals, i.e., the support [0, 1] is divided into ${1\over \ell }$ small intervals $\{I_1, I_2, I_3, \cdots , I_{1\over \ell }\} = \{[0,\ell ], (\ell , 2\ell ], (2\ell , 3\ell ], \cdots (1-\ell , 1]\}$. Then the support of the built discrete distribution is $\{I_1, I_2, I_3, \cdots , I_{1\over \ell }\}$. For an observation set O, the algorithm first computes its corresponding empirical discrete distribution, i.e., $\varvec{q} = [q_1,\cdots , q_{1\over \ell }]$, where $q_s$ represents the proportion of observations that lie in interval $I_s$. Then it uses the empirical discrete distribution $\varvec{q}$ to estimate a continuous distribution with probability density function p, i.e., for any $z \in I_s$, $p(z) = {q_s \over \ell }$.

The difficult point is that we do not know how many intervals are enough. If we choose a small $\ell$, then the support of that discrete distribution becomes too large, which leads to a high complexity bound. But if the number of intervals is too small, then there will be a large gap between $D_i$ and the estimated distribution within each interval. To deal with this challenge, we use an adaptive method to choose $\ell$ in Algorithm 4, which is described as follows.

Note that there are two possible errors between the estimated continuous distribution and the real one. Firstly, there is a gap between the empirical discrete distribution $\varvec{q}$ and the real discrete distribution (i.e., the probability mass of the random variable lies in each interval). By the analysis in Sect. 4.2.1, this gap is upper bounded by $\sqrt{\log {1\over \delta } \over 2n} + \sqrt{1\over 4n\ell }$ (here n is the size of the observation set O), which increases when we decrease $\ell$. Secondly, even if the empirical discrete distribution is the same as the real discrete distribution, there is also a gap between the real probability density function (a curve) and the estimated probability density function (several horizontal line segments). From Corollary 3, we know that the difference between a curve and several horizontal line segments is upper bounded by ${1\over \ell } \cdot {C\ell ^2 \over 4} = {C\ell \over 4}$, which decreases when we decrease $\ell$. Then to trade-off between these two errors, we shrink the size $\ell$ by half until $\sqrt{1\over 4n\ell } \ge {C\ell \over 4}$. This makes sure that the final $\ell$ is neither too large nor too small.

Theorem 3

If H follows Assumption 1 with $\varLambda _{ TV}$ and constant B, and $D_i$’s are bounded continuous distributions that follow Assumptions 3 and 4, then Algorithm 4 satisfies that: Eq. (3) holds with

$$\begin{aligned} n_H(\delta ,\varDelta ) = {2B^2\log {1\over \delta } \over \varDelta ^2} + {8\sqrt{2}B^3C \over \varDelta ^3}, \end{aligned}$$

and Eq. (4) holds with

$$\begin{aligned} \varDelta _H(\delta ,n) = B\sqrt{\log {1\over \delta }\over 2n} + B\left( {\sqrt{2}C \over n}\right) ^{1\over 3}. \end{aligned}$$

Proof

Let $d_i(\ell ,s) \triangleq \int _{(s-1)\ell }^{s\ell } d_i(z)dz$ be the probability mass for distribution $D_i$ in interval $((s-1)\ell ,s\ell ]$. Then we can bound the gap $\varLambda _{ TV}(\mathbf{p} ,D_i)$ as follows:

$$\begin{aligned} \varLambda _{ TV}(\mathbf{p} ,D_i)= & {} {1\over 2} \sum _{s=1}^{1\over \ell } \int _{(s-1)\ell }^{s\ell } |p(z) - d_i(z)| dz\nonumber \\\le & {} {1\over 2} \sum _{s=1}^{1\over \ell } \int _{(s-1)\ell }^{s\ell } \left| p(z) - {d_i(\ell ,s) \over \ell } \right| dz\nonumber \\&+{1\over 2} \sum _{s=1}^{1\over \ell } \int _{(s-1)\ell }^{s\ell } \left| {d_i(\ell ,s) \over \ell } - d_i(z)\right| dz\nonumber \\\le & {} {1\over 2} \sum _{s=1}^{1\over \ell } |q_s - d_i(\ell ,s)| + {C\ell \over 4}, \end{aligned}$$

(5)

where Eq. (5) is given by Corollary 3.

Similar to the proof of Theorem 1, the first term in (5) can be upper bounded by $\sqrt{\log {1\over \delta }\over 2n} + \sqrt{1 \over 4n\ell }$ with probability at least $1-\delta$. As for the second term, in lines 5-8 of Algorithm 4, we know that ${C\ell \over 4} \le \sqrt{1\over 4n\ell }$. Thus, $\varLambda _{ TV}(\mathbf{p} ,D_i) \le \sqrt{\log {1\over \delta }\over 2n} + \sqrt{1 \over n\ell }$ with probability at least $1-\delta$.

Note that $\ell$ is not an input of $\varDelta _H$ or $n_H$, thus we still need a bound on it. This is given by lines 5-8 as well. If we shrink the size of intervals to $\ell$, then $2\ell$ must satisfy that $\sqrt{1\over 4n(2\ell )} \le {C(2\ell ) \over 4}$, which implies $\ell \ge \left( {1\over 2C^2n}\right) ^{1\over 3}$.

Then we have that $\varDelta _H(\delta ,n) = B\left( \sqrt{\log {1\over \delta }\over 2n} + ({\sqrt{2}C \over n})^{1\over 3}\right)$.

To bound $n_H(\delta , \varDelta )$, we define two values $n_1,n_2$ as follows: $B\sqrt{\log {1\over \delta }\over 2n_1} = {\varDelta \over 2}$, $B({\sqrt{2}C \over n_2})^{1\over 3} = {\varDelta \over 2}$. Then we know that $B\left( \sqrt{\log {1\over \delta }\over 2(n_1+n_2)} + ({\sqrt{2}C \over n_1+n_2})^{1\over 3}\right) \le B\sqrt{\log {1\over \delta }\over 2n_1}+B({\sqrt{2}C \over n_2})^{1\over 3} = \varDelta$, which means that we can use $n_1+n_2$ as an upper bound of $n_H(\delta , \varDelta )$. After some basic computations, we have that $n_1 = {2B^2\log {1\over \delta } \over \varDelta ^2}$ and $n_2 = {8\sqrt{2}B^3C \over \varDelta ^3}$. $\square$

By Theorem 3, we have the following corollary.

Corollary 4

With $\varDelta _H(\delta ,n), n_H(\delta , \varDelta )$ set as in Theorem 3, using Algorithm 4 in the elimination framework has sample complexity $O(\sum _{i=1}^m {B^2\over \varDelta _i^2} (\log {m\over \delta }+\log \log {1\over \varDelta _i}) + B^3C\sum _{i=1}^m {1\over \varDelta _i^3})$, and using Algorithm 4 in the LUCB framework has sample complexity $O(B^2h\log {Bh\over \delta } + B^3C\sum _{i=1}^m {1\over \varDelta _i^3})$, where $h = \sum _{i=1}^m {1\over \varDelta _i^2}$.

Compared with Corollary 1, the additive constant in Corollary 4 is $O(B^3C\sum _{i=1}^m {1\over \varDelta _i^3})$ (which does not depend on $\delta$ too, indicating that our algorithms are still asymptotically optimal). The reason is that when applying Algorithm 4 in the elimination framework or the LUCB framework, the game stops when $\ell _i = \varTheta ({\varDelta _i \over BC})$. Thus, the estimated discrete distribution for arm i has support size $\varTheta ({BC \over \varDelta _i})$, which means that its corresponding additive constant term becomes $O({B^2|S_i|\over \varDelta _i^2}) = O({B^3C\over \varDelta _i^3})$.

4.2.4 Continuous distributions with unbounded support

Now we consider the most complicated case: $D_i$’s are unbounded continuous distributions. We still need to assume that the probability density function exists and is integrable for every distribution $D_i$. Moreover, we also assume that $D_i$’s are sub-exponential. All these are concluded in the following assumption.

Assumption 5

The probability density function $d_i$ exists for every $D_i$, and all the $d_i$’s are integrable. There exist constants $\beta ,\lambda$ such that for any $i\in [m]$ and $z \in \mathbb {R}$, $d_i(z) \le \beta \exp (-\lambda |z|)$.

Here we use Algorithm 5 to estimate the value of $H(D_i)$. Similar to Algorithm 4, Algorithm 5 also divides the continuous distribution into several intervals with length $\ell$, and treats each interval as an element in the support of the discrete distribution. However, we cannot divide the whole support into small intervals since it is unbounded. Therefore, we also need a bounded interval $[-L, L]$, and only divide this interval into small ones. As for the interval $(-\infty , -L)$ (or $(L, \infty )$), we treat it as an integral. Specifically, there are ${2L \over \ell } + 2$ intervals $\{I_-, I_1,I_2, \cdots , I_{2L\over \ell }, I_+\} = \{(-\infty , -L), [-L,-L+\ell ], [-L + \ell , -L + 2\ell ], \cdots , (L-\ell , L], (L, \infty )\}$, and we still use $\varvec{q} = [q_-, q_1, q_2, \cdots , q_{2L\over \ell }, q_+]$ to denote the empirical discrete distribution. To estimate the probability density function p, we use the following method: for $z \in I_s$ with $1 \le s \le {2L \over \ell }$, we choose $p(z) = {q_s \over \ell }$ (the same as Algorithm 4); for intervals $I_-$ and $I_+$, we use two exponential tails, i.e., $p(z) = q_- \lambda e^{\lambda (z+L)}$ for $z \in I_-$ and $p(z) = q_+ \lambda e^{-\lambda (z-L)}$ for $z \in I_+$.

Now there are three possible errors between the estimated distribution and the real one. Firstly, since there are ${2L \over \ell } + 2$ intervals, the gap between the real discrete distribution (i.e., the probability mass of the random variable lies in each interval) and the empirical discrete distribution $\varvec{q}$ is upper bounded by $\sqrt{\log {1\over \delta }\over 2n} + \sqrt{{{2L \over \ell } + 2 \over 4n}} \le \sqrt{\log {1\over \delta }\over 2n} + \sqrt{ L \over n\ell }$. Secondly, the difference between a curve and several horizontal line segments in the bounded interval $[-L, L]$ is upper bounded by ${2L \over \ell } \cdot {C\ell ^2 \over 4} = {CL\ell \over 2}$. Lastly, the difference in the two unbounded intervals $(-\infty , -L)$ and $(L, \infty )$ is upper bounded by ${2\beta \over \lambda }e^{-\lambda L}$ (by Assumption 5).

From the above explanations, we come up with the following adaptive methods to find proper L and $\ell$. If $\sqrt{\log {1\over \delta } \over 2n} < {2\beta \over \lambda } e^{-\lambda L}$, i.e., the last difference is too large, then we double the size of L to make that difference smaller. This makes sure that the final L is neither too large nor too small. After we find a proper L, we check whether $\sqrt{L \over n\ell } < {CL\ell \over 2}$, i.e. whether the second difference is too large. If it is too large, then we shrink the size $\ell$ by half. This makes sure that the final $\ell$ is neither too large nor too small as well.

We can see that these adaptive steps achieve a good trade-off between the above three errors, and therefore Algorithm 5 also enjoys good performance, which is stated in the next theorem.

Theorem 4

If H follows Assumption 1 with $\varLambda _{ TV}$ and constant B, and $D_i$’s are unbounded continuous distributions that follow Assumptions 4 and 5, then Algorithm 5 satisfies that: Eq. (3) holds with

$$\begin{aligned} n_H(\delta , \varDelta ) = {8B^2\log {1\over \delta } \over \varDelta ^2}+ {128B^3C \over \lambda ^2\varDelta ^3} \log ^2 {1024\beta ^2 B^3C\over \lambda ^4\varDelta ^3}, \end{aligned}$$

and Eq. (4) holds with

$$\begin{aligned} \varDelta _H(\delta ,n) = B\sqrt{2\log {1\over \delta }\over n} + B\left( {{8\sqrt{2}}C \log ^2({\beta \over \lambda }\sqrt{8n\over \log {(1/\delta )}})\over n\lambda ^2}\right) ^{1/3}. \end{aligned}$$

Proof

Similar to the proof of Theorem 3, we let $d_i(L,\ell ,s) = \int _{-L+(s-1)\ell }^{-L+s\ell } d_i(z) dz$ be the probability mass for distribution $D_i$ in interval $(-L+(s-1)\ell , -L+s\ell ]$, while $d_i(L,\ell ,-) = \int _{-\infty }^{-L} d_i(z)dz$ and $d_i(L,\ell ,+) = \int _{L}^\infty d_i(z)dz$.

First consider $\int _{-\infty }^{-L} |p(z) - d_i(z)|dz$, and we have:

$$\begin{aligned} \int _{-\infty }^{-L} |p(z) - d_i(z)|dz\le & {} \int _{-\infty }^{-L} (p(z) + d_i(z))dz \\= & {} \int _{-\infty }^{-L} (2d_i(z) + p(z) - d_i(z))dz \\\le & {} 2 d_i(L,\ell ,-) + |q_- -d_i(L,\ell ,-) |. \end{aligned}$$

Similarly, $\int _{L}^{\infty } |p(z) - d_i(z)|dz \le 2 d_i(L,\ell ,+) + |q_+ -d_i(L,\ell ,+)|$. Thus, we have that

$$\begin{aligned} \varLambda _{ TV}(\mathbf{p} ,D_i)\le & {} d_i(L,\ell ,-) + d_i(L,\ell ,+) + {CL\ell \over 2} \nonumber \\&+{1\over 2}\sum _{s\in \{-,1,\cdots ,{2L\over \ell },+\}}|q_s - d_i(L,\ell ,s)| \end{aligned}$$

(6)

$$\begin{aligned}\le & {} {2\beta \over \lambda }e^{-\lambda L} + {CL\ell \over 2} \nonumber \\&+ {1\over 2}\sum _{s\in \{-,1,\cdots ,{2L\over \ell },+\}}|q_s - d_i(L,\ell ,s)|. \end{aligned}$$

(7)

The proof of Eq. (6) is the same as Eq. (5), which needs to use Corollary 3. Eq. (7) is given by Assumption 5.

By lines 6-8 and 9-11 in Algorithm 5, we always have ${2\beta \over \lambda }e^{-\lambda L} \le \sqrt{\log {1\over \delta }\over 2n}$ and ${CL\ell \over 2} \le \sqrt{L \over n\ell }$. Thus,

$$\begin{aligned} {2\beta \over \lambda }e^{-\lambda L} + {CL\ell \over 2} \le \sqrt{\log {1\over \delta }\over 2n} + \sqrt{L \over n\ell }. \end{aligned}$$

By results in Berend and Kontorovich (2012), the third term in Eq. (7) has the following high probability upper bound (with probability $1-\delta$)

$$\begin{aligned} \sqrt{\log {1\over \delta }\over 2n} + \sqrt{{1 \over 4n}\cdot \left( {2L \over \ell } + 2\right) } \le \sqrt{\log {1\over \delta }\over 2n} + \sqrt{L \over n\ell }. \end{aligned}$$

Thus $\varLambda _{ TV}(\mathbf{p} ,D_i) \le \sqrt{2\log {1\over \delta }\over n} + \sqrt{4L \over n\ell }$ with probability $1 - \delta$.

Now we consider L and $\ell$. From lines 6-8 in Algorithm 5, we know that ${2\beta \over \lambda }e^{-\lambda {L\over 2}} > \sqrt{\log {1\over \delta }\over 2n}$, which implies $L < {2\over \lambda } \log ({\beta \over \lambda }\sqrt{8n\over \log {1\over \delta }})$. From lines 9-11, we know that $\sqrt{L\over n(2\ell )} < {CL(2\ell ) \over 2}$, which implies $\ell > ({1\over 2C^2nL})^{1\over 3}$.

Then $\varDelta _H(\delta ,n) = B\sqrt{2\log {1\over \delta }\over n} + B\left( {{8\sqrt{2}}C \log ^2({\beta \over \lambda }\sqrt{8n\over \log {(1/\delta )}})\over n\lambda ^2}\right) ^{1/3}$. As for $n_H(\delta , \varDelta )$, we still choose $n_1, n_2$ such that $B\sqrt{2\log {1\over \delta }\over n_1} = {\varDelta \over 2}$ and $B\left( {{8\sqrt{2}}C \log ^2({\beta \over \lambda }\sqrt{8n_2\over \log {(1/\delta )}})\over n_2\lambda ^2}\right) ^{1/3} = {\varDelta \over 2}$. This implies that $n_H(\delta , \varDelta ) = {8B^2\log {1\over \delta } \over \varDelta ^2}+ {128B^3C \over \lambda ^2\varDelta ^3} \log ^2 {1024\beta ^2 B^3C\over \lambda ^4\varDelta ^3}$. $\square$

Corollary 5

With $\varDelta _H(\delta ,n), n_H(\delta , \varDelta )$ set as in Theorem 4, using Algorithm 5 in the elimination framework has sample complexity $O(\sum _{i=1}^m {B^2\over \varDelta _i^2} (\log {m\over \delta }+\log \log {1\over \varDelta _i}) + \sum _{i=1}^m {B^3C \over \lambda ^2\varDelta _i^3} \log ^2 {\beta BC\over \lambda \varDelta _i})$, and using Algorithm 5 in the LUCB framework has sample complexity $O(B^2h \log {Bh\over \delta } + \sum _{i=1}^m {B^3C \over \lambda ^2\varDelta _i^3} \log ^2 {\beta BC\over \lambda \varDelta _i})$, where $O(\sum _{i=1}^m {B^3C \over \lambda ^2\varDelta _i^3} \log ^2 {\beta BC\over \lambda \varDelta _i})$ is a constant gap that does not depend on $\delta$ and $h = \sum _{i=1}^m {1\over \varDelta _i^2}$.

By Corollary 5, we know that in this case, our algorithms are also asymptotically optimal, since the gap between the complexity upper bounds and the complexity lower bound does not depend on $\delta$.

4.3 Applications of our solutions

Here we briefly explain how to apply our solutions in reality. Consider the scenario that we are selecting the closest distribution to the target distribution (as explained in Sect. 2.3.1, it is widely adopted in real applications such as quality control and medical problems). In this case, we can set $H(D) = -\varLambda _{ TV}(D, G)$, where G is the target distribution. Note that this reward function H satisfies Assumption 1 with the total variation distance $\varLambda _{ TV}$ and $B = 1$ naturally (by triangle inequality, $|H(D) - H(D')| = |\varLambda _{ TV}(D, G) - \varLambda _{ TV}(D', G)| \le \varLambda _{ TV}(D,D')$). Therefore, for different kinds of distributions, we can use different algorithms in Sect. 4.2 and their corresponding $n_H(\delta , \varDelta )$’s or $\varDelta _H(\delta ,n)$’s with $B = 1$ to solve this problem (by applying either the elimination framework or the LUCB framework).

Our solutions have the advantage that they can deal with the case of looking for the closest distribution under the measure of total variation distance $\varLambda _{ TV}$, while existing algorithms (e.g., the Kolmogorov-Smirnov test) only work for the Kolmogorov-Smirnov distance $\varLambda _{K}$. Compared with using the Kolmogorov-Smirnov distance to make decisions, using the total variation distance can achieve a lower complexity bound. This is because $\varLambda _K(D,D') \le \varLambda _{ TV}(D,D')$, i.e., a small total variation distance leads to a small Kolmogorov-Smirnov distance, but not vice versa. Specifically, when we use the Kolmogorov-Smirnov distance to measure the similarity of distribution D to target G (i.e. $H_K(D) = -\varLambda _K(D,G)$), the reward gap $\varDelta _K = H_K(G) - H_K(D) = \varLambda _K(D,G)$, and when we use the total variation distance to measure the similarity of distribution D to target G (i.e. $H_{ TV}(D) = -\varLambda _{ TV}(D,G)$), the reward gap $\varDelta _{ TV} = H_{ TV}(G) - H_{ TV}(D) = \varLambda _{ TV}(D,G)$. Since $\varLambda _K(D,D') \le \varLambda _{ TV}(D,D')$, we know that $\varDelta _K \le \varDelta _{ TV}$, which means that using the Kolmogorov-Smirnov distance leads to a higher sample complexity (see a concrete example in detail in Sect. 5).

5 Experiments

Here we use some experiments to demonstrate the effectiveness of our algorithms, by considering the following problem instance.

Problem 1

There are two arms $\{1,2\}$. The observation distribution of arm 1 is $\mathsf{Unif}$, the uniform distribution on [0, 1], and the observation distribution of arm 2 is $\mathsf{Pert}(n)$ with parameter $n \in \mathbb {N}^+$. $\mathsf{Pert}(n)$ is also bounded in [0, 1], and has probability density function $f_n(x) = 1 + (-1)^{k(x,n)}({1\over 2n} - r(x,n))$ for $x \in [0,1]$, where $k(x,n) = \lfloor xn \rfloor$ and $r(x,n) = x - {\lfloor xn \rfloor \over n}$ (as shown in Fig. 1). The goal of the player is to find out the uniform distribution among these two arms.

Here we compare the complexity of applying the LUCB framework to the traditional solution (i.e., using the measure of Kolmogorov-Smirnov distance $\varLambda _K$) and our solution (i.e., using the measure of total variation distance $\varLambda _{ TV}$). The reason that we consider the LUCB framework is because the complexity curve under the elimination framework has a stair-step shape, which makes it a little hard to figure out how fast the complexity increases when n or $\log {1\over \delta }$ increases.

Fig. 2 shows the complexity of applying the LUCB framework for different parameters n and different error constraints $\delta$. The red line is the sample complexities of using the Kolmogorov-Smirnov distance $\varLambda _K$ and the blue line is the sample complexities of using the total variation distance $\varLambda _{ TV}$. All these results take an average of 100 independent runs. We can see that the complexity of our solution (the blue line) increases much slower than the existing solution (the red line) when n or $\log {1\over \delta }$ increases, indicating that our solution is very efficient. This accords with our analysis, which is explained in detail in the next few lines.

Recall that Proposition 4 shows that using the Kolmogorov-Smirnov distance $\varLambda _K$ requires a complexity of $T_K = O({1\over \varDelta _{K}^2} (\log {2\over \delta }+\log {1\over \varDelta _{K}}))$, where $\varDelta _K = \varLambda _K(\mathsf{Pert}(n), \mathsf{Unif}) = {1\over 8n^2}$. As for using the total variation distance, Corollary 4 shows that the complexity is upper bounded by $T_{ TV} = O({1\over \varDelta _{ TV}^2} (\log {2\over \delta }+\log {1\over \varDelta _{ TV}}) + {1\over \varDelta _{ TV}^3})$, where $\varDelta _{ TV} = \varLambda _{ TV}(\mathsf{Pert}(n), \mathsf{Unif}) = {1\over 4n}$. Therefore, the complexity $T_K = O(n^4\log {1\over \delta } + n^4\log n)$ and the complexity $T_{ TV} = O(n^2\log {1\over \delta } + n^2\log n + n^3)$, i.e., $T_{ TV}$ is much lower than $T_K$. Besides, by Proposition 1, we know that the complexity lower bound is $\varOmega ({1\over \mathsf{KL}(\mathsf{Pert}(n), \mathsf{Unif})} \log {1\over \delta })$, where

$$\begin{aligned} \mathsf{KL}(\mathsf{Pert}(n), \mathsf{Unif}) = n\int _{0}^{1\over n}\left( 1-{1\over 2n} + x\right) \log \left( 1-{1\over 2n} + x\right) dx = O\left( {1\over n^2}\right) . \end{aligned}$$

This means that our solution has an asymptotically optimal complexity upper bound, while the existing solution fails to achieve optimality in this problem.

6 Future work

In this paper, we concentrate on dealing with the fixed-confidence problem. Therefore, a possible further research topic is the fixed-budget model (based on general reward functions). In Audibert et al. (2010), the authors use an elimination-based policy called SR to solve the fixed-budget problem. We believe that it is not hard to develop algorithms for the fixed-budget problem based on their algorithm framework and analysis in this paper.

Another interesting topic is to extend the pure exploration problem with general reward functions to the combinatorial setting, i.e., to find out a set of arms $S \subseteq A$ such that the joint distribution of arms in S has the largest score (under a general reward function H). For example, let $R(S) = r(\varvec{X},S)$ denote the random reward of a set of arms S, where $\varvec{X}$ is a random vector that follows the fixed joint distribution, and our goal is to find out the best set of arms S such that R(S) has the largest $\tau$-quantile. This setting is common in online systems that follow some special combinatorial structures, e.g., recommendation websites and search engines. Further research about this topic can be really helpful for algorithm design in these applications.

References

Audibert, J. Y., Bubeck, S., & Munos, R. (2010). Best arm identification in multi-armed bandits. In COLT 2010 - the Conference on Learning Theory (pp. 41–53), Haifa, Israel.
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2–3), 235–256.
Article Google Scholar
Avenhaus, R., & Canty, M. J. (1996). Compliance quantified: An introduction to data verification. Cambridge University Press.
Bensefia, A., Paquet, T., & Heutte, L. (2004). Handwriting analysis for writer verification. In Ninth International Workshop on Frontiers in Handwriting Recognition (pp. 196–201). IEEE.
Berend, D., & Kontorovich, A. (2012). On the convergence of the empirical distribution. VI International Workshop “Applied Problems in Theory of Probabilities and Mathematical Statistics Related to Modeling of Information Systems", Autumn Session.
Berry, D. A., & Fristedt, B. (1985). Bandit problems: Sequential allocation of experiments (Monographs on statistics and applied probability). Springer.
Bessler, S. A. (1960). Theory and applications of the sequential design of experiments, k-actions and infinitely many experiments. Part I. Theory. Tech. rep., Stanford Univ CA Applied Mathematics and Statistics Labs.
Bubeck, S., Wang, T., & Viswanathan, N. (2013). Multiple identifications in multi-armed bandits. In International Conference on Machine Learning (pp. 258–265).
Carpentier, A., & Valko, M. (2014) Extreme bandits. In Neural Information Processing Systems.
Charalambous, C. D., Tzortzis, I., Loyka, S., & Charalambous, T. (2014). Extremum problems with total variation distance and their applications. IEEE Transactions on Automatic Control, 59(9), 2353–2368.
Article MathSciNet Google Scholar
Chen, L., Li, J., & Qiao, M. (2017).Towards instance optimal bounds for best arm identification. In Conference on Learning Theory (pp. 535–592). PMLR.
Chen, W., Hu, W., Li, F., Li, J., Liu, Y., & Lu, P. (2016). Combinatorial multi-armed bandit with general reward functions. In NIPS.
Chernoff, H. (1959). Sequential design of experiments. Annals of Mathematical Statistics, 30(3), 755–770.
Article MathSciNet Google Scholar
David, Y., & Shimkin, N. (2016). Pure exploration for max-quantile bandits. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 556–571). Springer.
Degenne, R., Koolen, W. M., & Ménard, P. (2019). Non-asymptotic pure exploration by solving games. Advances in Neural Information Processing Systems, 32, 14492–14501.
Google Scholar
Dragalin, V. (1996). A simple and effective scanning rule for a multi-channel system. Metrika, 43(1), 165–182.
Article MathSciNet Google Scholar
Dvoretzky, A., Kiefer, J., & Wolfowitz, J. (1956). Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, 642–669.
Evendar, E., Mannor, S., & Mansour, Y. (2006). Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7, 1079–1105.
MathSciNet MATH Google Scholar
Galichet, N., Sebag, M., & Teytaud, O. (2013). Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. In Asian Conference on Machine Learning (pp. 245–260). PMLR.
Garivier, A., & Kaufmann, E. (2016). Optimal best arm identification with fixed confidence. In Conference on Learning Theory (pp. 998–1027). PMLR.
Gibbs, A. L., & Su, F. E. (2002). On choosing and bounding probability metrics. International statistical review, 70(3), 419–435.
Article Google Scholar
Gittins, J., Glazebrook, K., & Weber, R. (2011). Multi-armed bandit allocation indices. Wiley.
Jamieson, K., Malloy, M., Nowak, R., & Bubeck, S. (2014). lil’ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory (pp. 423–439). PMLR.
Kalyanakrishnan, S., Tewari, A., Auer, P., & Stone, P. (2012). Pac subset selection in stochastic multi-armed bandits. In International Conference on Machine Learning.
Kaufmann, E., Cappé, O., & Garivier, A. (2014). On the complexity of a/b testing. In Conference on Learning Theory (pp. 461–481). PMLR.
Kaufmann, E., Cappé, O., & Garivier, A. (2016). On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17(1), 1–42.
MathSciNet MATH Google Scholar
Kaufmann, E., & Kalyanakrishnan, S. (2013). Information complexity in bandit subset selection. In Conference on Learning Theory (pp. 228–251).
Larsen, R.J. (1976) Statistics in the real world: A book of examples. Tech. rep.
Lilliefors, H. W. (1967). On the Kolmogorov–Smirnov test for normality with mean and variance unknown. Journal of the American statistical Association, 62(318), 399–402.
Article Google Scholar
Maron, O., & Moore, A. W. (1997). The racing algorithm: Model selection for lazy learners. Artificial Intelligence Review, 11(1), 193–225.
Article Google Scholar
Massart, P. (1990). The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The Annals of Probability, 1269–1283.
Pochampally, K.K., & Gupta, S. M. (2014). Six sigma case studies with Minitab®. CRC Press.
Sani, A., Lazaric, A., & Munos, R. (2012). Risk-aversion in multi-armed bandits. In Proceedings of the 25th International Conference on Neural Information Processing Systems-Volume 2 (pp. 3275–3283).
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction, Vol. 1. MIT Press.
Szorenyi, B., Busa-Fekete, R., Weng, P., & Hüllermeier, E. (2015) Qualitative multi-armed bandits: A quantile-based approach. In International Conference on Machine Learning (pp. 1660–1668).
Vakili, S., & Zhao, Q. (2016). Risk-averse multi-armed bandit problems under mean-variance measure. IEEE Journal of Selected Topics in Signal Processing, 10(6), 1093–1111.
Article Google Scholar
Zigangirov, K. S. (1966). On a problem in optimal scanning. Theory of Probability & Its Applications, 11(2), 294–298.
Article MathSciNet Google Scholar

Download references

Funding

The work of Siwei Wang was supported in part by the National Natural Science Foundation of China Grant 62106122.

Author information

Authors and Affiliations

Tsinghua University, Beijing, China
Siwei Wang
Mircosoft Research, Beijing, China
Wei Chen

Authors

Siwei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SW made substantial contributions to the analysis and drafted this manuscript. WC also made contributions to the analysis and revise this manuscript critically.

Corresponding author

Correspondence to Siwei Wang.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Consent to participate

We hereby confirm that all the authors are aware of this submission and consent to its review by Machine Learning Journal.

Consent for publication

We hereby confirm that all the authors are aware of this submission and consent to its publication.

Additional information

Editor: Scott Sanner.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Proofs of Propositions

1.1 A.1 Proof of Proposition 1

The proof of Proposition 1 adapts the analysis in Kaufmann et al. (2016).

Proposition 1

For a pure exploration problem instance with general reward functions, the expected complexity of any $\delta$-PAC algorithm $\pi$ is lower bounded by

$$\begin{aligned} \mathbb {E}[T_\pi ] \ge \left( \max _{i\ne i^*} {1\over \mathsf{KL}(D_{i^*}, D_{i})} + \sum _{i\ne i^*} {1\over \mathsf{KL}(D_i, D_{i^*})} \right) \log {1\over 2.4\delta }, \end{aligned}$$

(1)

where $\mathsf{KL}(D, D')$ denotes the KL-divergence between D and $D'$.

Proof

Following the proof steps of Theorem 4 in Kaufmann et al. (2016), for any arm $i \ne i^*$, we must have that

$$\begin{aligned} \mathbb {E}[N_i]\ge & {} {1\over \mathsf{KL}(D_i, D_{i^*})} \log {1\over 2.4\delta },\\ \mathbb {E}[N_{i^*}]\ge & {} {1\over \mathsf{KL}(D_{i^*}, D_{i})} \log {1\over 2.4\delta }, \end{aligned}$$

where $N_i$ represents the number of observations on arm i when the $\delta$-PAC algorithm $\pi$ terminates. This implies that

$$\begin{aligned} \mathbb {E}[T_{\pi }] = \sum _{i \in [m]} \mathbb {E}[N_{i}] \ge \left( \max _{i\ne i^*} {1\over \mathsf{KL}(D_{i^*}, D_{i})} + \sum _{i\ne i^*} {1\over \mathsf{KL}(D_i, D_{i^*})} \right) \log {1\over 2.4\delta }. \end{aligned}$$

$\square$

1.2 A.2 Proof of Proposition 2

The proof of Proposition 2 adapts the analysis in Evendar et al. (2006); Kaufmann and Kalyanakrishnan (2013).

Proposition 2

With probability at least $1-\delta$, Algorithm 1 works correctly, and the sample complexity $T_E$ satisfies:

$$\begin{aligned} T_E \le \sum _{i=1}^m n_H\left( {\delta \over 2m\log ^2\left( {8\over \varDelta _i}\right) }, {\varDelta _i \over 8}\right) . \end{aligned}$$

Proof

Let $\mathcal {E}_{R}$ be the event that

$$\begin{aligned} \{\forall k, \forall i \in A(k), |H(D_i) - {\hat{H}}_i(k)| \le { rad}(k)\}, \end{aligned}$$

where A(k) is the set of remaining arms at the end of phase k. By definition of $n_H(\delta ,\varDelta )$

$$\begin{aligned} \Pr [\mathcal {E}_{R}] \ge 1 - \sum _{i=1}^m \sum _{k=1}^\infty \delta (k) = 1 - \sum _{k=1}^\infty {\delta \over 2k^2} \ge 1 - \delta . \end{aligned}$$

Then we prove that Algorithm 1 works correctly under event $\mathcal {E}_{R}$.

If arm $i^*$ is deleted, then $\exists k \ge 1, i \ne i^*$ such that ${\hat{H}}_{i^*}(k)\le {\hat{H}}_i(k) - 2{ rad}(k)$. However, under event $\mathcal {E}_{R}$, we always have ${\hat{H}}_{i^*}(k) \ge H(D_{i^*}) - { rad}(k) > H(D_i) - { rad}(k) \ge {\hat{H}}_i(k) - 2{ rad}(k)$, which makes a contradiction.

Next, we consider the complexity under $\mathcal {E}_{R}$. When ${ rad}(k) < {\varDelta _i \over 4}$, we have ${\hat{H}}_{i^*}(k) -2{ rad}(k) \ge H(D_{i^*}) - 3{ rad}(k) = H(D_i) +\varDelta _i - 3{ rad}(k) \ge {\hat{H}}_i(k) - 4{ rad}(k) + \varDelta _i > {\hat{H}}_i(k)$. Thus at the end of phase k, arm i is deleted.

This implies that any sub-optimal arm $i\ne i^*$ will be deleted when ${ rad}(k) < {\varDelta _i \over 4}$. Thus, it can remain in A for at most $\log {\varDelta _i\over 8}$ phases. Because of this, the number of time steps that we choose this arm to pull has an upper bound of $n_H\left( {\delta \over 2m\log ^2\left( 8/\varDelta _i\right) }, {\varDelta _i \over 8}\right)$. As for the optimal arm $i^*$, we need to pull it until the last sub-optimal arm is eliminated. From the definition of $\varDelta _{i^*}$, we know that it is pulled for at most $n_H\left( {\delta \over 2m\log ^2\left( 8/\varDelta _{i^*}\right) }, {\varDelta _{i^*} \over 8}\right)$ times as well. $\square$

1.3 A.3 Proof of Proposition 3

The proof of Proposition 3 adapts the analysis in Kalyanakrishnan et al. (2012).

Proposition 3

With probability at least $1-\delta$, Algorithm 2 works correctly, and the sample complexity $T_L$ satisfies:

$$\begin{aligned} T_L \le \min _{t> 0} \left\{ t > \sum _{i=1}^m n_H\left( {\delta \over 2mt^2}, {\varDelta _i\over 4}\right) \right\} . \end{aligned}$$

Proof

Let $\mathcal {E}_{L}$ be the event that

$$\begin{aligned} \{\forall t, \forall i, |H(D_i) - {\hat{H}}_i(t)| \le { rad}_i(t)\}. \end{aligned}$$

By the definition of $\varDelta _H(\delta ,n)$, we have that

$$\begin{aligned} \Pr [\mathcal {E}_{L}] \ge 1 - \sum _{i=1}^m \sum _{t=1}^\infty \sum _{n} \Pr [|O_i(t)| = n]{\delta \over 2mt^2} \ge 1 - \sum _{i=1}^m \sum _{t=1}^\infty {\delta \over 2mt^2} \ge 1 - \sum _{t=1}^\infty {\delta \over 2t^2} \ge 1 - \delta . \end{aligned}$$

Then we prove that Algorithm 2 works correctly under event $\mathcal {E}_{L}$.

If the game stops with output $S_\pi \ne i^*$, then $\exists t, i \ne i^*$ such that ${\hat{H}}_{i^*}(t) + { rad}_{i^*}(t) \le {\hat{H}}_i(t) - { rad}_i(t)$. However, under event $\mathcal {E}_{L}$, we always have ${\hat{H}}_{i^*}(t) + { rad}_{i^*}(t) \ge H(D_{i^*}) > H(D_i) \ge {\hat{H}}_i(t) - { rad}_i(t)$, which makes a contradiction.

Now we consider the complexity under $\mathcal {E}_L$. If the game does not stop at time t, then there are three possible cases: i) $i_1(t) = i$ and $i_2(t) = i^*$; ii) $i_1(t) = i^*$ and $i_2(t) = i$; iii) $i_1(t) = i$ and $i_2(t) = j$.

In the first case, $i_1(t) = i$ implies that ${\hat{H}}_i(t) \ge {\hat{H}}_{i^*}(t)$, then we must have $H(D_i) + { rad}_i(t) \ge {\hat{H}}_i(t) \ge {\hat{H}}_{i^*}(t) \ge H(D_{i^*}) - { rad}_{i^*}(t)$. This means that ${ rad}_i(t) + { rad}_{i^*}(t) \ge \varDelta _i$. If we choose to pull arm i, then $2{ rad}_i(t) \ge \varDelta _i$. Otherwise, if we choose to pull arm $i^*$, then $2{ rad}_{i^*}(t) \ge \varDelta _i \ge \varDelta _{i^*}$.

In the second case, $i_1(t) = i^*$ and $i_2(t) = i$ but the game does not stop at t implies that ${\hat{H}}_i(t) + { rad}_i(t) \ge {\hat{H}}_{i^*}(t) - { rad}_{i^*}(t)$, then we must have $H(D_i) + 2{ rad}_i(t) \ge {\hat{H}}_i(t) + { rad}_i(t) \ge {\hat{H}}_{i^*}(t) - { rad}_{i^*}(t) \ge H(D_{i^*}) - 2{ rad}_{i^*}(t)$. This means that $2{ rad}_i(t) + 2{ rad}_{i^*}(t) \ge \varDelta _i$. If we choose to pull arm i, then $4{ rad}_i(t) \ge \varDelta _i$. Otherwise, if we choose to pull arm $i^*$, then $4{ rad}_{i^*}(t) \ge \varDelta _i \ge \varDelta _{i^*}$.

In the third case, $i_2(t) = j$ implies that ${\hat{H}}_{j}(t) + { rad}_j(t) \ge {\hat{H}}_{i^*}(t) + { rad}_{i^*}(t)$, then we must have $H(D_j) + 2{ rad}_j(t)\ge {\hat{H}}_{j}(t) + { rad}_j(t) \ge {\hat{H}}_{i^*}(t) + { rad}_{i^*}(t) \ge H(D_{i^*})$. This means that if we choose to pull arm j, then $2{ rad}_{j}(t) \ge \varDelta _j$. If we choose to pull arm i, then we must have ${ rad}_i(t) \ge { rad}_j(t)$. Note that $i_1(t) = i$ means that ${\hat{H}}_{i}(t) \ge {\hat{H}}_{j}(t)$. Then we know that ${\hat{H}}_{i}(t) + { rad}_i(t) \ge {\hat{H}}_{j}(t) + { rad}_j(t) \ge {\hat{H}}_{i^*}(t) + { rad}_{i^*}(t)$. By the same reason, we can obtain $2{ rad}_{i}(t) \ge \varDelta _i$.

Thus we only pull arm i when ${ rad}_i(t) \ge {\varDelta _i \over 4}$. Under event $\mathcal {E}_{L}$, if $\forall i, { rad}_i(t) < {\varDelta _i \over 4}$, then we will output the correct arm $i^*$. This means that the complexity $T_L \le \min _{t> 0} \{t > \sum _{i=1}^m n_H({\delta \over 2mt^2}. {\varDelta _i\over 4})\}$. $\square$

1.4 A.4 Proof of Proposition 4

Proposition 4

If Assumption 1 holds with distribution distance $\varLambda _{m}$ or $\varLambda _{K}$, then using the empirical distribution $\mathbf{p} (O)$ to estimate $H(D_i)$ (i.e., Algorithm 3) in the elimination framework with $n_H(\delta ,\varDelta ) = {B^2\over 2\varDelta ^2} \log {2\over \delta }$ has sample complexity $O(\sum _{i=1}^m {B^2\over \varDelta _i^2} (\log {m\over \delta }+\log \log {1\over \varDelta _i}))$, and using the empirical distribution $\mathbf{p} (O)$ to estimate $H(D_i)$ in the LUCB framework with $\varDelta _H(\delta ,n) = B\sqrt{\log {2\over \delta }\over 2n}$ has sample complexity $O(B^2 h \log {Bh\over \delta })$, where $h = \sum _{i=1}^m {1\over \varDelta _i^2}$.

Proof

Here we mainly consider the case that Assumption 1 holds with distribution distance $\varLambda _m$. The case that Assumption 1 holds with distribution distance $\varLambda _K$ is similar.

We first consider using the elimination framework. By Chernoff-Hoeffding’s inequality, when Assumption 1 holds with distribution distance $\varLambda _m$, we have that

$$\begin{aligned} \Pr \left[ |H(D) - H(\mathbf{p} (O))| \ge \varDelta \right] \le \Pr \left[ |\mathbb {E}[D] - \mathbb {E}[\mathbf{p} (O)]| \ge {\varDelta \over B}\right] \le 2\exp \left( -2{\varDelta ^2 \over B^2}|O|\right) . \end{aligned}$$

Thus, choosing $n_H(\delta ,\varDelta ) = {B^2\over 2\varDelta ^2} \log {2\over \delta }$ is enough to make Eq. (3) hold.

Using Proposition 2, we know that the complexity upper bound is

$$\begin{aligned} \sum _{i=1}^m n_H\left( {\delta \over 2m\log ^2\left( {8\over \varDelta _i}\right) }, {\varDelta _i \over 8}\right) = \sum _{i=1}^m {B^2\over 2\varDelta ^2} \log {4m\log ^2\left( {8\over \varDelta _i}\right) \over \delta } = O\left( \sum _{i=1}^m {B^2\over 2\varDelta ^2}\left( \log {m\over \delta } + \log \log {1\over \varDelta _i}\right) \right) . \end{aligned}$$

Then we come to the LUCB framework. By Chernoff-Hoeffding inequality, we have that

$$\begin{aligned} \Pr \left[ |H(D) - H(\mathbf{p} (O))| \ge \varDelta \right] \le \Pr \left[ |\mathbb {E}[D] - \mathbb {E}[\mathbf{p} (O)]| \ge {\varDelta \over B}\right] \le 2\exp \left( -2{\varDelta ^2 \over B^2}|O|\right) . \end{aligned}$$

Hence, choosing $\varDelta _H(\delta , n) = B\sqrt{\log {2\over \delta } \over 2n}$ is enough to make Eq. (4) hold.

Using Proposition 3, we know that the complexity upper bound is

$$\begin{aligned} \min _{t> 0} \left\{ t> \sum _{i=1}^m n_H\left( {\delta \over 2mt^2}, {\varDelta _i\over 4}\right) \right\} = \min _{t> 0}\left\{ t> \sum _{i=1}^m {B^2\over 2\varDelta _i^2} \log {4mt^2\over \delta }\right\} = \min _{t> 0}\left\{ t > B^2h \log {4mt^2\over \delta }\right\} . \end{aligned}$$

Therefore, after some basic computations, we know that the complexity is upper bounded by $O(B^2h\log {Bh \over \delta })$. $\square$

Appendix B The DKW inequality

The Dvoretzky–Kiefer–Wolfowitz inequality is used to bound the Kolmogorov–Smirnov distance between the empirical distribution and the real distribution, which is first proposed by Dvoretzky et al. (1956) and then improved by Massart (1990).

Proposition 5

(Dvoretzky–Kiefer–Wolfowitz inequality) Let $X_1, X_2, \cdots , X_n$ be i.i.d. random variables with cumulative distribution function F, and let $F_n$ denote the cumulative distribution function of the empirical distribution, i.e., $F_n(x) = {1\over n}\sum _{i=1}^n \mathbb {I}[X_i \le x]$, then

$$\begin{aligned} \Pr \left[ \sup _x |F_n(x) - F(x)| \ge \epsilon \right] \le 2e^{-2n\epsilon ^2}. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, S., Chen, W. The pure exploration problem with general reward functions depending on full distributions. Mach Learn 111, 3279–3306 (2022). https://doi.org/10.1007/s10994-022-06214-8

Download citation

Received: 17 November 2020
Revised: 08 June 2022
Accepted: 12 June 2022
Published: 18 July 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s10994-022-06214-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

The pure exploration problem with general reward functions depending on full distributions

Abstract

Similar content being viewed by others

Monte Carlo Tree Search with Robust Exploration

BelMan: An Information-Geometric Approach to Stochastic Bandits

Sub-sampling for Multi-armed Bandits

1 Introduction

1.1 Related works

2 Preliminaries

2.1 Models and definitions

Assumption 1

Remark 1

2.2 Complexity lower bound

Proposition 1

2.3 Example of applications

2.3.1 Selecting the closest distribution to the target distribution

2.3.2 Selecting the distribution that follows the target type

3 Algorithmic frameworks

3.1 Elimination framework

Proposition 2

3.2 LUCB framework

Proposition 3

4 Estimation methods for different type of H functions

Definition 1

4.1 The case of \(\varLambda _{m}\) or \(\varLambda _{K}\)

Proposition 4

4.2 The case of \(\varLambda _{ TV}\)

4.2.1 Discrete distributions with finite support

Theorem 1

Proof

Corollary 1

4.2.2 Discrete distributions with infinite support

Assumption 2

Theorem 2

Proof

Corollary 2

4.2.3 Continuous distributions with bounded support

Assumption 3

Assumption 4

Corollary 3

Theorem 3

Proof

Corollary 4

4.2.4 Continuous distributions with unbounded support

Assumption 5

Theorem 4

Proof

Corollary 5

4.3 Applications of our solutions

5 Experiments

Problem 1

6 Future work

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendices

Appendix A Proofs of Propositions

1.1 A.1 Proof of Proposition 1

Proposition 1

Proof

1.2 A.2 Proof of Proposition 2

Proposition 2

Proof

1.3 A.3 Proof of Proposition 3

Proposition 3

Proof

1.4 A.4 Proof of Proposition 4

Proposition 4

Proof

Appendix B The DKW inequality

Proposition 5