Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In the classical multi-armed bandit (MAB) problem, the learning agent faces a set \(K\) of stochastic arms, from which it chooses arms sequentially. In each round, the agent observes a random reward that depends on the selected arm. The goal of the agent is to maximize the cumulative reward (in the regret formulation), or to identify the arm with the highest expected reward (in the pure exploration problem). The MAB model has been studies extensively in the statistical and learning literature, see [2] for a comprehensive survey.

In this paper, we consider a quantile-based variant of the pure exploration MAB problem (quantile-MAB). In this variant, for a given \(0<\lambda <1\), the goal is to identify the arm for which the \(\lambda \)-quantile is the largest among all arms (here, as usual the \(\lambda \)-quantile is such that the probability of observing a larger reward is at least \(\lambda \)). More precisely, considering the PAC framework, the goal is to identify an \((\epsilon ,\delta )\) -correct arm, namely an arm for which the \((\lambda -\epsilon )\)-quantile is not smaller than the largest \(\lambda \)-quantile among all arms, with a probability larger than \(1-\delta \). In addition, we wish to minimize the sample complexity, i.e., the expected number of samples observed until the learning algorithm terminates.

For the standard MAB problem, algorithms that find the best arm (in terms of its expected reward) in the PAC sense were presented in [1, 58, 10], and lower bounds on the sample complexity were presented in [1, 9, 11].

Similar to the present quantile-MAB problem is the variant of the MAB problem in which the goal is to find the arm from which the largest possible sample can be obtained. This is known as the max k-armed bandit problem, and was first introduced in [3]. For this variant, algorithms that find the best arm in the PAC sense were provided in [4, 13], and a lower bound was presented in [4]. In contrast to the current quantile-MAB problem, in the max k-armed setting, it is necessary to assume a lower bound on the tail probabilities of the arms. When the tail functions of the arms are known, and \(\epsilon =\lambda \), the algorithms for the max k-armed bandit setting can be applied in the present quantile-MAB problem. However, their sample complexity upper bounds are larger than those of the algorithms presented in this paper.

More related to the present quantile-MAB problem is the work [15] which consider a measure of risk called value-at-risk (see [12]). The value-at-risk of a given random variable (R.V.) \(\varvec{X}\) is actually the same as the quantile of the R.V. \(\varvec{-X}\). An algorithm with an upper bound on the sample complexity that increases as \(\frac{\lambda |K|}{\epsilon ^2\delta D}\), (where D is the upper bound on the density functions) was provided in [15], that algorithm is computationally demanding since at each iteration it solves a non-linear constrained and integer-valued optimization problem. Recently, the quantile-MAB problem was studied in [14]. They provided a lower bound for the case in which \(\lambda =3/4\) and an algorithm with an upper bound on the sample complexity of the order of \(\sum _{k\in K}\frac{1}{\left( \max \left( \epsilon ,\varDelta _{k,\lambda }\right) \right) ^2}\ln (\frac{|K|}{\delta \max \left( \epsilon ,\varDelta _{k,\lambda }\right) })\), where \(\varDelta _{k,\lambda }\) is the difference between the \(\lambda \)-quantile of arm k and that of the best arm.

In this paper, for certain arm distributions, we provide a lower bound of the order of \(\sum _{k\in K}\frac{\lambda (1-\lambda )\ln (\frac{1}{\delta })}{\left( \max \left( \epsilon ,\varDelta _{k,\lambda }\right) \right) ^2}\) on the sample complexity of every \((\epsilon ,\delta )\) -correct algorithm. That lower bound improves the bound in [14] in the sense of considering the quantile factor \(\lambda (1-\lambda )\). This is significant when \(\lambda \) is close to 1 or to 0. Furthermore, for general distribution functions, we provide two algorithms that attain the lower bound up to the logarithmic terms \(\ln \left( |K|\epsilon \right) \) and \(\ln \left( |K|\log _2(\epsilon )\right) \) respectively. The upper bounds of these algorithms are smaller than that in [14] by a factor of \(\lambda \) and a logarithmic factor in \(\epsilon \) for the second algorithm.

The paper proceeds as follows. In the next section we present our model. In Sect. 3, a lower bound on the sample complexity of every \((\epsilon ,\delta )\) -correct algorithm is presented. Then in Sect. 4 we present our \((\epsilon ,\delta )\)-correct algorithms, and provide upper bounds on their sample complexity. The second algorithm is bases on applying the doubling trick on the first one. Then, in Sect. 5 we provide experiments that illustrate the improved sample complexity of our algorithms compared with the results presented in [14]. In Sect. 6 we close the paper with some concluding remarks.

2 Model Definition

We consider a finite set of arms, denoted by \(K\). At each stage \(t=1,2,\dots \) the learning agent chooses an arm \(k\in K\), and a real valued reward is obtained from that arm. The rewards obtained from each arm k are independent and identically distributed, with a distribution function (CDF) \(F_{k}(x)\), \(x\in \mathbb {R}\). We denote the quantile function of arm \(k\in K\) by \(Q_k:[0,1]\rightarrow \mathbb {R}\), and define it as follows.

Definition 1

For every arm \(k\in K\), the quantile function \(Q_k(\lambda )\) is defined by

$$\begin{aligned} Q_k(\lambda )\triangleq \inf \{x\in \mathbb {R}|1-\lambda <F_k(x)\}. \end{aligned}$$

Note that \(P\left( \varvec{x}_k\ge Q_k(\lambda ) \right) \ge 1-\lambda \) where \(\varvec{x}_k\) stands for a random variable with distribution \(F_k\). Clearly, if \(F_k\) is continuous at the point \(Q_k(\lambda )\), we have equality, namely, \(P\left( \varvec{x}_k\ge Q_k(\lambda ) \right) = 1-\lambda \).

An algorithm for the quantile-MAB problem samples an arm at each time step, based on the observed history so far (i.e., the previously selected arms and observed rewards). We require the algorithm to terminate after a random number T of samples, which is finite with probability 1, and return an arm \(k'\). An algorithm is said to be \((\epsilon ,\delta )\) -correct if the returned arm is \(\epsilon \)-optimal with a probability larger than \(1-\delta \), (see a precise definition later in this section). The expected number of samples E[T] taken by the algorithm is the sample complexity, which we wish to minimize.

We next provide some definitions and notations which we use later in this paper. A \(\lambda \)-quantile optimal arm is defined as follows.

Definition 2

Arm \(k\in K\) is \(\lambda \)-quantile optimal if

$$\begin{aligned} Q_{k}(\lambda )=x^*_{\lambda }\triangleq \max _{k'\in K}Q_{k'}(\lambda ). \end{aligned}$$

We use the following quantity which represents the distance of an arm from being optimal,

$$\begin{aligned} \varDelta _{k,\lambda }=\sup \{F_k(x)|x<x^*_{\lambda }\}-(1-\lambda ). \end{aligned}$$
(1)

If \(F_k\) is continuous, then \(\varDelta _{k,\lambda }=F_k(x^*_{\lambda })-(1-\lambda )\). Furthermore, note that for every suboptimal arm k, namely, an arm for which \(Q_k(\lambda )<x^*_{\lambda }\), it follows by the monotonicity of CDF functions that \(\varDelta _{k,\lambda }>0\).

Now we are ready to precisely define an \((\epsilon ,\delta )\) -correct algorithm.

Definition 3

For \(\lambda \) and \(\epsilon \) such that \(0<\epsilon<\lambda <1\) and \(\delta >0\), an algorithm is \((\epsilon ,\delta )\) -correct if

$$\begin{aligned} P\left( Q_{k'}(\lambda -\epsilon )\ge x^*_{\lambda }\right) \ge 1-\delta \end{aligned}$$

where \(k'\) stands for the arm returned by the algorithm.

3 Lower Bound

Before presenting our algorithms, we provide a lower bound on the sample complexity of any \((\epsilon ,\delta )\) -correct algorithm for certain arm distributions. The lower bound is provided in the following Theorem.

Theorem 1

Assume that \(F_k\) is continuous for every \(k\in K\). Fix some \(\epsilon _0\) such that \(0<\epsilon _0<\frac{1}{4}\). For every \(\lambda \in [2\epsilon _0,1-2\epsilon _0]\), \(\epsilon \in (0,\epsilon _0]\) and \(\delta \le 0.15\), there exist some set of arm distributions \(\{F_k\}_{k\in K}\), such that for every \((\epsilon ,\delta )\) -correct algorithm,

$$\begin{aligned} E[T]\ge \sum _{k\in K\setminus k^*}\frac{\lambda \left( 1-\lambda \right) }{2\left( \max \left( \epsilon ,\varDelta _{k,\lambda }\right) \right) ^2}\ln \left( \frac{1}{2.4\delta }\right) \end{aligned}$$
(2)

where \(k^*\) denote some optimal arm, with \(Q_{k^*}(\lambda )=x^*_{\lambda }\).

The above lower bound refines the one presented in [14] in the sense that here the size of the quantile \(\lambda \) is considered in the bound. To illustrate the lower bound, we provide an example.

Example 1

Let \(\{\mu _k\}_{k\in K}\) be a set of constants, and let \(\mu ^*=\max _{k\in K}\mu _{k}\). Suppose that the rewards of each arm \(k\in K\) are uniformly distributed on the interval \((\mu _k-1,\mu _k)\). Since, \(\mu _{k}-x^*_{\lambda }\le 1\), for every arm \(k\in K\) it follows that

$$\begin{aligned} \begin{aligned} \sup \{F_k(x)|x<x^*_{\lambda }\}&=F_k(x^*_{\lambda })= {\left\{ \begin{array}{ll} 1-(\mu _k-x^*_{\lambda }), &{}\quad \mu _k\ge x^*_{\lambda }\\ 1, &{}\quad \mu _k< x^*_{\lambda } \end{array}\right. }. \end{aligned} \end{aligned}$$

As \(x^*_{\lambda }=\mu ^*-\lambda \), Eq. (1), implies that

$$\begin{aligned} \varDelta _{k,\lambda }=\min \left( \lambda ,\mu ^*-\mu _k\right) . \end{aligned}$$

Since \(\epsilon <\lambda \), the denominator term in Eq. (2) can be seen to be

$$\begin{aligned} \begin{aligned} \max \left( \epsilon ,\varDelta _{k,\lambda }\right)&= {\left\{ \begin{array}{ll} \epsilon , &{}\quad \mu ^*-\mu _k<\epsilon \\ \mu ^*-\mu _k, &{}\quad \epsilon \le \mu ^*-\mu _k<\lambda \\ \lambda , &{}\quad \lambda \le \mu ^*-\mu _k \end{array}\right. }. \end{aligned} \end{aligned}$$

Proof

(Theorem 1 ). First we assume that the quantile value of the optimal arm, namely, \(x^*_{\lambda }\) is known. Moreover, we assume that for every arm \(k\in K\), the conditional probabilities \(P\left( \varvec{x}_k| \varvec{x}_k\ge x^*_{\lambda } \right) \) and \(P\left( \varvec{x}_k| \varvec{x}_k< x^*_{\lambda } \right) \) are also known. Therefore, the learning algorithm needs only to estimate the parameters

$$p_k\triangleq P\left( \varvec{x}_k\ge x^*_{\lambda } \right) ,\quad \forall k\in K.$$

Now, by the continuity of the distribution functions it follows that \(\max _{k\in K}p_k=\lambda \). Also, by Eq. (1) it follows that

$$\max _{k\in K}p_k-p_{k'}=\varDelta _{k',\lambda }.$$

Therefore, finding an arm \(k'\) such that \(\varDelta _{k',\lambda }\le \epsilon \) is the same as finding a Bernoulli arm \(k'\), such that its expected value is \(\epsilon \)-optimal, namely, \(\max _{k\in K}p_k-p_{k'}\le \epsilon \). So, our problem is the same as the standard Bernoulli bandit problem with \(\{p_k\}_{k\in K}\) as the Bernoulli parameters.

Then, by Remark 5 in [9], in which a lower bound for the standard MAB problem with Bernoulli arms is provided for \(\delta \le 0.15\), we have

$$\begin{aligned} E[T]\ge \left( \frac{|S_{\epsilon }|-1}{KL\left( \lambda ,\lambda -\epsilon \right) }+\sum _{k\in \{K\setminus S_{\epsilon }\}}\frac{1}{KL\left( p_k,\lambda +\epsilon \right) }\right) \ln \frac{1}{2.4\delta }, \end{aligned}$$
(3)

where \(S_{\epsilon }\triangleq \{k|k\in K,\,p_k\ge \lambda -\epsilon \}\) and \(KL\left( p,q\right) \) stands for the Kullback-Leibler divergence between two Bernoulli distributions with parameters p and q respectively.

We note that \(\ln (1+x)\le x\). Hence,

$$\begin{aligned} KL\left( p,q\right) =p\ln \left( \frac{p}{q}\right) +(1-p)\ln \left( \frac{1-p}{1-q}\right) \le p\frac{p-q}{q}+(1-p)\frac{q-p}{1-q}=\frac{(p-q)^2}{q(1-q)}. \end{aligned}$$
(4)

Therefore, by Eqs. (3) and (4) it follows that

$$\begin{aligned} E[T]\ge \left( \frac{\left( \lambda -\epsilon \right) \left( 1-\lambda +\epsilon \right) \left( |S_{\epsilon }|-1\right) }{\epsilon ^2}+\sum _{k\in \{K\setminus S_{\epsilon }\}}\frac{\left( \lambda +\epsilon \right) \left( 1-\lambda -\epsilon \right) }{\left( \lambda +\epsilon -p_k\right) ^2}\right) \ln \frac{1}{2.4\delta }. \end{aligned}$$

Hence, by the facts that \(2\epsilon \le \lambda \) and \(2\epsilon \le (1-\lambda )\) and since \(\varDelta _{k,\lambda }=\lambda -p_k\), Eq. (2) is obtained. \(\quad \square \)

4 Algorithms

In this section we provide two related algorithms. The first one is simpler and attains the lower bound in Theorem 1 up to a logarithmic term. The second algorithm is based on applying the doubling trick on the first one and hence its upper bound attains Theorem 1 up to a double logarithmic term.

4.1 The Max-Q Algorithm

Here we present our Max-Q algorithm. The algorithm is \((\epsilon ,\delta )\)-correct and based on sampling the arm which has the highest potential \(\lambda \)-quantile value.

The Max-Q algorithm starts by sampling a fixed number of times from each arm. Then, for each arm, the algorithm associates a value that has been sampled from its quantile in a large probability and choses the arm for which the value is maximal. If the number of times that arm has been sampled is larger than a certain threshold, the algorithm stops returns that arm, else it samples one more time from the chosen arm.

The fundamental difference between the Max-Q algorithm and the algorithm presented in [14] is the fact that in the latter the entire CDF is estimated, while in this paper, just the value of the quantile is estimated. That difference leads to a bound on the sample complexity of the Max-Q algorithm which is smaller by a factor of \(\lambda \), compared to that in [14].

figure a

Theorem 2

For every \(\lambda \in (0,1)\), \(\epsilon \in (0,\lambda )\) and \(\delta \in (0,1)\), Algorithm 1 is \((\epsilon ,\delta )\)-correct with a sample complexity bound of

$$\begin{aligned} E[T]\le \sum _{k\in K}\frac{10\lambda L}{\left( \max \left( \epsilon ,\varDelta _{k,\lambda }\right) \right) ^2}+|K|+1, \end{aligned}$$
(5)

where \(L=6\ln \left( |K|\left( 1+\frac{-10\lambda \ln (\delta )}{\epsilon ^2}\right) \right) -\ln \left( \delta \right) \) as defined in the algorithm.

It may be observed that for \(\lambda \le \frac{1}{2}\), the upper bound provided in Theorem 2 is of the same order as the lower bound in Theorem 1, up to a logarithmic factor.

To establish Theorem 2, we first bound the probability of the event under which the m-th largest sample of one of the optimal arm is below the \(\lambda \)-quantile. Then, we bound the number of samples needed to be observed from each suboptimal arm such that the m-th largest value (obtained from that arm) is below the \((\lambda -\epsilon )\)-quantile. For establishing these bounds in a way that the multiplicative factor of \(\lambda \) remains in the bounds, we use Bernstein’s inequality for bounding the difference between the empirical mean and the mean value of a Bernoulli R.V. which is one if the sampled value is above the quantile and zero otherwise.

Proof

(Theorem 2 ). We denote the time step of the algorithm by t, the value of the counter C(k) at time step t by \(C^{t}(k)\) and we use the notations \(L'=L+\ln \left( \delta \right) \) and \(x^*\) as a short for \(x^*_{\lambda }\). Recall that T stands for the random final time step. By the condition in step 5 of the algorithm, for every arm \(k\in K\), it follows that,

$$\begin{aligned} C^{T-1}(k)\le \lfloor \frac{10\lambda \left( L'-\ln (\delta )\right) }{\epsilon ^2}\rfloor +1. \end{aligned}$$
(6)

Note that by the facts that for \(x\ge 6\) it follows that \(\frac{d6\ln (x)}{dx}\le 1\), and that for \(x_{0}=20\) it follows that \(x_{0}>6\ln (x_{0})+1\), it is obtained that

$$\begin{aligned} \begin{aligned} L''&\triangleq |K|\left( \frac{-10\lambda \ln (\delta )}{\epsilon ^2} +1\right)&>6\ln \left( |K|\left( \frac{-10\lambda \ln (\delta )}{\epsilon ^2} +1\right) \right) +1=L'+1, \end{aligned} \end{aligned}$$

for \(L''\ge 20\). So, by the fact that \(T=\sum _{k\in K}C^{T-1}(k)+1\), for \(L''\ge 20\) it follows that

$$\begin{aligned} \begin{aligned} T\le |K|\left( \frac{10\lambda \left( L'-\ln (\delta )\right) }{\epsilon ^2} +1\right) +1&< |K|\left( \frac{10\lambda \left( L''-\ln (\delta )\right) }{\epsilon ^2} +1\right) \\&\le L''^{2}=e^{\frac{L'}{3}}. \end{aligned} \end{aligned}$$
(7)

We proceed to establish the \((\epsilon ,\delta )\)-correctness of the algorithm. Let \(V^k_N(m)\) stand for the m-th largest value obtained from arm k after sampling it for N times and assume w.l.o.g. that \(Q_{1}(\lambda )=x^*_{\lambda }\). Then, for \(N\ge N_0\) and \(m=\lfloor \lambda N-\sqrt{3\lambda NL}\rfloor +1\), as stated in the algorithm, by Lemma 1 below it follows that

$$\begin{aligned} P\left( V^{1}_{N}(m)<x^*\right) \le \delta e^{-L'}. \end{aligned}$$
(8)

Hence, at every time step t, by Eqs. (7) and (8), applying the union bound obtains

$$\begin{aligned} \begin{aligned} P\left( V^{t,1}<x^*\right)&\le \sum _{N=N_0}^{\exp \left( \frac{L'}{3}\right) }P\left( V^{1}_{N}(m)<x^*\right)&= \delta e^{-\frac{2L'}{3}}. \end{aligned} \end{aligned}$$
(9)

where \(V^{t,k}\) stands for the value of \(V^{k}\) at time step t.

Let \(k^*_T\) stand for the arm returned by the algorithm. Also, by Lemma 1, for \(N>\frac{10\lambda \left( L'-\ln (\delta )\right) }{\epsilon ^2}\), it follows that

$$\begin{aligned} P\left( V_N^k(m)>Q_{k}(\lambda -\epsilon )\right) \le \delta e^{-L'}. \end{aligned}$$
(10)

So, since by the condition in step 5, it is obtained that \(C(k^*_T)>\frac{10\lambda \left( L'-\ln (\delta )\right) }{\epsilon ^2}\), it follows by Eq. (10) and the union bound that

$$\begin{aligned} P\left( V^{T,k^*_T}>Q_{k^*_T}(\lambda -\epsilon )\right) \le \sum _{k\in K}\sum _{t=1}^{\exp \left( \frac{L'}{3}\right) }\sum _{N=1}^{\exp \left( \frac{L'}{3}\right) }\delta e^{-L'}=|K|\delta e^{-\frac{L'}{3}}. \end{aligned}$$
(11)

Also, by Eq. (9) and the union bound it follows that

$$\begin{aligned} P\left( V^{T,1}< x^*\right) \le \sum _{t=1}^{\exp \left( \frac{L'}{3}\right) }P\left( V^{t,1}<x^*\right) \le \delta e^{-\frac{L'}{3}}. \end{aligned}$$
(12)

So, since by step 4 of the algorithm, \(V^{T,k^*_T}\ge V^{T,1}\), it follows by Eqs. (11) and (12) that

$$\begin{aligned} P\left( Q_{k^*_T}(\lambda -\epsilon )< x^*\right) \le P\left( V^{T,k^*_T}>Q_{k^*_T}(\lambda -\epsilon )\right) +P\left( V^{T,1}< x^*\right) < \delta . \end{aligned}$$

It follows that the algorithm returns an \(\epsilon \)-optimal arm with a probability larger than \(1-\delta \). Hence, it is \((\epsilon ,\delta )\)-correct.

To prove the bound on the expected sample complexity of the algorithm, we define the following sets:

$$M(\epsilon )=\{l\in K|\varDelta _{k,\lambda }\le \epsilon \}\quad \text {and} \quad N(\epsilon )=\{l\in K|\varDelta _{k,\lambda }>\epsilon \}.$$

As before, we assume w.l.o.g. that \(Q_1(\lambda )=x^*\). Then, for the case in which

$$\begin{aligned} E_{1}\triangleq \bigcap _{1\le t\le T}\left\{ V^{t,1}\ge x^*\right\} \end{aligned}$$

occurs, for every arm \(k\in K\), a necessary condition for \(C^T(k)>N_k'\), where \(N_k'=\lfloor \frac{10\lambda \left( L'-\ln (\delta )\right) }{\varDelta _{k,\lambda }^2}\rfloor +1\) is

$$\begin{aligned} E_{k}\triangleq \left\{ V^k_{N_k'}(m_k')\ge x^*\right\} , \end{aligned}$$

where \(m_k'=\lfloor \lambda N_k'-\sqrt{3\lambda N_k'\left( L'-\ln (\delta )\right) }\rfloor +1\).

Now, by using the bound in Eq. (6) and the fact that \(\sum _{k\in K}C^{T}(k)=\sum _{k\in K}C^{T-1}(k)+1\) for the arms in the set \(M(\epsilon )\), \(N_k'\) as a bound for the arms in the set \(N(\epsilon )\), and the bound in Eq. (7), it is obtained that

$$\begin{aligned} \begin{aligned} E[T]\le \left( 1-P\left( E_{1}\right) \right) e^{\frac{L'}{3}}&+P\left( E_{1}\right) \sum _{k\in N(\epsilon )}\left( \left( 1-P\left( E_k|E_1\right) \right) \varPhi _k(\epsilon )+e^{\frac{L'}{3}}P\left( E_k|E_1\right) \right) \\&+\sum _{k\in M(\epsilon )}\varPhi _k(\epsilon )+1, \end{aligned} \end{aligned}$$
(13)

where \(\varPhi _k(\epsilon )=\lfloor \frac{10\lambda \left( L'-\ln (\delta )\right) }{\left( \max \left( \epsilon ,\varDelta _{k,\lambda }\right) \right) ^2}\rfloor +1\). But, by Eq. (9) it follows that

$$\begin{aligned} \begin{aligned} P\left( E_1\right) \ge 1-\sum _{t=1}^{\exp \left( \frac{L'}{3}\right) }P\left( V^{t,1}<x^*\right)&\ge 1-\delta e^{\frac{-2L'}{3}}e^{\frac{L'}{3}}&=1-\delta e^{\frac{-L'}{3}}. \end{aligned} \end{aligned}$$
(14)

Also, since \(Q_k' \triangleq Q_k\left( \lambda -\sqrt{\frac{10\lambda \left( L'-\ln (\delta )\right) }{N_k'}}\right) <x^*\) for \(k\in N(\epsilon )\), it follows by Lemma 1 that

$$\begin{aligned} P\left( E_{k}|E_1\right) P\left( E_1\right) \le P\left( E_{k}\right) \le P\left( V^k_{N_k'}(m_k')>Q_k'\right) \le \delta e^{-L'},\quad \forall k\in N(\epsilon ) \end{aligned}$$
(15)

Therefore, by Eqs. (13), (14) and (15) and the definition of \(\varPhi _k(\epsilon )\), the bound on the sample complexity is obtained. \(\quad \square \)

Lemma 1

For every arm \(k\in K\), let \(V^k_N(m)\) stand for the m-th largest value obtained from arm k after sampling it for N times. Then, for any positive integers m and N such that \(m<N\), and every \(\lambda \in [0, 1]\), it follows that,

  1. 1.

    If \(\frac{m}{N}>\lambda \), then

    $$\begin{aligned} P\left( V^k_{N}(m)> Q_k(\lambda )\right) \le f_{0}(m,N,\lambda )\, . \end{aligned}$$
  2. 2.

    If \(\frac{m}{N}<\lambda \), then

    $$\begin{aligned} P\left( V^k_{N}(m)<Q_k(\lambda )\right) \le f_{0}(m,N,\lambda )\, , \end{aligned}$$

where \(f_{0}(m,N,\lambda )=\exp \left( -\frac{|m-N\lambda |^{2}}{2\left( N\lambda +|m-N\lambda |/3\right) }\right) \).

The proof is based on Bernstein’s inequality.

Proof

In this proof, we omit the arm index k for short. We start with claim (1). Let \(x_i\) stand for the i-th sampled value from the arm, and let \(\left\{ X_{i}(\lambda )\right\} \) and \(\left\{ Y_{i}(\lambda )\right\} \) be random variables for which

$$\begin{aligned} \begin{aligned} X_{i}(\lambda )&= {\left\{ \begin{array}{ll} 1 &{}w.p\quad \lambda \\ 0 &{}w.p\quad 1-\lambda \end{array}\right. } \quad \text {and} \quad Y_{i}(\lambda )&= {\left\{ \begin{array}{ll} 1 &{}x_{i}>Q(\lambda )\\ 0 &{}x_{i}\le Q(\lambda ) \end{array}\right. }\, . \end{aligned} \end{aligned}$$
(16)

Note that the variables \(\left\{ Y_{i}(\lambda )\right\} \) are i.i.d. The variables \(\left\{ X_{i}(\lambda )\right\} \) are i.i.d as well. Then, since \(P\left( Y_{i}(\lambda )=1\right) \le P\left( X_{i}(\lambda )=1\right) \), after sampling N times,

$$\begin{aligned} \begin{aligned} P\left( V_{N}(m)>Q(\lambda )\right)&=P\left( \frac{1}{N}\sum _{i=1}^{N}Y_{i}(\lambda )\ge \frac{m}{N}\right) \le P\left( \frac{1}{N}\sum _{i=1}^{N}X_{i}(\lambda )\ge \frac{m}{N}\right) \\&=P\left( \frac{1}{N}\sum _{i=1}^{N}\widetilde{X}_{i}(\lambda )\ge \frac{m}{N}-E[X_{1}(\lambda )]\right) \triangleq \Upsilon (\lambda ,m,N)\, , \end{aligned} \end{aligned}$$
(17)

where \(\widetilde{X}_{i}(\lambda )=X_{i}(\lambda )-E[X_{1}(\lambda )]\). So, \(\left\{ \widetilde{X}_{i}(\lambda )\right\} \) satisfies the conditions of Bernstein’s inequality with \(\sigma ^{2}=\lambda \left( 1-\lambda \right) \), and \(E[X_{1}(\lambda )]=\lambda \). Therefore

$$\begin{aligned} \begin{aligned} \Upsilon (\lambda ,m,N)&\le \exp \left( -\frac{\left( m-N\lambda \right) ^{2}}{2\left( N\lambda \left( 1-\lambda \right) +\left( m-N\lambda \right) /3\right) }\right) \\ {}&\le \exp \left( -\frac{\left( m-N\lambda \right) ^{2}}{2\left( N\lambda +\left( m-N\lambda \right) /3\right) }\right) \, . \end{aligned} \end{aligned}$$
(18)

Proceeding to claim (2), let \(\left\{ Z_{i}(\lambda )\right\} \) be random variables for which

$$\begin{aligned} Z_{i}(\lambda )= {\left\{ \begin{array}{ll} 1 &{}x_{i}\ge Q(\lambda )\\ 0 &{}x_{i}<Q(\lambda ) \end{array}\right. }\, . \end{aligned}$$
(19)

Note that \(\left\{ Z_{i}\right\} \) are i.i.d. Then, since \(P\left( Z_{i}(\lambda )=1\right) \ge P\left( X_{i}(\lambda )=1\right) \),

$$\begin{aligned} \begin{aligned} P\left( V_{N}(m)<Q(\lambda )\right)&=P\left( \frac{1}{N}\sum _{i=1}^{N}Z_{i}(\lambda )<\frac{m}{N}\right) \le P\left( \frac{1}{N}\sum _{i=1}^{N}X_{i}(\lambda )\le \frac{m}{N}\right) \\ {}&=P\left( \frac{1}{N}\sum _{i=1}^{N}\widetilde{X}_{i}(\lambda )\le \frac{m}{N}-E[X_{1}(\lambda )]\right) \triangleq \hat{\Upsilon }(\lambda ,m,N) \end{aligned} \end{aligned}$$
(20)

and by symmetry

$$\begin{aligned} \begin{aligned} \hat{\Upsilon }(\lambda ,m,N)&\le \exp \left( -\frac{\left( N\lambda -m\right) ^{2}}{2\left( N\lambda \left( 1-\lambda \right) +\left( N\lambda -m\right) /3\right) }\right) \\ {}&\le \exp \left( -\frac{\left( N\lambda -m\right) ^{2}}{2\left( N\lambda +\left( N\lambda -m\right) /3\right) }\right) \, . \end{aligned} \end{aligned}$$
(21)

\(\quad \square \)

4.2 The Doubled Max-Q Algorithm

Here we improve on the previous algorithm by resorting to the doubling trick. The Doubled Max-Q Algorithm is based on the same principle as the Max-Q Algorithm. However, instead of observing one sample at each time step, here the algorithm doubles the number of samples of the chosen arm. Consequently, the number of times at which the algorithm needs to choose an arm is roughly logarithmic compared to that under the previous algorithm, leading to a tighter bound. Algorithm 2 presents the proposed Doubled Max-Q algorithm.

figure b

Theorem 3

For every \(\lambda \in (0,1)\), \(\epsilon \in (0,\lambda )\) and \(\delta \in (0,1)\), Algorithm 2 is \((\epsilon ,\delta )\)-correct with a sample complexity bound of

$$\begin{aligned} E[T]\le \sum _{k\in K}\frac{20\lambda L_D}{\left( \max \left( \epsilon ,\varDelta _{k,\lambda }\right) \right) ^2}+|K|+1, \end{aligned}$$
(22)

where \(L_D=6\ln \left( |K|\log _2\left( \frac{-20\lambda \ln (\delta )}{\epsilon ^2}\right) \right) -\ln \left( \delta \right) \) as defined in the algorithm.

Here, the upper bound is of the same order as the lower bound in Theorem 1, up to a double-logarithmic order.

The proof of Theorem 3 is established by some adjustments of the proof of Theorem 2.

Proof

As before, we denote the time step of the algorithm by t, the value of the counter C(k) at time step t by \(C^{t}(k)\) and we use the notations \(L_D'=L_D+\ln \left( \delta \right) \) and \(x^*\) as a short for \(x^*_{\lambda }\). We note that here, at each time step, there may be more than a single sample, so T, the sample complexity, may be different than the final time step. Hence, here we denote the (random) final time step by \(T_D\). By the condition in step 5 of the algorithm, for every arm \(k\in K\), it follows that,

$$\begin{aligned} C^{T_D-1}(k)\le \frac{10\lambda \left( L_D'-\ln (\delta )\right) }{\epsilon ^2}. \end{aligned}$$
(23)

Note that by the facts that for \(x\ge 6\) it follows that \(\frac{d6\ln (x)}{dx}\le 1\), and that for \(x_{0}=20\) it follows that \(x_{0}>6\ln (x_{0})+1\) it is obtained that

$$\begin{aligned} \begin{aligned} L_D''&\triangleq |K|\log _2\left( \frac{-20\lambda \ln (\delta )}{\epsilon ^2}\right)&>6\ln \left( |K|\log _2\left( \frac{-20\lambda \ln (\delta )}{\epsilon ^2}\right) \right) +1=L_D'+1, \end{aligned} \end{aligned}$$

for \(L_D''\ge 20\). So, by the fact that \(T=\sum _{k\in K}\log _2\left( 2C^{T_D-1}(k)\right) \), for \(L_D''\ge 20\) it follows that

$$\begin{aligned} \begin{aligned} T&\le |K|\log _2\left( \frac{20\lambda \left( L_D'-\ln (\delta )\right) }{\epsilon ^2}\right) < |K|\log _2\left( \frac{20\lambda \left( L_D''-\ln (\delta )\right) }{\epsilon ^2}\right) \\ {}&\le |K|\log _2\left( L_D''\right) +L_D''\le L_D''\left( \log _2\left( L_D''\right) +1\right) \le \frac{\left( L_D''\right) ^{2}}{2}=\frac{1}{2}e^{\frac{L_D'}{3}}. \end{aligned} \end{aligned}$$
(24)

Recall that \(x^*\) is used as a short for \(x^*_{\lambda }\). Now, we begin with proving the \((\epsilon ,\delta )\)-correctness property of the algorithm. We let \(V^k_N(m)\) stands for the m-th largest value obtained from arm k after sampling it for N times and we assume w.l.o.g. that \(Q_{1}(\lambda )=x^*\). Then, for \(N\ge N_0\) and \(m=\lfloor \lambda N-\sqrt{3\lambda NL_D}\rfloor +1\), as stated in the algorithm, by Lemma 1 it follows that

$$\begin{aligned} P\left( V^{1}_{N}(m)<x^*\right) \le \delta e^{-L_D'} \end{aligned}$$
(25)

Hence, at every time step t, by Eqs. (24) and (25), by applying the union bound, for \(N_i=2^iN_0\) it follows that

$$\begin{aligned} \begin{aligned} P\left( V^{t,1}<x^*\right)&\le \sum _{i=0}^{\frac{1}{2}\exp \left( \frac{L_D'}{3}\right) }P\left( V^{1}_{N_i}(m)<x^*\right)&= \delta e^{-\frac{2L_D'}{3}}. \end{aligned} \end{aligned}$$
(26)

where \(V^{t,k}\) stands for the value of \(V^{k}\) at time step t.

Now, we let \(k^*_{T_D}\) stands for the arm returned by the algorithm. Also, by Lemma 1, for \(N>\frac{10\lambda \left( L_D'-\ln (\delta )\right) }{\epsilon ^2}\), it follows that

$$\begin{aligned} P\left( V_N^k(m)>Q_{k}(\lambda -\epsilon )\right) \le \delta e^{-L_D'}. \end{aligned}$$
(27)

So, since by the condition in step 5, it is obtained that \(C(k^*_{T_D})>\frac{10\lambda \left( L_D'-\ln (\delta )\right) }{\epsilon ^2}\), it follows by Eq. (27) and the union bound that

$$\begin{aligned} P\left( V^{{T_D},k^*_{T_D}}>Q_{k^*_{T_D}}(\lambda -\epsilon )\right) \le \sum _{k\in K}\sum _{t=1}^{\frac{1}{2}\exp \left( \frac{L_D'}{3}\right) }\sum _{i=0}^{\frac{1}{2}\exp \left( \frac{L_D'}{3}\right) }\delta e^{-L_D'}=|K|\delta e^{-\frac{L_D'}{3}}. \end{aligned}$$
(28)

Also, by Eq. (26) and applying the union bound it follows that

$$\begin{aligned} P\left( V^{{T_D},1}< x^*\right) \le \sum _{t=1}^{\frac{1}{2}\exp \left( \frac{L_D'}{3}\right) }P\left( V^{t,1}<x^*\right) \le \delta e^{-\frac{L_D'}{3}} \end{aligned}$$
(29)

So, since by step 4 of the algorithm, \(V^{{T_D},k^*_{T_D}}\ge V^{{T_D},1}\), it follows by Eqs. (28) and (29) that

$$\begin{aligned} P\left( Q_{k^*_{T_D}}(\lambda -\epsilon )< x^*\right) \le P\left( V^{{T_D},k^*_{T_D}}>Q_{k^*_{T_D}}(\lambda -\epsilon )\right) +P\left( V^{{T_D},1}< x^*\right) < \delta \end{aligned}$$

Therefore, it follows that the algorithm returns an \(\epsilon \)-optimal arm with a probability larger than \(1-\delta \). So, it is \((\epsilon ,\delta )\)-correct.

For proving the bound on the expected sample complexity of the algorithm we define the following sets:

$$M(\epsilon )=\{l\in K|\varDelta _{k,\lambda }\le \epsilon \}\quad \text {and} \quad N(\epsilon )=\{l\in K|\varDelta _{k,\lambda }>\epsilon \}.$$

As before, we assume w.l.o.g. that \(Q_1(\lambda )=x^*\). For the case in which

$$\begin{aligned} E_{1}\triangleq \bigcap _{1\le t\le T}\left\{ V^{t,1}\ge x^*\right\} , \end{aligned}$$

occurs, for every arm \(k\in K\), a necessary condition for \(C^{T_D}(k)>N_{k,D}'\), where

$$N_{k,D}'\triangleq \min \left\{ N_i|N_i>\frac{10\lambda \left( L_D'-\ln (\delta )\right) }{\varDelta _{k,\lambda }^2},i\in \mathbb {N}\right\} $$

is

$$\begin{aligned} E_{k,D}\triangleq \left\{ V^k_{N_{k,D}'}(m_{k,D}')\ge x^*\right\} , \end{aligned}$$

where \(m_{k,D}'=\lfloor \lambda N_{k,D}'-\sqrt{3\lambda N_{k,D}'\left( L_D'-\ln (\delta )\right) }\rfloor +1\).

Then for

$$\varPhi _{k,D}(\epsilon )=\frac{20\lambda \left( L_D'-\ln (\delta )\right) }{\left( \max \left( \epsilon ,\varDelta _{k,\lambda }\right) \right) ^2}$$

for \(k\in N(\epsilon )\) it follows that

$$N_{k,D}'\le \varPhi _{k,D}(\epsilon )$$

So, by using the bound in Eq. (23) and the fact that \(\sum _{k\in K}C^{{T_D}}(k)=2\sum _{k\in K}C^{{T_D}-1}(k)\) for the arms in the set \(M(\epsilon )\), \(N_{k,D}'\) as a bound for the arms in the set \(N(\epsilon )\) and the bound in Eq. (24), it is obtained that

$$\begin{aligned} \begin{aligned} E[T]\le&\left( 1-P\left( E_{1}\right) \right) e^{\frac{L_D'}{3}} \\ {}&+P\left( E_{1}\right) \sum _{k\in N(\epsilon )}\left( \left( 1-P\left( E_{k,D}|E_1\right) \right) \varPhi _{k,D}(\epsilon )+e^{\frac{L_D'}{3}}P\left( E_{k,D}|E_1\right) \right) \\ {}&+\sum _{k\in M(\epsilon )}\varPhi _{k,D}(\epsilon )+1, \end{aligned} \end{aligned}$$
(30)

But, by Eq. (26) it follows that

$$\begin{aligned} \begin{aligned} P\left( E_1\right) \ge 1-\sum _{t=1}^{\exp \left( \frac{L_D'}{3}\right) }P\left( V^{t,1}<x^*\right)&\ge 1-\delta e^{-\frac{2L_D'}{3}}e^{\frac{L_D'}{3}}&=1-\delta e^{\frac{-L_D'}{3}} \end{aligned} \end{aligned}$$
(31)

Also, since \(Q_k\left( \lambda -\sqrt{\frac{10\lambda \left( L_D'-\ln (\delta )\right) }{N_{k,D}'}}\right) <x^*\) for \(k\in N(\epsilon )\), it follows by Lemma 1 that

$$\begin{aligned} P\left( E_{k,D}|E_1\right) P\left( E_1\right) <\delta e^{-L_D'},\quad \forall k\in N(\epsilon ) \end{aligned}$$
(32)

Therefore, by Eqs. (30), (31) and (32) and the definition of \(\varPhi _{k,D}(\epsilon )\), the bound on the sample complexity is obtained. \(\quad \square \)

5 Experiments

In this section we investigate numerically the Max-Q and the Double-Max-Q algorithms presented in this paper and compare them with the QPAC algorithm presented in [14].

In Fig. 1, we present the average sample complexity of 10 runs vs. the quantile \(\lambda \) for \(\delta =0.01\) and various values of \(\epsilon \). As shown in Fig. 1, and detailed in Tables 1, 2, 3 and 4, the Max-Q and the Double-Max-Q algorithms significantly outperform the QPAC algorithm. The arms distribution functions used here were uniform with an interval of length 1.

Table 1. \(\log _{10}\) of the average sample complexity of the Max-Q, the Double-Max-Q and the QPAC algorithms. The number of arms was 10 and the averages were computed over 10 runs.
Fig. 1.
figure 1

The average sample complexity of the Max-Q, the Double-Max-Q and the QPAC algorithms for various of parameters settings. The number of arms was 10 and the averages were computed over 10 runs.

Table 2. \(\log _{10}\) of the average sample complexity of the Max-Q, the Double-Max-Q and the QPAC algorithms. The number of arms was 10 and the averages were computed over 10 runs.
Table 3. \(\log _{10}\) of the average sample complexity of the Max-Q, the Double-Max-Q and the QPAC algorithms. The number of arms was 10 and the averages were computed over 10 runs.
Table 4. \(\log _{10}\) of the average sample complexity of the Max-Q, the Double-Max-Q and the QPAC algorithms. The number of arms was 10 and the averages were computed over 10 runs.

6 Conclusion

In this paper we studied the pure exploration problem where the goal is to find the arm with the maximal \(\lambda \)-quantile. Under the PAC framework, we provided a lower bound and algorithms that attain it up to a logarithmic term (for the first algorithm) and a double-logarithmic term (for the second algorithm).

A challenge for future work is closing the logarithmic gap between the lower and upper bounds.