Optimal learning for sequential sampling with non-parametric beliefs

Barut, Emre; Powell, Warren B.

doi:10.1007/s10898-013-0050-5

Optimal learning for sequential sampling with non-parametric beliefs

Published: 03 March 2013

Volume 58, pages 517–543, (2014)
Cite this article

Journal of Global Optimization Aims and scope Submit manuscript

Emre Barut¹ &
Warren B. Powell¹

439 Accesses
9 Citations
Explore all metrics

Abstract

We propose a sequential learning policy for ranking and selection problems, where we use a non-parametric procedure for estimating the value of a policy. Our estimation approach aggregates over a set of kernel functions in order to achieve a more consistent estimator. Each element in the kernel estimation set uses a different bandwidth to achieve better aggregation. The final estimate uses a weighting scheme with the inverse mean square errors of the kernel estimators as weights. This weighting scheme is shown to be optimal under independent kernel estimators. For choosing the measurement, we employ the knowledge gradient policy that relies on predictive distributions to calculate the optimal sampling point. Our method allows a setting where the beliefs are expected to be correlated but the correlation structure is unknown beforehand. Moreover, the proposed policy is shown to be asymptotically optimal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

Article Open access 07 July 2017

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

Distributionally robust stochastic programs with side information based on trimmings

Article Open access 22 November 2021

References

Agrawal, R.: The continuum-armed bandit problem. SIAM J. Control Optim. 33, 1926–1951 (1995)
Article Google Scholar
Barton, R.R., Meckesheimer, M.: Chapter 18 metamodel-based simulation optimization in Simulation. In: Henderson, S.G., Nelson, B.L. (eds.). vol. 13 of Handbooks in Operations Research and Management Science. Elsevier (pp. 535–574) (2006)
Billingsley, P.: Probability and Measure, 3rd edn. Wiley-Interscience, New York (1995)
Google Scholar
Branin, F.H.: Widely convergent method for finding multiple solutions of simultaneous nonlinear equations. IBM J. Res. Dev. 16, 504–522 (1972)
Article Google Scholar
Bunea, F., Nobel, A.: Sequential procedures for aggregating arbitrary estimators of a conditional mean. IEEE Trans. Inf. Theory 54, 1725–1735 (2008)
Article Google Scholar
Chehrazi, N., Weber, T.A.: Monotone approximation of decision problems. Oper. Res. 58, 1158–1177 (2010)
Article Google Scholar
Chick, S.E., Gans, N.: Economic analysis of simulation selection problems. Manag. Sci. 55, 421–437 (2009)
Article Google Scholar
Cochran, W.G., Cox, G.M.: Experimental Designs. Wiley, New York (1957)
Google Scholar
Fan, J., Gijbels, I.: Local Polynomial Modelling and Its Applications: Monographs on Statistics and Applied Probability 66 (Chapman & Hall/CRC Monographs on Statistics & Applied Probability). Chapman & Hall, London (1996)
Frazier, P.I., Powell, W.B., Dayanik, S.: knowledge-gradient policy for sequential information collection. SIAM J. Control Optim. 47, 2410–2439 (2008)
Article Google Scholar
Frazier, P.I., Powell, W.B., Dayanik, S.: The knowledge-gradient policy for correlated normal beliefs. INFORMS J. Comput. 21, 599–613 (2009)
Article Google Scholar
Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and an application to boosting in computational learning theory. In: Vitanyi, P. (ed.) vol. 904 of Lecture Notes in Computer Science. Springer Berlin, Heidelberg (1995)
Fu, M.C.: Chapter 19 gradient estimation. In: Simulation. In: Henderson, S.G., Nelson, B.L. (eds.) vol. 13 of Handbooks in Operations Research and Management Science. Elsevier, pp. 575–616 (2006)
Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian Data Analysis, Second Edition (Texts in Statistical Science). Chapman & Hall/CRC, Boca Raton (2003)
Google Scholar
George, A., Powell, W.B., Kulkarni, S.R.: Value function approximation using multiple aggregation for multiattribute resource management. J. Mach. Learn. Res. 9, 2079–2111 (2008)
Google Scholar
Gibbs, M.: Bayesian Gaussian Processes for Regression and Classification, dissertation. University of Cambridge, (1997)
Ginebra, J., Clayton, M.K.: Response surface bandits. J. R. Stat. Soc. Ser. B (Methodological) 57, 771–784 (1995)
Google Scholar
Gittins J., Jones D. (1974) A dynamic allocation index for the sequential design of experiments. In: Gani, J., Sarkadi, K., Vincze, I. (eds) Progress in Statistics. North-Holland, Amsterdam, pp. 241–266.
Gittins, J.C.: Bandit processes and dynamic allocation indices. J. R. Stat. Soc. Ser. B (Methodological) 41, 148–177 (1979)
Google Scholar
Gupta, S.S., Miescke, K.J.: Bayesian look ahead one-stage sampling allocations for selection of the best population. J. Stat. Plan. Inference, 54, 229–244. 40 Years of Statistical Selection Theory, Part I. (1996)
Google Scholar
Hardle, W.K.: Applied Nonparametric Regression. Cambridge University Press, Cambridge (1992)
Google Scholar
Hardle, W.K., Muller, M., Sperlich, S., Werwatz, A.: Nonparametric and Semiparametric Models. Springer, Berlin (2004)
Book Google Scholar
Huang, D., Allen, T.T., Notz, W.I., Zeng, N.: Global optimization of stochastic black-box systems via sequential kriging meta-models. J. Glob. Optim. 34, 441–466 (2006)
Article Google Scholar
Juditsky, A., Nemirovski, A.: Functional aggregation for nonparametric regression. Ann. Stat. 28, 681–712 (2000)
Article Google Scholar
Kaelbling, L.P.: Learning in Embedded Systems. MIT Press, Cambridge (1993)
Google Scholar
Kleinberg, R.: Nearly tight bounds for the continuum-armed bandit problem. In: Advances in Neural Information Processing Systems 17, MIT Press, pp. 697–704 (2005)
Mes, M.R., Powell, W.B., Frazier, P.I.: Hierarchical knowledge gradient for sequential sampling hierarchical knowledge gradient for sequential sampling. J. Mach. Learn. Res. 12, 2931–2974 (2011)
Google Scholar
Negoescu, D.M., Frazier, P.I., Powell, W.B.: The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS J. Comput. 23, 346–363 (2011)
Article Google Scholar
Nelson, B.L., Swann, J., Goldsman, D., Song, W.: Simple procedures for selecting the best simulated system when the number of alternatives is large. Oper. Res. 49, 950–963 (2001)
Article Google Scholar
Olafsson, S.: Chapter 21 metaheuristics, in Simulation. In: Henderson, S.G., Nelson, B.L. (eds.) vol. 13 of Handbooks in Operations Research and Management Science., pp. 633–654. Elsevier, (2006)
Powell, W.B.: Approximate Dynamic Programming: Solving the Curses of Dimensionality Wiley Series in Probability and Statistics. Wiley, Hoboken (2007)
Book Google Scholar
Powell, W.B., Ryzhov, I.: Optimal Learning. Wiley, Philadelphia (2012)
Book Google Scholar
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Article Google Scholar
Ryzhov, I., Powell, W., Frazier, P.: The knowledge gradient algorithm for a general class of online learning problems, (2011)
Spall, J.C.: Introduction to Stochastic Search and Optimization. Wiley, New York (2003)
Book Google Scholar
Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning. MIT Press, Cambridge (1998)
Google Scholar
Villemonteix, J., Vazquez, E., Walter, E.: An informational approach to the global optimization of expensive-to-evaluate functions. J. Glob. Optim. 44, 509–534 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ, 08544, USA
Emre Barut & Warren B. Powell

Authors

Emre Barut
View author publications
You can also search for this author in PubMed Google Scholar
Warren B. Powell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Emre Barut.

Additional information

This research was supported in part by grant AFOSR-FA9550-05-1-0121 from the Air Force Office of Scientific Research.

Proofs

In this section, we provide the proofs for the propositions and the lemmas used in the paper. For simplicity, when there is no confusion, we use $K(x,x^{\prime })$ to denote $K_{i}(x,x^{\prime })$.

1.1 Proof of Proposition 1

Proof

Let $\mathcal C $ be a generic subset of $\mathcal K $. We first show that for any such $\mathcal C $, the posterior of $\mu _{x}$ given $\mu _{x}^{i,n}$, for all $i\in \mathcal C $ is normal with mean and precision given by,

$$\begin{aligned} \mu _{x}^{C,n}&= \frac{1}{\beta ^{C,n}}\left( \beta _{x}^{0}\mu _{x}^{0}+ \sum _{{i}\in \mathcal C }((\sigma _{x}^{i,n})^{2}+\nu _{x}^{i})^{-1} \mu _{x}^{i,n}\right) ,\\ \beta _{x}^{C,n}&= \beta _{x}^{0}+\sum _{{i}\in \mathcal C } ((\sigma _{x}^{i,n})^{2}+\nu _{x}^{i})^{-1}. \end{aligned}$$

Then, the proposition follows by letting $\mathcal C =\mathcal K $.

Using induction, we first consider $\mathcal C =\emptyset $, then clearly the posterior is the same as the prior $(\mu _{x}^{0},\beta _{x}^{0})$ and the above equation holds as well.

Now, assume the proposed equations for the posterior distribution hold for all $\mathcal C $ of size $m$, and consider $\mathcal C ^{\prime }$ with $m+1$ elements ($\mathcal C ^{\prime }=\mathcal C \cup \{{j}\})$. By Bayes’ rule

$$\begin{aligned} \mathbb P _{C^{\prime }}(\mu _{x}\in du)=\mathbb P _{C}(\mu _{x} \in du|Y_{x}^{j}=h)\propto \mathbb P _{C}(Y_{x}^{j}\in dh| \mu _{x}=u)\mathbb P _{C}(\mu _{x}\in du). \end{aligned}$$

where $Y_{x}^{j}$ stands for the observations for kernel ${j}$. Using the previous induction statement

$$\begin{aligned} \mathbb P _{C}(\mu _{x}\in du)=\varphi ((u-\mu _{x}^{C,n})/\sigma _{x}^{C,n}). \end{aligned}$$

By the independence assumption,

$$\begin{aligned}&\mathbb P _{C}(Y_{x}^{j}\in dh|\mu _{x}=u) =\mathbb P (Y_{x}^{j}\in dh|\mu _{x}=u)\\&\quad =\int \limits _\mathbb{R }\mathbb P (Y_{x}^{j}\in dh|\mu _{x}^{k}=v)\mathbb P (\mu _{x}^{k}=v|\mu _{x}=u)dv\\&\quad \propto \int \limits _\mathbb{R }\varphi ((\mu _{x}^{j,n}-v)/ \sigma _{x}^{j,n})\varphi ((v-u)/\sqrt{\nu _{x}^{j}})dv\propto \varphi \left( \frac{\mu _{x}^{j,n}-u}{\sqrt{(\sigma _{x}^{j,n})^{2}+\nu _{x}^{j}}}\right) . \end{aligned}$$

Combining $\mathbb P _{C}(Y_{x}^{j}\in dh|\mu _{x}=u)$ and $\mathbb P _{C}(\mu _{x}\in du)$, we obtain

$$\begin{aligned} \mathbb P _{C^{\prime }}(\mu _{x}\in du)\propto \varphi \left( \frac{\mu _{x}^{j,n}-u}{\sqrt{(\sigma _{x}^{j,n})^{2} +\nu _{x}^{j}}}\right) \varphi ((u-\mu _{x}^{C,n})/\sigma _{x}^{C,n}) \propto \varphi ((u-\mu _{x}^{C^{\prime },n})/\sigma _{x}^{C^{\prime },n}). \end{aligned}$$

This gives us the desired result. $\square $

1.2 Proofs of Lemmas

This section contains the lemmas used for proving Theorem 1.

Lemma 1

For all $x\in \mathcal X ,\lim \sup _{n}\max _{m\le n}\left| \mu _{x}^{0,m}\right| $ is finite almost surely (a.s.).

Proof

We fix $x\in \mathcal X $. For each $\omega $, we let $N_{x}^{n}\left( \omega \right) $ the number of times we measure alternative $x$ until time period $n$,

$$\begin{aligned} N_{x}^{n}(\omega )=\sum _{m\le n-1}1_{\{x^{m}=x\}}. \end{aligned}$$

$N_{x}^{n}(\omega )$ is an increasing sequence for all $\omega $ and the limit $N_{x}^{\infty }(\omega )=\lim _{n\rightarrow \infty }N_{x}^{n}(\omega )$ exists. We bound $\left| \mu _{x}^{0,n}\right| $ above by,

$$\begin{aligned}&\left| \mu _{x}^{0,n}\right| \le \frac{\beta _{x}^{0}}{\beta _{x}^{n}}\left| \mu _{x}^{0,0}\right| +\frac{\beta _{x}^{n}-\beta _{x}^{0}}{\beta _{x}^{n}}\left| \frac{\sum _{j=1}^{n-1}1_{\{x^{i}=x\}}y_{x}^{j+1}}{N_{x}^{n}(\omega )}\right| \\&\quad \le \frac{\beta _{x}^{0}}{\beta _{x}^{n}}\left| \mu _{x}^{0,0}\right| +\frac{\beta _{x}^{n}-\beta _{x}^{0}}{\beta _{x}^{n}}\left| \mu _{x}\right| +\frac{\beta _{x}^{n}-\beta _{x}^{0}}{\beta _{x}^{n}}\left| \frac{\sum _{j=1}^{n-1}1_{\{x^{j}=x\}}y_{x}^{j+1}-N_{x}^{n}(\omega ) \mu _{x}}{N_{x}^{n}(\omega )}\right| \\&\quad =\frac{\beta _{x}^{0}}{\beta _{x}^{n}}\left| \mu _{x}^{0,0}\right| +\frac{\beta _{x}^{n}-\beta _{x}^{0}}{\beta _{x}^{n}}\left| \mu _{x}\right| +\frac{\lambda _{x}\left( \beta _{x}^{n}-\beta _{x}^{0}\right) }{\beta _{x}^{n}} \left| \sum _{j=1}^{n-1}1_{\{x^{j}=x\}}\frac{\left( y_{x}^{j+1}- \mu _{x}\right) }{\lambda _{x}}\right| . \end{aligned}$$

$\frac{\beta _{x}^{n}-\beta _{x}^{0}}{\beta _{x}^{n}}$ is bounded above by 1, and the first two terms are clearly finite, therefore we only concentrate on the finiteness of the last term. Note that $\frac{\left( y_{x}^{j+1}-\mu _{x}\right) }{\lambda _{x}}$ has a standard normal distribution. As the normal distribution has finite mean, we let $\varOmega _{0}$ be the almost sure event where $\left| y_{x}^{j}\right| \ne \infty $ for all $j\in \mathbb N _{+}$. We further divide $\varOmega _{0}$ into two sets,

$$\begin{aligned} \hat{\varOmega }_{0}=\left\{ \omega \in \varOmega _{0}:N_{x}^{\infty }\left( \omega \right) <\infty \right\} , \end{aligned}$$

where alternative $x$ is measured finitely many times, and

$$\begin{aligned} \hat{\varOmega }_{0}^{C}=\varOmega _{0}\backslash \hat{\varOmega }_{0} =\left\{ \omega \in \varOmega _{0}:N_{x}^{\infty }\left( \omega \right) =\infty \right\} \end{aligned}$$

where alternative $x$ is measured infinitely often. We further define the event $\mathcal H _{x}$ as

$$\begin{aligned} \mathcal H _{x}=\left\{ \omega \in \varOmega _{0}:\lim \sup _{n}\max _{m\le n}\left| \mu _{x}^{0,m}\right| =\infty \right\} \!. \end{aligned}$$

We will show that $\mathbb P \left( \hat{\varOmega }_{0}\cap \mathcal H _{x}\right) =0$ and $\mathbb P \left( \hat{\varOmega }_{0}^{C}\cap \mathcal H _{x}\right) =0$ to conclude that $\mathbb P \left( \mathcal H _{x}\right) =\mathbb P \left( \hat{\varOmega }_{0}\cap \mathcal H _{x}\right) +\mathbb P \left( \hat{\varOmega }_{0}^{C}\cap \mathcal H _{x}\right) =0$.

For any $\omega \in \hat{\varOmega }_{0}\cap \mathcal H _{x}$, let $M_{x}(\omega )$ be the last time that $x$ is measured, that is for all $n_{1},n_{2}\ge M_{x}(\omega ),\,N_{x}^{n_{1}}(\omega )=N_{x}^{n_{2}}(\omega )$. Then, we have that

$$\begin{aligned} \sum _{j=1}^{M_{x}(\omega )}\lambda _{x}1_{\{x^{j}=x\}}\left| \frac{\left( y_{x}^{j+1}-\mu _{x}\right) }{\lambda _{x}}\right|&= \lim \sup _{n}\max _{m\le n}\sum _{j=1}^{M_{x}(\omega )}\lambda _{x}1_{\{x^{j}=x\}}\left| \frac{\left( y_{x}^{j+1}-\mu _{x}\right) }{\lambda _{x}}\right| \\&= \lim \sup _{n}\max _{m\le n}\sum _{j=1}^{m}\lambda _{x}1_{\{x^{j}=x\}}\left| \frac{\left( y_{x}^{j+1}-\mu _{x}\right) }{\lambda _{x}}\right| \\&\ge \lim \sup _{n}\max _{m\le n}\left| \sum _{j=1}^{m}\lambda _{x}1_{\{x^{j}=x\}}\frac{\left( y_{x}^{j+1}-\mu _{x}\right) }{\lambda _{x}}\right| \\&\ge \lim \sup _{n}\max _{m\le n}\left| \mu _{x}^{0,m}\right| =\infty , \end{aligned}$$

where $M_{x}\left( \omega \right) <\infty $ by construction. However, this also implies that $y_{x}^{j+1}=\infty $ or $y_{x}^{j+1}=-\infty $ for at least one $i$, therefore $\omega \notin \hat{\varOmega }_{0}$ and we get a contradiction. Then, $\mathbb P \left( \hat{\varOmega }_{0}\cap \mathcal H _{x}\right) =0$.

To show that $\mathbb P \left( \hat{\varOmega }_{0}^{C}\cap \mathcal H _{x}\right) =0$, we let $J_{i}:=1_{\{x^{i}=x\}}\frac{\left( y_{x}^{j+1}-\mu _{x}\right) }{\lambda _{x}}$ and remind that $J_{i}$ has a standard normal distribution. We further define a subsequence $G\left( \omega \right) \subset \mathbb N _{+}$ by,

$$\begin{aligned} G\left( \omega \right) :=\left\{ j\in \mathbb N _{+}:1_{\{x^{j}=x\}}=1\right\} , \end{aligned}$$

and we let $J^{*}:=\left( J_{i}\right) _{i\in G(\omega )}$. By construction, $G\left( \omega \right) $ has countably infinite elements for all $\omega \in \hat{\varOmega }_{0}^{C}$. Here, we make use a version of the law of iterated logarithms [3] which states that,

$$\begin{aligned} \lim \sup _{n}\max _{m\le n}\left| \bar{Z}_{n}\right| <\infty \,(a.s.), \end{aligned}$$

where $\bar{Z}_{n}=\sum _{j=1}^{n}z_{i}/n$ and $z_{j}$ are i.i.d. random variables with zero mean and variance 1. We let $\varOmega _{1}$ be the almost sure set where this law holds for $\bar{Z}_{n}=J_{n}^{*}$, and the proof follows by noting that $\mathbb P \left( \hat{\varOmega }_{0}^{C}\cap \mathcal H _{x} \cap \varOmega _{1}\right) =0$. $\square $

Lemma 2

Assume that we have a prior on each point $\left( \beta _{x}^{0}>0,\forall x\in \mathcal X \right) $, then for any $x,x^{\prime }\in \mathcal X , k_{i}\in \mathcal K $, the following are finite a.s. : $\sup _{n}\left| \mu _{x}^{i,n}\right| ,\,\sup _{n}\left| a_{x^{\prime }}^{n}(x)\right| $ and $\sup _{n}\left| b_{x^{\prime }}^{n}(x)\right| $.

Proof

For any $x\in \mathcal X ,k_{i}\in \mathcal K $ and $n\in \mathbb N $, let $p_{x^{\prime }}^{i,n}=\frac{\beta _{x}^{n}K_{i}(x,x^{\prime })}{\sum _{j=1}^{M} \beta _{x}^{n}K_{i}(x,x_{j})}$. Clearly, for any $x^{\prime }\in \mathcal X $ all $p_{x^{\prime }}^{i,n}\ge 0$ and $\sum _{x^{\prime }\in \mathcal X }p_{x^{\prime }}^{i,n}=1$. That is for any $x^{\prime }$ and $n,\,p_{x^{\prime }}^{i,n}$ form a convex combination of $\mu _{x^{\prime }}^{0,n}$. Then,

$$\begin{aligned} \sup |\mu _{x}^{i,n}|=\sup _{n}\left| \frac{\sum _{j=1}^{M} \beta _{x}^{n}K_{i}(x,x_{j})\mu _{x_{j}}^{0,n}}{\sum _{j=1}^{M} \beta _{x}^{n}K_{i}(x,x_{j})}\right| =\sup _{n}\left| \sum p_{x}^{i,n}\mu _{x}^{0,n}\right| \le \sup _{n,x}|\mu _{x}^{0,n}|. \end{aligned}$$

And the last term is finite by Lemma 1.

To show the finiteness of $\sup _{n}|a_{x^{\prime }}^{n}(x)|$, we note that $a_{x^{\prime }}^{n}(x)$ is a linear combination of $\mu _{x}^{i,n}$ and $\mu _{x^{\prime }}^{i,n}$, where the weights for $\mu _{x}^{i,n}$ are given by $\left( 1-\frac{\beta _{x_{n}}^{\varepsilon }K(x,x_{n})}{A_{n+1} ^{i}(x,x_{n})}\right) $ and the weight for $\mu _{x^{\prime }}^{i,n}$ is $\sum _{{i}\in \mathcal K }w_{x}^{i,n+1}\frac{\beta _{x_{n}}^{\varepsilon } K(x,x_{n})}{A_{n+1}^{i}(x,x_{n})}$. These weights are between 0 and 1, and the finiteness follows.

To see $\sup _{n}|b_{x^{\prime }}^{n}(x)|$, first note that for any ${i}\in \mathcal K $ and any $x,x^{\prime }\in \mathcal X $,

$$\begin{aligned} A_{n+1}^{i}(x,x^{\prime })=\sum _{\hat{x}\in \mathcal X } \beta _{\hat{x}}^{n}K(x,\hat{x})+\beta _{x^{\prime }}^{\varepsilon }K(x,x^{\prime }), \end{aligned}$$

is an increasing sequence in $n$. And trivially, $(\sigma _{x}^{n})^{2}=1/\beta _{x}^{n}$ is a decreasing sequence in $n$. Then for any $n\in \mathbb N $,

$$\begin{aligned} \tilde{\sigma }(x,x^{\prime },i)_{n}=\sqrt{((\sigma _{x^{\prime }}^{n})^{2}+\lambda _{x^{\prime }})} \frac{\beta _{x^{\prime }}^{\varepsilon }K(x,x^{\prime })}{A_{n}^{i}(x,x^{\prime })} \le \tilde{\sigma }(x,x^{\prime },i)_{0}<\infty . \end{aligned}$$

As $b_{x^{\prime }}^{n}(x)$ is a convex combination of $\tilde{\sigma }(x,x^{\prime },i)$ where the weights are given by $w_{x}^{i,n}$, it follows that $\sup _{n}|b_{x^{\prime }}^{n}(x)|$ is finite. $\square $

Lemma 3

For any $\omega \in \varOmega $, we let $\mathcal X ^{\prime }(\omega )$ be the random set of alternatives measured infinitely often by the KGNP policy. Fix $\omega \in \varOmega $, then for any $x\notin \mathcal X ^{\prime }(\omega )$ let $x^{\prime }\in \mathcal X $ be an alternative such that $x^{\prime }\ne x,\,K_{i}(x,x^{\prime })>0$ for at least one $k_{i}\in \mathcal K $, and $x^{\prime }$ is measured at least once. Also assume that $\mu _{x}\ne \mu _{x^{\prime }}$. Then, $\liminf _{n}\left| \mu _{x}^{i,n}-\mu _{x}^{0,n}\right| >0$ a.s. In other words, the estimator using kernel $k_{i}$ has a bias almost surely.

Proof

As $x\notin \mathcal X ^{\prime }$, there is some $N<\infty $ such that $\mu _{x}^{0,n}=\mu _{x}^{0,N}$ for all $n\ge N$. And as $\mu _{x}^{0,N}=\frac{\mu _{x}^{0}+\sum _{m\le N}\beta _{x}^{\varepsilon }y_{x_{m}}1_{(x_{m}=x)}}{\beta _{x}^{0}+\sum _{m\le N}\beta _{x}^{\varepsilon }1_{(x_{m}=x)}}$, it is given by a linear combination of normal random variables $\left( y_{x_{m}}\right) $ and is a continuous random variable.

As $x\ne x^{\prime }$ is at least measured once, and $K_{i}(x,x^{\prime })>0, \mu _{x}^{i,n}$ contains positively weighted $\mu _{x^{\prime }}^{0,n}$ terms. Also, using the assumption $\mu _{x^{\prime }}\ne \mu _{x},\,\mu _{x^{\prime }}^{0,n}$ will not be perfectly correlated with $\mu _{x}^{0,n}$. Then, as both are continuous random variables, the probability that $\mu _{x}^{0,n}$ will be equal to any cluster point of $\mu _{x}^{i,n}$ is zero a.s. That is $\liminf _{n}\left| \mu _{x}^{i,n}-\mu _{x}^{0,n}\right| >0$. $\square $

Remark

If $\mu _{x}$ are generated from a continuously distributed prior (e.g. normal distribution), then for all $x\ne x^{\prime },\,\mathbb P (\mu _{x}\ne \mu _{x^{\prime }})=1$ and the assumption for the previous lemma holds almost surely.

Lemma 4

For any $\omega \in \varOmega $, we let $\mathcal X ^{\prime }(\omega )$ be the random set of alternatives measured infinitely often by the KGNP policy. For all $x,x^{\prime }\in \mathcal X $, the following holds a.s.:

if $x\in \mathcal X ^{\prime }$, then $\lim _{n}b_{x^{\prime }}^{n}(x)=0$ and $\lim _{n}b_{x}^{n}(x^{\prime })=0,$
if $x\notin \mathcal X ^{\prime }$, then $\liminf _{n}b_{x}^{n}(x)>0.$

Proof

We start by considering the first case, $x\in \mathcal X ^{\prime }$. If $K_{i}(x,x^{\prime })=0$ for all ${i}\in \mathcal K ,\,b_{x^{\prime }}^{n}(x)=b_{x}^{n}(x^{\prime })=0$ for all $n$ by the definition. Taking $n\rightarrow \infty $ we get the result.

If $K_{i}(x,x^{\prime })>0$ for some ${i}\in K$, showing $\lim _{n}b_{x^{\prime }}^{n}(x)=0$ is equivalent to showing that for all ${i}\in \mathcal K $

$$\begin{aligned} \tilde{\sigma }(x,x^{\prime },i)=\sqrt{((\sigma _{x^{\prime }}^{n})^{2} +\lambda _{x^{\prime }})}\frac{\beta _{x^{\prime }}^{\varepsilon }K(x,x^{\prime })}{A_{n+1}^{i}(x,x^{\prime })}\longrightarrow 0. \end{aligned}$$

As noted previously, $A_{n}^{i}(x,x^{\prime })$ is an increasing sequence. If $x\in \mathcal X ^{\prime }$, then we also have that, $\beta _{x}^{n}\rightarrow \infty $, and

$$\begin{aligned} \frac{1}{A_{n+1}^{i}(x,x^{\prime })}\le \frac{1}{\beta _{x}^{n}K(x,x^{\prime })}\longrightarrow 0. \end{aligned}$$

Therefore $\lim _{n}b_{x^{\prime }}^{n}(x)=0$ under this case as well. Showing $\lim _{n}b_{x}^{n}(x^{\prime })=0$, reduces to showing that,

$$\begin{aligned} \frac{1}{A_{n+1}^{i}(x^{\prime },x)}\longrightarrow 0, \end{aligned}$$

which is also given by above.

Now for the second result, where $K_{i}(x,x^{\prime })>0$ for some ${i}\in \mathcal K $ and $x\notin \mathcal X ^{\prime }$; by the definition of $b_{x}^{n}(x)$

$$\begin{aligned} b_{x}^{n}(x)\ge w_{x}^{0,n+1}\tilde{\sigma }(x,x,0)= w_{x}^{0,n+1}\sqrt{((\sigma _{x}^{n})^{2}+\lambda _{x}} \frac{\beta _{x}^{\varepsilon }}{\beta _{x}^{n}+\beta _{x}^{\varepsilon }K(x,x)}. \end{aligned}$$

For a given $\omega \in \varOmega $, let $N$ be the last time that alternative $x$ is observed. Then, for all $n\ge N$,

$$\begin{aligned} \beta _{x}^{n}=\beta _{x}^{N}\le \beta _{x}^{0} +N\beta _{x}^{\varepsilon }<\infty . \end{aligned}$$

Recall that $(\sigma _{x}^{n})^{2}=1/\beta _{x}^{n}$ and $\lambda _{x}=1/\beta _{x}^{\varepsilon }$, and that these terms are finite for a finitely sampled alternative. For $\liminf _{n}b_{x}^{n}(x)>0$ to hold, we only need to show that the weight stays above 0, that is,

$$\begin{aligned} \liminf _{n}w_{x}^{0,n}=\liminf _{n}\left( \frac{(( \sigma _{x}^{0,n})^{2})^{-1}}{\sum _{{i^{\prime }}\in \mathcal K } ((\sigma _{x}^{i^{\prime },n})^{2}+\nu _{x}^{i^{\prime },n})^{-1}}\right) >0. \end{aligned}$$

Almost sure finiteness of the numerator has been shown above, which means we only need to show that

$$\begin{aligned} \limsup _{n}\sum _{{i^{\prime }}\in \mathcal K }((\sigma _{x}^{i^{\prime },n})^{2} +\nu _{x}^{i^{\prime },n})^{-1}<\infty . \end{aligned}$$

First we divide the set of kernels into two pieces. Let $\mathcal K _{1}(\omega ,x)$ be the set such that, for $\omega \in \varOmega $, there is at least one $x^{\prime }\in \mathcal X ^{\prime }$ such that $K_{i}(x,x^{\prime })>0$. In other words, there is one infinitely often sampled point ($x^{\prime }$) close to our original point ($x$) that has influence on the prediction. Let $\mathcal K _{2}(\omega ,x)$=$\mathcal K \backslash \mathcal K _{1}$. Now as all terms are positive,

$$\begin{aligned}&\limsup _{n}\sum _{{i^{\prime }}\in \mathcal K }((\sigma _{x}^{i^{\prime },n})^{2} +\nu _{x}^{i^{\prime },n})^{-1}\le \limsup _{n}\sum _{{i^{\prime }}\in \mathcal K _{1}} ((\sigma _{x}^{i^{\prime },n})^{2}+\nu _{x}^{i^{\prime },n})^{-1}\\&+\limsup _{n} \sum _{{i^{\prime }}\in \mathcal K _{2}}((\sigma _{x}^{i^{\prime },n}) ^{2}+\nu _{x}^{i^{\prime },n})^{-1}. \end{aligned}$$

For all $k_{i^{\prime }}\in \mathcal K _{1}$, we have that by Lemma 3, $\liminf _{n}\nu _{x}^{{i^{\prime }},n}>0$, even if $\lim \inf _{n}(\sigma _{x}^{{i^{\prime }},n})^{2}=0$, the limsup for the first term on the right are finite. Finally, for all ${i^{\prime }}\in \mathcal K _{2}$, as none of the points using ${i^{\prime }}\in \mathcal K _{2}$ using to predict $\mu _{x}$ are sampled infinitely often, letting

$$\begin{aligned} N_{X}=\max _{x\notin \mathcal X ^{\prime }}N_{x}, \end{aligned}$$

where $N_{x}$ is the last time point $x$ is sampled, we have $N_{X}<\infty $. Then, $\beta _{x}^{n}$ for all $x\notin \mathcal X ^{\prime }(\omega )$ is finite (and bounded above by $N_{X}(\max _{x\notin \mathcal X ^{\prime }}\beta _{x}^{\varepsilon })$) and

$$\begin{aligned} \sum _{{i}\in \mathcal K _{2}}((\sigma _{x}^{i,n})^{2}+\nu _{x}^{i,n})^{-1}&\le \sum _{{i}\in \mathcal K _{2}}((\sigma _{x}^{i,n})^{2})^{-1}\\&\le \sum _{{i}\in \mathcal K _{2}}\frac{(\sum _{x^{\prime }\in \mathcal X } \beta _{x^{\prime }}^{n}K_{i}(x,x^{\prime }))^{2}}{\sum _{x^{\prime }\in \mathcal X }\beta _{x^{\prime }}^{n}K_{i} (x,x^{\prime })^{2}}\\&\le \sum _{{i}\in \mathcal K _{2}}\frac{(\sum _{x^{\prime }\in \mathcal X }N_{X} (\max _{x\notin \mathcal X ^{\prime }}\beta _{x}^{\varepsilon })K_{i}(x,x^{\prime }))^{2}}{\sum _{x^{\prime }\in \mathcal X }N_{X}(\max _{x\notin \mathcal X ^{\prime }} \beta _{x}^{\varepsilon })K_{i}(x,x^{\prime })^{2}}<\infty \end{aligned}$$

where the last term does not contain $n$. Taking the limit supremum over $n$ for both sides gives us the final result. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Barut, E., Powell, W.B. Optimal learning for sequential sampling with non-parametric beliefs. J Glob Optim 58, 517–543 (2014). https://doi.org/10.1007/s10898-013-0050-5

Download citation

Received: 22 March 2012
Accepted: 16 February 2013
Published: 03 March 2013
Issue Date: March 2014
DOI: https://doi.org/10.1007/s10898-013-0050-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimal learning for sequential sampling with non-parametric beliefs

Abstract

Access this article

Similar content being viewed by others

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Distributionally robust stochastic programs with side information based on trimmings

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Proofs

1.1 Proof of Proposition 1

Proof

1.2 Proofs of Lemmas

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Remark

Lemma 4

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimal learning for sequential sampling with non-parametric beliefs

Abstract

Access this article

Similar content being viewed by others

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Distributionally robust stochastic programs with side information based on trimmings

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Proofs

Proofs

1.1 Proof of Proposition 1

Proof

1.2 Proofs of Lemmas

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Remark

Lemma 4

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation