Skip to main content
Log in

Optimal learning for sequential sampling with non-parametric beliefs

  • Published:
Journal of Global Optimization Aims and scope Submit manuscript

Abstract

We propose a sequential learning policy for ranking and selection problems, where we use a non-parametric procedure for estimating the value of a policy. Our estimation approach aggregates over a set of kernel functions in order to achieve a more consistent estimator. Each element in the kernel estimation set uses a different bandwidth to achieve better aggregation. The final estimate uses a weighting scheme with the inverse mean square errors of the kernel estimators as weights. This weighting scheme is shown to be optimal under independent kernel estimators. For choosing the measurement, we employ the knowledge gradient policy that relies on predictive distributions to calculate the optimal sampling point. Our method allows a setting where the beliefs are expected to be correlated but the correlation structure is unknown beforehand. Moreover, the proposed policy is shown to be asymptotically optimal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Agrawal, R.: The continuum-armed bandit problem. SIAM J. Control Optim. 33, 1926–1951 (1995)

    Article  Google Scholar 

  2. Barton, R.R., Meckesheimer, M.: Chapter 18 metamodel-based simulation optimization in Simulation. In: Henderson, S.G., Nelson, B.L. (eds.). vol. 13 of Handbooks in Operations Research and Management Science. Elsevier (pp. 535–574) (2006)

  3. Billingsley, P.: Probability and Measure, 3rd edn. Wiley-Interscience, New York (1995)

    Google Scholar 

  4. Branin, F.H.: Widely convergent method for finding multiple solutions of simultaneous nonlinear equations. IBM J. Res. Dev. 16, 504–522 (1972)

    Article  Google Scholar 

  5. Bunea, F., Nobel, A.: Sequential procedures for aggregating arbitrary estimators of a conditional mean. IEEE Trans. Inf. Theory 54, 1725–1735 (2008)

    Article  Google Scholar 

  6. Chehrazi, N., Weber, T.A.: Monotone approximation of decision problems. Oper. Res. 58, 1158–1177 (2010)

    Article  Google Scholar 

  7. Chick, S.E., Gans, N.: Economic analysis of simulation selection problems. Manag. Sci. 55, 421–437 (2009)

    Article  Google Scholar 

  8. Cochran, W.G., Cox, G.M.: Experimental Designs. Wiley, New York (1957)

    Google Scholar 

  9. Fan, J., Gijbels, I.: Local Polynomial Modelling and Its Applications: Monographs on Statistics and Applied Probability 66 (Chapman & Hall/CRC Monographs on Statistics & Applied Probability). Chapman & Hall, London (1996)

  10. Frazier, P.I., Powell, W.B., Dayanik, S.: knowledge-gradient policy for sequential information collection. SIAM J. Control Optim. 47, 2410–2439 (2008)

    Article  Google Scholar 

  11. Frazier, P.I., Powell, W.B., Dayanik, S.: The knowledge-gradient policy for correlated normal beliefs. INFORMS J. Comput. 21, 599–613 (2009)

    Article  Google Scholar 

  12. Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and an application to boosting in computational learning theory. In: Vitanyi, P. (ed.) vol. 904 of Lecture Notes in Computer Science. Springer Berlin, Heidelberg (1995)

  13. Fu, M.C.: Chapter 19 gradient estimation. In: Simulation. In: Henderson, S.G., Nelson, B.L. (eds.) vol. 13 of Handbooks in Operations Research and Management Science. Elsevier, pp. 575–616 (2006)

  14. Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian Data Analysis, Second Edition (Texts in Statistical Science). Chapman & Hall/CRC, Boca Raton (2003)

    Google Scholar 

  15. George, A., Powell, W.B., Kulkarni, S.R.: Value function approximation using multiple aggregation for multiattribute resource management. J. Mach. Learn. Res. 9, 2079–2111 (2008)

    Google Scholar 

  16. Gibbs, M.: Bayesian Gaussian Processes for Regression and Classification, dissertation. University of Cambridge, (1997)

  17. Ginebra, J., Clayton, M.K.: Response surface bandits. J. R. Stat. Soc. Ser. B (Methodological) 57, 771–784 (1995)

    Google Scholar 

  18. Gittins J., Jones D. (1974) A dynamic allocation index for the sequential design of experiments. In: Gani, J., Sarkadi, K., Vincze, I. (eds) Progress in Statistics. North-Holland, Amsterdam, pp. 241–266.

  19. Gittins, J.C.: Bandit processes and dynamic allocation indices. J. R. Stat. Soc. Ser. B (Methodological) 41, 148–177 (1979)

    Google Scholar 

  20. Gupta, S.S., Miescke, K.J.: Bayesian look ahead one-stage sampling allocations for selection of the best population. J. Stat. Plan. Inference, 54, 229–244. 40 Years of Statistical Selection Theory, Part I. (1996)

    Google Scholar 

  21. Hardle, W.K.: Applied Nonparametric Regression. Cambridge University Press, Cambridge (1992)

    Google Scholar 

  22. Hardle, W.K., Muller, M., Sperlich, S., Werwatz, A.: Nonparametric and Semiparametric Models. Springer, Berlin (2004)

    Book  Google Scholar 

  23. Huang, D., Allen, T.T., Notz, W.I., Zeng, N.: Global optimization of stochastic black-box systems via sequential kriging meta-models. J. Glob. Optim. 34, 441–466 (2006)

    Article  Google Scholar 

  24. Juditsky, A., Nemirovski, A.: Functional aggregation for nonparametric regression. Ann. Stat. 28, 681–712 (2000)

    Article  Google Scholar 

  25. Kaelbling, L.P.: Learning in Embedded Systems. MIT Press, Cambridge (1993)

    Google Scholar 

  26. Kleinberg, R.: Nearly tight bounds for the continuum-armed bandit problem. In: Advances in Neural Information Processing Systems 17, MIT Press, pp. 697–704 (2005)

  27. Mes, M.R., Powell, W.B., Frazier, P.I.: Hierarchical knowledge gradient for sequential sampling hierarchical knowledge gradient for sequential sampling. J. Mach. Learn. Res. 12, 2931–2974 (2011)

    Google Scholar 

  28. Negoescu, D.M., Frazier, P.I., Powell, W.B.: The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS J. Comput. 23, 346–363 (2011)

    Article  Google Scholar 

  29. Nelson, B.L., Swann, J., Goldsman, D., Song, W.: Simple procedures for selecting the best simulated system when the number of alternatives is large. Oper. Res. 49, 950–963 (2001)

    Article  Google Scholar 

  30. Olafsson, S.: Chapter 21 metaheuristics, in Simulation. In: Henderson, S.G., Nelson, B.L. (eds.) vol. 13 of Handbooks in Operations Research and Management Science., pp. 633–654. Elsevier, (2006)

  31. Powell, W.B.: Approximate Dynamic Programming: Solving the Curses of Dimensionality Wiley Series in Probability and Statistics. Wiley, Hoboken (2007)

    Book  Google Scholar 

  32. Powell, W.B., Ryzhov, I.: Optimal Learning. Wiley, Philadelphia (2012)

    Book  Google Scholar 

  33. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)

    Article  Google Scholar 

  34. Ryzhov, I., Powell, W., Frazier, P.: The knowledge gradient algorithm for a general class of online learning problems, (2011)

  35. Spall, J.C.: Introduction to Stochastic Search and Optimization. Wiley, New York (2003)

    Book  Google Scholar 

  36. Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning. MIT Press, Cambridge (1998)

    Google Scholar 

  37. Villemonteix, J., Vazquez, E., Walter, E.: An informational approach to the global optimization of expensive-to-evaluate functions. J. Glob. Optim. 44, 509–534 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Emre Barut.

Additional information

This research was supported in part by grant AFOSR-FA9550-05-1-0121 from the Air Force Office of Scientific Research.

Proofs

Proofs

In this section, we provide the proofs for the propositions and the lemmas used in the paper. For simplicity, when there is no confusion, we use \(K(x,x^{\prime })\) to denote \(K_{i}(x,x^{\prime })\).

1.1 Proof of Proposition 1

Proof

Let \(\mathcal C \) be a generic subset of \(\mathcal K \). We first show that for any such \(\mathcal C \), the posterior of \(\mu _{x}\) given \(\mu _{x}^{i,n}\), for all \(i\in \mathcal C \) is normal with mean and precision given by,

$$\begin{aligned} \mu _{x}^{C,n}&= \frac{1}{\beta ^{C,n}}\left( \beta _{x}^{0}\mu _{x}^{0}+ \sum _{{i}\in \mathcal C }((\sigma _{x}^{i,n})^{2}+\nu _{x}^{i})^{-1} \mu _{x}^{i,n}\right) ,\\ \beta _{x}^{C,n}&= \beta _{x}^{0}+\sum _{{i}\in \mathcal C } ((\sigma _{x}^{i,n})^{2}+\nu _{x}^{i})^{-1}. \end{aligned}$$

Then, the proposition follows by letting \(\mathcal C =\mathcal K \).

Using induction, we first consider \(\mathcal C =\emptyset \), then clearly the posterior is the same as the prior \((\mu _{x}^{0},\beta _{x}^{0})\) and the above equation holds as well.

Now, assume the proposed equations for the posterior distribution hold for all \(\mathcal C \) of size \(m\), and consider \(\mathcal C ^{\prime }\) with \(m+1\) elements (\(\mathcal C ^{\prime }=\mathcal C \cup \{{j}\})\). By Bayes’ rule

$$\begin{aligned} \mathbb P _{C^{\prime }}(\mu _{x}\in du)=\mathbb P _{C}(\mu _{x} \in du|Y_{x}^{j}=h)\propto \mathbb P _{C}(Y_{x}^{j}\in dh| \mu _{x}=u)\mathbb P _{C}(\mu _{x}\in du). \end{aligned}$$

where \(Y_{x}^{j}\) stands for the observations for kernel \({j}\). Using the previous induction statement

$$\begin{aligned} \mathbb P _{C}(\mu _{x}\in du)=\varphi ((u-\mu _{x}^{C,n})/\sigma _{x}^{C,n}). \end{aligned}$$

By the independence assumption,

$$\begin{aligned}&\mathbb P _{C}(Y_{x}^{j}\in dh|\mu _{x}=u) =\mathbb P (Y_{x}^{j}\in dh|\mu _{x}=u)\\&\quad =\int \limits _\mathbb{R }\mathbb P (Y_{x}^{j}\in dh|\mu _{x}^{k}=v)\mathbb P (\mu _{x}^{k}=v|\mu _{x}=u)dv\\&\quad \propto \int \limits _\mathbb{R }\varphi ((\mu _{x}^{j,n}-v)/ \sigma _{x}^{j,n})\varphi ((v-u)/\sqrt{\nu _{x}^{j}})dv\propto \varphi \left( \frac{\mu _{x}^{j,n}-u}{\sqrt{(\sigma _{x}^{j,n})^{2}+\nu _{x}^{j}}}\right) . \end{aligned}$$

Combining \(\mathbb P _{C}(Y_{x}^{j}\in dh|\mu _{x}=u)\) and \(\mathbb P _{C}(\mu _{x}\in du)\), we obtain

$$\begin{aligned} \mathbb P _{C^{\prime }}(\mu _{x}\in du)\propto \varphi \left( \frac{\mu _{x}^{j,n}-u}{\sqrt{(\sigma _{x}^{j,n})^{2} +\nu _{x}^{j}}}\right) \varphi ((u-\mu _{x}^{C,n})/\sigma _{x}^{C,n}) \propto \varphi ((u-\mu _{x}^{C^{\prime },n})/\sigma _{x}^{C^{\prime },n}). \end{aligned}$$

This gives us the desired result. \(\square \)

1.2 Proofs of Lemmas

This section contains the lemmas used for proving Theorem 1.

Lemma 1

For all \(x\in \mathcal X ,\lim \sup _{n}\max _{m\le n}\left| \mu _{x}^{0,m}\right| \) is finite almost surely (a.s.).

Proof

We fix \(x\in \mathcal X \). For each \(\omega \), we let \(N_{x}^{n}\left( \omega \right) \) the number of times we measure alternative \(x\) until time period \(n\),

$$\begin{aligned} N_{x}^{n}(\omega )=\sum _{m\le n-1}1_{\{x^{m}=x\}}. \end{aligned}$$

\(N_{x}^{n}(\omega )\) is an increasing sequence for all \(\omega \) and the limit \(N_{x}^{\infty }(\omega )=\lim _{n\rightarrow \infty }N_{x}^{n}(\omega )\) exists. We bound \(\left| \mu _{x}^{0,n}\right| \) above by,

$$\begin{aligned}&\left| \mu _{x}^{0,n}\right| \le \frac{\beta _{x}^{0}}{\beta _{x}^{n}}\left| \mu _{x}^{0,0}\right| +\frac{\beta _{x}^{n}-\beta _{x}^{0}}{\beta _{x}^{n}}\left| \frac{\sum _{j=1}^{n-1}1_{\{x^{i}=x\}}y_{x}^{j+1}}{N_{x}^{n}(\omega )}\right| \\&\quad \le \frac{\beta _{x}^{0}}{\beta _{x}^{n}}\left| \mu _{x}^{0,0}\right| +\frac{\beta _{x}^{n}-\beta _{x}^{0}}{\beta _{x}^{n}}\left| \mu _{x}\right| +\frac{\beta _{x}^{n}-\beta _{x}^{0}}{\beta _{x}^{n}}\left| \frac{\sum _{j=1}^{n-1}1_{\{x^{j}=x\}}y_{x}^{j+1}-N_{x}^{n}(\omega ) \mu _{x}}{N_{x}^{n}(\omega )}\right| \\&\quad =\frac{\beta _{x}^{0}}{\beta _{x}^{n}}\left| \mu _{x}^{0,0}\right| +\frac{\beta _{x}^{n}-\beta _{x}^{0}}{\beta _{x}^{n}}\left| \mu _{x}\right| +\frac{\lambda _{x}\left( \beta _{x}^{n}-\beta _{x}^{0}\right) }{\beta _{x}^{n}} \left| \sum _{j=1}^{n-1}1_{\{x^{j}=x\}}\frac{\left( y_{x}^{j+1}- \mu _{x}\right) }{\lambda _{x}}\right| . \end{aligned}$$

\(\frac{\beta _{x}^{n}-\beta _{x}^{0}}{\beta _{x}^{n}}\) is bounded above by 1, and the first two terms are clearly finite, therefore we only concentrate on the finiteness of the last term. Note that \(\frac{\left( y_{x}^{j+1}-\mu _{x}\right) }{\lambda _{x}}\) has a standard normal distribution. As the normal distribution has finite mean, we let \(\varOmega _{0}\) be the almost sure event where \(\left| y_{x}^{j}\right| \ne \infty \) for all \(j\in \mathbb N _{+}\). We further divide \(\varOmega _{0}\) into two sets,

$$\begin{aligned} \hat{\varOmega }_{0}=\left\{ \omega \in \varOmega _{0}:N_{x}^{\infty }\left( \omega \right) <\infty \right\} , \end{aligned}$$

where alternative \(x\) is measured finitely many times, and

$$\begin{aligned} \hat{\varOmega }_{0}^{C}=\varOmega _{0}\backslash \hat{\varOmega }_{0} =\left\{ \omega \in \varOmega _{0}:N_{x}^{\infty }\left( \omega \right) =\infty \right\} \end{aligned}$$

where alternative \(x\) is measured infinitely often. We further define the event \(\mathcal H _{x}\) as

$$\begin{aligned} \mathcal H _{x}=\left\{ \omega \in \varOmega _{0}:\lim \sup _{n}\max _{m\le n}\left| \mu _{x}^{0,m}\right| =\infty \right\} \!. \end{aligned}$$

We will show that \(\mathbb P \left( \hat{\varOmega }_{0}\cap \mathcal H _{x}\right) =0\) and \(\mathbb P \left( \hat{\varOmega }_{0}^{C}\cap \mathcal H _{x}\right) =0\) to conclude that \(\mathbb P \left( \mathcal H _{x}\right) =\mathbb P \left( \hat{\varOmega }_{0}\cap \mathcal H _{x}\right) +\mathbb P \left( \hat{\varOmega }_{0}^{C}\cap \mathcal H _{x}\right) =0\).

For any \(\omega \in \hat{\varOmega }_{0}\cap \mathcal H _{x}\), let \(M_{x}(\omega )\) be the last time that \(x\) is measured, that is for all \(n_{1},n_{2}\ge M_{x}(\omega ),\,N_{x}^{n_{1}}(\omega )=N_{x}^{n_{2}}(\omega )\). Then, we have that

$$\begin{aligned} \sum _{j=1}^{M_{x}(\omega )}\lambda _{x}1_{\{x^{j}=x\}}\left| \frac{\left( y_{x}^{j+1}-\mu _{x}\right) }{\lambda _{x}}\right|&= \lim \sup _{n}\max _{m\le n}\sum _{j=1}^{M_{x}(\omega )}\lambda _{x}1_{\{x^{j}=x\}}\left| \frac{\left( y_{x}^{j+1}-\mu _{x}\right) }{\lambda _{x}}\right| \\&= \lim \sup _{n}\max _{m\le n}\sum _{j=1}^{m}\lambda _{x}1_{\{x^{j}=x\}}\left| \frac{\left( y_{x}^{j+1}-\mu _{x}\right) }{\lambda _{x}}\right| \\&\ge \lim \sup _{n}\max _{m\le n}\left| \sum _{j=1}^{m}\lambda _{x}1_{\{x^{j}=x\}}\frac{\left( y_{x}^{j+1}-\mu _{x}\right) }{\lambda _{x}}\right| \\&\ge \lim \sup _{n}\max _{m\le n}\left| \mu _{x}^{0,m}\right| =\infty , \end{aligned}$$

where \(M_{x}\left( \omega \right) <\infty \) by construction. However, this also implies that \(y_{x}^{j+1}=\infty \) or \(y_{x}^{j+1}=-\infty \) for at least one \(i\), therefore \(\omega \notin \hat{\varOmega }_{0}\) and we get a contradiction. Then, \(\mathbb P \left( \hat{\varOmega }_{0}\cap \mathcal H _{x}\right) =0\).

To show that \(\mathbb P \left( \hat{\varOmega }_{0}^{C}\cap \mathcal H _{x}\right) =0\), we let \(J_{i}:=1_{\{x^{i}=x\}}\frac{\left( y_{x}^{j+1}-\mu _{x}\right) }{\lambda _{x}}\) and remind that \(J_{i}\) has a standard normal distribution. We further define a subsequence \(G\left( \omega \right) \subset \mathbb N _{+}\) by,

$$\begin{aligned} G\left( \omega \right) :=\left\{ j\in \mathbb N _{+}:1_{\{x^{j}=x\}}=1\right\} , \end{aligned}$$

and we let \(J^{*}:=\left( J_{i}\right) _{i\in G(\omega )}\). By construction, \(G\left( \omega \right) \) has countably infinite elements for all \(\omega \in \hat{\varOmega }_{0}^{C}\). Here, we make use a version of the law of iterated logarithms [3] which states that,

$$\begin{aligned} \lim \sup _{n}\max _{m\le n}\left| \bar{Z}_{n}\right| <\infty \,(a.s.), \end{aligned}$$

where \(\bar{Z}_{n}=\sum _{j=1}^{n}z_{i}/n\) and \(z_{j}\) are i.i.d. random variables with zero mean and variance 1. We let \(\varOmega _{1}\) be the almost sure set where this law holds for \(\bar{Z}_{n}=J_{n}^{*}\), and the proof follows by noting that \(\mathbb P \left( \hat{\varOmega }_{0}^{C}\cap \mathcal H _{x} \cap \varOmega _{1}\right) =0\). \(\square \)

Lemma 2

Assume that we have a prior on each point \(\left( \beta _{x}^{0}>0,\forall x\in \mathcal X \right) \), then for any \(x,x^{\prime }\in \mathcal X , k_{i}\in \mathcal K \), the following are finite a.s. : \(\sup _{n}\left| \mu _{x}^{i,n}\right| ,\,\sup _{n}\left| a_{x^{\prime }}^{n}(x)\right| \) and \(\sup _{n}\left| b_{x^{\prime }}^{n}(x)\right| \).

Proof

For any \(x\in \mathcal X ,k_{i}\in \mathcal K \) and \(n\in \mathbb N \), let \(p_{x^{\prime }}^{i,n}=\frac{\beta _{x}^{n}K_{i}(x,x^{\prime })}{\sum _{j=1}^{M} \beta _{x}^{n}K_{i}(x,x_{j})}\). Clearly, for any \(x^{\prime }\in \mathcal X \) all \(p_{x^{\prime }}^{i,n}\ge 0\) and \(\sum _{x^{\prime }\in \mathcal X }p_{x^{\prime }}^{i,n}=1\). That is for any \(x^{\prime }\) and \(n,\,p_{x^{\prime }}^{i,n}\) form a convex combination of \(\mu _{x^{\prime }}^{0,n}\). Then,

$$\begin{aligned} \sup |\mu _{x}^{i,n}|=\sup _{n}\left| \frac{\sum _{j=1}^{M} \beta _{x}^{n}K_{i}(x,x_{j})\mu _{x_{j}}^{0,n}}{\sum _{j=1}^{M} \beta _{x}^{n}K_{i}(x,x_{j})}\right| =\sup _{n}\left| \sum p_{x}^{i,n}\mu _{x}^{0,n}\right| \le \sup _{n,x}|\mu _{x}^{0,n}|. \end{aligned}$$

And the last term is finite by Lemma 1.

To show the finiteness of \(\sup _{n}|a_{x^{\prime }}^{n}(x)|\), we note that \(a_{x^{\prime }}^{n}(x)\) is a linear combination of \(\mu _{x}^{i,n}\) and \(\mu _{x^{\prime }}^{i,n}\), where the weights for \(\mu _{x}^{i,n}\) are given by \(\left( 1-\frac{\beta _{x_{n}}^{\varepsilon }K(x,x_{n})}{A_{n+1} ^{i}(x,x_{n})}\right) \) and the weight for \(\mu _{x^{\prime }}^{i,n}\) is \(\sum _{{i}\in \mathcal K }w_{x}^{i,n+1}\frac{\beta _{x_{n}}^{\varepsilon } K(x,x_{n})}{A_{n+1}^{i}(x,x_{n})}\). These weights are between 0 and 1, and the finiteness follows.

To see \(\sup _{n}|b_{x^{\prime }}^{n}(x)|\), first note that for any \({i}\in \mathcal K \) and any \(x,x^{\prime }\in \mathcal X \),

$$\begin{aligned} A_{n+1}^{i}(x,x^{\prime })=\sum _{\hat{x}\in \mathcal X } \beta _{\hat{x}}^{n}K(x,\hat{x})+\beta _{x^{\prime }}^{\varepsilon }K(x,x^{\prime }), \end{aligned}$$

is an increasing sequence in \(n\). And trivially, \((\sigma _{x}^{n})^{2}=1/\beta _{x}^{n}\) is a decreasing sequence in \(n\). Then for any \(n\in \mathbb N \),

$$\begin{aligned} \tilde{\sigma }(x,x^{\prime },i)_{n}=\sqrt{((\sigma _{x^{\prime }}^{n})^{2}+\lambda _{x^{\prime }})} \frac{\beta _{x^{\prime }}^{\varepsilon }K(x,x^{\prime })}{A_{n}^{i}(x,x^{\prime })} \le \tilde{\sigma }(x,x^{\prime },i)_{0}<\infty . \end{aligned}$$

As \(b_{x^{\prime }}^{n}(x)\) is a convex combination of \(\tilde{\sigma }(x,x^{\prime },i)\) where the weights are given by \(w_{x}^{i,n}\), it follows that \(\sup _{n}|b_{x^{\prime }}^{n}(x)|\) is finite. \(\square \)

Lemma 3

For any \(\omega \in \varOmega \), we let \(\mathcal X ^{\prime }(\omega )\) be the random set of alternatives measured infinitely often by the KGNP policy. Fix \(\omega \in \varOmega \), then for any \(x\notin \mathcal X ^{\prime }(\omega )\) let \(x^{\prime }\in \mathcal X \) be an alternative such that \(x^{\prime }\ne x,\,K_{i}(x,x^{\prime })>0\) for at least one \(k_{i}\in \mathcal K \), and \(x^{\prime }\) is measured at least once. Also assume that \(\mu _{x}\ne \mu _{x^{\prime }}\). Then, \(\liminf _{n}\left| \mu _{x}^{i,n}-\mu _{x}^{0,n}\right| >0\) a.s. In other words, the estimator using kernel \(k_{i}\) has a bias almost surely.

Proof

As \(x\notin \mathcal X ^{\prime }\), there is some \(N<\infty \) such that \(\mu _{x}^{0,n}=\mu _{x}^{0,N}\) for all \(n\ge N\). And as \(\mu _{x}^{0,N}=\frac{\mu _{x}^{0}+\sum _{m\le N}\beta _{x}^{\varepsilon }y_{x_{m}}1_{(x_{m}=x)}}{\beta _{x}^{0}+\sum _{m\le N}\beta _{x}^{\varepsilon }1_{(x_{m}=x)}}\), it is given by a linear combination of normal random variables \(\left( y_{x_{m}}\right) \) and is a continuous random variable.

As \(x\ne x^{\prime }\) is at least measured once, and \(K_{i}(x,x^{\prime })>0, \mu _{x}^{i,n}\) contains positively weighted \(\mu _{x^{\prime }}^{0,n}\) terms. Also, using the assumption \(\mu _{x^{\prime }}\ne \mu _{x},\,\mu _{x^{\prime }}^{0,n}\) will not be perfectly correlated with \(\mu _{x}^{0,n}\). Then, as both are continuous random variables, the probability that \(\mu _{x}^{0,n}\) will be equal to any cluster point of \(\mu _{x}^{i,n}\) is zero a.s. That is \(\liminf _{n}\left| \mu _{x}^{i,n}-\mu _{x}^{0,n}\right| >0\). \(\square \)

Remark

If \(\mu _{x}\) are generated from a continuously distributed prior (e.g. normal distribution), then for all \(x\ne x^{\prime },\,\mathbb P (\mu _{x}\ne \mu _{x^{\prime }})=1\) and the assumption for the previous lemma holds almost surely.

Lemma 4

For any \(\omega \in \varOmega \), we let \(\mathcal X ^{\prime }(\omega )\) be the random set of alternatives measured infinitely often by the KGNP policy. For all \(x,x^{\prime }\in \mathcal X \), the following holds a.s.:

  • if  \(x\in \mathcal X ^{\prime }\)then \(\lim _{n}b_{x^{\prime }}^{n}(x)=0\)  and  \(\lim _{n}b_{x}^{n}(x^{\prime })=0,\)

  • if \(x\notin \mathcal X ^{\prime }\)then  \(\liminf _{n}b_{x}^{n}(x)>0.\)

Proof

We start by considering the first case, \(x\in \mathcal X ^{\prime }\). If \(K_{i}(x,x^{\prime })=0\) for all \({i}\in \mathcal K ,\,b_{x^{\prime }}^{n}(x)=b_{x}^{n}(x^{\prime })=0\) for all \(n\) by the definition. Taking \(n\rightarrow \infty \) we get the result.

If \(K_{i}(x,x^{\prime })>0\) for some \({i}\in K\), showing \(\lim _{n}b_{x^{\prime }}^{n}(x)=0\) is equivalent to showing that for all \({i}\in \mathcal K \)

$$\begin{aligned} \tilde{\sigma }(x,x^{\prime },i)=\sqrt{((\sigma _{x^{\prime }}^{n})^{2} +\lambda _{x^{\prime }})}\frac{\beta _{x^{\prime }}^{\varepsilon }K(x,x^{\prime })}{A_{n+1}^{i}(x,x^{\prime })}\longrightarrow 0. \end{aligned}$$

As noted previously, \(A_{n}^{i}(x,x^{\prime })\) is an increasing sequence. If \(x\in \mathcal X ^{\prime }\), then we also have that, \(\beta _{x}^{n}\rightarrow \infty \), and

$$\begin{aligned} \frac{1}{A_{n+1}^{i}(x,x^{\prime })}\le \frac{1}{\beta _{x}^{n}K(x,x^{\prime })}\longrightarrow 0. \end{aligned}$$

Therefore \(\lim _{n}b_{x^{\prime }}^{n}(x)=0\) under this case as well. Showing \(\lim _{n}b_{x}^{n}(x^{\prime })=0\), reduces to showing that,

$$\begin{aligned} \frac{1}{A_{n+1}^{i}(x^{\prime },x)}\longrightarrow 0, \end{aligned}$$

which is also given by above.

Now for the second result, where \(K_{i}(x,x^{\prime })>0\) for some \({i}\in \mathcal K \) and \(x\notin \mathcal X ^{\prime }\); by the definition of \(b_{x}^{n}(x)\)

$$\begin{aligned} b_{x}^{n}(x)\ge w_{x}^{0,n+1}\tilde{\sigma }(x,x,0)= w_{x}^{0,n+1}\sqrt{((\sigma _{x}^{n})^{2}+\lambda _{x}} \frac{\beta _{x}^{\varepsilon }}{\beta _{x}^{n}+\beta _{x}^{\varepsilon }K(x,x)}. \end{aligned}$$

For a given \(\omega \in \varOmega \), let \(N\) be the last time that alternative \(x\) is observed. Then, for all \(n\ge N\),

$$\begin{aligned} \beta _{x}^{n}=\beta _{x}^{N}\le \beta _{x}^{0} +N\beta _{x}^{\varepsilon }<\infty . \end{aligned}$$

Recall that \((\sigma _{x}^{n})^{2}=1/\beta _{x}^{n}\) and \(\lambda _{x}=1/\beta _{x}^{\varepsilon }\), and that these terms are finite for a finitely sampled alternative. For \(\liminf _{n}b_{x}^{n}(x)>0\) to hold, we only need to show that the weight stays above 0, that is,

$$\begin{aligned} \liminf _{n}w_{x}^{0,n}=\liminf _{n}\left( \frac{(( \sigma _{x}^{0,n})^{2})^{-1}}{\sum _{{i^{\prime }}\in \mathcal K } ((\sigma _{x}^{i^{\prime },n})^{2}+\nu _{x}^{i^{\prime },n})^{-1}}\right) >0. \end{aligned}$$

Almost sure finiteness of the numerator has been shown above, which means we only need to show that

$$\begin{aligned} \limsup _{n}\sum _{{i^{\prime }}\in \mathcal K }((\sigma _{x}^{i^{\prime },n})^{2} +\nu _{x}^{i^{\prime },n})^{-1}<\infty . \end{aligned}$$

First we divide the set of kernels into two pieces. Let \(\mathcal K _{1}(\omega ,x)\) be the set such that, for \(\omega \in \varOmega \), there is at least one \(x^{\prime }\in \mathcal X ^{\prime }\) such that \(K_{i}(x,x^{\prime })>0\). In other words, there is one infinitely often sampled point (\(x^{\prime }\)) close to our original point (\(x\)) that has influence on the prediction. Let \(\mathcal K _{2}(\omega ,x)\)=\(\mathcal K \backslash \mathcal K _{1}\). Now as all terms are positive,

$$\begin{aligned}&\limsup _{n}\sum _{{i^{\prime }}\in \mathcal K }((\sigma _{x}^{i^{\prime },n})^{2} +\nu _{x}^{i^{\prime },n})^{-1}\le \limsup _{n}\sum _{{i^{\prime }}\in \mathcal K _{1}} ((\sigma _{x}^{i^{\prime },n})^{2}+\nu _{x}^{i^{\prime },n})^{-1}\\&+\limsup _{n} \sum _{{i^{\prime }}\in \mathcal K _{2}}((\sigma _{x}^{i^{\prime },n}) ^{2}+\nu _{x}^{i^{\prime },n})^{-1}. \end{aligned}$$

For all \(k_{i^{\prime }}\in \mathcal K _{1}\), we have that by Lemma 3, \(\liminf _{n}\nu _{x}^{{i^{\prime }},n}>0\), even if \(\lim \inf _{n}(\sigma _{x}^{{i^{\prime }},n})^{2}=0\), the limsup for the first term on the right are finite. Finally, for all \({i^{\prime }}\in \mathcal K _{2}\), as none of the points using \({i^{\prime }}\in \mathcal K _{2}\) using to predict \(\mu _{x}\) are sampled infinitely often, letting

$$\begin{aligned} N_{X}=\max _{x\notin \mathcal X ^{\prime }}N_{x}, \end{aligned}$$

where \(N_{x}\) is the last time point \(x\) is sampled, we have \(N_{X}<\infty \). Then, \(\beta _{x}^{n}\) for all \(x\notin \mathcal X ^{\prime }(\omega )\) is finite (and bounded above by \(N_{X}(\max _{x\notin \mathcal X ^{\prime }}\beta _{x}^{\varepsilon })\)) and

$$\begin{aligned} \sum _{{i}\in \mathcal K _{2}}((\sigma _{x}^{i,n})^{2}+\nu _{x}^{i,n})^{-1}&\le \sum _{{i}\in \mathcal K _{2}}((\sigma _{x}^{i,n})^{2})^{-1}\\&\le \sum _{{i}\in \mathcal K _{2}}\frac{(\sum _{x^{\prime }\in \mathcal X } \beta _{x^{\prime }}^{n}K_{i}(x,x^{\prime }))^{2}}{\sum _{x^{\prime }\in \mathcal X }\beta _{x^{\prime }}^{n}K_{i} (x,x^{\prime })^{2}}\\&\le \sum _{{i}\in \mathcal K _{2}}\frac{(\sum _{x^{\prime }\in \mathcal X }N_{X} (\max _{x\notin \mathcal X ^{\prime }}\beta _{x}^{\varepsilon })K_{i}(x,x^{\prime }))^{2}}{\sum _{x^{\prime }\in \mathcal X }N_{X}(\max _{x\notin \mathcal X ^{\prime }} \beta _{x}^{\varepsilon })K_{i}(x,x^{\prime })^{2}}<\infty \end{aligned}$$

where the last term does not contain \(n\). Taking the limit supremum over \(n\) for both sides gives us the final result. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Barut, E., Powell, W.B. Optimal learning for sequential sampling with non-parametric beliefs. J Glob Optim 58, 517–543 (2014). https://doi.org/10.1007/s10898-013-0050-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10898-013-0050-5

Keywords

Navigation