Fast Converging Multi-armed Bandit Optimization Using Probabilistic Graphical Model

Zhao, Chen; Watanabe, Kohei; Yang, Bin; Hirate, Yu

doi:10.1007/978-3-319-93037-4_10

Fast Converging Multi-armed Bandit Optimization Using Probabilistic Graphical Model

Chen Zhao¹⁹,
Kohei Watanabe²⁰,
Bin Yang²¹ &
…
Yu Hirate²¹

Conference paper
First Online: 20 June 2018

1994 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10938))

Abstract

This paper designs a strategic model used to optimize click-though rates (CTR) for profitable recommendation systems. Approximating a function from samples as a vital step of data prediction is desirable when ground truth is not directly accessible. While interpolation algorithms such as regression and non-kernel SVMs are prevalent in modern machine learning, they are, however, in many cases not proper options for fitting arbitrary functions with no closed-form expression. The major contribution of this paper consists of a semi-parametric graphical model complying with properties of the Gaussian Markov random field (GMRF) to approximate general functions that can be multivariate. Based upon model inference, this paper further investigates several policies commonly used in Bayesian optimization to solve the multi-armed bandit model (MAB) problem. The primary objective is to locate global optimum of an unknown function. In case of recommendation, the proposed algorithm leads to maximum user clicks from rescheduled recommendation policy while maintaining the lowest possible cost. Comparative experiments are conducted among a set of policies. Empirical evaluation suggests that Thompson sampling is the most suitable policy for the proposed algorithm.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Honda, J., Takemura, A.: An asymptotically optimal policy for finite support models in the multiarmed bandit problem. Mach. Learn. 85(3), 361–391 (2011)
Article MathSciNet Google Scholar
Maes, F., Wehenkel, L., Ernst, D.: Learning to play K-armed bandit problems. In: Proceedings of ICAART (2012)
Google Scholar
Rue, H., Held, L.: Gaussian Markov Random Fields: Theory and Applications. Monographs on Statistics and Applied Probability. GMR:1051482. Chapman & Hall/CRC, Boca Raton (2005)
Book Google Scholar
Srinivas, N., Krause, A., Kakade, S., Seeger, M.: Gaussian process optimization in the bandit setting: no regret and experimental design. In: Proceedings of ICML (2010)
Google Scholar
Agrawal, S., Goyal, N.: Thompson sampling for contextual bandits with linear payoffs. In: Proceedings of ICML (2013)
Google Scholar
Sui, Y., Gotovos, A., Burdick, J., Krause, A.: Safe exploration for optimization with Gaussian processes. In: Proceedings of ICML (2015)
Google Scholar
Djolonga, J., Krause, A., Cevher, V.: High-dimensional Gaussian process bandits. In: Proceedings of NIPS (2013)
Google Scholar
Wang, Z., Zhou, B., Jegelka, S.: Optimization as estimation with Gaussian processes in bandit settings. In: Proceedings of AISTATS (2016)
Google Scholar
Desautels, T., Krause, A., Burdick, J.W.: Parallelizing exploration-exploitation tradeoffs in Gaussian process bandit optimization. J. Mach. Learn. Res. 15, 3873–3923 (2014)
MathSciNet MATH Google Scholar
Wu, Y., György, A., Szepesvári, C.: Online learning with Gaussian payoffs and side observations. In: Proceedings of NIPS (2015)
Google Scholar
Li, L., Chu, W., Langford, J., Schapire, R.E.: A contextual-bandit approach to personalized news article recommendation. In: Proceedings of WWW (2010)
Google Scholar
Srinivas, N., Krause, A., Kakade, S.M., Seeger, M.W.: Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Trans. Inf. Theory 58(5), 3250–3265 (2012)
Article MathSciNet Google Scholar
Vanchinathan, H.P., Nikolic, I., De Bona, F., Krause, A.: Explore-exploit in top-N recommender systems via Gaussian processes. In: Proceedings of RecSys (2014)
Google Scholar
Schreiter, J., Nguyen-Tuong, D., Eberts, M., Bischoff, B., Markert, H., Toussaint, M.: Safe exploration for active learning with Gaussian processes. In: Bifet, A., May, M., Zadrozny, B., Gavalda, R., Pedreschi, D., Bonchi, F., Cardoso, J., Spiliopoulou, M. (eds.) ECML PKDD 2015. LNCS (LNAI), vol. 9286, pp. 133–149. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23461-8_9
Chapter Google Scholar
Chapelle, O., Li, L.: An empirical evaluation of Thompson sampling. In: Proceedings of NIPS (2011)
Google Scholar
Zeng, C., Wang, Q., Mokhtari, S., Li, T.: Online context-aware recommendation with time varying multi-armed bandit. In: Proceedings of KDD (2016)
Google Scholar
Nguyen, T.V., Karatzoglou, A., Baltrunas, L.: Gaussian process factorization machines for context-aware recommendations. In: Proceedings of SIGIR (2014)
Google Scholar
Bubeck, S., Munos, R., Stoltz, G.: Pure exploration in multi-armed bandits problems. In: Gavaldà, R., Lugosi, G., Zeugmann, T., Zilles, S. (eds.) ALT 2009. LNCS (LNAI), vol. 5809, pp. 23–37. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04414-4_7
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

University of Tsukuba, Tsukuba, Japan
Chen Zhao
University of Tokyo, Tokyo, Japan
Kohei Watanabe
Rakuten Institute of Technology, Tokyo, Japan
Bin Yang & Yu Hirate

Authors

Chen Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Kohei Watanabe
View author publications
You can also search for this author in PubMed Google Scholar
Bin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yu Hirate
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chen Zhao .

Editor information

Editors and Affiliations

Deakin University, Geelong, Victoria, Australia
Dinh Phung
National Chiao Tung University, Hsinchu City, Taiwan
Vincent S. Tseng
Monash University, Clayton, Victoria, Australia
Geoffrey I. Webb
Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Bao Ho
University of Melbourne, Melbourne, Victoria, Australia
Mohadeseh Ganji
University of Melbourne, Melbourne, Victoria, Australia
Lida Rashidi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, C., Watanabe, K., Yang, B., Hirate, Y. (2018). Fast Converging Multi-armed Bandit Optimization Using Probabilistic Graphical Model. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10938. Springer, Cham. https://doi.org/10.1007/978-3-319-93037-4_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-93037-4_10
Published: 20 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93036-7
Online ISBN: 978-3-319-93037-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics