Abstract
We study the contextual bandit problem with linear payoff function. In the traditional contextual bandit problem, the algorithm iteratively chooses an action based on the observed context, and immediately receives a reward for the chosen action. Motivated by a practical need in many applications, we study the design of algorithms under the piled-reward setting, where the rewards are received as a pile instead of immediately. We present how the Linear Upper Confidence Bound (LinUCB) algorithm for the traditional problem can be naïvely applied under the piled-reward setting, and prove its regret bound. Then, we extend LinUCB to a novel algorithm, called Linear Upper Confidence Bound with Pseudo Reward (LinUCBPR), which digests the observed contexts to choose actions more strategically before the piled rewards are received. We prove that LinUCBPR can match LinUCB in the regret bound under the piled-reward setting. Experiments on the artificial and real-world datasets demonstrate the strong performance of LinUCBPR in practice.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
available from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
References
Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., Schapire, R.E.: Taming the monster: a fast and simple algorithm for contextual bandits. In: ICML, pp. 1638–1646 (2014)
Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 3, 397–422 (2003)
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)
Chou, K.C., Chiang, C.K., Lin, H.T., Lu, C.J.: Pseudo-reward algorithms for contextual bandits with linear payoff functions. In: ACML, pp. 344–359 (2014)
Chu, W., Li, L., Reyzin, L., Schapire, R.E.: Contextual bandits with linear payoff functions. In: AISTATS, pp. 208–214 (2011)
Dudík, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., Zhang, T.: Efficient optimal learning for contextual bandits. In: UAI, pp. 169–178 (2011)
Dudík, M., Langford, J., Li, L.: Doubly robust policy evaluation and learning. In: ICML, pp. 1097–1104 (2011)
Guha, S., Munagala, K., Pal, M.: Multiarmed bandit problems with delayed feedback, arxiv:1011.1161 (2010)
Joulani, P., György, A., Szepesvári, C.: Online learning under delayed feedback. In: ICML, pp. 1453–1461 (2013)
Li, L., Chu, W., Langford, J., Schapire, R.E.: A contextual-bandit approach to personalized news article recommendation. In: WWW, pp. 661–670 (2010)
Li, L., Chu, W., Langford, J., Wang, X.: Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In: WSDM, pp. 297–306 (2011)
Mandel, T., Liu, Y.E., Brunskill, E., Popovic, Z.: The queue method: handling delay, heuristics, prior data, and evaluation in bandits. In: AAAI (2015)
Wang, C.C., Kulkarni, S.R., Poor, H.V.: Bandit problems with side observations. IEEE Trans. Autom. Control 50(3), 338–355 (2005)
Acknowledgements
We thank the anonymous reviewers and the members of the NTU CLLab for valuable suggestions. This work is partially supported by the Ministry of Science and Technology of Taiwan (MOST 103-2221-E-002 -148 -MY3) and Asian Office of Aerospace Research and Development (AOARD FA2386-15-1-4012).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Huang, KH., Lin, HT. (2016). Linear Upper Confidence Bound Algorithm for Contextual Bandit Problem with Piled Rewards. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9652. Springer, Cham. https://doi.org/10.1007/978-3-319-31750-2_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-31750-2_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31749-6
Online ISBN: 978-3-319-31750-2
eBook Packages: Computer ScienceComputer Science (R0)