Linear Upper Confidence Bound Algorithm for Contextual Bandit Problem with Piled Rewards

Huang, Kuan-Hao; Lin, Hsuan-Tien

doi:10.1007/978-3-319-31750-2_12

Kuan-Hao Huang¹⁹ &
Hsuan-Tien Lin¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9652))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3250 Accesses
5 Citations

Abstract

We study the contextual bandit problem with linear payoff function. In the traditional contextual bandit problem, the algorithm iteratively chooses an action based on the observed context, and immediately receives a reward for the chosen action. Motivated by a practical need in many applications, we study the design of algorithms under the piled-reward setting, where the rewards are received as a pile instead of immediately. We present how the Linear Upper Confidence Bound (LinUCB) algorithm for the traditional problem can be naïvely applied under the piled-reward setting, and prove its regret bound. Then, we extend LinUCB to a novel algorithm, called Linear Upper Confidence Bound with Pseudo Reward (LinUCBPR), which digests the observed contexts to choose actions more strategically before the piled rewards are received. We prove that LinUCBPR can match LinUCB in the regret bound under the piled-reward setting. Experiments on the artificial and real-world datasets demonstrate the strong performance of LinUCBPR in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Linear Bandits in Unknown Environments

Con-CNAME: A Contextual Multi-armed Bandit Algorithm for Personalized Recommendations

Interconnected Neural Linear Contextual Bandits with UCB Exploration

Notes

1.
available from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

References

Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., Schapire, R.E.: Taming the monster: a fast and simple algorithm for contextual bandits. In: ICML, pp. 1638–1646 (2014)
Google Scholar
Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 3, 397–422 (2003)
MathSciNet MATH Google Scholar
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)
Article MATH Google Scholar
Chou, K.C., Chiang, C.K., Lin, H.T., Lu, C.J.: Pseudo-reward algorithms for contextual bandits with linear payoff functions. In: ACML, pp. 344–359 (2014)
Google Scholar
Chu, W., Li, L., Reyzin, L., Schapire, R.E.: Contextual bandits with linear payoff functions. In: AISTATS, pp. 208–214 (2011)
Google Scholar
Dudík, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., Zhang, T.: Efficient optimal learning for contextual bandits. In: UAI, pp. 169–178 (2011)
Google Scholar
Dudík, M., Langford, J., Li, L.: Doubly robust policy evaluation and learning. In: ICML, pp. 1097–1104 (2011)
Google Scholar
Guha, S., Munagala, K., Pal, M.: Multiarmed bandit problems with delayed feedback, arxiv:1011.1161 (2010)
Joulani, P., György, A., Szepesvári, C.: Online learning under delayed feedback. In: ICML, pp. 1453–1461 (2013)
Google Scholar
Li, L., Chu, W., Langford, J., Schapire, R.E.: A contextual-bandit approach to personalized news article recommendation. In: WWW, pp. 661–670 (2010)
Google Scholar
Li, L., Chu, W., Langford, J., Wang, X.: Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In: WSDM, pp. 297–306 (2011)
Google Scholar
Mandel, T., Liu, Y.E., Brunskill, E., Popovic, Z.: The queue method: handling delay, heuristics, prior data, and evaluation in bandits. In: AAAI (2015)
Google Scholar
Wang, C.C., Kulkarni, S.R., Poor, H.V.: Bandit problems with side observations. IEEE Trans. Autom. Control 50(3), 338–355 (2005)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

We thank the anonymous reviewers and the members of the NTU CLLab for valuable suggestions. This work is partially supported by the Ministry of Science and Technology of Taiwan (MOST 103-2221-E-002 -148 -MY3) and Asian Office of Aerospace Research and Development (AOARD FA2386-15-1-4012).

Author information

Authors and Affiliations

Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan
Kuan-Hao Huang & Hsuan-Tien Lin

Authors

Kuan-Hao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Hsuan-Tien Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hsuan-Tien Lin .

Editor information

Editors and Affiliations

The University of Melbourne, Melbourne, Victoria, Australia
James Bailey
The University of Texas at Dallas, Richardson, Texas, USA
Latifur Khan
Osaka University, Osaka, Japan
Takashi Washio
University of Auckland, Auckland, New Zealand
Gill Dobbie
Shenzhen University, Shenzhen, China
Joshua Zhexue Huang
Massey University, Auckland, New Zealand
Ruili Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, KH., Lin, HT. (2016). Linear Upper Confidence Bound Algorithm for Contextual Bandit Problem with Piled Rewards. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9652. Springer, Cham. https://doi.org/10.1007/978-3-319-31750-2_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-31750-2_12
Published: 12 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31749-6
Online ISBN: 978-3-319-31750-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Linear Upper Confidence Bound Algorithm for Contextual Bandit Problem with Piled Rewards