skip to main content
10.1145/3394486.3406484acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
tutorial

Learning by Exploration: New Challenges in Real-World Environments

Published: 20 August 2020 Publication History

Abstract

Learning is a predominant theme for any intelligent system, humans, or machines. Moving beyond the classical paradigm of learning from past experience, e.g., offline supervised learning from given labels, a learner needs to actively collect exploratory feedback to learn from the unknowns, i.e., learning through exploration. This tutorial will introduce the learning by exploration paradigm, which is the key ingredient in many interactive online learning problems, including the multi-armed bandit and, more generally, reinforcement learning problems.
In this tutorial, we will first motivate the need for exploration in machine learning algorithms and highlight its importance in many real-world problems where online sequential decision making is involved. In real-world application scenarios, considerable challenges arise in such a learning problem, including sample complexity, costly and even outdated feedback, and ethical considerations of exploration (such as fairness and privacy). We will introduce several classical exploration strategies and then highlight the aforementioned three fundamental challenges in the learning from exploration paradigm and introduce the recent research development on addressing them, respectively.

References

[1]
Y. Abbasi-yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In NIPS, pages 2312--2320. 2011.
[2]
S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127--135, 2013.
[3]
P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397--422, 2002.
[4]
P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2--3):235--256, May 2002.
[5]
Y. Cao, Z. Wen, B. Kveton, and Y. Xie. Nearly optimal adaptive procedure with change detection for piecewise-stationary bandit. arXiv preprint arXiv:1802.03692, 2018.
[6]
N. Cesa-Bianchi, C. Gentile, and G. Zappella. A gang of bandits. In Pro. NIPS, 2013.
[7]
Y. Chen, A. Cuellar, H. Luo, J. Modi, H. Nemlekar, and S. Nikolaidis. The fair contextual multi-armed bandit. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, pages 1810--1812, 2020.
[8]
Y. Chen, C.-W. Lee, H. Luo, and C.-Y. Wei. A new algorithm for non-stationary contextual bandits: Efficient, optimal, and parameter-free. In Conference on Learning Theory, 2019.
[9]
A. Garivier and E. Moulines. On upper-confidence bound policies for non-stationary bandit problems. In arXiv preprint arXiv:0805.3415 (2008).
[10]
C. Gentile, S. Li, P. Kar, A. Karatzoglou, G. Zappella, and E. Etrue. On context-dependent clustering of bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1253--1262. JMLR. org, 2017.
[11]
C. Gentile, S. Li, and G. Zappella. Online clustering of bandits. In Pro. of the 31st International Conference on Machine Learning (ICML-14), pages 757--765, 2014.
[12]
N. Hariri, B. Mobasher, and R. Burke. Adapting to user preference changes in interactive recommendation. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI'15, pages 4268--4274. AAAI Press, 2015.
[13]
M. Joseph, M. Kearns, J. H. Morgenstern, and A. Roth. Fairness in learning: Classic and contextual bandits. In Advances in Neural Information Processing Systems, pages 325--333, 2016.
[14]
E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In International conference on algorithmic learning theory, pages 199--213. Springer, 2012.
[15]
B. Kveton, C. Szepesvari, M. Ghavamzadeh, and C. Boutilier. Perturbed-history exploration in stochastic linear bandits. arXiv preprint arXiv:1903.09132, 2019.
[16]
B. Kveton, C. Szepesvari, S. Vaswani, Z. Wen, T. Lattimore, and M. Ghavamzadeh. Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. In International Conference on Machine Learning, pages 3601--3610, 2019.
[17]
J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS'07, page 817--824, Red Hook, NY, USA, 2007. Curran Associates Inc.
[18]
L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of 19th WWW, pages 661--670. ACM, 2010.
[19]
S. Li, A. Karatzoglou, and C. Gentile. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 539--548, 2016.
[20]
H. Luo, C.-Y. Wei, A. Agarwal, and J. Langford. Efficient contextual bandits in non-stationary worlds. In Conference On Learning Theory, pages 1739--1776, 2018.
[21]
Y. Russac, C. Vernade, and O. Cappé. Weighted linear bandits for non-stationary environments. In Advances in Neural Information Processing Systems, pages 12040--12049, 2019.
[22]
R. Shariff and O. Sheffet. Differentially private contextual linear bandits. In Advances in Neural Information Processing Systems, pages 4296--4306, 2018.
[23]
R. S. Sutton, A. G. Barto, et al. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998.
[24]
H. Wang, S. Kim, E. McCord-Snook, Q. Wu, and H. Wang. Variance reduction in gradient exploration for online learning to rank. In 42nd ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'19, pages 835--844. ACM, 2019.
[25]
H. Wang, Q. Wu, and H. Wang. Learning hidden features for contextual bandits. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages 1633--1642, 2016.
[26]
H. Wang, Q. Wu, and H. Wang. Factorization bandits for interactive recommendation. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[27]
Q. Wu, N. Iyer, and H. Wang. Learning contextual bandits in a collaborative environment. In SIGIR 2018.
[28]
Q. Wu, Z. Li, H. Wang, W. Chen, and H. Wang. Factorization bandits for online influence maximization. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 636--646, 2019.
[29]
Q. Wu, H. Wang, Q. Gu, and H. Wang. Contextual bandits in a collaborative environment. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 529--538, 2016.
[30]
Q. Wu, H. Wang, Y. Li, and H. Wang. Dynamic ensemble of contextual bandits to satisfy users' changing interests. In The World Wide Web Conference, WWW '19, pages 2080--2090, New York, NY, USA, 2019. ACM.
[31]
C. Zhang, A. Agarwal, H. D. Iii, J. Langford, and S. Negahban. Warm-starting contextual bandits: Robustly combining supervised and bandit feedback. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7335--7344, Long Beach, California, USA, 09-15 Jun 2019. PMLR.
[32]
J. Zhang and E. Bareinboim. Transfer learning in multi-armed bandit: a causal approach. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 1778--1780, 2017.

Cited By

View all
  • (2023)Information seeking behavior of medical students in Sri LankaInformation Development10.1177/02666669231179398Online publication date: 12-Jun-2023
  • (2021)Interactive Information Retrieval with Bandit FeedbackProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3462810(2658-2661)Online publication date: 11-Jul-2021

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
August 2020
3664 pages
ISBN:9781450379984
DOI:10.1145/3394486
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2020

Check for updates

Author Tags

  1. learning by exploration
  2. multi-armed bandit
  3. reinforcement learning

Qualifiers

  • Tutorial

Funding Sources

  • National Science Foundation

Conference

KDD '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Information seeking behavior of medical students in Sri LankaInformation Development10.1177/02666669231179398Online publication date: 12-Jun-2023
  • (2021)Interactive Information Retrieval with Bandit FeedbackProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3462810(2658-2661)Online publication date: 11-Jul-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media