ABSTRACT
We formalize the associative bandit problem framework introduced by Kaelbling as a learning-theory problem. The learning environment is modeled as a k-armed bandit where arm payoffs are conditioned on an observable input selected on each trial. We show that, if the payoff functions are constrained to a known hypothesis class, learning can be performed efficiently with respect to the VC dimension of this class. We formally reduce the problem of PAC classification to the associative bandit problem, producing an efficient algorithm for any hypothesis class for which efficient classification algorithms are known. We demonstrate the approach empirically on a scalable concept class.
- Abe, N., Biermann, A. W., & Long, P. M. (2003). Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37, 263--293.Google ScholarDigital Library
- Auer, P. (2000). An improved on-line algorithm for learning linear evaluation functions. Proceedings of the 13th Annual Conference on Computational Learning Theory (pp. 118--125). Google ScholarDigital Library
- Berry, D. A., & Fristedt, B. (1985). Bandit problems: Sequential allocation of experiments. London, UK: Chapman and Hall.Google ScholarCross Ref
- Elkan, C. (2001). The foundations of cost-sensitive learning. IJCAI (pp. 973--978). Google ScholarDigital Library
- Fiechter, C.-N. (1995). PAC associative reinforcement learning. Unpublished manuscript.Google Scholar
- Fiechter, C.-N. (1997). Expected mistake bound model for on-line reinforcement learning. Proceedings of the Fourteenth International Conference on Machine Learning (pp. 116--124). Google ScholarDigital Library
- Fong, P. W. L. (1995). A quantitative study of hypothesis selection. Proceedings of the Twelfth International Conference on Machine Learning (ICML-95) (pp. 226--234).Google ScholarDigital Library
- Kaelbling, L. P. (1994). Associative reinforcement learning: Functions in k-DNF. Machine Learning, 15. Google ScholarDigital Library
- Kearns, M. J., & Schapire, R. E. (1990). Efficient distribution-free learning of probabilistic concepts. Journal of Computer and System Sciences, 48, 464--497. Google ScholarDigital Library
- Langford, J., & Zadrozny, B. (2005). Estimating class membership probabilities using classifier learners. Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics (pp. 198--205).Google Scholar
- Mitchell, T. M. (1997). Machine learning. McGraw Hill. Google ScholarDigital Library
- Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27, 1134--1142. Google ScholarDigital Library
- Zadrozny, B., Langford, J., & Abe, N. (2003). Cost-sensitive learning by cost-proportionate example weighting. ICDM (p. 435). Google ScholarDigital Library
Index Terms
- Experience-efficient learning in associative bandit problems
Recommendations
Online bandit learning against an adaptive adversary: from regret to policy regret
ICML'12: Proceedings of the 29th International Coference on International Conference on Machine LearningOnline learning algorithms are designed to learn even when their input is generated by an adversary. The widely-accepted formal definition of an online algorithm's ability to learn is the game-theoretic notion of regret. We argue that the standard ...
Bandit problems with fidelity rewards
The fidelity bandits problem is a variant of the K-armed bandit problem in which the reward of each arm is augmented by a fidelity reward that provides the player with an additional payoff depending on how 'loyal' the player has been to that arm in the ...
Comments