Abstract
A learner’s modifiable components are called its policy. An algorithm that modifies the policy is a learning algorithm. If the learning algorithm has modifiable components represented as part of the policy, then we speak of a self-modifying policy (SMP). SMPs can modify the way they modify themselves etc. They are of interest in situations where the initial learning algorithm itself can be improved by experience — this is what we call “learning to learn”. How can we force some (stochastic) SMP to trigger better and better self-modifications? The success-story algorithm (SSA) addresses this question in a lifelong reinforcement learning context. During the learner’s life-time, SSA is occasionally called at times computed according to SMP itself. SSA uses backtracking to undo those SMP-generated SMP-modifications that have not been empirically observed to trigger lifelong reward accelerations (measured up until the current SSA call — this evaluates the long-term effects of SMP-modifications setting the stage for later SMP-modifications). SMP-modifications that survive SSA represent a lifelong success history. Until the next SSA call, they build the basis for additional SMP-modifications. Solely by self-modifications our SMP/SSA-based learners solve a complex task in a partially observable environment (POE) whose state space is far bigger than most reported in the POE literature.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
J. Schmidhuber. On learning how to learn learning strategies. Technical Report FKI-198-94, Fakultät für Informatik, Technische Universität München, 1994. Revised 1995.
D. H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7): 1341–1390, 1996.
J. Schmidhuber. Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks, 1997. In press.
P. Utgoff. Shift of bias for inductive concept learning. In R. Michalski, J. Carbonell, and T. Mitchell, editors, Machine Learning, volume 2, pages 163–190. Morgan Kaufmann, Los Altos, CA, 1986.
R.J. Solomonoff. A formal theory of inductive inference. Part I. Information and Control, 7:1–22, 1964.
A.N. Kolmogorov. Three approaches to the quantitative definition of information. Problems of Information Transmission, 1:1–11, 1965.
G.J. Chaitin. On the length of programs for computing finite binary sequences: statistical considerations. Journal of the ACM, 16:145–159, 1969.
M. Li and P. M. B. Vitányi. An Introduction to Kolmogorov Complexity and its Applications. Springer, 1993
J. Schmidhuber. Discovering solutions with low Kolmogorov complexity and high generalization capability. In A. Prieditis and S. Russell, editors, Machine Learning: Proceedings of the Twelfth International Conference, pages 488–496. Morgan Kaufmann Publishers, San Francisco, CA, 1995.
M.A. Wiering and J. Schmidhuber. Solving POMDPs with Levin search and EIRA. In L. Saitta, editor, Machine Learning: Proceedings of the Thirteenth International Conference, pages 534–542. Morgan Kaufmann Publishers, San Francisco, CA, 199
L. A. Levin. Universal sequential search problems. Problems of Information Transmission, 9(3):265–266, 1973.
L. A. Levin. Randomness conservation inequalities: Information and independence in mathematical theories. Information and Control, 61:15–37, 1984.
R.J. Solomonoff. An application of algorithmic probability to problems in artificial intelligence. In L. N. Kanal and J. F. Lemmer, editors, Uncertainty in Artificial Intelligence, pages 473–491. Elsevier Science Publishers, 1986.
O. Watanabe. Kolmogorov complexity and computational complexity. EATCS Monographs on Theoretical Computer Science, Springer, 1992.
J. Schmidhuber. Discovering problem solutions with low Kolmogorov complexity and high generalization capability. Technical Report FKI-194-94, Fakultät für Informatik, Technische Universität München, 1994. Short version in A. Prieditis and S. Russell, eds., Machine Learning: Proceedings of the Twelfth International Conference, Morgan Kaufmann Publishers, pages 488–496, San Francisco, CA, 1995.
R. Caruana, D. L. Silver, J. Baxter, T. M. Mitchell, L. Y. Pratt, and S. Thrun. Learning to learn: knowledge consolidation and transfer in inductive systems, 1995. Workshop held at NIPS-95, Vail, CO, see http://www.cs.cmu.edu/afs/user/caruana/pub/transfer.html.
J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-⋯ hook. Institut für Informatik, Technische Universität München, 1987.
J. Schmidhuber. A self-referential weight matrix. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 446–451. Springer, 1993.
S.D. Whitehead and D. H. Ballard. Active perception and reinforcement learning. Neural Computation, 2(4):409–419, 1990.
J. Schmidhuber. Reinforcement learning in Markovian and non-Markovian environments. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3, pages 500–506. San Mateo, CA: Morgan Kaufmann, 1991.
L.J. Lin. Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Carnegie Mellon University, Pittsburgh, January 1993.
M. B. Ring. Continual Learning in Reinforcement Environments. PhD thesis, University of Texas at Austin, Austin, Texas 78712, August 1994.
M.I. Littman. Memoryless policies: Theoretical limitations and practical results. In J. A. Meyer D. Cliff, P. Husbands and S. W. Wilson, editors, Proc. of the International Conference on Simulation of Adaptive Behavior: From Animals to Animals 3, pages 297–305. MIT Press/Bradford Books, 1994.
D. Cliff and S. Ross. Adding temporary memory to ZCS. Adaptive Behavior, 3:101–150, 1994.
L. Chrisman. Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In Proceedings of the Tenth International Conference on Artificial Intelligence, pages 183–188. AAAI Press, San Jose, California, 1992.
T. Jaakkola, S. P. Singh, and M. I. Jordan. Reinforcement learning algorithm for partially observable Markov decision problems. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems 7, pages 345–352. MIT Press, Cambridge MA, 1995.
L.P. Kaelbling, M.L. Littman, and A.R. Cassandra. Planning and acting in partially observable stochastic domains. Technical report, Brown University, Providence RI, 1995.
R. A. McCallum. Instance-based utile distinctions for reinforcement learning with hidden state. In A. Prieditis and S. Russell, editors, Machine Learning: Proceedings of the Twelfth International Conference, pages 387–395. Morgan Kaufmann Publishers, San Francisco, CA, 1995.
M. Wiering and J. Schmidhuber. HQ-Learning: Discovering Markovian subgoals for nonMarkovian reinforcement learning. Technical Report IDSIA-95-96, IDSIA, 1996.
S. Russell and E. Wefald. Principles of Metareasoning. Artificial Intelligence, 49:361–395, 1991.
M. Boddy and T. L. Dean. Deliberation scheduling for problem solving in time-constrained environments. Artificial Intelligence, 67:245–285, 1994.
D. A. Berry and B. Fristedt. Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, London, 1985.
J. C. Gittins. Multi-armed Bandit Allocation Indices. Wiley-Interscience series in systems and optimization. Wiley, Chichester, NY, 1989.
R. Greiner. PALO: A probabilistic hill-climbing algorithm. Artificial Intelligence, 83(2), 1996.
P. R. Kumar and P. Varaiya. Stochastic Systems: Estimation, Identification, and Adaptive Control. Prentice Hall, 1986.
R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44,1988.
C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.
R.H. Crites and A.G. Barto. Improving elevator performance using reinforcement learning. In D.S. Touretzky, M.C. Mozer, and M.E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 1017–1023, Cambridge MA, 1996. MIT Press.
D. Lenat. Theory formation by heuristic search. Machine Learning, 21, 1983.
P. S. Rosenbloom, J. E. Laird, and A. Newell. The SOAR Papers. MIT Press, 1993.
J. Schmidhuber. A neural network that embeds its own meta-levels. In Proc. of the International Conference on Neural Networks’ 93, San Francisco. IEEE, 1993.
J. Zhao and J. Schmidhuber. Incremental self-improvement for life-time multi-agent reinforcement learning. In Pattie Maes, Maja Mataric, Jean-Arcady Meyer, Jordan Pollack, and Stewart W. Wilson, editors, From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pages 516–525. MIT Press, Bradford Books, 1996.
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1998 Springer Science+Business Media New York
About this chapter
Cite this chapter
Schmidhuber, J., Zhao, J., Schraudolph, N.N. (1998). Reinforcement Learning with Self-Modifying Policies. In: Thrun, S., Pratt, L. (eds) Learning to Learn. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-5529-2_12
Download citation
DOI: https://doi.org/10.1007/978-1-4615-5529-2_12
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-7527-2
Online ISBN: 978-1-4615-5529-2
eBook Packages: Springer Book Archive