Skip to main content

Reinforcement Learning with Self-Modifying Policies

  • Chapter

Abstract

A learner’s modifiable components are called its policy. An algorithm that modifies the policy is a learning algorithm. If the learning algorithm has modifiable components represented as part of the policy, then we speak of a self-modifying policy (SMP). SMPs can modify the way they modify themselves etc. They are of interest in situations where the initial learning algorithm itself can be improved by experience — this is what we call “learning to learn”. How can we force some (stochastic) SMP to trigger better and better self-modifications? The success-story algorithm (SSA) addresses this question in a lifelong reinforcement learning context. During the learner’s life-time, SSA is occasionally called at times computed according to SMP itself. SSA uses backtracking to undo those SMP-generated SMP-modifications that have not been empirically observed to trigger lifelong reward accelerations (measured up until the current SSA call — this evaluates the long-term effects of SMP-modifications setting the stage for later SMP-modifications). SMP-modifications that survive SSA represent a lifelong success history. Until the next SSA call, they build the basis for additional SMP-modifications. Solely by self-modifications our SMP/SSA-based learners solve a complex task in a partially observable environment (POE) whose state space is far bigger than most reported in the POE literature.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   189.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • J. Schmidhuber. On learning how to learn learning strategies. Technical Report FKI-198-94, Fakultät für Informatik, Technische Universität München, 1994. Revised 1995.

    Google Scholar 

  • D. H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7): 1341–1390, 1996.

    Article  Google Scholar 

  • J. Schmidhuber. Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks, 1997. In press.

    Google Scholar 

  • P. Utgoff. Shift of bias for inductive concept learning. In R. Michalski, J. Carbonell, and T. Mitchell, editors, Machine Learning, volume 2, pages 163–190. Morgan Kaufmann, Los Altos, CA, 1986.

    Google Scholar 

  • R.J. Solomonoff. A formal theory of inductive inference. Part I. Information and Control, 7:1–22, 1964.

    Article  MathSciNet  MATH  Google Scholar 

  • A.N. Kolmogorov. Three approaches to the quantitative definition of information. Problems of Information Transmission, 1:1–11, 1965.

    Google Scholar 

  • G.J. Chaitin. On the length of programs for computing finite binary sequences: statistical considerations. Journal of the ACM, 16:145–159, 1969.

    Article  MathSciNet  MATH  Google Scholar 

  • M. Li and P. M. B. Vitányi. An Introduction to Kolmogorov Complexity and its Applications. Springer, 1993

    Google Scholar 

  • J. Schmidhuber. Discovering solutions with low Kolmogorov complexity and high generalization capability. In A. Prieditis and S. Russell, editors, Machine Learning: Proceedings of the Twelfth International Conference, pages 488–496. Morgan Kaufmann Publishers, San Francisco, CA, 1995.

    Google Scholar 

  • M.A. Wiering and J. Schmidhuber. Solving POMDPs with Levin search and EIRA. In L. Saitta, editor, Machine Learning: Proceedings of the Thirteenth International Conference, pages 534–542. Morgan Kaufmann Publishers, San Francisco, CA, 199

    Google Scholar 

  • L. A. Levin. Universal sequential search problems. Problems of Information Transmission, 9(3):265–266, 1973.

    Google Scholar 

  • L. A. Levin. Randomness conservation inequalities: Information and independence in mathematical theories. Information and Control, 61:15–37, 1984.

    Article  MathSciNet  MATH  Google Scholar 

  • R.J. Solomonoff. An application of algorithmic probability to problems in artificial intelligence. In L. N. Kanal and J. F. Lemmer, editors, Uncertainty in Artificial Intelligence, pages 473–491. Elsevier Science Publishers, 1986.

    Google Scholar 

  • O. Watanabe. Kolmogorov complexity and computational complexity. EATCS Monographs on Theoretical Computer Science, Springer, 1992.

    Book  MATH  Google Scholar 

  • J. Schmidhuber. Discovering problem solutions with low Kolmogorov complexity and high generalization capability. Technical Report FKI-194-94, Fakultät für Informatik, Technische Universität München, 1994. Short version in A. Prieditis and S. Russell, eds., Machine Learning: Proceedings of the Twelfth International Conference, Morgan Kaufmann Publishers, pages 488–496, San Francisco, CA, 1995.

    Google Scholar 

  • R. Caruana, D. L. Silver, J. Baxter, T. M. Mitchell, L. Y. Pratt, and S. Thrun. Learning to learn: knowledge consolidation and transfer in inductive systems, 1995. Workshop held at NIPS-95, Vail, CO, see http://www.cs.cmu.edu/afs/user/caruana/pub/transfer.html.

  • J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-⋯ hook. Institut für Informatik, Technische Universität München, 1987.

    Google Scholar 

  • J. Schmidhuber. A self-referential weight matrix. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 446–451. Springer, 1993.

    Google Scholar 

  • S.D. Whitehead and D. H. Ballard. Active perception and reinforcement learning. Neural Computation, 2(4):409–419, 1990.

    Article  Google Scholar 

  • J. Schmidhuber. Reinforcement learning in Markovian and non-Markovian environments. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3, pages 500–506. San Mateo, CA: Morgan Kaufmann, 1991.

    Google Scholar 

  • L.J. Lin. Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Carnegie Mellon University, Pittsburgh, January 1993.

    Google Scholar 

  • M. B. Ring. Continual Learning in Reinforcement Environments. PhD thesis, University of Texas at Austin, Austin, Texas 78712, August 1994.

    Google Scholar 

  • M.I. Littman. Memoryless policies: Theoretical limitations and practical results. In J. A. Meyer D. Cliff, P. Husbands and S. W. Wilson, editors, Proc. of the International Conference on Simulation of Adaptive Behavior: From Animals to Animals 3, pages 297–305. MIT Press/Bradford Books, 1994.

    Google Scholar 

  • D. Cliff and S. Ross. Adding temporary memory to ZCS. Adaptive Behavior, 3:101–150, 1994.

    Article  Google Scholar 

  • L. Chrisman. Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In Proceedings of the Tenth International Conference on Artificial Intelligence, pages 183–188. AAAI Press, San Jose, California, 1992.

    Google Scholar 

  • T. Jaakkola, S. P. Singh, and M. I. Jordan. Reinforcement learning algorithm for partially observable Markov decision problems. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems 7, pages 345–352. MIT Press, Cambridge MA, 1995.

    Google Scholar 

  • L.P. Kaelbling, M.L. Littman, and A.R. Cassandra. Planning and acting in partially observable stochastic domains. Technical report, Brown University, Providence RI, 1995.

    Google Scholar 

  • R. A. McCallum. Instance-based utile distinctions for reinforcement learning with hidden state. In A. Prieditis and S. Russell, editors, Machine Learning: Proceedings of the Twelfth International Conference, pages 387–395. Morgan Kaufmann Publishers, San Francisco, CA, 1995.

    Google Scholar 

  • M. Wiering and J. Schmidhuber. HQ-Learning: Discovering Markovian subgoals for nonMarkovian reinforcement learning. Technical Report IDSIA-95-96, IDSIA, 1996.

    Google Scholar 

  • S. Russell and E. Wefald. Principles of Metareasoning. Artificial Intelligence, 49:361–395, 1991.

    Article  MathSciNet  MATH  Google Scholar 

  • M. Boddy and T. L. Dean. Deliberation scheduling for problem solving in time-constrained environments. Artificial Intelligence, 67:245–285, 1994.

    Article  MATH  Google Scholar 

  • D. A. Berry and B. Fristedt. Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, London, 1985.

    MATH  Google Scholar 

  • J. C. Gittins. Multi-armed Bandit Allocation Indices. Wiley-Interscience series in systems and optimization. Wiley, Chichester, NY, 1989.

    Google Scholar 

  • R. Greiner. PALO: A probabilistic hill-climbing algorithm. Artificial Intelligence, 83(2), 1996.

    Google Scholar 

  • P. R. Kumar and P. Varaiya. Stochastic Systems: Estimation, Identification, and Adaptive Control. Prentice Hall, 1986.

    Google Scholar 

  • R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44,1988.

    Google Scholar 

  • C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.

    MATH  Google Scholar 

  • R.H. Crites and A.G. Barto. Improving elevator performance using reinforcement learning. In D.S. Touretzky, M.C. Mozer, and M.E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 1017–1023, Cambridge MA, 1996. MIT Press.

    Google Scholar 

  • D. Lenat. Theory formation by heuristic search. Machine Learning, 21, 1983.

    Google Scholar 

  • P. S. Rosenbloom, J. E. Laird, and A. Newell. The SOAR Papers. MIT Press, 1993.

    Google Scholar 

  • J. Schmidhuber. A neural network that embeds its own meta-levels. In Proc. of the International Conference on Neural Networks’ 93, San Francisco. IEEE, 1993.

    Google Scholar 

  • J. Zhao and J. Schmidhuber. Incremental self-improvement for life-time multi-agent reinforcement learning. In Pattie Maes, Maja Mataric, Jean-Arcady Meyer, Jordan Pollack, and Stewart W. Wilson, editors, From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pages 516–525. MIT Press, Bradford Books, 1996.

    Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1998 Springer Science+Business Media New York

About this chapter

Cite this chapter

Schmidhuber, J., Zhao, J., Schraudolph, N.N. (1998). Reinforcement Learning with Self-Modifying Policies. In: Thrun, S., Pratt, L. (eds) Learning to Learn. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-5529-2_12

Download citation

  • DOI: https://doi.org/10.1007/978-1-4615-5529-2_12

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4613-7527-2

  • Online ISBN: 978-1-4615-5529-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics