Sequential Decision Making Based on Direct Search

Schmidhuber, Jürgen

doi:10.1007/3-540-44565-X_10

Jürgen Schmidhuber³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1828))

1054 Accesses
5 Citations

Abstract

The most challenging open issues in sequential decision making include partial observability of the decision maker’s environment, hierarchical and other types of abstract credit assignment, the learning of credit assignment algorithms, and exploration without a priori world models. I will summarize why direct search (DS) in policy space provides a more natural framework for addressing these issues than reinforcement learning (RL) based on value functions and dynamic programming. Then I will point out fundamental drawbacks of traditional DS methods in case of stochastic environments, stochastic policies, and unknown temporal delays between actions and observable effects. I will discuss a remedy called the success-story algorithm, show how it can outperform traditional DS, and mention a relationship to market models combining certain aspects of DS and traditional RL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Andre, D. (1998). Learning hierarchical behaviors. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.
Google Scholar
Banzhaf, W., Nordin, P., Keller, R. E., & Francone, F. D. (1998). Genetic Programming — An Introduction. Morgan Kaufmann Publishers, San Francisco, CA, USA.
MATH Google Scholar
Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13, 834–846.
Google Scholar
Baum, E. B., & Durdanovic, I. (1998). Toward code evolution by artificial economies. Tech. rep., NEC Research Institute, Princeton, NJ. Extension of a paper in Proc. 13th ICML’1996, Morgan Kaufmann, CA.
Google Scholar
Bellman, R. (1961). Adaptive Control Processes. Princeton University Press.
Google Scholar
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic Programming. Athena Scientific, Belmont, MA.
MATH Google Scholar
Bowling, M., & Veloso, M. (1998). Bounding the suboptimality of reusing sub-problems. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.
Google Scholar
Chaitin, G. (1969). On the length of programs for computing finite binary sequences: statistical considerations. Journal of the ACM, 16, 145–159.
Article MATH MathSciNet Google Scholar
Coelho, J., & Grupen, R. A. (1998). Control abstractions as state representation. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.
Google Scholar
Cohn, D. A. (1994). Neural network exploration using optimal experiment design. In Cowan, J., Tesauro, G., & Alspector, J. (Eds.), Advances in Neural Information Processing Systems 6, pp. 679–686. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Cramer, N. L. (1985). A representation for the adaptive generation of simple sequential programs. In Grefenstette, J. (Ed.), Proceedings of an International Conference on Genetic Algorithms and Their Applications Hillsdale NJ. Lawrence Erlbaum Associates.
Google Scholar
Dayan, P., & Hinton, G. (1993). Feudal reinforcement learning. In Lippman, D. S., Moody, J. E., & Touretzky, D. S. (Eds.), Advances in Neural Information Processing Systems 5, pp. 271–278. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Dayan, P., & Sejnowski, T. J. (1996). Exloration bonuses and dual control. Machine Learning, 25, 5–22.
Google Scholar
Dickmanns, D., Schmidhuber, J., & Winklhofer, A. (1987). Der genetische Algorithmus: Eine Implementierung in Prolog. Fortgeschrittenenpraktikum, Institut für Informatik, Lehrstuhl Prof. Radig, Technische Universität München..
Google Scholar
Digney, B. (1996). Emergent hierarchical control structures: Learning reactive/hierarchical relationships in reinforcement environments. In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., & Wilson, S. W. (Eds.), From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pp. 363–372. MIT Press, Bradford Books.
Google Scholar
Eldracher, M., & Baginski, B. (1993). Neural subgoal generation using backpropagation. In Lendaris, G. G., Grossberg, S., & Kosko, B. (Eds.), World Congress on Neural Networks, pp. III-145–III-148. Lawrence Erlbaum Associates, Inc., Publishers, Hillsdale.
Google Scholar
Fedorov, V. V. (1972). Theory of optimal experiments. Academic Press.
Google Scholar
Gittins, J. C. (1989). Multi-armed Bandit Allocation Indices. Wiley-Interscience series in systems and optimization. Wiley, Chichester, NY.
Google Scholar
Harada, D., & Russell, S. (1998). Meta-level reinforcement learning. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.
Google Scholar
Hochreiter, S., & Schmidhuber, J. (1997). LSTM can solve hard long time lag problems. In Mozer, M. C., Jordan, M. I., & Petsche, T. (Eds.), Advances in Neural Information Processing Systems 9, pp. 473–479. MIT Press, Cambridge MA.
Google Scholar
Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor.
Google Scholar
Holland, J. H. (1985). Properties of the bucket brigade. In Proceedings of an International Conference on Genetic Algorithms. Hillsdale, NJ.
Google Scholar
Huber, M., & Grupen, R. A. (1998). Learning robot control using control policies as abstract actions. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.
Google Scholar
Humphrys, M. (1996). Action selection methods using reinforcement learning. In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., & Wilson, S. W. (Eds.), From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pp. 135–144. MIT Press, Bradford Books.
Google Scholar
Hwang, J., Choi, J., Oh, S., & II, R. J. M. (1991). Query-based learning applied to partially trained multilayer perceptrons. IEEE Transactions on Neural Networks, 2(1), 131–136.
Article Google Scholar
Jaakkola, T., Singh, S. P., & Jordan, M. I. (1995). Reinforcement learning algorithm for partially observable Markov decision problems. In Tesauro, G., Touretzky, D. S., & Leen, T. K. (Eds.), Advances in Neural Information Processing Systems 7, pp. 345–352. MIT Press, Cambridge MA.
Google Scholar
Juels, A., & Wattenberg, M. (1996). Stochastic hillclimbing as a baseline method for evaluating genetic algorithms. In Touretzky, D. S., Mozer, M. C., & Hasselmo, M. E. (Eds.), Advances in Neural Information Processing Systems, Vol. 8, pp. 430–436. The MIT Press, Cambridge, MA.
Google Scholar
Kaelbling, L. (1993). Learning in Embedded Systems. MIT Press.
Google Scholar
Kaelbling, L., Littman, M., & Cassandra, A. (1995). Planning and acting in partially observable stochastic domains. Tech. rep., Brown University, Providence RI.
Google Scholar
Kearns, M., & Singh, S. (1999). Finite-sample convergence rates for Q-learning and indirect algorithms. In Kearns, M., Solla, S. A., & Cohn, D. (Eds.), Advances in Neural Information Processing Systems 12. MIT Press, Cambridge MA.
Google Scholar
Kirchner, F. (1998). Q-learning of complex behaviors on a six-legged walking machine. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.
Google Scholar
Koenig, S., & Simmons, R. G. (1996). The effect of representation and knowedge on goal-directed exploration with reinforcement learnign algorithm. Machine Learning, 22, 228–250.
Google Scholar
Kolmogorov, A. (1965). Three approaches to the quantitative definition of information. Problems of Information Transmission, 1, 1–11.
Google Scholar
Koumoutsakos P., F. J., & D., P. (1998). Evolution strategies for parameter optimization in jet flow control. Center for Turbulence Research — Proceedings of the Summer program 1998, 10, 121–132.
Google Scholar
Lenat, D. (1983). Theory formation by heuristic search. Machine Learning, 21.
Google Scholar
Levin, L. A. (1973). Universal sequential search problems. Problems of Information Transmission, 9(3), 265–266.
Google Scholar
Levin, L. A. (1984). Randomness conservation inequalities: Information and independence in mathematical theories. Information and Control, 61, 15–37.
Article MATH MathSciNet Google Scholar
Li, M., & Vitányi, P. M. B. (1993). An Introduction to Kolmogorov Complexity and its Applications. Springer.
Google Scholar
Lin, L. (1993). Reinforcement Learning for Robots Using Neural Networks. Ph.D. thesis, Carnegie Mellon University, Pittsburgh.
Google Scholar
Littman, M. (1996). Algorithms for Sequential Decision Making. Ph.D. thesis, Brown University.
Google Scholar
Littman, M., Cassandra, A., & Kaelbling, L. (1995). Learning policies for partially observable environments: Scaling up. In Prieditis, A., & Russell, S. (Eds.), Machine Learning: Proceedings of the Twelfth International Conference, pp. 362–370. Morgan Kaufmann Publishers, San Francisco, CA.
Google Scholar
MacKay, D. J. C. (1992). Information-based objective functions for active data selection. Neural Computation, 4(2), 550–604.
MathSciNet Google Scholar
McCallum, R. A. (1996). Learning to use selective attention and short-term memory in sequential tasks. In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., & Wilson, S. W. (Eds.), From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pp. 315–324. MIT Press, Bradford Books.
Google Scholar
McGovern, A. (1998). acquire-macros: An algorithm for automatically learning macro-action. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.
Google Scholar
Moore, A., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13, 103–130.
Google Scholar
Moore, A. W., Baird, L., & Kaelbling, L. P. (1998). Multi-value-functions: Efficient automatic action hierarchies for multiple goal mdps. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.
Google Scholar
Plutowski, M., Cottrell, G., & White, H. (1994). Learning Mackey-Glass from 25 examples, plus or minus 2. In Cowan, J., Tesauro, G., & Alspector, J. (Eds.), Advances in Neural Information Processing Systems 6, pp. 1135–1142. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Ray, T. S. (1992). An approach to the synthesis of life. In Langton, C., Taylor, C., Farmer, J. D., & Rasmussen, S. (Eds.), Artificial Life II, pp. 371–408. Addison Wesley Publishing Company.
Google Scholar
Rechenberg, I. (1971). Evolutionsstrategie-Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Dissertation.. Published 1973 by Fromman-Holzboog.
Google Scholar
Ring, M. B. (1991). Incremental development of complex behaviors through automatic construction of sensory-motor hierarchies. In Birnbaum, L., & Collins, G. (Eds.), Machine Learning: Proceedings of the Eighth International Workshop, pp. 343–347. Morgan Kaufmann.
Google Scholar
Ring, M. B. (1993). Learning sequential tasks by incrementally adding higher orders. In S. J. Hanson, GIles J. D. C., & Giles, C. L. (Eds.), Advances in Neural Information Processing Systems 5, pp. 115–122. Morgan Kaufmann.
Google Scholar
Ring, M. B. (1994). Continual Learning in Reinforcement Environments. Ph.D. thesis, University of Texas at Austin, Austin, Texas 78712.
Google Scholar
SaIlustowicz, R. P., & Schmidhuber, J. (1997). Probabilistic incremental program evolution. Evolutionary Computation, 5(2), 123–141.
Article Google Scholar
Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 3, 210–229.
Article Google Scholar
Schmidhuber, J. (1987). Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. Institut für Informatik, Technische Universität München..
Google Scholar
Schmidhuber, J. (1989). A local learning algorithm for dynamic feedforward and recurrent networks. Connection Science, 1(4), 403–412.
Article Google Scholar
Schmidhuber, J. (1991a). Curious model-building control systems. In Proc. International Joint Conference on Neural Networks, Singapore, Vol. 2, pp. 1458–1463. IEEE.
Article Google Scholar
Schmidhuber, J. (1991b). Learning to generate sub-goals for action sequences. In Kohonen, T., Mäkisara, K., Simula, O., & Kangas, J. (Eds.), Artificial Neural Networks, pp. 967–972. Elsevier Science Publishers B.V., North-Holland.
Google Scholar
Schmidhuber, J. (1991c). Reinforcement learning in Markovian and non-Markovian environments. In Lippman, D. S., Moody, J. E., & Touretzky, D. S. (Eds.), Advances in Neural Information Processing Systems 3, pp. 500–506. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Schmidhuber, J. (1995). Discovering solutions with low Kolmogorov complexity and high generalization capability. In Prieditis, A., & Russell, S. (Eds.), Machine Learning: Proceedings of the Twelfth International Conference, pp. 488–496. Morgan Kaufmann Publishers, San Francisco, CA.
Google Scholar
Schmidhuber, J. (1997). Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks, 10(5), 857–873.
Article Google Scholar
Schmidhuber, J. (1999). Artificial curiosity based on discovering novel algorithmic predictability through coevolution. In Angeline, P., Michalewicz, Z., Schoenauer, M., Yao, X., & Zalzala, Z. (Eds.), Congress on Evolutionary Computation, pp. 1612–1618. IEEE Press, Piscataway, NJ.
Google Scholar
Schmidhuber, J., & Prelinger, D. (1993). Discovering predictable classifications. Neural Computation, 5(4), 625–635.
Article Google Scholar
Schmidhuber, J., & Zhao, J. (1999). Direct policy search and uncertain policy evaluation. In AAAI Spring Symposium on Search under Uncertain and Incomplete Information, Stanford Univ., pp. 119–124. American Association for Artificial Intelligence, Menlo Park, Calif.
Google Scholar
Schmidhuber, J., Zhao, J., & Schraudolph, N. (1997a). Reinforcement learning with self-modifying policies. In Thrun, S., & Pratt, L. (Eds.), Learning to learn, pp. 293–309. Kluwer.
Google Scholar
Schmidhuber, J., Zhao, J., & Wiering, M. (1997b). Shifting inductive bias with success-story algorithm, adaptive Levin search, and incremental self-improvement. Machine Learning, 28, 105–130.
Article Google Scholar
Schwefel, H. P. (1974). Numerische Optimierung von Computer-Modellen. Dissertation.. Published 1977 by Birkhäuser, Basel.
Google Scholar
Schwefel, H. P. (1995). Evolution and Optimum Seeking. Wiley Interscience.
Google Scholar
Shannon, C. E. (1948). A mathematical theory of communication (parts I and II). Bell System Technical Journal, XXVII, 379–423.
MathSciNet Google Scholar
Singh, S. (1992). The efficient learning of multiple task sequences. In Moody, J., Hanson, S., & Lippman, R. (Eds.), Advances in Neural Information Processing Systems 4, pp. 251–258 San Mateo, CA. Morgan Kaufmann.
Google Scholar
Solomonoff, R. (1964). A formal theory of inductive inference. Part I. Information and Control, 7, 1–22.
Article MATH MathSciNet Google Scholar
Solomonoff, R. (1986). An application of algorithmic probability to problems in artificial intelligence. In Kanal, L. N., & Lemmer, J. F. (Eds.), Uncertainty in Artificial Intelligence, pp. 473–491. Elsevier Science Publishers.
Google Scholar
Storck, J., Hochreiter, S., & Schmidhuber, J. (1995). Reinforcement driven information acquisition in non-deterministic environments. In Proceedings of the International Conference on Artificial Neural Networks, Paris, Vol. 2, pp. 159–164. EC2 & Cie, Paris.
Google Scholar
Sun, R., & Sessions, C. (2000). Self-segmentation of sequences: automatic formation of hierarchies of sequential behaviors. IEEE Transactions on Systems, Man, and Cybernetics: Part B Cybernetics, 30(3).
Google Scholar
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.
Google Scholar
Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In Prieditis, A., & Russell, S. (Eds.), Machine Learning: Proceedings of the Twelfth International Conference, pp. 531–539. Morgan Kaufmann Publishers, San Francisco, CA.
Google Scholar
Sutton, R. S., & Pinette, B. (1985). The learning of world models by connectionist networks. Proceedings of the 7th Annual Conference of the Cognitive Science Society, 54–64.
Google Scholar
Sutton, R. S., Singh, S., Precup, D., & Ravindran, B. (1999). Improved switching among temporally abstract actions. In Advances in Neural Information Processing Systems 11. MIT Press. To appear.
Google Scholar
Teller, A. (1994). The evolution of mental models. In Kenneth E. Kinnear, J. (Ed.), Advances in Genetic Programming, pp. 199–219. MIT Press.
Google Scholar
Tesauro, G. (1994). TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6(2), 215–219.
Article Google Scholar
Tham, C. (1995). Reinforcement learning of multiple tasks using a hierarchical CMAC architecture. Robotics and Autonomous Systems, 15(4), 247–274.
Article Google Scholar
Thrun, S., & Möller, K. (1992). Active exploration in dynamic environments. In Lippman, D. S., Moody, J. E., & Touretzky, D. S. (Eds.), Advances in Neural Information Processing Systems 4, pp. 531–538. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Wang, G., & Mahadevan, S. (1998). A greedy divide-and-conquer approach to optimizing large manufacturing systems using reinforcement learning. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.
Google Scholar
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292.
MATH Google Scholar
Watkins, C. (1989). Learning from Delayed Rewards. Ph.D. thesis, King’s College, Oxford.
Google Scholar
Weiss, G. (1994). Hierarchical chunking in classifier systems. In Proceedings of the 12th National Conference on Artificial Intelligence, Vol. 2, pp. 1335–1340. AAAIPress/The MIT Press.
Google Scholar
Weiss, G., & Sen, S. (Eds.). (1996). Adaption and Learning in Multi-Agent Systems. LNAI 1042, Springer.
Google Scholar
Wiering, M., & Schmidhuber, J. (1998). HQ-learning. Adaptive Behavior, 6(2), 219–246.
Article Google Scholar
Wiering, M., & Schmidhuber, J. (1996). Solving POMDPs with Levin search and EIRA. In Saitta, L. (Ed.), Machine Learning: Proceedings of the Thirteenth International Conference, pp. 534–542. Morgan Kaufmann Publishers, San Francisco, CA.
Google Scholar
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256.
MATH Google Scholar
Wilson, S. (1994). ZCS: A zeroth level classifier system. Evolutionary Computation, 2, 1–18.
Article Google Scholar
Wilson, S. (1995). Classifier fitness based on accuracy. Evolutionary Computation, 3(2), 149–175.
Article Google Scholar
Wolpert, D. H., Tumer, K., & Frank, J. (1999). Using collective intelligence to route internet traffic. In Kearns, M., Solla, S. A., & Cohn, D. (Eds.), Advances in Neural Information Processing Systems 12. MIT Press, Cambridge MA.
Google Scholar

Download references

Author information

Authors and Affiliations

IDSIA, Galleria 2, 6928, Manno Lugano, Switzerland
Jürgen Schmidhuber

Authors

Jürgen Schmidhuber
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

CECS Department, University of Missouri-Columbia, 201 Engineering Building West, Columbia, MO, 65211-2060, USA
Ron Sun
NEC Research Institute, 4 Independence Way, Princeton, NJ, 08540, USA
C. Lee Giles

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Schmidhuber, J. (2000). Sequential Decision Making Based on Direct Search. In: Sun, R., Giles, C.L. (eds) Sequence Learning. Lecture Notes in Computer Science(), vol 1828. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44565-X_10

Download citation

DOI: https://doi.org/10.1007/3-540-44565-X_10
Published: 07 December 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41597-8
Online ISBN: 978-3-540-44565-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics