Abstract
The most challenging open issues in sequential decision making include partial observability of the decision maker’s environment, hierarchical and other types of abstract credit assignment, the learning of credit assignment algorithms, and exploration without a priori world models. I will summarize why direct search (DS) in policy space provides a more natural framework for addressing these issues than reinforcement learning (RL) based on value functions and dynamic programming. Then I will point out fundamental drawbacks of traditional DS methods in case of stochastic environments, stochastic policies, and unknown temporal delays between actions and observable effects. I will discuss a remedy called the success-story algorithm, show how it can outperform traditional DS, and mention a relationship to market models combining certain aspects of DS and traditional RL.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Andre, D. (1998). Learning hierarchical behaviors. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.
Banzhaf, W., Nordin, P., Keller, R. E., & Francone, F. D. (1998). Genetic Programming — An Introduction. Morgan Kaufmann Publishers, San Francisco, CA, USA.
Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13, 834–846.
Baum, E. B., & Durdanovic, I. (1998). Toward code evolution by artificial economies. Tech. rep., NEC Research Institute, Princeton, NJ. Extension of a paper in Proc. 13th ICML’1996, Morgan Kaufmann, CA.
Bellman, R. (1961). Adaptive Control Processes. Princeton University Press.
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic Programming. Athena Scientific, Belmont, MA.
Bowling, M., & Veloso, M. (1998). Bounding the suboptimality of reusing sub-problems. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.
Chaitin, G. (1969). On the length of programs for computing finite binary sequences: statistical considerations. Journal of the ACM, 16, 145–159.
Coelho, J., & Grupen, R. A. (1998). Control abstractions as state representation. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.
Cohn, D. A. (1994). Neural network exploration using optimal experiment design. In Cowan, J., Tesauro, G., & Alspector, J. (Eds.), Advances in Neural Information Processing Systems 6, pp. 679–686. San Mateo, CA: Morgan Kaufmann.
Cramer, N. L. (1985). A representation for the adaptive generation of simple sequential programs. In Grefenstette, J. (Ed.), Proceedings of an International Conference on Genetic Algorithms and Their Applications Hillsdale NJ. Lawrence Erlbaum Associates.
Dayan, P., & Hinton, G. (1993). Feudal reinforcement learning. In Lippman, D. S., Moody, J. E., & Touretzky, D. S. (Eds.), Advances in Neural Information Processing Systems 5, pp. 271–278. San Mateo, CA: Morgan Kaufmann.
Dayan, P., & Sejnowski, T. J. (1996). Exloration bonuses and dual control. Machine Learning, 25, 5–22.
Dickmanns, D., Schmidhuber, J., & Winklhofer, A. (1987). Der genetische Algorithmus: Eine Implementierung in Prolog. Fortgeschrittenenpraktikum, Institut für Informatik, Lehrstuhl Prof. Radig, Technische Universität München..
Digney, B. (1996). Emergent hierarchical control structures: Learning reactive/hierarchical relationships in reinforcement environments. In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., & Wilson, S. W. (Eds.), From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pp. 363–372. MIT Press, Bradford Books.
Eldracher, M., & Baginski, B. (1993). Neural subgoal generation using backpropagation. In Lendaris, G. G., Grossberg, S., & Kosko, B. (Eds.), World Congress on Neural Networks, pp. III-145–III-148. Lawrence Erlbaum Associates, Inc., Publishers, Hillsdale.
Fedorov, V. V. (1972). Theory of optimal experiments. Academic Press.
Gittins, J. C. (1989). Multi-armed Bandit Allocation Indices. Wiley-Interscience series in systems and optimization. Wiley, Chichester, NY.
Harada, D., & Russell, S. (1998). Meta-level reinforcement learning. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.
Hochreiter, S., & Schmidhuber, J. (1997). LSTM can solve hard long time lag problems. In Mozer, M. C., Jordan, M. I., & Petsche, T. (Eds.), Advances in Neural Information Processing Systems 9, pp. 473–479. MIT Press, Cambridge MA.
Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor.
Holland, J. H. (1985). Properties of the bucket brigade. In Proceedings of an International Conference on Genetic Algorithms. Hillsdale, NJ.
Huber, M., & Grupen, R. A. (1998). Learning robot control using control policies as abstract actions. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.
Humphrys, M. (1996). Action selection methods using reinforcement learning. In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., & Wilson, S. W. (Eds.), From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pp. 135–144. MIT Press, Bradford Books.
Hwang, J., Choi, J., Oh, S., & II, R. J. M. (1991). Query-based learning applied to partially trained multilayer perceptrons. IEEE Transactions on Neural Networks, 2(1), 131–136.
Jaakkola, T., Singh, S. P., & Jordan, M. I. (1995). Reinforcement learning algorithm for partially observable Markov decision problems. In Tesauro, G., Touretzky, D. S., & Leen, T. K. (Eds.), Advances in Neural Information Processing Systems 7, pp. 345–352. MIT Press, Cambridge MA.
Juels, A., & Wattenberg, M. (1996). Stochastic hillclimbing as a baseline method for evaluating genetic algorithms. In Touretzky, D. S., Mozer, M. C., & Hasselmo, M. E. (Eds.), Advances in Neural Information Processing Systems, Vol. 8, pp. 430–436. The MIT Press, Cambridge, MA.
Kaelbling, L. (1993). Learning in Embedded Systems. MIT Press.
Kaelbling, L., Littman, M., & Cassandra, A. (1995). Planning and acting in partially observable stochastic domains. Tech. rep., Brown University, Providence RI.
Kearns, M., & Singh, S. (1999). Finite-sample convergence rates for Q-learning and indirect algorithms. In Kearns, M., Solla, S. A., & Cohn, D. (Eds.), Advances in Neural Information Processing Systems 12. MIT Press, Cambridge MA.
Kirchner, F. (1998). Q-learning of complex behaviors on a six-legged walking machine. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.
Koenig, S., & Simmons, R. G. (1996). The effect of representation and knowedge on goal-directed exploration with reinforcement learnign algorithm. Machine Learning, 22, 228–250.
Kolmogorov, A. (1965). Three approaches to the quantitative definition of information. Problems of Information Transmission, 1, 1–11.
Koumoutsakos P., F. J., & D., P. (1998). Evolution strategies for parameter optimization in jet flow control. Center for Turbulence Research — Proceedings of the Summer program 1998, 10, 121–132.
Lenat, D. (1983). Theory formation by heuristic search. Machine Learning, 21.
Levin, L. A. (1973). Universal sequential search problems. Problems of Information Transmission, 9(3), 265–266.
Levin, L. A. (1984). Randomness conservation inequalities: Information and independence in mathematical theories. Information and Control, 61, 15–37.
Li, M., & Vitányi, P. M. B. (1993). An Introduction to Kolmogorov Complexity and its Applications. Springer.
Lin, L. (1993). Reinforcement Learning for Robots Using Neural Networks. Ph.D. thesis, Carnegie Mellon University, Pittsburgh.
Littman, M. (1996). Algorithms for Sequential Decision Making. Ph.D. thesis, Brown University.
Littman, M., Cassandra, A., & Kaelbling, L. (1995). Learning policies for partially observable environments: Scaling up. In Prieditis, A., & Russell, S. (Eds.), Machine Learning: Proceedings of the Twelfth International Conference, pp. 362–370. Morgan Kaufmann Publishers, San Francisco, CA.
MacKay, D. J. C. (1992). Information-based objective functions for active data selection. Neural Computation, 4(2), 550–604.
McCallum, R. A. (1996). Learning to use selective attention and short-term memory in sequential tasks. In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., & Wilson, S. W. (Eds.), From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pp. 315–324. MIT Press, Bradford Books.
McGovern, A. (1998). acquire-macros: An algorithm for automatically learning macro-action. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.
Moore, A., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13, 103–130.
Moore, A. W., Baird, L., & Kaelbling, L. P. (1998). Multi-value-functions: Efficient automatic action hierarchies for multiple goal mdps. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.
Plutowski, M., Cottrell, G., & White, H. (1994). Learning Mackey-Glass from 25 examples, plus or minus 2. In Cowan, J., Tesauro, G., & Alspector, J. (Eds.), Advances in Neural Information Processing Systems 6, pp. 1135–1142. San Mateo, CA: Morgan Kaufmann.
Ray, T. S. (1992). An approach to the synthesis of life. In Langton, C., Taylor, C., Farmer, J. D., & Rasmussen, S. (Eds.), Artificial Life II, pp. 371–408. Addison Wesley Publishing Company.
Rechenberg, I. (1971). Evolutionsstrategie-Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Dissertation.. Published 1973 by Fromman-Holzboog.
Ring, M. B. (1991). Incremental development of complex behaviors through automatic construction of sensory-motor hierarchies. In Birnbaum, L., & Collins, G. (Eds.), Machine Learning: Proceedings of the Eighth International Workshop, pp. 343–347. Morgan Kaufmann.
Ring, M. B. (1993). Learning sequential tasks by incrementally adding higher orders. In S. J. Hanson, GIles J. D. C., & Giles, C. L. (Eds.), Advances in Neural Information Processing Systems 5, pp. 115–122. Morgan Kaufmann.
Ring, M. B. (1994). Continual Learning in Reinforcement Environments. Ph.D. thesis, University of Texas at Austin, Austin, Texas 78712.
SaIlustowicz, R. P., & Schmidhuber, J. (1997). Probabilistic incremental program evolution. Evolutionary Computation, 5(2), 123–141.
Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 3, 210–229.
Schmidhuber, J. (1987). Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. Institut für Informatik, Technische Universität München..
Schmidhuber, J. (1989). A local learning algorithm for dynamic feedforward and recurrent networks. Connection Science, 1(4), 403–412.
Schmidhuber, J. (1991a). Curious model-building control systems. In Proc. International Joint Conference on Neural Networks, Singapore, Vol. 2, pp. 1458–1463. IEEE.
Schmidhuber, J. (1991b). Learning to generate sub-goals for action sequences. In Kohonen, T., Mäkisara, K., Simula, O., & Kangas, J. (Eds.), Artificial Neural Networks, pp. 967–972. Elsevier Science Publishers B.V., North-Holland.
Schmidhuber, J. (1991c). Reinforcement learning in Markovian and non-Markovian environments. In Lippman, D. S., Moody, J. E., & Touretzky, D. S. (Eds.), Advances in Neural Information Processing Systems 3, pp. 500–506. San Mateo, CA: Morgan Kaufmann.
Schmidhuber, J. (1995). Discovering solutions with low Kolmogorov complexity and high generalization capability. In Prieditis, A., & Russell, S. (Eds.), Machine Learning: Proceedings of the Twelfth International Conference, pp. 488–496. Morgan Kaufmann Publishers, San Francisco, CA.
Schmidhuber, J. (1997). Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks, 10(5), 857–873.
Schmidhuber, J. (1999). Artificial curiosity based on discovering novel algorithmic predictability through coevolution. In Angeline, P., Michalewicz, Z., Schoenauer, M., Yao, X., & Zalzala, Z. (Eds.), Congress on Evolutionary Computation, pp. 1612–1618. IEEE Press, Piscataway, NJ.
Schmidhuber, J., & Prelinger, D. (1993). Discovering predictable classifications. Neural Computation, 5(4), 625–635.
Schmidhuber, J., & Zhao, J. (1999). Direct policy search and uncertain policy evaluation. In AAAI Spring Symposium on Search under Uncertain and Incomplete Information, Stanford Univ., pp. 119–124. American Association for Artificial Intelligence, Menlo Park, Calif.
Schmidhuber, J., Zhao, J., & Schraudolph, N. (1997a). Reinforcement learning with self-modifying policies. In Thrun, S., & Pratt, L. (Eds.), Learning to learn, pp. 293–309. Kluwer.
Schmidhuber, J., Zhao, J., & Wiering, M. (1997b). Shifting inductive bias with success-story algorithm, adaptive Levin search, and incremental self-improvement. Machine Learning, 28, 105–130.
Schwefel, H. P. (1974). Numerische Optimierung von Computer-Modellen. Dissertation.. Published 1977 by Birkhäuser, Basel.
Schwefel, H. P. (1995). Evolution and Optimum Seeking. Wiley Interscience.
Shannon, C. E. (1948). A mathematical theory of communication (parts I and II). Bell System Technical Journal, XXVII, 379–423.
Singh, S. (1992). The efficient learning of multiple task sequences. In Moody, J., Hanson, S., & Lippman, R. (Eds.), Advances in Neural Information Processing Systems 4, pp. 251–258 San Mateo, CA. Morgan Kaufmann.
Solomonoff, R. (1964). A formal theory of inductive inference. Part I. Information and Control, 7, 1–22.
Solomonoff, R. (1986). An application of algorithmic probability to problems in artificial intelligence. In Kanal, L. N., & Lemmer, J. F. (Eds.), Uncertainty in Artificial Intelligence, pp. 473–491. Elsevier Science Publishers.
Storck, J., Hochreiter, S., & Schmidhuber, J. (1995). Reinforcement driven information acquisition in non-deterministic environments. In Proceedings of the International Conference on Artificial Neural Networks, Paris, Vol. 2, pp. 159–164. EC2 & Cie, Paris.
Sun, R., & Sessions, C. (2000). Self-segmentation of sequences: automatic formation of hierarchies of sequential behaviors. IEEE Transactions on Systems, Man, and Cybernetics: Part B Cybernetics, 30(3).
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.
Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In Prieditis, A., & Russell, S. (Eds.), Machine Learning: Proceedings of the Twelfth International Conference, pp. 531–539. Morgan Kaufmann Publishers, San Francisco, CA.
Sutton, R. S., & Pinette, B. (1985). The learning of world models by connectionist networks. Proceedings of the 7th Annual Conference of the Cognitive Science Society, 54–64.
Sutton, R. S., Singh, S., Precup, D., & Ravindran, B. (1999). Improved switching among temporally abstract actions. In Advances in Neural Information Processing Systems 11. MIT Press. To appear.
Teller, A. (1994). The evolution of mental models. In Kenneth E. Kinnear, J. (Ed.), Advances in Genetic Programming, pp. 199–219. MIT Press.
Tesauro, G. (1994). TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6(2), 215–219.
Tham, C. (1995). Reinforcement learning of multiple tasks using a hierarchical CMAC architecture. Robotics and Autonomous Systems, 15(4), 247–274.
Thrun, S., & Möller, K. (1992). Active exploration in dynamic environments. In Lippman, D. S., Moody, J. E., & Touretzky, D. S. (Eds.), Advances in Neural Information Processing Systems 4, pp. 531–538. San Mateo, CA: Morgan Kaufmann.
Wang, G., & Mahadevan, S. (1998). A greedy divide-and-conquer approach to optimizing large manufacturing systems using reinforcement learning. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292.
Watkins, C. (1989). Learning from Delayed Rewards. Ph.D. thesis, King’s College, Oxford.
Weiss, G. (1994). Hierarchical chunking in classifier systems. In Proceedings of the 12th National Conference on Artificial Intelligence, Vol. 2, pp. 1335–1340. AAAIPress/The MIT Press.
Weiss, G., & Sen, S. (Eds.). (1996). Adaption and Learning in Multi-Agent Systems. LNAI 1042, Springer.
Wiering, M., & Schmidhuber, J. (1998). HQ-learning. Adaptive Behavior, 6(2), 219–246.
Wiering, M., & Schmidhuber, J. (1996). Solving POMDPs with Levin search and EIRA. In Saitta, L. (Ed.), Machine Learning: Proceedings of the Thirteenth International Conference, pp. 534–542. Morgan Kaufmann Publishers, San Francisco, CA.
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256.
Wilson, S. (1994). ZCS: A zeroth level classifier system. Evolutionary Computation, 2, 1–18.
Wilson, S. (1995). Classifier fitness based on accuracy. Evolutionary Computation, 3(2), 149–175.
Wolpert, D. H., Tumer, K., & Frank, J. (1999). Using collective intelligence to route internet traffic. In Kearns, M., Solla, S. A., & Cohn, D. (Eds.), Advances in Neural Information Processing Systems 12. MIT Press, Cambridge MA.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Schmidhuber, J. (2000). Sequential Decision Making Based on Direct Search. In: Sun, R., Giles, C.L. (eds) Sequence Learning. Lecture Notes in Computer Science(), vol 1828. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44565-X_10
Download citation
DOI: https://doi.org/10.1007/3-540-44565-X_10
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41597-8
Online ISBN: 978-3-540-44565-4
eBook Packages: Springer Book Archive