Skip to main content

Sequential Decision Making Based on Direct Search

  • Chapter
  • First Online:
Sequence Learning

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1828))

Abstract

The most challenging open issues in sequential decision making include partial observability of the decision maker’s environment, hierarchical and other types of abstract credit assignment, the learning of credit assignment algorithms, and exploration without a priori world models. I will summarize why direct search (DS) in policy space provides a more natural framework for addressing these issues than reinforcement learning (RL) based on value functions and dynamic programming. Then I will point out fundamental drawbacks of traditional DS methods in case of stochastic environments, stochastic policies, and unknown temporal delays between actions and observable effects. I will discuss a remedy called the success-story algorithm, show how it can outperform traditional DS, and mention a relationship to market models combining certain aspects of DS and traditional RL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Andre, D. (1998). Learning hierarchical behaviors. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.

    Google Scholar 

  • Banzhaf, W., Nordin, P., Keller, R. E., & Francone, F. D. (1998). Genetic Programming — An Introduction. Morgan Kaufmann Publishers, San Francisco, CA, USA.

    MATH  Google Scholar 

  • Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13, 834–846.

    Google Scholar 

  • Baum, E. B., & Durdanovic, I. (1998). Toward code evolution by artificial economies. Tech. rep., NEC Research Institute, Princeton, NJ. Extension of a paper in Proc. 13th ICML’1996, Morgan Kaufmann, CA.

    Google Scholar 

  • Bellman, R. (1961). Adaptive Control Processes. Princeton University Press.

    Google Scholar 

  • Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic Programming. Athena Scientific, Belmont, MA.

    MATH  Google Scholar 

  • Bowling, M., & Veloso, M. (1998). Bounding the suboptimality of reusing sub-problems. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.

    Google Scholar 

  • Chaitin, G. (1969). On the length of programs for computing finite binary sequences: statistical considerations. Journal of the ACM, 16, 145–159.

    Article  MATH  MathSciNet  Google Scholar 

  • Coelho, J., & Grupen, R. A. (1998). Control abstractions as state representation. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.

    Google Scholar 

  • Cohn, D. A. (1994). Neural network exploration using optimal experiment design. In Cowan, J., Tesauro, G., & Alspector, J. (Eds.), Advances in Neural Information Processing Systems 6, pp. 679–686. San Mateo, CA: Morgan Kaufmann.

    Google Scholar 

  • Cramer, N. L. (1985). A representation for the adaptive generation of simple sequential programs. In Grefenstette, J. (Ed.), Proceedings of an International Conference on Genetic Algorithms and Their Applications Hillsdale NJ. Lawrence Erlbaum Associates.

    Google Scholar 

  • Dayan, P., & Hinton, G. (1993). Feudal reinforcement learning. In Lippman, D. S., Moody, J. E., & Touretzky, D. S. (Eds.), Advances in Neural Information Processing Systems 5, pp. 271–278. San Mateo, CA: Morgan Kaufmann.

    Google Scholar 

  • Dayan, P., & Sejnowski, T. J. (1996). Exloration bonuses and dual control. Machine Learning, 25, 5–22.

    Google Scholar 

  • Dickmanns, D., Schmidhuber, J., & Winklhofer, A. (1987). Der genetische Algorithmus: Eine Implementierung in Prolog. Fortgeschrittenenpraktikum, Institut für Informatik, Lehrstuhl Prof. Radig, Technische Universität München..

    Google Scholar 

  • Digney, B. (1996). Emergent hierarchical control structures: Learning reactive/hierarchical relationships in reinforcement environments. In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., & Wilson, S. W. (Eds.), From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pp. 363–372. MIT Press, Bradford Books.

    Google Scholar 

  • Eldracher, M., & Baginski, B. (1993). Neural subgoal generation using backpropagation. In Lendaris, G. G., Grossberg, S., & Kosko, B. (Eds.), World Congress on Neural Networks, pp. III-145–III-148. Lawrence Erlbaum Associates, Inc., Publishers, Hillsdale.

    Google Scholar 

  • Fedorov, V. V. (1972). Theory of optimal experiments. Academic Press.

    Google Scholar 

  • Gittins, J. C. (1989). Multi-armed Bandit Allocation Indices. Wiley-Interscience series in systems and optimization. Wiley, Chichester, NY.

    Google Scholar 

  • Harada, D., & Russell, S. (1998). Meta-level reinforcement learning. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.

    Google Scholar 

  • Hochreiter, S., & Schmidhuber, J. (1997). LSTM can solve hard long time lag problems. In Mozer, M. C., Jordan, M. I., & Petsche, T. (Eds.), Advances in Neural Information Processing Systems 9, pp. 473–479. MIT Press, Cambridge MA.

    Google Scholar 

  • Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor.

    Google Scholar 

  • Holland, J. H. (1985). Properties of the bucket brigade. In Proceedings of an International Conference on Genetic Algorithms. Hillsdale, NJ.

    Google Scholar 

  • Huber, M., & Grupen, R. A. (1998). Learning robot control using control policies as abstract actions. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.

    Google Scholar 

  • Humphrys, M. (1996). Action selection methods using reinforcement learning. In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., & Wilson, S. W. (Eds.), From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pp. 135–144. MIT Press, Bradford Books.

    Google Scholar 

  • Hwang, J., Choi, J., Oh, S., & II, R. J. M. (1991). Query-based learning applied to partially trained multilayer perceptrons. IEEE Transactions on Neural Networks, 2(1), 131–136.

    Article  Google Scholar 

  • Jaakkola, T., Singh, S. P., & Jordan, M. I. (1995). Reinforcement learning algorithm for partially observable Markov decision problems. In Tesauro, G., Touretzky, D. S., & Leen, T. K. (Eds.), Advances in Neural Information Processing Systems 7, pp. 345–352. MIT Press, Cambridge MA.

    Google Scholar 

  • Juels, A., & Wattenberg, M. (1996). Stochastic hillclimbing as a baseline method for evaluating genetic algorithms. In Touretzky, D. S., Mozer, M. C., & Hasselmo, M. E. (Eds.), Advances in Neural Information Processing Systems, Vol. 8, pp. 430–436. The MIT Press, Cambridge, MA.

    Google Scholar 

  • Kaelbling, L. (1993). Learning in Embedded Systems. MIT Press.

    Google Scholar 

  • Kaelbling, L., Littman, M., & Cassandra, A. (1995). Planning and acting in partially observable stochastic domains. Tech. rep., Brown University, Providence RI.

    Google Scholar 

  • Kearns, M., & Singh, S. (1999). Finite-sample convergence rates for Q-learning and indirect algorithms. In Kearns, M., Solla, S. A., & Cohn, D. (Eds.), Advances in Neural Information Processing Systems 12. MIT Press, Cambridge MA.

    Google Scholar 

  • Kirchner, F. (1998). Q-learning of complex behaviors on a six-legged walking machine. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.

    Google Scholar 

  • Koenig, S., & Simmons, R. G. (1996). The effect of representation and knowedge on goal-directed exploration with reinforcement learnign algorithm. Machine Learning, 22, 228–250.

    Google Scholar 

  • Kolmogorov, A. (1965). Three approaches to the quantitative definition of information. Problems of Information Transmission, 1, 1–11.

    Google Scholar 

  • Koumoutsakos P., F. J., & D., P. (1998). Evolution strategies for parameter optimization in jet flow control. Center for Turbulence Research — Proceedings of the Summer program 1998, 10, 121–132.

    Google Scholar 

  • Lenat, D. (1983). Theory formation by heuristic search. Machine Learning, 21.

    Google Scholar 

  • Levin, L. A. (1973). Universal sequential search problems. Problems of Information Transmission, 9(3), 265–266.

    Google Scholar 

  • Levin, L. A. (1984). Randomness conservation inequalities: Information and independence in mathematical theories. Information and Control, 61, 15–37.

    Article  MATH  MathSciNet  Google Scholar 

  • Li, M., & Vitányi, P. M. B. (1993). An Introduction to Kolmogorov Complexity and its Applications. Springer.

    Google Scholar 

  • Lin, L. (1993). Reinforcement Learning for Robots Using Neural Networks. Ph.D. thesis, Carnegie Mellon University, Pittsburgh.

    Google Scholar 

  • Littman, M. (1996). Algorithms for Sequential Decision Making. Ph.D. thesis, Brown University.

    Google Scholar 

  • Littman, M., Cassandra, A., & Kaelbling, L. (1995). Learning policies for partially observable environments: Scaling up. In Prieditis, A., & Russell, S. (Eds.), Machine Learning: Proceedings of the Twelfth International Conference, pp. 362–370. Morgan Kaufmann Publishers, San Francisco, CA.

    Google Scholar 

  • MacKay, D. J. C. (1992). Information-based objective functions for active data selection. Neural Computation, 4(2), 550–604.

    MathSciNet  Google Scholar 

  • McCallum, R. A. (1996). Learning to use selective attention and short-term memory in sequential tasks. In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., & Wilson, S. W. (Eds.), From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pp. 315–324. MIT Press, Bradford Books.

    Google Scholar 

  • McGovern, A. (1998). acquire-macros: An algorithm for automatically learning macro-action. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.

    Google Scholar 

  • Moore, A., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13, 103–130.

    Google Scholar 

  • Moore, A. W., Baird, L., & Kaelbling, L. P. (1998). Multi-value-functions: Efficient automatic action hierarchies for multiple goal mdps. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.

    Google Scholar 

  • Plutowski, M., Cottrell, G., & White, H. (1994). Learning Mackey-Glass from 25 examples, plus or minus 2. In Cowan, J., Tesauro, G., & Alspector, J. (Eds.), Advances in Neural Information Processing Systems 6, pp. 1135–1142. San Mateo, CA: Morgan Kaufmann.

    Google Scholar 

  • Ray, T. S. (1992). An approach to the synthesis of life. In Langton, C., Taylor, C., Farmer, J. D., & Rasmussen, S. (Eds.), Artificial Life II, pp. 371–408. Addison Wesley Publishing Company.

    Google Scholar 

  • Rechenberg, I. (1971). Evolutionsstrategie-Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Dissertation.. Published 1973 by Fromman-Holzboog.

    Google Scholar 

  • Ring, M. B. (1991). Incremental development of complex behaviors through automatic construction of sensory-motor hierarchies. In Birnbaum, L., & Collins, G. (Eds.), Machine Learning: Proceedings of the Eighth International Workshop, pp. 343–347. Morgan Kaufmann.

    Google Scholar 

  • Ring, M. B. (1993). Learning sequential tasks by incrementally adding higher orders. In S. J. Hanson, GIles J. D. C., & Giles, C. L. (Eds.), Advances in Neural Information Processing Systems 5, pp. 115–122. Morgan Kaufmann.

    Google Scholar 

  • Ring, M. B. (1994). Continual Learning in Reinforcement Environments. Ph.D. thesis, University of Texas at Austin, Austin, Texas 78712.

    Google Scholar 

  • SaIlustowicz, R. P., & Schmidhuber, J. (1997). Probabilistic incremental program evolution. Evolutionary Computation, 5(2), 123–141.

    Article  Google Scholar 

  • Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 3, 210–229.

    Article  Google Scholar 

  • Schmidhuber, J. (1987). Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. Institut für Informatik, Technische Universität München..

    Google Scholar 

  • Schmidhuber, J. (1989). A local learning algorithm for dynamic feedforward and recurrent networks. Connection Science, 1(4), 403–412.

    Article  Google Scholar 

  • Schmidhuber, J. (1991a). Curious model-building control systems. In Proc. International Joint Conference on Neural Networks, Singapore, Vol. 2, pp. 1458–1463. IEEE.

    Article  Google Scholar 

  • Schmidhuber, J. (1991b). Learning to generate sub-goals for action sequences. In Kohonen, T., Mäkisara, K., Simula, O., & Kangas, J. (Eds.), Artificial Neural Networks, pp. 967–972. Elsevier Science Publishers B.V., North-Holland.

    Google Scholar 

  • Schmidhuber, J. (1991c). Reinforcement learning in Markovian and non-Markovian environments. In Lippman, D. S., Moody, J. E., & Touretzky, D. S. (Eds.), Advances in Neural Information Processing Systems 3, pp. 500–506. San Mateo, CA: Morgan Kaufmann.

    Google Scholar 

  • Schmidhuber, J. (1995). Discovering solutions with low Kolmogorov complexity and high generalization capability. In Prieditis, A., & Russell, S. (Eds.), Machine Learning: Proceedings of the Twelfth International Conference, pp. 488–496. Morgan Kaufmann Publishers, San Francisco, CA.

    Google Scholar 

  • Schmidhuber, J. (1997). Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks, 10(5), 857–873.

    Article  Google Scholar 

  • Schmidhuber, J. (1999). Artificial curiosity based on discovering novel algorithmic predictability through coevolution. In Angeline, P., Michalewicz, Z., Schoenauer, M., Yao, X., & Zalzala, Z. (Eds.), Congress on Evolutionary Computation, pp. 1612–1618. IEEE Press, Piscataway, NJ.

    Google Scholar 

  • Schmidhuber, J., & Prelinger, D. (1993). Discovering predictable classifications. Neural Computation, 5(4), 625–635.

    Article  Google Scholar 

  • Schmidhuber, J., & Zhao, J. (1999). Direct policy search and uncertain policy evaluation. In AAAI Spring Symposium on Search under Uncertain and Incomplete Information, Stanford Univ., pp. 119–124. American Association for Artificial Intelligence, Menlo Park, Calif.

    Google Scholar 

  • Schmidhuber, J., Zhao, J., & Schraudolph, N. (1997a). Reinforcement learning with self-modifying policies. In Thrun, S., & Pratt, L. (Eds.), Learning to learn, pp. 293–309. Kluwer.

    Google Scholar 

  • Schmidhuber, J., Zhao, J., & Wiering, M. (1997b). Shifting inductive bias with success-story algorithm, adaptive Levin search, and incremental self-improvement. Machine Learning, 28, 105–130.

    Article  Google Scholar 

  • Schwefel, H. P. (1974). Numerische Optimierung von Computer-Modellen. Dissertation.. Published 1977 by Birkhäuser, Basel.

    Google Scholar 

  • Schwefel, H. P. (1995). Evolution and Optimum Seeking. Wiley Interscience.

    Google Scholar 

  • Shannon, C. E. (1948). A mathematical theory of communication (parts I and II). Bell System Technical Journal, XXVII, 379–423.

    MathSciNet  Google Scholar 

  • Singh, S. (1992). The efficient learning of multiple task sequences. In Moody, J., Hanson, S., & Lippman, R. (Eds.), Advances in Neural Information Processing Systems 4, pp. 251–258 San Mateo, CA. Morgan Kaufmann.

    Google Scholar 

  • Solomonoff, R. (1964). A formal theory of inductive inference. Part I. Information and Control, 7, 1–22.

    Article  MATH  MathSciNet  Google Scholar 

  • Solomonoff, R. (1986). An application of algorithmic probability to problems in artificial intelligence. In Kanal, L. N., & Lemmer, J. F. (Eds.), Uncertainty in Artificial Intelligence, pp. 473–491. Elsevier Science Publishers.

    Google Scholar 

  • Storck, J., Hochreiter, S., & Schmidhuber, J. (1995). Reinforcement driven information acquisition in non-deterministic environments. In Proceedings of the International Conference on Artificial Neural Networks, Paris, Vol. 2, pp. 159–164. EC2 & Cie, Paris.

    Google Scholar 

  • Sun, R., & Sessions, C. (2000). Self-segmentation of sequences: automatic formation of hierarchies of sequential behaviors. IEEE Transactions on Systems, Man, and Cybernetics: Part B Cybernetics, 30(3).

    Google Scholar 

  • Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.

    Google Scholar 

  • Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In Prieditis, A., & Russell, S. (Eds.), Machine Learning: Proceedings of the Twelfth International Conference, pp. 531–539. Morgan Kaufmann Publishers, San Francisco, CA.

    Google Scholar 

  • Sutton, R. S., & Pinette, B. (1985). The learning of world models by connectionist networks. Proceedings of the 7th Annual Conference of the Cognitive Science Society, 54–64.

    Google Scholar 

  • Sutton, R. S., Singh, S., Precup, D., & Ravindran, B. (1999). Improved switching among temporally abstract actions. In Advances in Neural Information Processing Systems 11. MIT Press. To appear.

    Google Scholar 

  • Teller, A. (1994). The evolution of mental models. In Kenneth E. Kinnear, J. (Ed.), Advances in Genetic Programming, pp. 199–219. MIT Press.

    Google Scholar 

  • Tesauro, G. (1994). TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6(2), 215–219.

    Article  Google Scholar 

  • Tham, C. (1995). Reinforcement learning of multiple tasks using a hierarchical CMAC architecture. Robotics and Autonomous Systems, 15(4), 247–274.

    Article  Google Scholar 

  • Thrun, S., & Möller, K. (1992). Active exploration in dynamic environments. In Lippman, D. S., Moody, J. E., & Touretzky, D. S. (Eds.), Advances in Neural Information Processing Systems 4, pp. 531–538. San Mateo, CA: Morgan Kaufmann.

    Google Scholar 

  • Wang, G., & Mahadevan, S. (1998). A greedy divide-and-conquer approach to optimizing large manufacturing systems using reinforcement learning. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.

    Google Scholar 

  • Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292.

    MATH  Google Scholar 

  • Watkins, C. (1989). Learning from Delayed Rewards. Ph.D. thesis, King’s College, Oxford.

    Google Scholar 

  • Weiss, G. (1994). Hierarchical chunking in classifier systems. In Proceedings of the 12th National Conference on Artificial Intelligence, Vol. 2, pp. 1335–1340. AAAIPress/The MIT Press.

    Google Scholar 

  • Weiss, G., & Sen, S. (Eds.). (1996). Adaption and Learning in Multi-Agent Systems. LNAI 1042, Springer.

    Google Scholar 

  • Wiering, M., & Schmidhuber, J. (1998). HQ-learning. Adaptive Behavior, 6(2), 219–246.

    Article  Google Scholar 

  • Wiering, M., & Schmidhuber, J. (1996). Solving POMDPs with Levin search and EIRA. In Saitta, L. (Ed.), Machine Learning: Proceedings of the Thirteenth International Conference, pp. 534–542. Morgan Kaufmann Publishers, San Francisco, CA.

    Google Scholar 

  • Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256.

    MATH  Google Scholar 

  • Wilson, S. (1994). ZCS: A zeroth level classifier system. Evolutionary Computation, 2, 1–18.

    Article  Google Scholar 

  • Wilson, S. (1995). Classifier fitness based on accuracy. Evolutionary Computation, 3(2), 149–175.

    Article  Google Scholar 

  • Wolpert, D. H., Tumer, K., & Frank, J. (1999). Using collective intelligence to route internet traffic. In Kearns, M., Solla, S. A., & Cohn, D. (Eds.), Advances in Neural Information Processing Systems 12. MIT Press, Cambridge MA.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Schmidhuber, J. (2000). Sequential Decision Making Based on Direct Search. In: Sun, R., Giles, C.L. (eds) Sequence Learning. Lecture Notes in Computer Science(), vol 1828. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44565-X_10

Download citation

  • DOI: https://doi.org/10.1007/3-540-44565-X_10

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-41597-8

  • Online ISBN: 978-3-540-44565-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics