Skip to main content

Reinforcement Learning in Continuous State and Action Spaces

  • Chapter

Part of the book series: Adaptation, Learning, and Optimization ((ALO,volume 12))

Abstract

Many traditional reinforcement-learning algorithms have been designed for problems with small finite state and action spaces. Learning in such discrete problems can been difficult, due to noise and delayed reinforcements. However, many real-world problems have continuous state or action spaces, which can make learning a good decision policy even more involved. In this chapter we discuss how to automatically find good decision policies in continuous domains. Because analytically computing a good policy from a continuous model can be infeasible, in this chapter we mainly focus on methods that explicitly update a representation of a value function, a policy or both. We discuss considerations in choosing an appropriate representation for these functions and discuss gradient-based and gradient-free ways to update the parameters. We show how to apply these methods to reinforcement-learning problems and discuss many specific algorithms. Amongst others, we cover gradient-based temporal-difference learning, evolutionary strategies, policy-gradient algorithms and (natural) actor-critic methods. We discuss the advantages of different approaches and compare the performance of a state-of-the-art actor-critic method and a state-of-the-art evolutionary strategy empirically.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   299.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   379.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   379.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Akimoto, Y., Nagata, Y., Ono, I., Kobayashi, S.: Bidirectional Relation Between CMA Evolution Strategies and Natural Evolution Strategies. In: Schaefer, R., Cotta, C., Kołodziej, J., Rudolph, G. (eds.) PPSN XI. LNCS, vol. 6238, pp. 154–163. Springer, Heidelberg (2010)

    Google Scholar 

  • Albus, J.S.: A theory of cerebellar function. Mathematical Biosciences 10, 25–61 (1971)

    Google Scholar 

  • Albus, J.S.: A new approach to manipulator control: The cerebellar model articulation controller (CMAC). In: Dynamic Systems, Measurement and Control, pp. 220–227 (1975)

    Google Scholar 

  • Amari, S.I.: Natural gradient works efficiently in learning. Neural Computation 10(2), 251–276 (1998)

    MathSciNet  Google Scholar 

  • Anderson, C.W.: Learning to control an inverted pendulum using neural networks. IEEE Control Systems Magazine 9(3), 31–37 (1989)

    Google Scholar 

  • Antos, A., Munos, R., Szepesvári, C.: Fitted Q-iteration in continuous action-space MDPs. In: Advances in Neural Information Processing Systems (NIPS-2007), vol. 20, pp. 9–16 (2008a)

    Google Scholar 

  • Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning 71(1), 89–129 (2008b)

    Google Scholar 

  • Babuska, R.: Fuzzy modeling for control. Kluwer Academic Publishers (1998)

    Google Scholar 

  • Bäck, T.: Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms. Oxford University Press, USA (1996)

    Google Scholar 

  • Bäck, T., Schwefel, H.P.: An overview of evolutionary algorithms for parameter optimization. Evolutionary Computation 1(1), 1–23 (1993)

    Google Scholar 

  • Baird, L.: Residual algorithms: Reinforcement learning with function approximation. In: Prieditis, A., Russell, S. (eds.) Machine Learning: Proceedings of the Twelfth International Conference, pp. 30–37. Morgan Kaufmann Publishers, San Francisco (1995)

    Google Scholar 

  • Baird, L.C., Klopf, A.H.: Reinforcement learning with high-dimensional, continuous actions. Tech. Rep. WL-TR-93-114, Wright Laboratory, Wright-Patterson Air Force Base, OH (1993)

    Google Scholar 

  • Bardi, M., Dolcetta, I.C.: Optimal control and viscosity solutions of Hamilton–Jacobi–Bellman equations. Springer, Heidelberg (1997)

    Google Scholar 

  • Barto, A.G., Sutton, R.S., Anderson, C.W.: Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics SMC-13, 834–846 (1983)

    Google Scholar 

  • Baxter, J., Bartlett, P.L.: Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15, 319–350 (2001)

    MathSciNet  Google Scholar 

  • Beard, R., Saridis, G., Wen, J.: Approximate solutions to the time-invariant Hamilton–Jacobi–Bellman equation. Journal of Optimization theory and Applications 96(3), 589–626 (1998)

    MathSciNet  Google Scholar 

  • Bellman, R.: Dynamic Programming. Princeton University Press (1957)

    Google Scholar 

  • Benbrahim, H., Franklin, J.A.: Biped dynamic walking using reinforcement learning. Robotics and Autonomous Systems 22(3-4), 283–302 (1997)

    Google Scholar 

  • Berenji, H.: Fuzzy Q-learning: a new approach for fuzzy dynamic programming. In: Proceedings of the Third IEEE Conference on Fuzzy Systems, IEEE World Congress on Computational Intelligence, pp. 486–491. IEEE (1994)

    Google Scholar 

  • Berenji, H., Khedkar, P.: Learning and tuning fuzzy logic controllers through reinforcements. IEEE Transactions on Neural Networks 3(5), 724–740 (1992)

    Google Scholar 

  • Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. I. Athena Scientific (2005)

    Google Scholar 

  • Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. II. Athena Scientific (2007)

    Google Scholar 

  • Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-dynamic Programming. Athena Scientific, Belmont (1996)

    Google Scholar 

  • Bertsekas, D.P., Borkar, V.S., Nedic, A.: Improved temporal difference methods with linear function approximation. In: Handbook of Learning and Approximate Dynamic Programming, pp. 235–260 (2004)

    Google Scholar 

  • Beyer, H., Schwefel, H.: Evolution strategies–a comprehensive introduction. Natural Computing 1(1), 3–52 (2002)

    MathSciNet  Google Scholar 

  • Bhatnagar, S., Sutton, R.S., Ghavamzadeh, M., Lee, M.: Natural actor-critic algorithms. Automatica 45(11), 2471–2482 (2009)

    MathSciNet  Google Scholar 

  • Bishop, C.M.: Neural networks for pattern recognition. Oxford University Press, USA (1995)

    Google Scholar 

  • Bishop, C.M.: Pattern recognition and machine learning. Springer, New York (2006)

    Google Scholar 

  • Bonarini, A.: Delayed reinforcement, fuzzy Q-learning and fuzzy logic controllers. In: Herrera, F., Verdegay, J.L. (eds.) Genetic Algorithms and Soft Computing. Studies in Fuzziness, vol. 8, pp. 447–466. Physica-Verlag, Berlin (1996)

    Google Scholar 

  • Boyan, J.A.: Technical update: Least-squares temporal difference learning. Machine Learning 49(2), 233–246 (2002)

    Google Scholar 

  • Bradtke, S.J., Barto, A.G.: Linear least-squares algorithms for temporal difference learning. Machine Learning 22, 33–57 (1996)

    Google Scholar 

  • Bryson, A., Ho, Y.: Applied Optimal Control. Blaisdell Publishing Co. (1969)

    Google Scholar 

  • Buşoniu, L., Ernst, D., De Schutter, B., Babuška, R.: Continuous-State Reinforcement Learning with Fuzzy Approximation. In: Tuyls, K., Nowe, A., Guessoum, Z., Kudenko, D. (eds.) ALAMAS 2005, ALAMAS 2006, and ALAMAS 2007. LNCS (LNAI), vol. 4865, pp. 27–43. Springer, Heidelberg (2008)

    Google Scholar 

  • Buşoniu, L., Babuška, R., De Schutter, B., Ernst, D.: Reinforcement Learning and Dynamic Programming Using Function Approximators. CRC Press, Boca Raton (2010)

    Google Scholar 

  • Coulom, R.: Reinforcement learning using neural networks, with applications to motor control. PhD thesis, Institut National Polytechnique de Grenoble (2002)

    Google Scholar 

  • Crites, R.H., Barto, A.G.: Improving elevator performance using reinforcement learning. In: Touretzky, D.S., Mozer, M.C., Hasselmo, M.E. (eds.) Advances in Neural Information Processing Systems, vol. 8, pp. 1017–1023. MIT Press, Cambridge (1996)

    Google Scholar 

  • Crites, R.H., Barto, A.G.: Elevator group control using multiple reinforcement learning agents. Machine Learning 33(2/3), 235–262 (1998)

    Google Scholar 

  • Davis, L.: Handbook of genetic algorithms. Arden Shakespeare (1991)

    Google Scholar 

  • Dayan, P.: The convergence of TD(λ) for general lambda. Machine Learning 8, 341–362 (1992)

    Google Scholar 

  • Dayan, P., Sejnowski, T.: TD(λ): Convergence with probability 1. Machine Learning 14, 295–301 (1994)

    Google Scholar 

  • Dearden, R., Friedman, N., Russell, S.: Bayesian Q-learning. In: Proceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence, pp. 761–768. American Association for Artificial Intelligence (1998)

    Google Scholar 

  • Dearden, R., Friedman, N., Andre, D.: Model based Bayesian exploration. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 150–159 (1999)

    Google Scholar 

  • Eiben, A.E., Smith, J.E.: Introduction to evolutionary computing. Springer, Heidelberg (2003)

    Google Scholar 

  • Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6(1), 503–556 (2005)

    MathSciNet  Google Scholar 

  • Fisher, R.A.: On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London Series A, Containing Papers of a Mathematical or Physical Character 222, 309–368 (1922)

    Google Scholar 

  • Fisher, R.A.: Statistical methods for research workers. Oliver & Boyd, Edinburgh (1925)

    Google Scholar 

  • Främling, K.: Replacing eligibility trace for action-value learning with function approximation. In: Proceedings of the 15th European Symposium on Artificial Neural Networks (ESANN-2007), pp. 313–318. d-side publishing (2007)

    Google Scholar 

  • Gaskett, C., Wettergreen, D., Zelinsky, A.: Q-learning in continuous state and action spaces. In: Advanced Topics in Artificial Intelligence, pp. 417–428 (1999)

    Google Scholar 

  • Geramifard, A., Bowling, M., Sutton, R.S.: Incremental least-squares temporal difference learning. In: Proceedings of the 21st National Conference on Artificial Intelligence, vol. 1, pp. 356–361. AAAI Press (2006)

    Google Scholar 

  • Geramifard, A., Bowling, M., Zinkevich, M., Sutton, R.: ilstd: Eligibility traces and convergence analysis. In: Advances in Neural Information Processing Systems, vol. 19, pp. 441–448 (2007)

    Google Scholar 

  • Glasmachers, T., Schaul, T., Yi, S., Wierstra, D., Schmidhuber, J.: Exponential natural evolution strategies. In: Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation, pp. 393–400. ACM (2010)

    Google Scholar 

  • Glorennec, P.: Fuzzy Q-learning and dynamical fuzzy Q-learning. In: Proceedings of the Third IEEE Conference on Fuzzy Systems, IEEE World Congress on Computational Intelligence, pp. 474–479. IEEE (1994)

    Google Scholar 

  • Glover, F., Kochenberger, G.: Handbook of metaheuristics. Springer, Heidelberg (2003)

    Google Scholar 

  • Gomez, F., Schmidhuber, J., Miikkulainen, R.: Accelerated neural evolution through cooperatively coevolved synapses. The Journal of Machine Learning Research 9, 937–965 (2008)

    MathSciNet  Google Scholar 

  • Gordon, G.J.: Stable function approximation in dynamic programming. In: Prieditis, A., Russell, S. (eds.) Proceedings of the Twelfth International Conference on Machine Learning (ICML 1995), pp. 261–268. Morgan Kaufmann, San Francisco (1995)

    Google Scholar 

  • Gordon, G.J.: Approximate solutions to Markov decision processes. PhD thesis, Carnegie Mellon University (1999)

    Google Scholar 

  • Greensmith, E., Bartlett, P.L., Baxter, J.: Variance reduction techniques for gradient estimates in reinforcement learning. The Journal of Machine Learning Research 5, 1471–1530 (2004)

    MathSciNet  Google Scholar 

  • Hansen, N., Ostermeier, A.: Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation 9(2), 159–195 (2001)

    Google Scholar 

  • Hansen, N., Müller, S.D., Koumoutsakos, P.: Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary Computation 11(1), 1–18 (2003)

    Google Scholar 

  • Hansen, N., Auger, A., Ros, R., Finck, S., Pošík, P.: Comparing results of 31 algorithms from the black-box optimization benchmarking BBOB-2009. In: Proceedings of the 12th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO 2010, pp. 1689–1696. ACM, New York (2010)

    Google Scholar 

  • Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall PTR (1994)

    Google Scholar 

  • Heidrich-Meisner, V., Igel, C.: Evolution Strategies for Direct Policy Search. In: Rudolph, G., Jansen, T., Lucas, S., Poloni, C., Beume, N. (eds.) PPSN 2008. LNCS, vol. 5199, pp. 428–437. Springer, Heidelberg (2008)

    Google Scholar 

  • Holland, J.H.: Outline for a logical theory of adaptive systems. Journal of the ACM (JACM) 9(3), 297–314 (1962)

    MathSciNet  Google Scholar 

  • Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor (1975)

    Google Scholar 

  • Howard, R.A.: Dynamic programming and Markov processes. MIT Press (1960)

    Google Scholar 

  • Huyer, W., Neumaier, A.: SNOBFIT–stable noisy optimization by branch and fit. ACM Transactions on Mathematical Software (TOMS) 35(2), 1–25 (2008)

    MathSciNet  Google Scholar 

  • Jiang, F., Berry, H., Schoenauer, M.: Supervised and Evolutionary Learning of Echo State Networks. In: Rudolph, G., Jansen, T., Lucas, S., Poloni, C., Beume, N. (eds.) PPSN 2008. LNCS, vol. 5199, pp. 215–224. Springer, Heidelberg (2008)

    Google Scholar 

  • Jouffe, L.: Fuzzy inference system learning by reinforcement methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 28(3), 338–355 (1998)

    Google Scholar 

  • Kakade, S.: A natural policy gradient. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14 (NIPS-2001), pp. 1531–1538. MIT Press (2001)

    Google Scholar 

  • Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: Proceedings of IEEE International Conference on Neural Networks, Perth, Australia, vol. 4, pp. 1942–1948 (1995)

    Google Scholar 

  • Kirkpatrick, S.: Optimization by simulated annealing: Quantitative studies. Journal of Statistical Physics 34(5), 975–986 (1984)

    MathSciNet  Google Scholar 

  • Klir, G., Yuan, B.: Fuzzy sets and fuzzy logic: theory and applications. Prentice Hall PTR, Upper Saddle River (1995)

    Google Scholar 

  • Konda, V.: Actor-critic algorithms. PhD thesis, Massachusetts Institute of Technology (2002)

    Google Scholar 

  • Konda, V.R., Borkar, V.: Actor-critic type learning algorithms for Markov decision processes. SIAM Journal on Control and Optimization 38(1), 94–123 (1999)

    MathSciNet  Google Scholar 

  • Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. SIAM Journal on Control and Optimization 42(4), 1143–1166 (2003)

    MathSciNet  Google Scholar 

  • Kullback, S.: Statistics and Information Theory. J. Wiley and Sons, New York (1959)

    Google Scholar 

  • Kullback, S., Leibler, R.A.: On information and sufficiency. Annals of Mathematical Statistics 22, 79–86 (1951)

    MathSciNet  Google Scholar 

  • Lagoudakis, M., Parr, R.: Least-squares policy iteration. The Journal of Machine Learning Research 4, 1107–1149 (2003)

    MathSciNet  Google Scholar 

  • Lin, C., Lee, C.: Reinforcement structure/parameter learning for neural-network-based fuzzy logic control systems. IEEE Transactions on Fuzzy Systems 2(1), 46–63 (1994)

    MathSciNet  Google Scholar 

  • Lin, C.S., Kim, H.: CMAC-based adaptive critic self-learning control. IEEE Transactions on Neural Networks 2(5), 530–533 (1991)

    Google Scholar 

  • Lin, L.: Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning 8(3), 293–321 (1992)

    Google Scholar 

  • Lin, L.J.: Reinforcement learning for robots using neural networks. PhD thesis, Carnegie Mellon University, Pittsburgh (1993)

    Google Scholar 

  • Littman, M.L., Szepesvári, C.: A generalized reinforcement-learning model: Convergence and applications. In: Saitta, L. (ed.) Proceedings of the 13th International Conference on Machine Learning (ICML 1996), pp. 310–318. Morgan Kaufmann, Bari (1996)

    Google Scholar 

  • Maei, H.R., Sutton, R.S.: GQ (λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In: Proceedings of the Third Conference On Artificial General Intelligence (AGI-2010), pp. 91–96. Atlantis Press, Lugano (2010)

    Google Scholar 

  • Maei, H.R., Szepesvári, C., Bhatnagar, S., Precup, D., Silver, D., Sutton, R.: Convergent temporal-difference learning with arbitrary smooth function approximation. In: Advances in Neural Information Processing Systems 22 (NIPS-2009) (2009)

    Google Scholar 

  • Maei, H.R., Szepesvári, C., Bhatnagar, S., Sutton, R.S.: Toward off-policy learning control with function approximation. In: Proceedings of the 27th Annual International Conference on Machine Learning (ICML-2010). ACM, New York (2010)

    Google Scholar 

  • Maillard, O.A., Munos, R., Lazaric, A., Ghavamzadeh, M.: Finite sample analysis of Bellman residual minimization. In: Asian Conference on Machine Learning, ACML-2010 (2010)

    Google Scholar 

  • Mitchell, T.M.: Machine learning. McGraw Hill, New York (1996)

    Google Scholar 

  • Moriarty, D.E., Miikkulainen, R.: Efficient reinforcement learning through symbiotic evolution. Machine Learning 22, 11–32 (1996)

    Google Scholar 

  • Moriarty, D.E., Schultz, A.C., Grefenstette, J.J.: Evolutionary algorithms for reinforcement learning. Journal of Artificial Intelligence Research 11, 241–276 (1999)

    Google Scholar 

  • Murray, J.J., Cox, C.J., Lendaris, G.G., Saeks, R.: Adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 32(2), 140–153 (2002)

    Google Scholar 

  • Narendra, K.S., Thathachar, M.A.L.: Learning automata - a survey. IEEE Transactions on Systems, Man, and Cybernetics 4, 323–334 (1974)

    MathSciNet  Google Scholar 

  • Narendra, K.S., Thathachar, M.A.L.: Learning automata: an introduction. Prentice-Hall, Inc., Upper Saddle River (1989)

    Google Scholar 

  • Nedić, A., Bertsekas, D.P.: Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems 13(1-2), 79–110 (2003)

    MathSciNet  Google Scholar 

  • Neyman, J., Pearson, E.S.: On the use and interpretation of certain test criteria for purposes of statistical inference part i. Biometrika 20(1), 175–240 (1928)

    Google Scholar 

  • Ng, A.Y., Parr, R., Koller, D.: Policy search via density estimation. In: Solla, S.A., Leen, T.K., Müller, K.R. (eds.) Advances in Neural Information Processing Systems, vol. 13, pp. 1022–1028. The MIT Press (1999)

    Google Scholar 

  • Nguyen-Tuong, D., Peters, J.: Model learning for robot control: a survey. Cognitive Processing, 1–22 (2011)

    Google Scholar 

  • Ormoneit, D., Sen, Ś.: Kernel-based reinforcement learning. Machine Learning 49(2), 161–178 (2002)

    Google Scholar 

  • Pazis, J., Lagoudakis, M.G.: Binary action search for learning continuous-action control policies. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 793–800. ACM (2009)

    Google Scholar 

  • Peng, J.: Efficient dynamic programming-based learning for control. PhD thesis, Northeastern University (1993)

    Google Scholar 

  • Peters, J., Schaal, S.: Natural actor-critic. Neurocomputing 71(7-9), 1180–1190 (2008a)

    Google Scholar 

  • Peters, J., Schaal, S.: Reinforcement learning of motor skills with policy gradients. Neural Networks 21(4), 682–697 (2008b)

    Google Scholar 

  • Peters, J., Vijayakumar, S., Schaal, S.: Reinforcement learning for humanoid robotics. In: IEEE-RAS International Conference on Humanoid Robots (Humanoids 2003). IEEE Press (2003)

    Google Scholar 

  • Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian reinforcement learning. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 697–704. ACM (2006)

    Google Scholar 

  • Powell, M.: UOBYQA: unconstrained optimization by quadratic approximation. Mathematical Programming 92(3), 555–582 (2002)

    MathSciNet  Google Scholar 

  • Powell, M.: The NEWUOA software for unconstrained optimization without derivatives. In: Large-Scale Nonlinear Optimization, pp. 255–297 (2006)

    Google Scholar 

  • Powell, W.B.: Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley-Blackwell (2007)

    Google Scholar 

  • Precup, D., Sutton, R.S.: Off-policy temporal-difference learning with function approximation. In: Machine Learning: Proceedings of the Eighteenth International Conference (ICML 2001), pp. 417–424. Morgan Kaufmann, Williams College (2001)

    Google Scholar 

  • Precup, D., Sutton, R.S., Singh, S.P.: Eligibility traces for off-policy policy evaluation. In: Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), pp. 766–773. Morgan Kaufmann, Stanford University, Stanford, CA (2000)

    Google Scholar 

  • Prokhorov, D.V., Wunsch, D.C.: Adaptive critic designs. IEEE Transactions on Neural Networks 8(5), 997–1007 (2002)

    Google Scholar 

  • Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York (1994)

    Google Scholar 

  • Puterman, M.L., Shin, M.C.: Modified policy iteration algorithms for discounted Markov decision problems. Management Science 24(11), 1127–1137 (1978)

    MathSciNet  Google Scholar 

  • Rao, C.R., Poti, S.J.: On locally most powerful tests when alternatives are one sided. Sankhyā: The Indian Journal of Statistics, 439–439 (1946)

    Google Scholar 

  • Rechenberg, I.: Evolutionsstrategie - Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Fromman-Holzboog (1971)

    Google Scholar 

  • Riedmiller, M.: Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 317–328. Springer, Heidelberg (2005)

    Google Scholar 

  • Ripley, B.D.: Pattern recognition and neural networks. Cambridge University Press (2008)

    Google Scholar 

  • Rubinstein, R.: The cross-entropy method for combinatorial and continuous optimization. Methodology and Computing in Applied Probability 1(2), 127–190 (1999)

    MathSciNet  Google Scholar 

  • Rubinstein, R., Kroese, D.: The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation, and machine learning. Springer-Verlag New York Inc. (2004)

    Google Scholar 

  • Rückstieß, T., Sehnke, F., Schaul, T., Wierstra, D., Sun, Y., Schmidhuber, J.: Exploring parameter space in reinforcement learning. Paladyn 1(1), 14–24 (2010)

    Google Scholar 

  • Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Parallel Distributed Processing, vol. 1, pp. 318–362. MIT Press (1986)

    Google Scholar 

  • Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist sytems. Tech. Rep. CUED/F-INFENG-TR 166, Cambridge University, UK (1994)

    Google Scholar 

  • Santamaria, J.C., Sutton, R.S., Ram, A.: Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive Behavior 6(2), 163–217 (1997)

    Google Scholar 

  • Scherrer, B.: Should one compute the temporal difference fix point or minimize the Bellman residual? The unified oblique projection view. In: Fürnkranz, J., Joachims, T. (eds.) Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 959–966. Omnipress (2010)

    Google Scholar 

  • Schwefel, H.P.: Numerische Optimierung von Computer-Modellen. Interdisciplinary Systems Research, vol. 26. Birkhäuser, Basel (1977)

    Google Scholar 

  • Sehnke, F., Osendorfer, C., Rückstieß, T., Graves, A., Peters, J., Schmidhuber, J.: Parameter-exploring policy gradients. Neural Networks 23(4), 551–559 (2010)

    Google Scholar 

  • Singh, S.P., Sutton, R.S.: Reinforcement learning with replacing eligibility traces. Machine Learning 22, 123–158 (1996)

    Google Scholar 

  • Spaan, M., Vlassis, N.: Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research 24(1), 195–220 (2005)

    Google Scholar 

  • Stanley, K.O., Miikkulainen, R.: Efficient reinforcement learning through evolving neural network topologies. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2002), pp. 569–577. Morgan Kaufmann, San Francisco (2002)

    Google Scholar 

  • Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: PAC model-free reinforcement learning. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 881–888. ACM (2006)

    Google Scholar 

  • Strens, M.: A Bayesian framework for reinforcement learning. In: Proceedings of the Seventeenth International Conference on Machine Learning, p. 950. Morgan Kaufmann Publishers Inc. (2000)

    Google Scholar 

  • Sun, Y., Wierstra, D., Schaul, T., Schmidhuber, J.: Efficient natural evolution strategies. In: Proceedings of the 11th Annual conference on Genetic and Evolutionary Computation (GECCO-2009), pp. 539–546. ACM (2009)

    Google Scholar 

  • Sutton, R.S.: Temporal credit assignment in reinforcement learning. PhD thesis, University of Massachusetts, Dept. of Comp. and Inf. Sci. (1984)

    Google Scholar 

  • Sutton, R.S.: Learning to predict by the methods of temporal differences. Machine Learning 3, 9–44 (1988)

    Google Scholar 

  • Sutton, R.S.: Generalization in reinforcement learning: Successful examples using sparse coarse coding. In: Touretzky, D.S., Mozer, M.C., Hasselmo, M.E. (eds.) Advances in Neural Information Processing Systems, vol. 8, pp. 1038–1045. MIT Press, Cambridge (1996)

    Google Scholar 

  • Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. The MIT press, Cambridge (1998)

    Google Scholar 

  • Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems 13 (NIPS-2000), vol. 12, pp. 1057–1063 (2000)

    Google Scholar 

  • Sutton, R.S., Szepesvári, C., Maei, H.R.: A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In: Advances in Neural Information Processing Systems 21 (NIPS-2008), vol. 21, pp. 1609–1616 (2008)

    Google Scholar 

  • Sutton, R.S., Maei, H.R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., Wiewiora, E.: Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML 2009), pp. 993–1000. ACM (2009)

    Google Scholar 

  • Szepesvári, C.: Algorithms for reinforcement learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 4(1), 1–103 (2010)

    Google Scholar 

  • Szepesvári, C., Smart, W.D.: Interpolation-based Q-learning. In: Proceedings of the Twenty-First International Conference on Machine Learning (ICML 2004), p. 100. ACM (2004)

    Google Scholar 

  • Szita, I., Lörincz, A.: Learning tetris using the noisy cross-entropy method. Neural Computation 18(12), 2936–2941 (2006)

    Google Scholar 

  • Taylor, M.E., Whiteson, S., Stone, P.: Comparing evolutionary and temporal difference methods in a reinforcement learning domain. In: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, p. 1328. ACM (2006)

    Google Scholar 

  • Tesauro, G.: Practical issues in temporal difference learning. In: Lippman, D.S., Moody, J.E., Touretzky, D.S. (eds.) Advances in Neural Information Processing Systems, vol. 4, pp. 259–266. Morgan Kaufmann, San Mateo (1992)

    Google Scholar 

  • Tesauro, G.: TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation 6(2), 215–219 (1994)

    Google Scholar 

  • Tesauro, G.J.: Temporal difference learning and TD-Gammon. Communications of the ACM 38, 58–68 (1995)

    Google Scholar 

  • Thrun, S., Schwartz, A.: Issues in using function approximation for reinforcement learning. In: Mozer, M., Smolensky, P., Touretzky, D., Elman, J., Weigend, A. (eds.) Proceedings of the 1993 Connectionist Models Summer School. Lawrence Erlbaum, Hillsdale (1993)

    Google Scholar 

  • Touzet, C.F.: Neural reinforcement learning for behaviour synthesis. Robotics and Autonomous Systems 22(3/4), 251–281 (1997)

    Google Scholar 

  • Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal-difference learning with function approximation. Tech. Rep. LIDS-P-2322, MIT Laboratory for Information and Decision Systems, Cambridge, MA (1996)

    Google Scholar 

  • Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42(5), 674–690 (1997)

    MathSciNet  Google Scholar 

  • van Hasselt, H.P.: Double Q-Learning. In: Advances in Neural Information Processing Systems, vol. 23. The MIT Press (2010)

    Google Scholar 

  • van Hasselt, H.P.: Insights in reinforcement learning. PhD thesis, Utrecht University (2011)

    Google Scholar 

  • van Hasselt, H.P., Wiering, M.A.: Reinforcement learning in continuous action spaces. In: Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-2007), pp. 272–279 (2007)

    Google Scholar 

  • van Hasselt, H.P., Wiering, M.A.: Using continuous action spaces to solve discrete problems. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN 2009), pp. 1149–1156 (2009)

    Google Scholar 

  • van Seijen, H., van Hasselt, H.P., Whiteson, S., Wiering, M.A.: A theoretical and empirical analysis of Expected Sarsa. In: Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 177–184 (2009)

    Google Scholar 

  • Vapnik, V.N.: The nature of statistical learning theory. Springer, Heidelberg (1995)

    Google Scholar 

  • Vrabie, D., Pastravanu, O., Abu-Khalaf, M., Lewis, F.: Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica 45(2), 477–484 (2009)

    MathSciNet  Google Scholar 

  • Wang, F.Y., Zhang, H., Liu, D.: Adaptive dynamic programming: An introduction. IEEE Computational Intelligence Magazine 4(2), 39–47 (2009)

    Google Scholar 

  • Watkins, C.J.C.H.: Learning from delayed rewards. PhD thesis, King’s College, Cambridge, England (1989)

    Google Scholar 

  • Watkins, C.J.C.H., Dayan, P.: Q-learning. Machine Learning 8, 279–292 (1992)

    Google Scholar 

  • Werbos, P.J.: Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University (1974)

    Google Scholar 

  • Werbos, P.J.: Advanced forecasting methods for global crisis warning and models of intelligence. In: General Systems, vol. XXII, pp. 25–38 (1977)

    Google Scholar 

  • Werbos, P.J.: Backpropagation and neurocontrol: A review and prospectus. In: IEEE/INNS International Joint Conference on Neural Networks, Washington, D.C, vol. 1, pp. 209–216 (1989a)

    Google Scholar 

  • Werbos, P.J.: Neural networks for control and system identification. In: Proceedings of IEEE/CDC, Tampa, Florida (1989b)

    Google Scholar 

  • Werbos, P.J.: Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks 2, 179–189 (1990)

    Google Scholar 

  • Werbos, P.J.: Backpropagation through time: What it does and how to do it. Proceedings of the IEEE 78(10), 1550–1560 (2002)

    MathSciNet  Google Scholar 

  • Whiteson, S., Stone, P.: Evolutionary function approximation for reinforcement learning. Journal of Machine Learning Research 7, 877–917 (2006)

    MathSciNet  Google Scholar 

  • Whitley, D., Dominic, S., Das, R., Anderson, C.W.: Genetic reinforcement learning for neurocontrol problems. Machine Learning 13(2), 259–284 (1993)

    Google Scholar 

  • Wieland, A.P.: Evolving neural network controllers for unstable systems. In: International Joint Conference on Neural Networks, vol. 2, pp. 667–673. IEEE, New York (1991)

    Google Scholar 

  • Wiering, M.A., van Hasselt, H.P.: The QV family compared to other reinforcement learning algorithms. In: Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 101–108 (2009)

    Google Scholar 

  • Wierstra, D., Schaul, T., Peters, J., Schmidhuber, J.: Natural evolution strategies. In: IEEE Congress on Evolutionary Computation (CEC-2008), pp. 3381–3387. IEEE (2008)

    Google Scholar 

  • Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 229–256 (1992)

    Google Scholar 

  • Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent neural networks. Neural Computation 1(2), 270–280 (1989)

    Google Scholar 

  • Wilson, D.R., Martinez, T.R.: The general inefficiency of batch training for gradient descent learning. Neural Networks 16(10), 1429–1451 (2003)

    Google Scholar 

  • Zadeh, L.: Fuzzy sets. Information and Control 8(3), 338–353 (1965)

    MathSciNet  Google Scholar 

  • Zhou, C., Meng, Q.: Dynamic balance of a biped robot using fuzzy reinforcement learning agents. Fuzzy Sets and Systems 134(1), 169–187 (2003)

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hado van Hasselt .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

van Hasselt, H. (2012). Reinforcement Learning in Continuous State and Action Spaces. In: Wiering, M., van Otterlo, M. (eds) Reinforcement Learning. Adaptation, Learning, and Optimization, vol 12. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27645-3_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-27645-3_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-27644-6

  • Online ISBN: 978-3-642-27645-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics