Reinforcement Learning in Continuous State and Action Spaces

van Hasselt, Hado

doi:10.1007/978-3-642-27645-3_7

Reinforcement Learning in Continuous State and Action Spaces

Hado van Hasselt³

Chapter

29k Accesses
77 Citations

Part of the book series: Adaptation, Learning, and Optimization ((ALO,volume 12))

Abstract

Many traditional reinforcement-learning algorithms have been designed for problems with small finite state and action spaces. Learning in such discrete problems can been difficult, due to noise and delayed reinforcements. However, many real-world problems have continuous state or action spaces, which can make learning a good decision policy even more involved. In this chapter we discuss how to automatically find good decision policies in continuous domains. Because analytically computing a good policy from a continuous model can be infeasible, in this chapter we mainly focus on methods that explicitly update a representation of a value function, a policy or both. We discuss considerations in choosing an appropriate representation for these functions and discuss gradient-based and gradient-free ways to update the parameters. We show how to apply these methods to reinforcement-learning problems and discuss many specific algorithms. Amongst others, we cover gradient-based temporal-difference learning, evolutionary strategies, policy-gradient algorithms and (natural) actor-critic methods. We discuss the advantages of different approaches and compare the performance of a state-of-the-art actor-critic method and a state-of-the-art evolutionary strategy empirically.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 299.00; Price excludes VAT (USA)

Softcover Book: USD 379.99; Price excludes VAT (USA)

Hardcover Book: USD 379.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Akimoto, Y., Nagata, Y., Ono, I., Kobayashi, S.: Bidirectional Relation Between CMA Evolution Strategies and Natural Evolution Strategies. In: Schaefer, R., Cotta, C., Kołodziej, J., Rudolph, G. (eds.) PPSN XI. LNCS, vol. 6238, pp. 154–163. Springer, Heidelberg (2010)
Google Scholar
Albus, J.S.: A theory of cerebellar function. Mathematical Biosciences 10, 25–61 (1971)
Google Scholar
Albus, J.S.: A new approach to manipulator control: The cerebellar model articulation controller (CMAC). In: Dynamic Systems, Measurement and Control, pp. 220–227 (1975)
Google Scholar
Amari, S.I.: Natural gradient works efficiently in learning. Neural Computation 10(2), 251–276 (1998)
MathSciNet Google Scholar
Anderson, C.W.: Learning to control an inverted pendulum using neural networks. IEEE Control Systems Magazine 9(3), 31–37 (1989)
Google Scholar
Antos, A., Munos, R., Szepesvári, C.: Fitted Q-iteration in continuous action-space MDPs. In: Advances in Neural Information Processing Systems (NIPS-2007), vol. 20, pp. 9–16 (2008a)
Google Scholar
Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning 71(1), 89–129 (2008b)
Google Scholar
Babuska, R.: Fuzzy modeling for control. Kluwer Academic Publishers (1998)
Google Scholar
Bäck, T.: Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms. Oxford University Press, USA (1996)
Google Scholar
Bäck, T., Schwefel, H.P.: An overview of evolutionary algorithms for parameter optimization. Evolutionary Computation 1(1), 1–23 (1993)
Google Scholar
Baird, L.: Residual algorithms: Reinforcement learning with function approximation. In: Prieditis, A., Russell, S. (eds.) Machine Learning: Proceedings of the Twelfth International Conference, pp. 30–37. Morgan Kaufmann Publishers, San Francisco (1995)
Google Scholar
Baird, L.C., Klopf, A.H.: Reinforcement learning with high-dimensional, continuous actions. Tech. Rep. WL-TR-93-114, Wright Laboratory, Wright-Patterson Air Force Base, OH (1993)
Google Scholar
Bardi, M., Dolcetta, I.C.: Optimal control and viscosity solutions of Hamilton–Jacobi–Bellman equations. Springer, Heidelberg (1997)
Google Scholar
Barto, A.G., Sutton, R.S., Anderson, C.W.: Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics SMC-13, 834–846 (1983)
Google Scholar
Baxter, J., Bartlett, P.L.: Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15, 319–350 (2001)
MathSciNet Google Scholar
Beard, R., Saridis, G., Wen, J.: Approximate solutions to the time-invariant Hamilton–Jacobi–Bellman equation. Journal of Optimization theory and Applications 96(3), 589–626 (1998)
MathSciNet Google Scholar
Bellman, R.: Dynamic Programming. Princeton University Press (1957)
Google Scholar
Benbrahim, H., Franklin, J.A.: Biped dynamic walking using reinforcement learning. Robotics and Autonomous Systems 22(3-4), 283–302 (1997)
Google Scholar
Berenji, H.: Fuzzy Q-learning: a new approach for fuzzy dynamic programming. In: Proceedings of the Third IEEE Conference on Fuzzy Systems, IEEE World Congress on Computational Intelligence, pp. 486–491. IEEE (1994)
Google Scholar
Berenji, H., Khedkar, P.: Learning and tuning fuzzy logic controllers through reinforcements. IEEE Transactions on Neural Networks 3(5), 724–740 (1992)
Google Scholar
Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. I. Athena Scientific (2005)
Google Scholar
Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. II. Athena Scientific (2007)
Google Scholar
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-dynamic Programming. Athena Scientific, Belmont (1996)
Google Scholar
Bertsekas, D.P., Borkar, V.S., Nedic, A.: Improved temporal difference methods with linear function approximation. In: Handbook of Learning and Approximate Dynamic Programming, pp. 235–260 (2004)
Google Scholar
Beyer, H., Schwefel, H.: Evolution strategies–a comprehensive introduction. Natural Computing 1(1), 3–52 (2002)
MathSciNet Google Scholar
Bhatnagar, S., Sutton, R.S., Ghavamzadeh, M., Lee, M.: Natural actor-critic algorithms. Automatica 45(11), 2471–2482 (2009)
MathSciNet Google Scholar
Bishop, C.M.: Neural networks for pattern recognition. Oxford University Press, USA (1995)
Google Scholar
Bishop, C.M.: Pattern recognition and machine learning. Springer, New York (2006)
Google Scholar
Bonarini, A.: Delayed reinforcement, fuzzy Q-learning and fuzzy logic controllers. In: Herrera, F., Verdegay, J.L. (eds.) Genetic Algorithms and Soft Computing. Studies in Fuzziness, vol. 8, pp. 447–466. Physica-Verlag, Berlin (1996)
Google Scholar
Boyan, J.A.: Technical update: Least-squares temporal difference learning. Machine Learning 49(2), 233–246 (2002)
Google Scholar
Bradtke, S.J., Barto, A.G.: Linear least-squares algorithms for temporal difference learning. Machine Learning 22, 33–57 (1996)
Google Scholar
Bryson, A., Ho, Y.: Applied Optimal Control. Blaisdell Publishing Co. (1969)
Google Scholar
Buşoniu, L., Ernst, D., De Schutter, B., Babuška, R.: Continuous-State Reinforcement Learning with Fuzzy Approximation. In: Tuyls, K., Nowe, A., Guessoum, Z., Kudenko, D. (eds.) ALAMAS 2005, ALAMAS 2006, and ALAMAS 2007. LNCS (LNAI), vol. 4865, pp. 27–43. Springer, Heidelberg (2008)
Google Scholar
Buşoniu, L., Babuška, R., De Schutter, B., Ernst, D.: Reinforcement Learning and Dynamic Programming Using Function Approximators. CRC Press, Boca Raton (2010)
Google Scholar
Coulom, R.: Reinforcement learning using neural networks, with applications to motor control. PhD thesis, Institut National Polytechnique de Grenoble (2002)
Google Scholar
Crites, R.H., Barto, A.G.: Improving elevator performance using reinforcement learning. In: Touretzky, D.S., Mozer, M.C., Hasselmo, M.E. (eds.) Advances in Neural Information Processing Systems, vol. 8, pp. 1017–1023. MIT Press, Cambridge (1996)
Google Scholar
Crites, R.H., Barto, A.G.: Elevator group control using multiple reinforcement learning agents. Machine Learning 33(2/3), 235–262 (1998)
Google Scholar
Davis, L.: Handbook of genetic algorithms. Arden Shakespeare (1991)
Google Scholar
Dayan, P.: The convergence of TD(λ) for general lambda. Machine Learning 8, 341–362 (1992)
Google Scholar
Dayan, P., Sejnowski, T.: TD(λ): Convergence with probability 1. Machine Learning 14, 295–301 (1994)
Google Scholar
Dearden, R., Friedman, N., Russell, S.: Bayesian Q-learning. In: Proceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence, pp. 761–768. American Association for Artificial Intelligence (1998)
Google Scholar
Dearden, R., Friedman, N., Andre, D.: Model based Bayesian exploration. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 150–159 (1999)
Google Scholar
Eiben, A.E., Smith, J.E.: Introduction to evolutionary computing. Springer, Heidelberg (2003)
Google Scholar
Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6(1), 503–556 (2005)
MathSciNet Google Scholar
Fisher, R.A.: On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London Series A, Containing Papers of a Mathematical or Physical Character 222, 309–368 (1922)
Google Scholar
Fisher, R.A.: Statistical methods for research workers. Oliver & Boyd, Edinburgh (1925)
Google Scholar
Främling, K.: Replacing eligibility trace for action-value learning with function approximation. In: Proceedings of the 15th European Symposium on Artificial Neural Networks (ESANN-2007), pp. 313–318. d-side publishing (2007)
Google Scholar
Gaskett, C., Wettergreen, D., Zelinsky, A.: Q-learning in continuous state and action spaces. In: Advanced Topics in Artificial Intelligence, pp. 417–428 (1999)
Google Scholar
Geramifard, A., Bowling, M., Sutton, R.S.: Incremental least-squares temporal difference learning. In: Proceedings of the 21st National Conference on Artificial Intelligence, vol. 1, pp. 356–361. AAAI Press (2006)
Google Scholar
Geramifard, A., Bowling, M., Zinkevich, M., Sutton, R.: ilstd: Eligibility traces and convergence analysis. In: Advances in Neural Information Processing Systems, vol. 19, pp. 441–448 (2007)
Google Scholar
Glasmachers, T., Schaul, T., Yi, S., Wierstra, D., Schmidhuber, J.: Exponential natural evolution strategies. In: Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation, pp. 393–400. ACM (2010)
Google Scholar
Glorennec, P.: Fuzzy Q-learning and dynamical fuzzy Q-learning. In: Proceedings of the Third IEEE Conference on Fuzzy Systems, IEEE World Congress on Computational Intelligence, pp. 474–479. IEEE (1994)
Google Scholar
Glover, F., Kochenberger, G.: Handbook of metaheuristics. Springer, Heidelberg (2003)
Google Scholar
Gomez, F., Schmidhuber, J., Miikkulainen, R.: Accelerated neural evolution through cooperatively coevolved synapses. The Journal of Machine Learning Research 9, 937–965 (2008)
MathSciNet Google Scholar
Gordon, G.J.: Stable function approximation in dynamic programming. In: Prieditis, A., Russell, S. (eds.) Proceedings of the Twelfth International Conference on Machine Learning (ICML 1995), pp. 261–268. Morgan Kaufmann, San Francisco (1995)
Google Scholar
Gordon, G.J.: Approximate solutions to Markov decision processes. PhD thesis, Carnegie Mellon University (1999)
Google Scholar
Greensmith, E., Bartlett, P.L., Baxter, J.: Variance reduction techniques for gradient estimates in reinforcement learning. The Journal of Machine Learning Research 5, 1471–1530 (2004)
MathSciNet Google Scholar
Hansen, N., Ostermeier, A.: Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation 9(2), 159–195 (2001)
Google Scholar
Hansen, N., Müller, S.D., Koumoutsakos, P.: Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary Computation 11(1), 1–18 (2003)
Google Scholar
Hansen, N., Auger, A., Ros, R., Finck, S., Pošík, P.: Comparing results of 31 algorithms from the black-box optimization benchmarking BBOB-2009. In: Proceedings of the 12th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO 2010, pp. 1689–1696. ACM, New York (2010)
Google Scholar
Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall PTR (1994)
Google Scholar
Heidrich-Meisner, V., Igel, C.: Evolution Strategies for Direct Policy Search. In: Rudolph, G., Jansen, T., Lucas, S., Poloni, C., Beume, N. (eds.) PPSN 2008. LNCS, vol. 5199, pp. 428–437. Springer, Heidelberg (2008)
Google Scholar
Holland, J.H.: Outline for a logical theory of adaptive systems. Journal of the ACM (JACM) 9(3), 297–314 (1962)
MathSciNet Google Scholar
Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor (1975)
Google Scholar
Howard, R.A.: Dynamic programming and Markov processes. MIT Press (1960)
Google Scholar
Huyer, W., Neumaier, A.: SNOBFIT–stable noisy optimization by branch and fit. ACM Transactions on Mathematical Software (TOMS) 35(2), 1–25 (2008)
MathSciNet Google Scholar
Jiang, F., Berry, H., Schoenauer, M.: Supervised and Evolutionary Learning of Echo State Networks. In: Rudolph, G., Jansen, T., Lucas, S., Poloni, C., Beume, N. (eds.) PPSN 2008. LNCS, vol. 5199, pp. 215–224. Springer, Heidelberg (2008)
Google Scholar
Jouffe, L.: Fuzzy inference system learning by reinforcement methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 28(3), 338–355 (1998)
Google Scholar
Kakade, S.: A natural policy gradient. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14 (NIPS-2001), pp. 1531–1538. MIT Press (2001)
Google Scholar
Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: Proceedings of IEEE International Conference on Neural Networks, Perth, Australia, vol. 4, pp. 1942–1948 (1995)
Google Scholar
Kirkpatrick, S.: Optimization by simulated annealing: Quantitative studies. Journal of Statistical Physics 34(5), 975–986 (1984)
MathSciNet Google Scholar
Klir, G., Yuan, B.: Fuzzy sets and fuzzy logic: theory and applications. Prentice Hall PTR, Upper Saddle River (1995)
Google Scholar
Konda, V.: Actor-critic algorithms. PhD thesis, Massachusetts Institute of Technology (2002)
Google Scholar
Konda, V.R., Borkar, V.: Actor-critic type learning algorithms for Markov decision processes. SIAM Journal on Control and Optimization 38(1), 94–123 (1999)
MathSciNet Google Scholar
Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. SIAM Journal on Control and Optimization 42(4), 1143–1166 (2003)
MathSciNet Google Scholar
Kullback, S.: Statistics and Information Theory. J. Wiley and Sons, New York (1959)
Google Scholar
Kullback, S., Leibler, R.A.: On information and sufficiency. Annals of Mathematical Statistics 22, 79–86 (1951)
MathSciNet Google Scholar
Lagoudakis, M., Parr, R.: Least-squares policy iteration. The Journal of Machine Learning Research 4, 1107–1149 (2003)
MathSciNet Google Scholar
Lin, C., Lee, C.: Reinforcement structure/parameter learning for neural-network-based fuzzy logic control systems. IEEE Transactions on Fuzzy Systems 2(1), 46–63 (1994)
MathSciNet Google Scholar
Lin, C.S., Kim, H.: CMAC-based adaptive critic self-learning control. IEEE Transactions on Neural Networks 2(5), 530–533 (1991)
Google Scholar
Lin, L.: Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning 8(3), 293–321 (1992)
Google Scholar
Lin, L.J.: Reinforcement learning for robots using neural networks. PhD thesis, Carnegie Mellon University, Pittsburgh (1993)
Google Scholar
Littman, M.L., Szepesvári, C.: A generalized reinforcement-learning model: Convergence and applications. In: Saitta, L. (ed.) Proceedings of the 13th International Conference on Machine Learning (ICML 1996), pp. 310–318. Morgan Kaufmann, Bari (1996)
Google Scholar
Maei, H.R., Sutton, R.S.: GQ (λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In: Proceedings of the Third Conference On Artificial General Intelligence (AGI-2010), pp. 91–96. Atlantis Press, Lugano (2010)
Google Scholar
Maei, H.R., Szepesvári, C., Bhatnagar, S., Precup, D., Silver, D., Sutton, R.: Convergent temporal-difference learning with arbitrary smooth function approximation. In: Advances in Neural Information Processing Systems 22 (NIPS-2009) (2009)
Google Scholar
Maei, H.R., Szepesvári, C., Bhatnagar, S., Sutton, R.S.: Toward off-policy learning control with function approximation. In: Proceedings of the 27th Annual International Conference on Machine Learning (ICML-2010). ACM, New York (2010)
Google Scholar
Maillard, O.A., Munos, R., Lazaric, A., Ghavamzadeh, M.: Finite sample analysis of Bellman residual minimization. In: Asian Conference on Machine Learning, ACML-2010 (2010)
Google Scholar
Mitchell, T.M.: Machine learning. McGraw Hill, New York (1996)
Google Scholar
Moriarty, D.E., Miikkulainen, R.: Efficient reinforcement learning through symbiotic evolution. Machine Learning 22, 11–32 (1996)
Google Scholar
Moriarty, D.E., Schultz, A.C., Grefenstette, J.J.: Evolutionary algorithms for reinforcement learning. Journal of Artificial Intelligence Research 11, 241–276 (1999)
Google Scholar
Murray, J.J., Cox, C.J., Lendaris, G.G., Saeks, R.: Adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 32(2), 140–153 (2002)
Google Scholar
Narendra, K.S., Thathachar, M.A.L.: Learning automata - a survey. IEEE Transactions on Systems, Man, and Cybernetics 4, 323–334 (1974)
MathSciNet Google Scholar
Narendra, K.S., Thathachar, M.A.L.: Learning automata: an introduction. Prentice-Hall, Inc., Upper Saddle River (1989)
Google Scholar
Nedić, A., Bertsekas, D.P.: Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems 13(1-2), 79–110 (2003)
MathSciNet Google Scholar
Neyman, J., Pearson, E.S.: On the use and interpretation of certain test criteria for purposes of statistical inference part i. Biometrika 20(1), 175–240 (1928)
Google Scholar
Ng, A.Y., Parr, R., Koller, D.: Policy search via density estimation. In: Solla, S.A., Leen, T.K., Müller, K.R. (eds.) Advances in Neural Information Processing Systems, vol. 13, pp. 1022–1028. The MIT Press (1999)
Google Scholar
Nguyen-Tuong, D., Peters, J.: Model learning for robot control: a survey. Cognitive Processing, 1–22 (2011)
Google Scholar
Ormoneit, D., Sen, Ś.: Kernel-based reinforcement learning. Machine Learning 49(2), 161–178 (2002)
Google Scholar
Pazis, J., Lagoudakis, M.G.: Binary action search for learning continuous-action control policies. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 793–800. ACM (2009)
Google Scholar
Peng, J.: Efficient dynamic programming-based learning for control. PhD thesis, Northeastern University (1993)
Google Scholar
Peters, J., Schaal, S.: Natural actor-critic. Neurocomputing 71(7-9), 1180–1190 (2008a)
Google Scholar
Peters, J., Schaal, S.: Reinforcement learning of motor skills with policy gradients. Neural Networks 21(4), 682–697 (2008b)
Google Scholar
Peters, J., Vijayakumar, S., Schaal, S.: Reinforcement learning for humanoid robotics. In: IEEE-RAS International Conference on Humanoid Robots (Humanoids 2003). IEEE Press (2003)
Google Scholar
Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian reinforcement learning. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 697–704. ACM (2006)
Google Scholar
Powell, M.: UOBYQA: unconstrained optimization by quadratic approximation. Mathematical Programming 92(3), 555–582 (2002)
MathSciNet Google Scholar
Powell, M.: The NEWUOA software for unconstrained optimization without derivatives. In: Large-Scale Nonlinear Optimization, pp. 255–297 (2006)
Google Scholar
Powell, W.B.: Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley-Blackwell (2007)
Google Scholar
Precup, D., Sutton, R.S.: Off-policy temporal-difference learning with function approximation. In: Machine Learning: Proceedings of the Eighteenth International Conference (ICML 2001), pp. 417–424. Morgan Kaufmann, Williams College (2001)
Google Scholar
Precup, D., Sutton, R.S., Singh, S.P.: Eligibility traces for off-policy policy evaluation. In: Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), pp. 766–773. Morgan Kaufmann, Stanford University, Stanford, CA (2000)
Google Scholar
Prokhorov, D.V., Wunsch, D.C.: Adaptive critic designs. IEEE Transactions on Neural Networks 8(5), 997–1007 (2002)
Google Scholar
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York (1994)
Google Scholar
Puterman, M.L., Shin, M.C.: Modified policy iteration algorithms for discounted Markov decision problems. Management Science 24(11), 1127–1137 (1978)
MathSciNet Google Scholar
Rao, C.R., Poti, S.J.: On locally most powerful tests when alternatives are one sided. Sankhyā: The Indian Journal of Statistics, 439–439 (1946)
Google Scholar
Rechenberg, I.: Evolutionsstrategie - Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Fromman-Holzboog (1971)
Google Scholar
Riedmiller, M.: Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 317–328. Springer, Heidelberg (2005)
Google Scholar
Ripley, B.D.: Pattern recognition and neural networks. Cambridge University Press (2008)
Google Scholar
Rubinstein, R.: The cross-entropy method for combinatorial and continuous optimization. Methodology and Computing in Applied Probability 1(2), 127–190 (1999)
MathSciNet Google Scholar
Rubinstein, R., Kroese, D.: The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation, and machine learning. Springer-Verlag New York Inc. (2004)
Google Scholar
Rückstieß, T., Sehnke, F., Schaul, T., Wierstra, D., Sun, Y., Schmidhuber, J.: Exploring parameter space in reinforcement learning. Paladyn 1(1), 14–24 (2010)
Google Scholar
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Parallel Distributed Processing, vol. 1, pp. 318–362. MIT Press (1986)
Google Scholar
Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist sytems. Tech. Rep. CUED/F-INFENG-TR 166, Cambridge University, UK (1994)
Google Scholar
Santamaria, J.C., Sutton, R.S., Ram, A.: Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive Behavior 6(2), 163–217 (1997)
Google Scholar
Scherrer, B.: Should one compute the temporal difference fix point or minimize the Bellman residual? The unified oblique projection view. In: Fürnkranz, J., Joachims, T. (eds.) Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 959–966. Omnipress (2010)
Google Scholar
Schwefel, H.P.: Numerische Optimierung von Computer-Modellen. Interdisciplinary Systems Research, vol. 26. Birkhäuser, Basel (1977)
Google Scholar
Sehnke, F., Osendorfer, C., Rückstieß, T., Graves, A., Peters, J., Schmidhuber, J.: Parameter-exploring policy gradients. Neural Networks 23(4), 551–559 (2010)
Google Scholar
Singh, S.P., Sutton, R.S.: Reinforcement learning with replacing eligibility traces. Machine Learning 22, 123–158 (1996)
Google Scholar
Spaan, M., Vlassis, N.: Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research 24(1), 195–220 (2005)
Google Scholar
Stanley, K.O., Miikkulainen, R.: Efficient reinforcement learning through evolving neural network topologies. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2002), pp. 569–577. Morgan Kaufmann, San Francisco (2002)
Google Scholar
Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: PAC model-free reinforcement learning. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 881–888. ACM (2006)
Google Scholar
Strens, M.: A Bayesian framework for reinforcement learning. In: Proceedings of the Seventeenth International Conference on Machine Learning, p. 950. Morgan Kaufmann Publishers Inc. (2000)
Google Scholar
Sun, Y., Wierstra, D., Schaul, T., Schmidhuber, J.: Efficient natural evolution strategies. In: Proceedings of the 11th Annual conference on Genetic and Evolutionary Computation (GECCO-2009), pp. 539–546. ACM (2009)
Google Scholar
Sutton, R.S.: Temporal credit assignment in reinforcement learning. PhD thesis, University of Massachusetts, Dept. of Comp. and Inf. Sci. (1984)
Google Scholar
Sutton, R.S.: Learning to predict by the methods of temporal differences. Machine Learning 3, 9–44 (1988)
Google Scholar
Sutton, R.S.: Generalization in reinforcement learning: Successful examples using sparse coarse coding. In: Touretzky, D.S., Mozer, M.C., Hasselmo, M.E. (eds.) Advances in Neural Information Processing Systems, vol. 8, pp. 1038–1045. MIT Press, Cambridge (1996)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. The MIT press, Cambridge (1998)
Google Scholar
Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems 13 (NIPS-2000), vol. 12, pp. 1057–1063 (2000)
Google Scholar
Sutton, R.S., Szepesvári, C., Maei, H.R.: A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In: Advances in Neural Information Processing Systems 21 (NIPS-2008), vol. 21, pp. 1609–1616 (2008)
Google Scholar
Sutton, R.S., Maei, H.R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., Wiewiora, E.: Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML 2009), pp. 993–1000. ACM (2009)
Google Scholar
Szepesvári, C.: Algorithms for reinforcement learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 4(1), 1–103 (2010)
Google Scholar
Szepesvári, C., Smart, W.D.: Interpolation-based Q-learning. In: Proceedings of the Twenty-First International Conference on Machine Learning (ICML 2004), p. 100. ACM (2004)
Google Scholar
Szita, I., Lörincz, A.: Learning tetris using the noisy cross-entropy method. Neural Computation 18(12), 2936–2941 (2006)
Google Scholar
Taylor, M.E., Whiteson, S., Stone, P.: Comparing evolutionary and temporal difference methods in a reinforcement learning domain. In: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, p. 1328. ACM (2006)
Google Scholar
Tesauro, G.: Practical issues in temporal difference learning. In: Lippman, D.S., Moody, J.E., Touretzky, D.S. (eds.) Advances in Neural Information Processing Systems, vol. 4, pp. 259–266. Morgan Kaufmann, San Mateo (1992)
Google Scholar
Tesauro, G.: TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation 6(2), 215–219 (1994)
Google Scholar
Tesauro, G.J.: Temporal difference learning and TD-Gammon. Communications of the ACM 38, 58–68 (1995)
Google Scholar
Thrun, S., Schwartz, A.: Issues in using function approximation for reinforcement learning. In: Mozer, M., Smolensky, P., Touretzky, D., Elman, J., Weigend, A. (eds.) Proceedings of the 1993 Connectionist Models Summer School. Lawrence Erlbaum, Hillsdale (1993)
Google Scholar
Touzet, C.F.: Neural reinforcement learning for behaviour synthesis. Robotics and Autonomous Systems 22(3/4), 251–281 (1997)
Google Scholar
Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal-difference learning with function approximation. Tech. Rep. LIDS-P-2322, MIT Laboratory for Information and Decision Systems, Cambridge, MA (1996)
Google Scholar
Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42(5), 674–690 (1997)
MathSciNet Google Scholar
van Hasselt, H.P.: Double Q-Learning. In: Advances in Neural Information Processing Systems, vol. 23. The MIT Press (2010)
Google Scholar
van Hasselt, H.P.: Insights in reinforcement learning. PhD thesis, Utrecht University (2011)
Google Scholar
van Hasselt, H.P., Wiering, M.A.: Reinforcement learning in continuous action spaces. In: Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-2007), pp. 272–279 (2007)
Google Scholar
van Hasselt, H.P., Wiering, M.A.: Using continuous action spaces to solve discrete problems. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN 2009), pp. 1149–1156 (2009)
Google Scholar
van Seijen, H., van Hasselt, H.P., Whiteson, S., Wiering, M.A.: A theoretical and empirical analysis of Expected Sarsa. In: Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 177–184 (2009)
Google Scholar
Vapnik, V.N.: The nature of statistical learning theory. Springer, Heidelberg (1995)
Google Scholar
Vrabie, D., Pastravanu, O., Abu-Khalaf, M., Lewis, F.: Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica 45(2), 477–484 (2009)
MathSciNet Google Scholar
Wang, F.Y., Zhang, H., Liu, D.: Adaptive dynamic programming: An introduction. IEEE Computational Intelligence Magazine 4(2), 39–47 (2009)
Google Scholar
Watkins, C.J.C.H.: Learning from delayed rewards. PhD thesis, King’s College, Cambridge, England (1989)
Google Scholar
Watkins, C.J.C.H., Dayan, P.: Q-learning. Machine Learning 8, 279–292 (1992)
Google Scholar
Werbos, P.J.: Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University (1974)
Google Scholar
Werbos, P.J.: Advanced forecasting methods for global crisis warning and models of intelligence. In: General Systems, vol. XXII, pp. 25–38 (1977)
Google Scholar
Werbos, P.J.: Backpropagation and neurocontrol: A review and prospectus. In: IEEE/INNS International Joint Conference on Neural Networks, Washington, D.C, vol. 1, pp. 209–216 (1989a)
Google Scholar
Werbos, P.J.: Neural networks for control and system identification. In: Proceedings of IEEE/CDC, Tampa, Florida (1989b)
Google Scholar
Werbos, P.J.: Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks 2, 179–189 (1990)
Google Scholar
Werbos, P.J.: Backpropagation through time: What it does and how to do it. Proceedings of the IEEE 78(10), 1550–1560 (2002)
MathSciNet Google Scholar
Whiteson, S., Stone, P.: Evolutionary function approximation for reinforcement learning. Journal of Machine Learning Research 7, 877–917 (2006)
MathSciNet Google Scholar
Whitley, D., Dominic, S., Das, R., Anderson, C.W.: Genetic reinforcement learning for neurocontrol problems. Machine Learning 13(2), 259–284 (1993)
Google Scholar
Wieland, A.P.: Evolving neural network controllers for unstable systems. In: International Joint Conference on Neural Networks, vol. 2, pp. 667–673. IEEE, New York (1991)
Google Scholar
Wiering, M.A., van Hasselt, H.P.: The QV family compared to other reinforcement learning algorithms. In: Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 101–108 (2009)
Google Scholar
Wierstra, D., Schaul, T., Peters, J., Schmidhuber, J.: Natural evolution strategies. In: IEEE Congress on Evolutionary Computation (CEC-2008), pp. 3381–3387. IEEE (2008)
Google Scholar
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 229–256 (1992)
Google Scholar
Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent neural networks. Neural Computation 1(2), 270–280 (1989)
Google Scholar
Wilson, D.R., Martinez, T.R.: The general inefficiency of batch training for gradient descent learning. Neural Networks 16(10), 1429–1451 (2003)
Google Scholar
Zadeh, L.: Fuzzy sets. Information and Control 8(3), 338–353 (1965)
MathSciNet Google Scholar
Zhou, C., Meng, Q.: Dynamic balance of a biped robot using fuzzy reinforcement learning agents. Fuzzy Sets and Systems 134(1), 169–187 (2003)
MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Center for Mathematics and Computer Science, Centrum Wiskunde en Informatica CWI, Amsterdam, The Netherlands
Hado van Hasselt

Authors

Hado van Hasselt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hado van Hasselt .

Editor information

Editors and Affiliations

Fac. Mathematics &, Natural Sciences, University of Groningen, Nijenborgh 9, Groningen, 9747 AG, Netherlands
Marco Wiering
, Artificial Intelligence, Radboud University Nijmegen, B.02.30 Spinozagebouw, Montessorilaan 3, Nijmegen, 6500, Netherlands
Martijn van Otterlo

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

van Hasselt, H. (2012). Reinforcement Learning in Continuous State and Action Spaces. In: Wiering, M., van Otterlo, M. (eds) Reinforcement Learning. Adaptation, Learning, and Optimization, vol 12. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27645-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-27645-3_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-27644-6
Online ISBN: 978-3-642-27645-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics