Abstract
Value iteration is one of the first-order algorithms to approximate the solution of the Bellman equation arising from the Markov Decision Process (MDP). In recent literature, by approximating the max operator in the Bellman equation by a smooth function, an interesting second-order iterative method was discussed to solve the new Bellman equation. During the numerical simulation, it was observed that this second-order method is computationally expensive for a reasonable size of state and action. This second-order iterative method also faces difficulty in numerical implementation due to the calculation of an exponential function for larger values. In this manuscript, a few first-order iterative schemes have been derived from the second-order method to overcome the above practical problems. All the proposed iterative schemes possess the global convergence property. The proposed iterative schemes take less time to converge to the solution of the Bellman equation than the second-order method in many cases. These algorithms are efficient and easy to implement. An interesting theoretical comparison is provided between the algorithms. Numerical simulation supports our theoretical results.





Similar content being viewed by others
References
Asadi, K., Littman, M.L.: An alternative softmax operator for reinforcement learning. In D. Precup., Y. W. Teh (Eds.), Proceedings of International Conference on Machine Learning. 70, 243–252 (2017)
Bertsekas, D.: Multiagent value iteration algorithms in dynamic programming and reinforcement learning. Results Control Optim. 1, 100003 (2020)
Bian, T., Jiang, Z.P.: Value iteration and adaptive dynamic programming for data-driven adaptive optimal control design. Autom. J. IFAC 71, 348–360 (2016)
Blanchard, P., Higham, D.J., Higham, N.J.: Accurately computing the log-sum-exp and softmax functions. IMA J. Numer. Anal. 41, 2311–2330 (2021)
Boyd, S.P., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Carta, S., Ferreira, A., Podda, A.S., Recupero, D.R., Sanna, A.: Multi-DQN: an ensemble of Deep Q-learning agents for stock market forecasting. Expert Syst. Appl. 164, 113820 (2021)
Github.: Python mdp toolbox. https://github.com/sawcordwell/pymdptoolbox
Goyal, V., Grand-Clement, J.: A first-order approach to accelerated value iteration. Oper. Res. 71(2), 517–535 (2022)
Hasselt, H.: Double Q-learning. In: Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., Culotta, A. (Eds.), Advances in Neural Information Processing Systems. 23, 2613–2621 (2010)
Jaakkola, T., Jordan, M., Singh, S.: Convergence of stochastic iterative dynamic programming algorithms. In Cowan, J., Tesauro, G., Alspector, J. (Eds.), Advances in Neural Information Processing Systems. 6, 703–710 (1993)
Kamanchi, C., Diddigi, R.B., Bhatnagar, S.: Successive over-relaxation \({Q}\)-learning. IEEE Control Syst. Lett. 4, 55–60 (2019)
Kamanchi, C., Diddigi, R.B., Bhatnagar, S.: Generalized second-order value iteration in Markov decision processes. IEEE Trans. Autom. Control 67, 4241–4247 (2021)
Kantorovich, L.V., Akilov, G.P.: Functional Analysis in Normed Spaces. The Macmillan Company, New York (1964)
Lakshmikantham, V., Carl, S., Heikkilä, S.: Fixed point theorems in ordered Banach spaces via quasilinearization. Nonlinear Anal. 71, 3448–3458 (2009)
Lee, D., Powell, W.B.: Bias-corrected Q-learning with multistate extension. IEEE Trans. Autom. Control. 64, 4011–4023 (2019)
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)
Moré, J.J.: Global convergence of Newton–Gauss–Seidel methods. SIAM J. Numer. Anal. 8, 325–336 (1971)
Ortega, J.M., Rheinboldt, W.C.: Monotone iterations for nonlinear equations with application to Gauss–Seidel methods. SIAM J. Numer. Anal. 4, 171–190 (1967)
Ortega, J.M., Rheinboldt, W.C.: Iterative Solution of Nonlinear Equations in Several Variables. SIAM, Philadelphia (2000)
Pao, C.V.: Block monotone iterative methods for numerical solutions of nonlinear elliptic equations. Numer. Math. 72, 239–262 (1995)
Puterman, M.L., Brumelle, S.L.: On the convergence of policy iteration in stationary dynamic programming. Math. Oper. Res. 4, 60–69 (1979)
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons (2014)
Reetz, D.: Solution of a Markovian decision problem by successive overrelaxation. Z. Operations Res. A-B. 17, 29–32 (1973)
Rheinboldt, W.C.: A unified convergence theory for a class of iterative processes. SIAM J. Numer. Anal. 5, 42–63 (1968)
Rust, J.: Structural estimation of Markov decision processes. Handb. Econ. 4, 3081–3143 (1994)
Santos, M.S., Rust, J.: Convergence properties of policy iteration. SIAM J. Control. Optim. 42, 2094–2115 (2004)
Song, Z., Parr, R., Carin, L.: Revisiting the softmax bellman operator: New benefits and new perspective. In: Chaudhuri, K., Salakhutdinov, R. (Eds.), Proceedings of International Conference on Machine Learning. 97, 5916–5925 (2019)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. The MIT Press (2018)
Tamar, A., Wu, Y., Thomas, G., Levine, S., Abbeel, P.: Value iteration networks. In: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (Eds.), Advances in Neural Information Processing Systems. 29, 2154–2162 (2016)
Vandergraft, J.S.: Newton’s method for convex operators in partially ordered spaces. SIAM J. Numer. Anal. 4, 406–432 (1967)
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8, 279–292 (1992)
Watkins, D.S.: Fundamentals of Matrix Computations. John Wiley and Sons (2010)
Yang, G.S., Chen, E.K., An, C.W.: Mobile robot navigation using neural Q-learning. In: Proceedings of International Conference on Machine Learning and Cybernetics. 1, 48–52 (2004)
Acknowledgements
The authors are grateful to the referees for carefully evaluating the manuscript and for their suggestions and comments enhancing its readability and quality. One of the authors, Shreyas S R, acknowledges the Council of Scientific and Industrial Research (India) for the financial support in the form of a senior research fellowship (09/1022(0088)2019-EMR-I).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Nizar Touzi.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Antony Vijesh, V., Sumithra Rudresha, S. & Abdulla, M.S. A Note on Generalized Second-Order Value Iteration in Markov Decision Processes. J Optim Theory Appl 199, 1022–1049 (2023). https://doi.org/10.1007/s10957-023-02309-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-023-02309-x