A Note on Generalized Second-Order Value Iteration in Markov Decision Processes

Antony Vijesh, Villavarayan; Sumithra Rudresha, Shreyas; Abdulla, Mohammed Shahid

doi:10.1007/s10957-023-02309-x

A Note on Generalized Second-Order Value Iteration in Markov Decision Processes

Published: 07 November 2023

Volume 199, pages 1022–1049, (2023)
Cite this article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Villavarayan Antony Vijesh¹,
Shreyas Sumithra Rudresha¹ &
Mohammed Shahid Abdulla²

361 Accesses
Explore all metrics

Abstract

Value iteration is one of the first-order algorithms to approximate the solution of the Bellman equation arising from the Markov Decision Process (MDP). In recent literature, by approximating the max operator in the Bellman equation by a smooth function, an interesting second-order iterative method was discussed to solve the new Bellman equation. During the numerical simulation, it was observed that this second-order method is computationally expensive for a reasonable size of state and action. This second-order iterative method also faces difficulty in numerical implementation due to the calculation of an exponential function for larger values. In this manuscript, a few first-order iterative schemes have been derived from the second-order method to overcome the above practical problems. All the proposed iterative schemes possess the global convergence property. The proposed iterative schemes take less time to converge to the solution of the Bellman equation than the second-order method in many cases. These algorithms are efficient and easy to implement. An interesting theoretical comparison is provided between the algorithms. Numerical simulation supports our theoretical results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Markov Decision Processes with Discounted Rewards: New Action Elimination Procedure

Nonlinear Monte Carlo Methods with Polynomial Runtime for Bellman Equations of Discrete Time High-Dimensional Stochastic Optimal Control Problems

Article Open access 04 February 2025

Approximate Policy Iteration for Markov Decision Processes via Quantitative Adaptive Aggregations

Notes

https://github.com/shreyassr123/A-Note-on-Generalized-Second-order-Value-Iteration-in-MDP

References

Asadi, K., Littman, M.L.: An alternative softmax operator for reinforcement learning. In D. Precup., Y. W. Teh (Eds.), Proceedings of International Conference on Machine Learning. 70, 243–252 (2017)
Bertsekas, D.: Multiagent value iteration algorithms in dynamic programming and reinforcement learning. Results Control Optim. 1, 100003 (2020)
Article Google Scholar
Bian, T., Jiang, Z.P.: Value iteration and adaptive dynamic programming for data-driven adaptive optimal control design. Autom. J. IFAC 71, 348–360 (2016)
Article MathSciNet MATH Google Scholar
Blanchard, P., Higham, D.J., Higham, N.J.: Accurately computing the log-sum-exp and softmax functions. IMA J. Numer. Anal. 41, 2311–2330 (2021)
Article MathSciNet MATH Google Scholar
Boyd, S.P., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Book MATH Google Scholar
Carta, S., Ferreira, A., Podda, A.S., Recupero, D.R., Sanna, A.: Multi-DQN: an ensemble of Deep Q-learning agents for stock market forecasting. Expert Syst. Appl. 164, 113820 (2021)
Article Google Scholar
Github.: Python mdp toolbox. https://github.com/sawcordwell/pymdptoolbox
Goyal, V., Grand-Clement, J.: A first-order approach to accelerated value iteration. Oper. Res. 71(2), 517–535 (2022)
Article MathSciNet Google Scholar
Hasselt, H.: Double Q-learning. In: Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., Culotta, A. (Eds.), Advances in Neural Information Processing Systems. 23, 2613–2621 (2010)
Jaakkola, T., Jordan, M., Singh, S.: Convergence of stochastic iterative dynamic programming algorithms. In Cowan, J., Tesauro, G., Alspector, J. (Eds.), Advances in Neural Information Processing Systems. 6, 703–710 (1993)
Kamanchi, C., Diddigi, R.B., Bhatnagar, S.: Successive over-relaxation ${Q}$-learning. IEEE Control Syst. Lett. 4, 55–60 (2019)
Article MathSciNet MATH Google Scholar
Kamanchi, C., Diddigi, R.B., Bhatnagar, S.: Generalized second-order value iteration in Markov decision processes. IEEE Trans. Autom. Control 67, 4241–4247 (2021)
Article MathSciNet MATH Google Scholar
Kantorovich, L.V., Akilov, G.P.: Functional Analysis in Normed Spaces. The Macmillan Company, New York (1964)
MATH Google Scholar
Lakshmikantham, V., Carl, S., Heikkilä, S.: Fixed point theorems in ordered Banach spaces via quasilinearization. Nonlinear Anal. 71, 3448–3458 (2009)
Article MathSciNet MATH Google Scholar
Lee, D., Powell, W.B.: Bias-corrected Q-learning with multistate extension. IEEE Trans. Autom. Control. 64, 4011–4023 (2019)
Article MathSciNet MATH Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)
Article Google Scholar
Moré, J.J.: Global convergence of Newton–Gauss–Seidel methods. SIAM J. Numer. Anal. 8, 325–336 (1971)
Article MathSciNet MATH Google Scholar
Ortega, J.M., Rheinboldt, W.C.: Monotone iterations for nonlinear equations with application to Gauss–Seidel methods. SIAM J. Numer. Anal. 4, 171–190 (1967)
Article MathSciNet MATH Google Scholar
Ortega, J.M., Rheinboldt, W.C.: Iterative Solution of Nonlinear Equations in Several Variables. SIAM, Philadelphia (2000)
Book MATH Google Scholar
Pao, C.V.: Block monotone iterative methods for numerical solutions of nonlinear elliptic equations. Numer. Math. 72, 239–262 (1995)
Article MathSciNet MATH Google Scholar
Puterman, M.L., Brumelle, S.L.: On the convergence of policy iteration in stationary dynamic programming. Math. Oper. Res. 4, 60–69 (1979)
Article MathSciNet MATH Google Scholar
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons (2014)
Reetz, D.: Solution of a Markovian decision problem by successive overrelaxation. Z. Operations Res. A-B. 17, 29–32 (1973)
MathSciNet MATH Google Scholar
Rheinboldt, W.C.: A unified convergence theory for a class of iterative processes. SIAM J. Numer. Anal. 5, 42–63 (1968)
Article MathSciNet MATH Google Scholar
Rust, J.: Structural estimation of Markov decision processes. Handb. Econ. 4, 3081–3143 (1994)
MathSciNet Google Scholar
Santos, M.S., Rust, J.: Convergence properties of policy iteration. SIAM J. Control. Optim. 42, 2094–2115 (2004)
Article MathSciNet MATH Google Scholar
Song, Z., Parr, R., Carin, L.: Revisiting the softmax bellman operator: New benefits and new perspective. In: Chaudhuri, K., Salakhutdinov, R. (Eds.), Proceedings of International Conference on Machine Learning. 97, 5916–5925 (2019)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. The MIT Press (2018)
Tamar, A., Wu, Y., Thomas, G., Levine, S., Abbeel, P.: Value iteration networks. In: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (Eds.), Advances in Neural Information Processing Systems. 29, 2154–2162 (2016)
Vandergraft, J.S.: Newton’s method for convex operators in partially ordered spaces. SIAM J. Numer. Anal. 4, 406–432 (1967)
Article MathSciNet MATH Google Scholar
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8, 279–292 (1992)
Article MATH Google Scholar
Watkins, D.S.: Fundamentals of Matrix Computations. John Wiley and Sons (2010)
Yang, G.S., Chen, E.K., An, C.W.: Mobile robot navigation using neural Q-learning. In: Proceedings of International Conference on Machine Learning and Cybernetics. 1, 48–52 (2004)

Download references

Acknowledgements

The authors are grateful to the referees for carefully evaluating the manuscript and for their suggestions and comments enhancing its readability and quality. One of the authors, Shreyas S R, acknowledges the Council of Scientific and Industrial Research (India) for the financial support in the form of a senior research fellowship (09/1022(0088)2019-EMR-I).

Author information

Authors and Affiliations

Department of Mathematics, Indian Institute of Technology Indore, Indore, 453552, India
Villavarayan Antony Vijesh & Shreyas Sumithra Rudresha
IT and Systems Area, Indian Institute of Management, Kozhikode, Kozhikode, India
Mohammed Shahid Abdulla

Authors

Villavarayan Antony Vijesh
View author publications
You can also search for this author in PubMed Google Scholar
Shreyas Sumithra Rudresha
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Shahid Abdulla
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Villavarayan Antony Vijesh.

Additional information

Communicated by Nizar Touzi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Antony Vijesh, V., Sumithra Rudresha, S. & Abdulla, M.S. A Note on Generalized Second-Order Value Iteration in Markov Decision Processes. J Optim Theory Appl 199, 1022–1049 (2023). https://doi.org/10.1007/s10957-023-02309-x

Download citation

Received: 10 September 2022
Accepted: 12 September 2023
Published: 07 November 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10957-023-02309-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Note on Generalized Second-Order Value Iteration in Markov Decision Processes

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Markov Decision Processes with Discounted Rewards: New Action Elimination Procedure

Nonlinear Monte Carlo Methods with Polynomial Runtime for Bellman Equations of Discrete Time High-Dimensional Stochastic Optimal Control Problems

Approximate Policy Iteration for Markov Decision Processes via Quantitative Adaptive Aggregations

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A Note on Generalized Second-Order Value Iteration in Markov Decision Processes

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Markov Decision Processes with Discounted Rewards: New Action Elimination Procedure

Nonlinear Monte Carlo Methods with Polynomial Runtime for Bellman Equations of Discrete Time High-Dimensional Stochastic Optimal Control Problems

Approximate Policy Iteration for Markov Decision Processes via Quantitative Adaptive Aggregations

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation