Skip to main content

Advertisement

Log in

A Note on Generalized Second-Order Value Iteration in Markov Decision Processes

  • Published:
Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Abstract

Value iteration is one of the first-order algorithms to approximate the solution of the Bellman equation arising from the Markov Decision Process (MDP). In recent literature, by approximating the max operator in the Bellman equation by a smooth function, an interesting second-order iterative method was discussed to solve the new Bellman equation. During the numerical simulation, it was observed that this second-order method is computationally expensive for a reasonable size of state and action. This second-order iterative method also faces difficulty in numerical implementation due to the calculation of an exponential function for larger values. In this manuscript, a few first-order iterative schemes have been derived from the second-order method to overcome the above practical problems. All the proposed iterative schemes possess the global convergence property. The proposed iterative schemes take less time to converge to the solution of the Bellman equation than the second-order method in many cases. These algorithms are efficient and easy to implement. An interesting theoretical comparison is provided between the algorithms. Numerical simulation supports our theoretical results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
Algorithm 5

Similar content being viewed by others

Notes

  1. https://github.com/shreyassr123/A-Note-on-Generalized-Second-order-Value-Iteration-in-MDP

References

  1. Asadi, K., Littman, M.L.: An alternative softmax operator for reinforcement learning. In D. Precup., Y. W. Teh (Eds.), Proceedings of International Conference on Machine Learning. 70, 243–252 (2017)

  2. Bertsekas, D.: Multiagent value iteration algorithms in dynamic programming and reinforcement learning. Results Control Optim. 1, 100003 (2020)

    Article  Google Scholar 

  3. Bian, T., Jiang, Z.P.: Value iteration and adaptive dynamic programming for data-driven adaptive optimal control design. Autom. J. IFAC 71, 348–360 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  4. Blanchard, P., Higham, D.J., Higham, N.J.: Accurately computing the log-sum-exp and softmax functions. IMA J. Numer. Anal. 41, 2311–2330 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  5. Boyd, S.P., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)

    Book  MATH  Google Scholar 

  6. Carta, S., Ferreira, A., Podda, A.S., Recupero, D.R., Sanna, A.: Multi-DQN: an ensemble of Deep Q-learning agents for stock market forecasting. Expert Syst. Appl. 164, 113820 (2021)

    Article  Google Scholar 

  7. Github.: Python mdp toolbox. https://github.com/sawcordwell/pymdptoolbox

  8. Goyal, V., Grand-Clement, J.: A first-order approach to accelerated value iteration. Oper. Res. 71(2), 517–535 (2022)

    Article  MathSciNet  Google Scholar 

  9. Hasselt, H.: Double Q-learning. In: Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., Culotta, A. (Eds.), Advances in Neural Information Processing Systems. 23, 2613–2621 (2010)

  10. Jaakkola, T., Jordan, M., Singh, S.: Convergence of stochastic iterative dynamic programming algorithms. In Cowan, J., Tesauro, G., Alspector, J. (Eds.), Advances in Neural Information Processing Systems. 6, 703–710 (1993)

  11. Kamanchi, C., Diddigi, R.B., Bhatnagar, S.: Successive over-relaxation \({Q}\)-learning. IEEE Control Syst. Lett. 4, 55–60 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  12. Kamanchi, C., Diddigi, R.B., Bhatnagar, S.: Generalized second-order value iteration in Markov decision processes. IEEE Trans. Autom. Control 67, 4241–4247 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  13. Kantorovich, L.V., Akilov, G.P.: Functional Analysis in Normed Spaces. The Macmillan Company, New York (1964)

    MATH  Google Scholar 

  14. Lakshmikantham, V., Carl, S., Heikkilä, S.: Fixed point theorems in ordered Banach spaces via quasilinearization. Nonlinear Anal. 71, 3448–3458 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  15. Lee, D., Powell, W.B.: Bias-corrected Q-learning with multistate extension. IEEE Trans. Autom. Control. 64, 4011–4023 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  16. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)

    Article  Google Scholar 

  17. Moré, J.J.: Global convergence of Newton–Gauss–Seidel methods. SIAM J. Numer. Anal. 8, 325–336 (1971)

    Article  MathSciNet  MATH  Google Scholar 

  18. Ortega, J.M., Rheinboldt, W.C.: Monotone iterations for nonlinear equations with application to Gauss–Seidel methods. SIAM J. Numer. Anal. 4, 171–190 (1967)

    Article  MathSciNet  MATH  Google Scholar 

  19. Ortega, J.M., Rheinboldt, W.C.: Iterative Solution of Nonlinear Equations in Several Variables. SIAM, Philadelphia (2000)

    Book  MATH  Google Scholar 

  20. Pao, C.V.: Block monotone iterative methods for numerical solutions of nonlinear elliptic equations. Numer. Math. 72, 239–262 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  21. Puterman, M.L., Brumelle, S.L.: On the convergence of policy iteration in stationary dynamic programming. Math. Oper. Res. 4, 60–69 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  22. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons (2014)

  23. Reetz, D.: Solution of a Markovian decision problem by successive overrelaxation. Z. Operations Res. A-B. 17, 29–32 (1973)

    MathSciNet  MATH  Google Scholar 

  24. Rheinboldt, W.C.: A unified convergence theory for a class of iterative processes. SIAM J. Numer. Anal. 5, 42–63 (1968)

    Article  MathSciNet  MATH  Google Scholar 

  25. Rust, J.: Structural estimation of Markov decision processes. Handb. Econ. 4, 3081–3143 (1994)

    MathSciNet  Google Scholar 

  26. Santos, M.S., Rust, J.: Convergence properties of policy iteration. SIAM J. Control. Optim. 42, 2094–2115 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  27. Song, Z., Parr, R., Carin, L.: Revisiting the softmax bellman operator: New benefits and new perspective. In: Chaudhuri, K., Salakhutdinov, R. (Eds.), Proceedings of International Conference on Machine Learning. 97, 5916–5925 (2019)

  28. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. The MIT Press (2018)

  29. Tamar, A., Wu, Y., Thomas, G., Levine, S., Abbeel, P.: Value iteration networks. In: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (Eds.), Advances in Neural Information Processing Systems. 29, 2154–2162 (2016)

  30. Vandergraft, J.S.: Newton’s method for convex operators in partially ordered spaces. SIAM J. Numer. Anal. 4, 406–432 (1967)

    Article  MathSciNet  MATH  Google Scholar 

  31. Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8, 279–292 (1992)

    Article  MATH  Google Scholar 

  32. Watkins, D.S.: Fundamentals of Matrix Computations. John Wiley and Sons (2010)

  33. Yang, G.S., Chen, E.K., An, C.W.: Mobile robot navigation using neural Q-learning. In: Proceedings of International Conference on Machine Learning and Cybernetics. 1, 48–52 (2004)

Download references

Acknowledgements

The authors are grateful to the referees for carefully evaluating the manuscript and for their suggestions and comments enhancing its readability and quality. One of the authors, Shreyas S R, acknowledges the Council of Scientific and Industrial Research (India) for the financial support in the form of a senior research fellowship (09/1022(0088)2019-EMR-I).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Villavarayan Antony Vijesh.

Additional information

Communicated by Nizar Touzi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Antony Vijesh, V., Sumithra Rudresha, S. & Abdulla, M.S. A Note on Generalized Second-Order Value Iteration in Markov Decision Processes. J Optim Theory Appl 199, 1022–1049 (2023). https://doi.org/10.1007/s10957-023-02309-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10957-023-02309-x

Keywords

Navigation