Skip to main content
Log in

Homotopic policy mirror descent: policy convergence, algorithmic regularization, and improved sample complexity

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

We study a new variant of policy gradient method, named homotopic policy mirror descent (HPMD), for solving discounted, infinite horizon MDPs with finite state and action spaces. HPMD performs a mirror descent type policy update with an additional diminishing regularization term, and possesses several computational properties that seem to be new in the literature. When instantiated with Kullback–Leibler divergence, we establish the global linear convergence of HPMD applied to any MDP instance, for both the optimality gap, and a weighted distance to the set of optimal policies. We then unveil a phase transition, where both quantities exhibit local acceleration, and converge at a superlinear rate after the optimality gap falls below certain instance-dependent threshold. With local acceleration and diminishing regularization, we establish the first result among policy gradient methods on certifying and characterizing the limiting policy, by showing, with a non-asymptotic characterization, that the last-iterate policy converges to the unique optimal policy with the maximal entropy. We then extend all the aforementioned results to HPMD instantiated with a broad class of decomposable Bregman divergences, demonstrating the generality of the these computational properties. As a byproduct, we discover the finite-time exact convergence for some commonly used Bregman divergences, implying the continuing convergence of HPMD to the limiting policy even if the current policy is already optimal. Finally, we develop a stochastic version of HPMD and establish similar convergence properties. By exploiting the local acceleration, we show that when a generative model is available for policy evaluation, with a small enough \(\epsilon _0\), for any target precision \(\epsilon \le \epsilon _0\), an \(\epsilon \)-optimal policy can be learned with \(\widetilde{\mathcal {O}}(|{\mathcal {S}} | |\mathcal {A} | / \epsilon _0^2)\) samples with probability \(1 - \mathcal {O}(\epsilon _0^{\scriptscriptstyle 1/3})\).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Algorithm 2

Similar content being viewed by others

Notes

  1. For more environment details, we refer readers to [16], which adopts the same experiment setup.

  2. In Sect. 4, we will propose a generalized version of HPMD that does not require \(\pi _0\) to be the uniform policy.

References

  1. Agarwal, A., Kakade, S.M, Lee, J.D., Mahajan, G.: Optimality and approximation with policy gradient methods in Markov decision processes. In: Conference on Learning Theory, pp. 64–66. PMLR (2020)

  2. Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3), 167–175 (2003)

    Article  MathSciNet  Google Scholar 

  3. Ben-Tal, A., Nemirovski, A.: Optimization iii: convex analysis, nonlinear programming theory, nonlinear programming algorithms. In: Lecture notes, vol. 34 (2012)

  4. Bhandari, J., Russo, D.: A note on the linear convergence of policy gradient methods. arXiv preprint arXiv:2007.11120 (2020)

  5. Cen, S., Cheng, C., Chen, Y., Wei, Y., Chi, Y.: Fast global convergence of natural policy gradient methods with entropy regularization. Oper. Res. 70, 2563–2578 (2021)

    Article  MathSciNet  Google Scholar 

  6. Derman, E., Geist, M., Mannor, S.: Twice regularized MDPs and the equivalence between robustness and regularization. Adv. Neural Inf. Process. Syst. 34, 22274–22287 (2021)

    Google Scholar 

  7. Friedman, J., Hastie, T., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1(2), 302–332 (2007)

    Article  MathSciNet  Google Scholar 

  8. Gunasekar, S., Lee, J., Soudry, D., Srebro, N.: Characterizing implicit bias in terms of optimization geometry. In: International Conference on Machine Learning, pp. 1832–1841. PMLR (2018)

  9. Hastie, T., Rosset, S., Tibshirani, R., Zhu, J.: The entire regularization path for the support vector machine. J. Mach. Learn. Res. 5(Oct), 1391–1415 (2004)

    MathSciNet  Google Scholar 

  10. Hu, Y., Ji, Z., Telgarsky, M.: Actor-critic is implicitly biased towards high entropy optimal policies. arXiv preprint arXiv:2110.11280 (2021)

  11. Ji, Z., Telgarsky, M.: The implicit bias of gradient descent on nonseparable data. In: Conference on Learning Theory, pp. 1772–1798. PMLR (2019)

  12. Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: Proceedings of the 19th International Conference on Machine Learning. Citeseer (2002)

  13. Kakade, S.M.: A natural policy gradient. In: Advances in Neural Information Processing Systems, vol. 14 (2001)

  14. Khodadadian, S., Jhunjhunwala, P.R., Varma, S.M., Maguluri, S.T.: On the linear convergence of natural policy gradient algorithm. arXiv preprint arXiv:2105.01424 (2021)

  15. Lan, G.: Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes. Math. Program. (2022). https://doi.org/10.1007/s10107-022-01816-5

    Article  Google Scholar 

  16. Lan, G., Li, Y., Zhao, T.: Block policy mirror descent. arXiv e-prints arXiv:2201.05756 (2022)

  17. Li, Y., Ju, C., Fang, E.X., Zhao, T.: Implicit regularization of Bregman proximal point algorithm and mirror descent on separable data. arXiv preprint arXiv:2108.06808 (2021)

  18. Li, Y., Fang, E.X., Xu, H., Zhao, T.: Implicit bias of gradient descent based adversarial training on separable data. In: International Conference on Learning Representations (2020)

  19. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)

  20. Liu, B., Cai, Q., Yang, Z., Wang, Z.: Neural trust region/proximal policy optimization attains globally optimal policy (2019)

  21. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)

    Article  MathSciNet  Google Scholar 

  22. Nemirovskij, A.S., Yudin, D.B.: Problem complexity and method efficiency in optimization (1983)

  23. Neu, G., Jonsson, A., Gómez, V.: A unified view of entropy-regularized Markov decision processes. arXiv preprint arXiv:1705.07798 (2017)

  24. Nocedal, J., Wright, S.: Numerical Optimization. Springer, Berlin (2006)

    Google Scholar 

  25. Osborne, M.R., Presnell, B., Turlach, B.A.: A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20(3), 389–403 (2000)

    Article  MathSciNet  Google Scholar 

  26. Park, M.Y., Hastie, T.: L1-regularization path algorithm for generalized linear models. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 69(4), 659–677 (2007)

    Article  MathSciNet  Google Scholar 

  27. Peters, J., Mulling, K., Altun, Y.: Relative entropy policy search. In: Twenty-Fourth AAAI Conference on Artificial Intelligence (2010)

  28. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, London (2005)

    Google Scholar 

  29. Rosset, S., Zhu, J., Hastie, T.: Boosting as a regularized path to a maximum margin classifier. J. Mach. Learn. Res. 5, 941–973 (2004)

    MathSciNet  Google Scholar 

  30. Scherrer, B.: Improved and generalized upper bounds on the complexity of policy iteration. In: Advances in Neural Information Processing Systems, vol. 26 (2013)

  31. Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International conference on machine learning, pp. 1889–1897. PMLR (2015)

  32. Shani, L., Efroni, Y., Mannor, S.: Adaptive trust region policy optimization: global convergence and faster rates for regularized MDPs. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 5668–5675 (2020)

  33. Soudry, D., Hoffer, E., Nacson, M.S., Gunasekar, S., Srebro, N.: The implicit bias of gradient descent on separable data. J. Mach. Learn. Res. 19(1), 2822–2878 (2018)

    MathSciNet  Google Scholar 

  34. Xiao, L.: On the convergence rates of policy gradient methods. arXiv preprint arXiv:2201.07443 (2022)

  35. Xiao, L., Zhang, T.: A proximal-gradient homotopy method for the sparse least-squares problem. SIAM J. Optim. 23(2), 1062–1091 (2013)

    Article  MathSciNet  Google Scholar 

  36. Ye, Y.: The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate. Math. Oper. Res. 36(4), 593–603 (2011)

    Article  MathSciNet  Google Scholar 

  37. Zhan, W., Cen, S., Huang, B., Chen, Y., Lee, J.D., Chi, Y.: Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence. arXiv preprint arXiv:2105.11066 (2021)

  38. Zhao, P., Bin, Yu.: Stagewise lasso. J. Mach. Learn. Res. 8, 2701–2726 (2007)

    MathSciNet  Google Scholar 

Download references

Funding

The research has been partially supported by NSF Grants CCF-1909298 and DMS-2134037.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan Li.

Ethics declarations

Conflict of interest

We acknowledge the submission policy and declare no conflict of interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The research has been partially supported by NSF Grants CCF-1909298 and DMS-2134037.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Lan, G. & Zhao, T. Homotopic policy mirror descent: policy convergence, algorithmic regularization, and improved sample complexity. Math. Program. 207, 457–513 (2024). https://doi.org/10.1007/s10107-023-02017-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-023-02017-4

Keywords

Mathematics Subject Classification