Homotopic policy mirror descent: policy convergence, algorithmic regularization, and improved sample complexity

Li, Yan; Lan, Guanghui; Zhao, Tuo

doi:10.1007/s10107-023-02017-4

Homotopic policy mirror descent: policy convergence, algorithmic regularization, and improved sample complexity

Full Length Paper
Series A
Published: 19 September 2023

Volume 207, pages 457–513, (2024)
Cite this article

Mathematical Programming Submit manuscript

687 Accesses
3 Citations
Explore all metrics

Abstract

We study a new variant of policy gradient method, named homotopic policy mirror descent (HPMD), for solving discounted, infinite horizon MDPs with finite state and action spaces. HPMD performs a mirror descent type policy update with an additional diminishing regularization term, and possesses several computational properties that seem to be new in the literature. When instantiated with Kullback–Leibler divergence, we establish the global linear convergence of HPMD applied to any MDP instance, for both the optimality gap, and a weighted distance to the set of optimal policies. We then unveil a phase transition, where both quantities exhibit local acceleration, and converge at a superlinear rate after the optimality gap falls below certain instance-dependent threshold. With local acceleration and diminishing regularization, we establish the first result among policy gradient methods on certifying and characterizing the limiting policy, by showing, with a non-asymptotic characterization, that the last-iterate policy converges to the unique optimal policy with the maximal entropy. We then extend all the aforementioned results to HPMD instantiated with a broad class of decomposable Bregman divergences, demonstrating the generality of the these computational properties. As a byproduct, we discover the finite-time exact convergence for some commonly used Bregman divergences, implying the continuing convergence of HPMD to the limiting policy even if the current policy is already optimal. Finally, we develop a stochastic version of HPMD and establish similar convergence properties. By exploiting the local acceleration, we show that when a generative model is available for policy evaluation, with a small enough $\epsilon _0$, for any target precision $\epsilon \le \epsilon _0$, an $\epsilon $-optimal policy can be learned with $\widetilde{\mathcal {O}}(|{\mathcal {S}} | |\mathcal {A} | / \epsilon _0^2)$ samples with probability $1 - \mathcal {O}(\epsilon _0^{\scriptscriptstyle 1/3})$.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 1

Algorithm 2

Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes

Article 28 April 2022

Geometry and convergence of natural policy gradient methods

Article Open access 02 June 2023

Softmax policy gradient methods can take exponential time to converge

Article Open access 23 January 2023

Notes

For more environment details, we refer readers to [16], which adopts the same experiment setup.
In Sect. 4, we will propose a generalized version of HPMD that does not require $\pi _0$ to be the uniform policy.

References

Agarwal, A., Kakade, S.M, Lee, J.D., Mahajan, G.: Optimality and approximation with policy gradient methods in Markov decision processes. In: Conference on Learning Theory, pp. 64–66. PMLR (2020)
Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3), 167–175 (2003)
Article MathSciNet Google Scholar
Ben-Tal, A., Nemirovski, A.: Optimization iii: convex analysis, nonlinear programming theory, nonlinear programming algorithms. In: Lecture notes, vol. 34 (2012)
Bhandari, J., Russo, D.: A note on the linear convergence of policy gradient methods. arXiv preprint arXiv:2007.11120 (2020)
Cen, S., Cheng, C., Chen, Y., Wei, Y., Chi, Y.: Fast global convergence of natural policy gradient methods with entropy regularization. Oper. Res. 70, 2563–2578 (2021)
Article MathSciNet Google Scholar
Derman, E., Geist, M., Mannor, S.: Twice regularized MDPs and the equivalence between robustness and regularization. Adv. Neural Inf. Process. Syst. 34, 22274–22287 (2021)
Google Scholar
Friedman, J., Hastie, T., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1(2), 302–332 (2007)
Article MathSciNet Google Scholar
Gunasekar, S., Lee, J., Soudry, D., Srebro, N.: Characterizing implicit bias in terms of optimization geometry. In: International Conference on Machine Learning, pp. 1832–1841. PMLR (2018)
Hastie, T., Rosset, S., Tibshirani, R., Zhu, J.: The entire regularization path for the support vector machine. J. Mach. Learn. Res. 5(Oct), 1391–1415 (2004)
MathSciNet Google Scholar
Hu, Y., Ji, Z., Telgarsky, M.: Actor-critic is implicitly biased towards high entropy optimal policies. arXiv preprint arXiv:2110.11280 (2021)
Ji, Z., Telgarsky, M.: The implicit bias of gradient descent on nonseparable data. In: Conference on Learning Theory, pp. 1772–1798. PMLR (2019)
Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: Proceedings of the 19th International Conference on Machine Learning. Citeseer (2002)
Kakade, S.M.: A natural policy gradient. In: Advances in Neural Information Processing Systems, vol. 14 (2001)
Khodadadian, S., Jhunjhunwala, P.R., Varma, S.M., Maguluri, S.T.: On the linear convergence of natural policy gradient algorithm. arXiv preprint arXiv:2105.01424 (2021)
Lan, G.: Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes. Math. Program. (2022). https://doi.org/10.1007/s10107-022-01816-5
Article Google Scholar
Lan, G., Li, Y., Zhao, T.: Block policy mirror descent. arXiv e-prints arXiv:2201.05756 (2022)
Li, Y., Ju, C., Fang, E.X., Zhao, T.: Implicit regularization of Bregman proximal point algorithm and mirror descent on separable data. arXiv preprint arXiv:2108.06808 (2021)
Li, Y., Fang, E.X., Xu, H., Zhao, T.: Implicit bias of gradient descent based adversarial training on separable data. In: International Conference on Learning Representations (2020)
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
Liu, B., Cai, Q., Yang, Z., Wang, Z.: Neural trust region/proximal policy optimization attains globally optimal policy (2019)
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Article MathSciNet Google Scholar
Nemirovskij, A.S., Yudin, D.B.: Problem complexity and method efficiency in optimization (1983)
Neu, G., Jonsson, A., Gómez, V.: A unified view of entropy-regularized Markov decision processes. arXiv preprint arXiv:1705.07798 (2017)
Nocedal, J., Wright, S.: Numerical Optimization. Springer, Berlin (2006)
Google Scholar
Osborne, M.R., Presnell, B., Turlach, B.A.: A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20(3), 389–403 (2000)
Article MathSciNet Google Scholar
Park, M.Y., Hastie, T.: L1-regularization path algorithm for generalized linear models. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 69(4), 659–677 (2007)
Article MathSciNet Google Scholar
Peters, J., Mulling, K., Altun, Y.: Relative entropy policy search. In: Twenty-Fourth AAAI Conference on Artificial Intelligence (2010)
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, London (2005)
Google Scholar
Rosset, S., Zhu, J., Hastie, T.: Boosting as a regularized path to a maximum margin classifier. J. Mach. Learn. Res. 5, 941–973 (2004)
MathSciNet Google Scholar
Scherrer, B.: Improved and generalized upper bounds on the complexity of policy iteration. In: Advances in Neural Information Processing Systems, vol. 26 (2013)
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International conference on machine learning, pp. 1889–1897. PMLR (2015)
Shani, L., Efroni, Y., Mannor, S.: Adaptive trust region policy optimization: global convergence and faster rates for regularized MDPs. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 5668–5675 (2020)
Soudry, D., Hoffer, E., Nacson, M.S., Gunasekar, S., Srebro, N.: The implicit bias of gradient descent on separable data. J. Mach. Learn. Res. 19(1), 2822–2878 (2018)
MathSciNet Google Scholar
Xiao, L.: On the convergence rates of policy gradient methods. arXiv preprint arXiv:2201.07443 (2022)
Xiao, L., Zhang, T.: A proximal-gradient homotopy method for the sparse least-squares problem. SIAM J. Optim. 23(2), 1062–1091 (2013)
Article MathSciNet Google Scholar
Ye, Y.: The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate. Math. Oper. Res. 36(4), 593–603 (2011)
Article MathSciNet Google Scholar
Zhan, W., Cen, S., Huang, B., Chen, Y., Lee, J.D., Chi, Y.: Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence. arXiv preprint arXiv:2105.11066 (2021)
Zhao, P., Bin, Yu.: Stagewise lasso. J. Mach. Learn. Res. 8, 2701–2726 (2007)
MathSciNet Google Scholar

Download references

Funding

The research has been partially supported by NSF Grants CCF-1909298 and DMS-2134037.

Author information

Authors and Affiliations

H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA
Yan Li, Guanghui Lan & Tuo Zhao

Authors

Yan Li
View author publications
You can also search for this author inPubMed Google Scholar
Guanghui Lan
View author publications
You can also search for this author inPubMed Google Scholar
Tuo Zhao
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Yan Li.

Ethics declarations

Conflict of interest

We acknowledge the submission policy and declare no conflict of interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The research has been partially supported by NSF Grants CCF-1909298 and DMS-2134037.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, Y., Lan, G. & Zhao, T. Homotopic policy mirror descent: policy convergence, algorithmic regularization, and improved sample complexity. Math. Program. 207, 457–513 (2024). https://doi.org/10.1007/s10107-023-02017-4

Download citation

Received: 15 January 2023
Accepted: 26 July 2023
Published: 19 September 2023
Issue Date: September 2024
DOI: https://doi.org/10.1007/s10107-023-02017-4

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Homotopic policy mirror descent: policy convergence, algorithmic regularization, and improved sample complexity

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes

Geometry and convergence of natural policy gradient methods

Softmax policy gradient methods can take exponential time to converge

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now