Abstract
We study a new variant of policy gradient method, named homotopic policy mirror descent (HPMD), for solving discounted, infinite horizon MDPs with finite state and action spaces. HPMD performs a mirror descent type policy update with an additional diminishing regularization term, and possesses several computational properties that seem to be new in the literature. When instantiated with Kullback–Leibler divergence, we establish the global linear convergence of HPMD applied to any MDP instance, for both the optimality gap, and a weighted distance to the set of optimal policies. We then unveil a phase transition, where both quantities exhibit local acceleration, and converge at a superlinear rate after the optimality gap falls below certain instance-dependent threshold. With local acceleration and diminishing regularization, we establish the first result among policy gradient methods on certifying and characterizing the limiting policy, by showing, with a non-asymptotic characterization, that the last-iterate policy converges to the unique optimal policy with the maximal entropy. We then extend all the aforementioned results to HPMD instantiated with a broad class of decomposable Bregman divergences, demonstrating the generality of the these computational properties. As a byproduct, we discover the finite-time exact convergence for some commonly used Bregman divergences, implying the continuing convergence of HPMD to the limiting policy even if the current policy is already optimal. Finally, we develop a stochastic version of HPMD and establish similar convergence properties. By exploiting the local acceleration, we show that when a generative model is available for policy evaluation, with a small enough \(\epsilon _0\), for any target precision \(\epsilon \le \epsilon _0\), an \(\epsilon \)-optimal policy can be learned with \(\widetilde{\mathcal {O}}(|{\mathcal {S}} | |\mathcal {A} | / \epsilon _0^2)\) samples with probability \(1 - \mathcal {O}(\epsilon _0^{\scriptscriptstyle 1/3})\).




Similar content being viewed by others
References
Agarwal, A., Kakade, S.M, Lee, J.D., Mahajan, G.: Optimality and approximation with policy gradient methods in Markov decision processes. In: Conference on Learning Theory, pp. 64–66. PMLR (2020)
Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3), 167–175 (2003)
Ben-Tal, A., Nemirovski, A.: Optimization iii: convex analysis, nonlinear programming theory, nonlinear programming algorithms. In: Lecture notes, vol. 34 (2012)
Bhandari, J., Russo, D.: A note on the linear convergence of policy gradient methods. arXiv preprint arXiv:2007.11120 (2020)
Cen, S., Cheng, C., Chen, Y., Wei, Y., Chi, Y.: Fast global convergence of natural policy gradient methods with entropy regularization. Oper. Res. 70, 2563–2578 (2021)
Derman, E., Geist, M., Mannor, S.: Twice regularized MDPs and the equivalence between robustness and regularization. Adv. Neural Inf. Process. Syst. 34, 22274–22287 (2021)
Friedman, J., Hastie, T., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1(2), 302–332 (2007)
Gunasekar, S., Lee, J., Soudry, D., Srebro, N.: Characterizing implicit bias in terms of optimization geometry. In: International Conference on Machine Learning, pp. 1832–1841. PMLR (2018)
Hastie, T., Rosset, S., Tibshirani, R., Zhu, J.: The entire regularization path for the support vector machine. J. Mach. Learn. Res. 5(Oct), 1391–1415 (2004)
Hu, Y., Ji, Z., Telgarsky, M.: Actor-critic is implicitly biased towards high entropy optimal policies. arXiv preprint arXiv:2110.11280 (2021)
Ji, Z., Telgarsky, M.: The implicit bias of gradient descent on nonseparable data. In: Conference on Learning Theory, pp. 1772–1798. PMLR (2019)
Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: Proceedings of the 19th International Conference on Machine Learning. Citeseer (2002)
Kakade, S.M.: A natural policy gradient. In: Advances in Neural Information Processing Systems, vol. 14 (2001)
Khodadadian, S., Jhunjhunwala, P.R., Varma, S.M., Maguluri, S.T.: On the linear convergence of natural policy gradient algorithm. arXiv preprint arXiv:2105.01424 (2021)
Lan, G.: Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes. Math. Program. (2022). https://doi.org/10.1007/s10107-022-01816-5
Lan, G., Li, Y., Zhao, T.: Block policy mirror descent. arXiv e-prints arXiv:2201.05756 (2022)
Li, Y., Ju, C., Fang, E.X., Zhao, T.: Implicit regularization of Bregman proximal point algorithm and mirror descent on separable data. arXiv preprint arXiv:2108.06808 (2021)
Li, Y., Fang, E.X., Xu, H., Zhao, T.: Implicit bias of gradient descent based adversarial training on separable data. In: International Conference on Learning Representations (2020)
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
Liu, B., Cai, Q., Yang, Z., Wang, Z.: Neural trust region/proximal policy optimization attains globally optimal policy (2019)
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Nemirovskij, A.S., Yudin, D.B.: Problem complexity and method efficiency in optimization (1983)
Neu, G., Jonsson, A., Gómez, V.: A unified view of entropy-regularized Markov decision processes. arXiv preprint arXiv:1705.07798 (2017)
Nocedal, J., Wright, S.: Numerical Optimization. Springer, Berlin (2006)
Osborne, M.R., Presnell, B., Turlach, B.A.: A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20(3), 389–403 (2000)
Park, M.Y., Hastie, T.: L1-regularization path algorithm for generalized linear models. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 69(4), 659–677 (2007)
Peters, J., Mulling, K., Altun, Y.: Relative entropy policy search. In: Twenty-Fourth AAAI Conference on Artificial Intelligence (2010)
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, London (2005)
Rosset, S., Zhu, J., Hastie, T.: Boosting as a regularized path to a maximum margin classifier. J. Mach. Learn. Res. 5, 941–973 (2004)
Scherrer, B.: Improved and generalized upper bounds on the complexity of policy iteration. In: Advances in Neural Information Processing Systems, vol. 26 (2013)
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International conference on machine learning, pp. 1889–1897. PMLR (2015)
Shani, L., Efroni, Y., Mannor, S.: Adaptive trust region policy optimization: global convergence and faster rates for regularized MDPs. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 5668–5675 (2020)
Soudry, D., Hoffer, E., Nacson, M.S., Gunasekar, S., Srebro, N.: The implicit bias of gradient descent on separable data. J. Mach. Learn. Res. 19(1), 2822–2878 (2018)
Xiao, L.: On the convergence rates of policy gradient methods. arXiv preprint arXiv:2201.07443 (2022)
Xiao, L., Zhang, T.: A proximal-gradient homotopy method for the sparse least-squares problem. SIAM J. Optim. 23(2), 1062–1091 (2013)
Ye, Y.: The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate. Math. Oper. Res. 36(4), 593–603 (2011)
Zhan, W., Cen, S., Huang, B., Chen, Y., Lee, J.D., Chi, Y.: Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence. arXiv preprint arXiv:2105.11066 (2021)
Zhao, P., Bin, Yu.: Stagewise lasso. J. Mach. Learn. Res. 8, 2701–2726 (2007)
Funding
The research has been partially supported by NSF Grants CCF-1909298 and DMS-2134037.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We acknowledge the submission policy and declare no conflict of interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The research has been partially supported by NSF Grants CCF-1909298 and DMS-2134037.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, Y., Lan, G. & Zhao, T. Homotopic policy mirror descent: policy convergence, algorithmic regularization, and improved sample complexity. Math. Program. 207, 457–513 (2024). https://doi.org/10.1007/s10107-023-02017-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-023-02017-4