Deep Reinforcement Learning for Weakly Coupled MDP’s with Continuous Actions

Robledo, Francisco; Ayesta, Urtzi; Avrachenkov, Konstantin

doi:10.1007/978-3-031-70753-7_5

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14826))

Included in the following conference series:

International Conference on Analytical and Stochastic Modeling Techniques and Applications

62 Accesses

Abstract

This paper introduces the Lagrange Policy for Continuous Actions (LPCA), a reinforcement learning algorithm specifically designed for weakly coupled MDP problems with continuous action spaces. LPCA addresses the challenge of resource constraints dependent on continuous actions by introducing a Lagrange relaxation of the weakly coupled MDP problem within a neural network framework for Q-value computation. This approach effectively decouples the MDP, enabling efficient policy learning in resource-constrained environments. We present two variations of LPCA: LPCA-DE, which utilizes differential evolution for global optimization, and LPCA-Greedy, a method that incrementally and greadily selects actions based on Q-value gradients. Comparative analysis against other state-of-the-art techniques across various settings highlight LPCA’s robustness and efficiency in managing resource allocation while maximizing rewards.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Latent-Conditioned Policy Gradient for Multi-Objective Deep Reinforcement Learning

Introduction to Reinforcement Learning

Exploiting Reward Machines with Deep Reinforcement Learning in Continuous Action Domains

References

Avrachenkov, K.E., Borkar, V.S.: Whittle index based Q-learning for restless bandits with average reward. Automatica 139, 110186 (2022). https://doi.org/10.1016/j.automatica.2022.110186
Article MathSciNet Google Scholar
Bertsekas, D.: Dynamic Programming and Optimal Control: Volume I. Athena Scientific (2012), google-Books-ID: qVBEEAAAQBAJ
Google Scholar
Das, S., Suganthan, P.N.: Differential evolution: a survey of the state-of-the-art. IEEE Trans. Evol. Comput. 15(1), 4–31 (2011). https://doi.org/10.1109/TEVC.2010.2059031
Article Google Scholar
Fujimoto, S., Hoof, H., Meger, D.: Addressing Function Approximation Error in Actor-Critic Methods. pp. 1587–1596. PMLR, July 2018. https://proceedings.mlr.press/v80/fujimoto18a.html
Gast, N., Gaujal, B., Yan, C.: The LP-update policy for weakly coupled Markov decision processes. Tech. rep., November 2022. https://doi.org/10.48550/arXiv.2211.01961, http://arxiv.org/abs/2211.01961, arXiv:2211.01961 [math] type: article
Gittins, J.C.: Bandit processes and dynamic allocation indices. J. Roy. Stat. Soc.: Ser. B (Methodol.) 41(2), 148–164 (1979). https://doi.org/10.1111/j.2517-6161.1979.tb01068.x
Article MathSciNet Google Scholar
Hawkins, J.T.: A Langrangian decomposition approach to weakly coupled dynamic optimization problems and its applications. Ph.D. thesis, Massachusetts Institute of Technology (2003)
Google Scholar
Killian, J.A., Biswas, A., Shah, S., Tambe, M.: Q-learning lagrange policies for multi-action restless bandits. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD 2021, pp. 871–881. Association for Computing Machinery, New York, August 2021. https://doi.org/10.1145/3447548.3467370
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. Tech. rep., July 2019. https://doi.org/10.48550/arXiv.1509.02971, http://arxiv.org/abs/1509.02971, arXiv:1509.02971 [cs, stat] type: article
Meshram, R., Kaza, K.: Simulation Based Algorithms for Markov Decision Processes and Multi-Action Restless Bandits. Tech. rep., July 2020. https://doi.org/10.48550/arXiv.2007.12933, http://arxiv.org/abs/2007.12933, arXiv:2007.12933 [cs, eess] type: article
Nakhleh, K., Ganji, S., Hsieh, P.C., Hou, I.H., Shakkottai, S.: NeurWIN: neural whittle index network for restless bandits via deep RL. In: Advances in Neural Information Processing Systems, vol. 34, pp. 828–839. Curran Associates, Inc. (2021). https://proceedings.neurips.cc/paper/2021/hash/0768281a05da9f27df178b5c39a51263-Abstract.html
Pham, T.H., De Magistris, G., Tachibana, R.: OptLayer - practical constrained optimization for deep reinforcement learning in the real world. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6236–6243, May 2018. https://doi.org/10.1109/ICRA.2018.8460547, iSSN: 2577-087X
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, August 2014, google-Books-ID: VvBjBAAAQBAJ
Google Scholar
Robledo, F., Borkar, V.S., Ayesta, U., Avrachenkov, K.: Tabular and deep learning of whittle index. In: EWRL 2022 - 15th European Workshop of Reinforcement Learning. Milan, Italy, September 2022. https://hal.science/hal-03767324
Shar, I.E., Jiang, D.: Weakly coupled deep Q-Networks. In: Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems, vol. 36, pp. 43931–43950. Curran Associates, Inc. (2023)
Google Scholar
Srinivas, N., Krause, A., Kakade, S.M., Seeger, M.: Gaussian process optimization in the bandit setting: no regret and experimental design. IEEE Trans. Inf. Theory 58(5), 3250–3265 (2012). https://doi.org/10.1109/TIT.2011.2182033, http://arxiv.org/abs/0912.3995, arXiv:0912.3995 [cs]
Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT press (2018)
Google Scholar
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016). https://ojs.aaai.org/index.php/AAAI/article/view/10295
Weber, R.: Comments on: dynamic priority allocation via restless bandit marginal productivity indices. TOP 15(2), 211–216 (2007)
Article MathSciNet Google Scholar
Wei, X., Yu, H., Neely, M.J.: Online learning in weakly coupled markov decision processes: a convergence time study. In: Proceedings of the ACM on Measurement and Analysis of Computing Systems 2(1), 12:1–12:38, April 2018. https://doi.org/10.1145/3179415
Whittle, P.: Restless bandits: activity allocation in a changing world. J. Appl. Probability 25(A), 287–298 (1988). https://doi.org/10.2307/3214163. https://www.cambridge.org/core/journals/journal-of-applied-probability/article/abs/restless-bandits-activity-allocation-in-a-changing-world/DDEB5E22AFFEFF50AA97ADC96B71AE35
Wierman, A., Andrew, L.L.H., Tang, A.: Power-aware speed scaling in processor sharing systems. In: IEEE INFOCOM 2009. pp. 2007–2015, April 2009. https://doi.org/10.1109/INFCOM.2009.5062123, https://ieeexplore.ieee.org/abstract/document/5062123, iSSN: 0743-166X

Download references

Acknowledgments

F. Robledo has received funding from the Department of Education of the Basque Government through the Consolidated Research Group MATHMODE (IT1456-22). Research partially supported by the French “Agence Nationale de la Recherche (ANR)” through the project ANR-22-CE25-0013-02 (ANR EPLER) and DST-Inria Cefipra project LION.

Author information

Authors and Affiliations

UPV/EHU, Univ. of the Basque Country, 20018, Donostia, Spain
Francisco Robledo & Urtzi Ayesta
UPPA, Université de Pau et des Pays de l’Adour, 64000, Pau, France
Francisco Robledo
IRIT, Université de Toulouse, CNRS, Toulouse INP, UT3, Toulouse, France
Urtzi Ayesta
IKERBASQUE - Basque Foundation for Science, 48011, Bilbao, Spain
Urtzi Ayesta
INRIA, Sophia, Antipolis, France
Konstantin Avrachenkov

Authors

Francisco Robledo
View author publications
You can also search for this author in PubMed Google Scholar
Urtzi Ayesta
View author publications
You can also search for this author in PubMed Google Scholar
Konstantin Avrachenkov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francisco Robledo .

Editor information

Editors and Affiliations

Ghent University, Ghent, Belgium
Arnaud Devos
University of Turin, Turin, Italy
András Horváth
Università Ca' Foscari Venezia, Venice, Italy
Sabina Rossi

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Robledo, F., Ayesta, U., Avrachenkov, K. (2025). Deep Reinforcement Learning for Weakly Coupled MDP’s with Continuous Actions. In: Devos, A., Horváth, A., Rossi, S. (eds) Analytical and Stochastic Modelling Techniques and Applications. ASMTA 2024. Lecture Notes in Computer Science, vol 14826. Springer, Cham. https://doi.org/10.1007/978-3-031-70753-7_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-70753-7_5
Published: 13 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70752-0
Online ISBN: 978-3-031-70753-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Deep Reinforcement Learning for Weakly Coupled MDP’s with Continuous Actions