Abstract
This paper introduces the Lagrange Policy for Continuous Actions (LPCA), a reinforcement learning algorithm specifically designed for weakly coupled MDP problems with continuous action spaces. LPCA addresses the challenge of resource constraints dependent on continuous actions by introducing a Lagrange relaxation of the weakly coupled MDP problem within a neural network framework for Q-value computation. This approach effectively decouples the MDP, enabling efficient policy learning in resource-constrained environments. We present two variations of LPCA: LPCA-DE, which utilizes differential evolution for global optimization, and LPCA-Greedy, a method that incrementally and greadily selects actions based on Q-value gradients. Comparative analysis against other state-of-the-art techniques across various settings highlight LPCA’s robustness and efficiency in managing resource allocation while maximizing rewards.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Avrachenkov, K.E., Borkar, V.S.: Whittle index based Q-learning for restless bandits with average reward. Automatica 139, 110186 (2022). https://doi.org/10.1016/j.automatica.2022.110186
Bertsekas, D.: Dynamic Programming and Optimal Control: Volume I. Athena Scientific (2012), google-Books-ID: qVBEEAAAQBAJ
Das, S., Suganthan, P.N.: Differential evolution: a survey of the state-of-the-art. IEEE Trans. Evol. Comput. 15(1), 4–31 (2011). https://doi.org/10.1109/TEVC.2010.2059031
Fujimoto, S., Hoof, H., Meger, D.: Addressing Function Approximation Error in Actor-Critic Methods. pp. 1587–1596. PMLR, July 2018. https://proceedings.mlr.press/v80/fujimoto18a.html
Gast, N., Gaujal, B., Yan, C.: The LP-update policy for weakly coupled Markov decision processes. Tech. rep., November 2022. https://doi.org/10.48550/arXiv.2211.01961, http://arxiv.org/abs/2211.01961, arXiv:2211.01961 [math] type: article
Gittins, J.C.: Bandit processes and dynamic allocation indices. J. Roy. Stat. Soc.: Ser. B (Methodol.) 41(2), 148–164 (1979). https://doi.org/10.1111/j.2517-6161.1979.tb01068.x
Hawkins, J.T.: A Langrangian decomposition approach to weakly coupled dynamic optimization problems and its applications. Ph.D. thesis, Massachusetts Institute of Technology (2003)
Killian, J.A., Biswas, A., Shah, S., Tambe, M.: Q-learning lagrange policies for multi-action restless bandits. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD 2021, pp. 871–881. Association for Computing Machinery, New York, August 2021. https://doi.org/10.1145/3447548.3467370
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. Tech. rep., July 2019. https://doi.org/10.48550/arXiv.1509.02971, http://arxiv.org/abs/1509.02971, arXiv:1509.02971 [cs, stat] type: article
Meshram, R., Kaza, K.: Simulation Based Algorithms for Markov Decision Processes and Multi-Action Restless Bandits. Tech. rep., July 2020. https://doi.org/10.48550/arXiv.2007.12933, http://arxiv.org/abs/2007.12933, arXiv:2007.12933 [cs, eess] type: article
Nakhleh, K., Ganji, S., Hsieh, P.C., Hou, I.H., Shakkottai, S.: NeurWIN: neural whittle index network for restless bandits via deep RL. In: Advances in Neural Information Processing Systems, vol. 34, pp. 828–839. Curran Associates, Inc. (2021). https://proceedings.neurips.cc/paper/2021/hash/0768281a05da9f27df178b5c39a51263-Abstract.html
Pham, T.H., De Magistris, G., Tachibana, R.: OptLayer - practical constrained optimization for deep reinforcement learning in the real world. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6236–6243, May 2018. https://doi.org/10.1109/ICRA.2018.8460547, iSSN: 2577-087X
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, August 2014, google-Books-ID: VvBjBAAAQBAJ
Robledo, F., Borkar, V.S., Ayesta, U., Avrachenkov, K.: Tabular and deep learning of whittle index. In: EWRL 2022 - 15th European Workshop of Reinforcement Learning. Milan, Italy, September 2022. https://hal.science/hal-03767324
Shar, I.E., Jiang, D.: Weakly coupled deep Q-Networks. In: Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems, vol. 36, pp. 43931–43950. Curran Associates, Inc. (2023)
Srinivas, N., Krause, A., Kakade, S.M., Seeger, M.: Gaussian process optimization in the bandit setting: no regret and experimental design. IEEE Trans. Inf. Theory 58(5), 3250–3265 (2012). https://doi.org/10.1109/TIT.2011.2182033, http://arxiv.org/abs/0912.3995, arXiv:0912.3995 [cs]
Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT press (2018)
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016). https://ojs.aaai.org/index.php/AAAI/article/view/10295
Weber, R.: Comments on: dynamic priority allocation via restless bandit marginal productivity indices. TOP 15(2), 211–216 (2007)
Wei, X., Yu, H., Neely, M.J.: Online learning in weakly coupled markov decision processes: a convergence time study. In: Proceedings of the ACM on Measurement and Analysis of Computing Systems 2(1), 12:1–12:38, April 2018. https://doi.org/10.1145/3179415
Whittle, P.: Restless bandits: activity allocation in a changing world. J. Appl. Probability 25(A), 287–298 (1988). https://doi.org/10.2307/3214163. https://www.cambridge.org/core/journals/journal-of-applied-probability/article/abs/restless-bandits-activity-allocation-in-a-changing-world/DDEB5E22AFFEFF50AA97ADC96B71AE35
Wierman, A., Andrew, L.L.H., Tang, A.: Power-aware speed scaling in processor sharing systems. In: IEEE INFOCOM 2009. pp. 2007–2015, April 2009. https://doi.org/10.1109/INFCOM.2009.5062123, https://ieeexplore.ieee.org/abstract/document/5062123, iSSN: 0743-166X
Acknowledgments
F. Robledo has received funding from the Department of Education of the Basque Government through the Consolidated Research Group MATHMODE (IT1456-22). Research partially supported by the French “Agence Nationale de la Recherche (ANR)” through the project ANR-22-CE25-0013-02 (ANR EPLER) and DST-Inria Cefipra project LION.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Robledo, F., Ayesta, U., Avrachenkov, K. (2025). Deep Reinforcement Learning for Weakly Coupled MDP’s with Continuous Actions. In: Devos, A., Horváth, A., Rossi, S. (eds) Analytical and Stochastic Modelling Techniques and Applications. ASMTA 2024. Lecture Notes in Computer Science, vol 14826. Springer, Cham. https://doi.org/10.1007/978-3-031-70753-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-70753-7_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70752-0
Online ISBN: 978-3-031-70753-7
eBook Packages: Computer ScienceComputer Science (R0)