Skip to main content

Deep Reinforcement Learning for Weakly Coupled MDP’s with Continuous Actions

  • Conference paper
  • First Online:
Analytical and Stochastic Modelling Techniques and Applications (ASMTA 2024)

Abstract

This paper introduces the Lagrange Policy for Continuous Actions (LPCA), a reinforcement learning algorithm specifically designed for weakly coupled MDP problems with continuous action spaces. LPCA addresses the challenge of resource constraints dependent on continuous actions by introducing a Lagrange relaxation of the weakly coupled MDP problem within a neural network framework for Q-value computation. This approach effectively decouples the MDP, enabling efficient policy learning in resource-constrained environments. We present two variations of LPCA: LPCA-DE, which utilizes differential evolution for global optimization, and LPCA-Greedy, a method that incrementally and greadily selects actions based on Q-value gradients. Comparative analysis against other state-of-the-art techniques across various settings highlight LPCA’s robustness and efficiency in managing resource allocation while maximizing rewards.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Avrachenkov, K.E., Borkar, V.S.: Whittle index based Q-learning for restless bandits with average reward. Automatica 139, 110186 (2022). https://doi.org/10.1016/j.automatica.2022.110186

    Article  MathSciNet  Google Scholar 

  2. Bertsekas, D.: Dynamic Programming and Optimal Control: Volume I. Athena Scientific (2012), google-Books-ID: qVBEEAAAQBAJ

    Google Scholar 

  3. Das, S., Suganthan, P.N.: Differential evolution: a survey of the state-of-the-art. IEEE Trans. Evol. Comput. 15(1), 4–31 (2011). https://doi.org/10.1109/TEVC.2010.2059031

    Article  Google Scholar 

  4. Fujimoto, S., Hoof, H., Meger, D.: Addressing Function Approximation Error in Actor-Critic Methods. pp. 1587–1596. PMLR, July 2018. https://proceedings.mlr.press/v80/fujimoto18a.html

  5. Gast, N., Gaujal, B., Yan, C.: The LP-update policy for weakly coupled Markov decision processes. Tech. rep., November 2022. https://doi.org/10.48550/arXiv.2211.01961, http://arxiv.org/abs/2211.01961, arXiv:2211.01961 [math] type: article

  6. Gittins, J.C.: Bandit processes and dynamic allocation indices. J. Roy. Stat. Soc.: Ser. B (Methodol.) 41(2), 148–164 (1979). https://doi.org/10.1111/j.2517-6161.1979.tb01068.x

    Article  MathSciNet  Google Scholar 

  7. Hawkins, J.T.: A Langrangian decomposition approach to weakly coupled dynamic optimization problems and its applications. Ph.D. thesis, Massachusetts Institute of Technology (2003)

    Google Scholar 

  8. Killian, J.A., Biswas, A., Shah, S., Tambe, M.: Q-learning lagrange policies for multi-action restless bandits. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD 2021, pp. 871–881. Association for Computing Machinery, New York, August 2021. https://doi.org/10.1145/3447548.3467370

  9. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. Tech. rep., July 2019. https://doi.org/10.48550/arXiv.1509.02971, http://arxiv.org/abs/1509.02971, arXiv:1509.02971 [cs, stat] type: article

  10. Meshram, R., Kaza, K.: Simulation Based Algorithms for Markov Decision Processes and Multi-Action Restless Bandits. Tech. rep., July 2020. https://doi.org/10.48550/arXiv.2007.12933, http://arxiv.org/abs/2007.12933, arXiv:2007.12933 [cs, eess] type: article

  11. Nakhleh, K., Ganji, S., Hsieh, P.C., Hou, I.H., Shakkottai, S.: NeurWIN: neural whittle index network for restless bandits via deep RL. In: Advances in Neural Information Processing Systems, vol. 34, pp. 828–839. Curran Associates, Inc. (2021). https://proceedings.neurips.cc/paper/2021/hash/0768281a05da9f27df178b5c39a51263-Abstract.html

  12. Pham, T.H., De Magistris, G., Tachibana, R.: OptLayer - practical constrained optimization for deep reinforcement learning in the real world. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6236–6243, May 2018. https://doi.org/10.1109/ICRA.2018.8460547, iSSN: 2577-087X

  13. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, August 2014, google-Books-ID: VvBjBAAAQBAJ

    Google Scholar 

  14. Robledo, F., Borkar, V.S., Ayesta, U., Avrachenkov, K.: Tabular and deep learning of whittle index. In: EWRL 2022 - 15th European Workshop of Reinforcement Learning. Milan, Italy, September 2022. https://hal.science/hal-03767324

  15. Shar, I.E., Jiang, D.: Weakly coupled deep Q-Networks. In: Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems, vol. 36, pp. 43931–43950. Curran Associates, Inc. (2023)

    Google Scholar 

  16. Srinivas, N., Krause, A., Kakade, S.M., Seeger, M.: Gaussian process optimization in the bandit setting: no regret and experimental design. IEEE Trans. Inf. Theory 58(5), 3250–3265 (2012). https://doi.org/10.1109/TIT.2011.2182033, http://arxiv.org/abs/0912.3995, arXiv:0912.3995 [cs]

  17. Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT press (2018)

    Google Scholar 

  18. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016). https://ojs.aaai.org/index.php/AAAI/article/view/10295

  19. Weber, R.: Comments on: dynamic priority allocation via restless bandit marginal productivity indices. TOP 15(2), 211–216 (2007)

    Article  MathSciNet  Google Scholar 

  20. Wei, X., Yu, H., Neely, M.J.: Online learning in weakly coupled markov decision processes: a convergence time study. In: Proceedings of the ACM on Measurement and Analysis of Computing Systems 2(1), 12:1–12:38, April 2018. https://doi.org/10.1145/3179415

  21. Whittle, P.: Restless bandits: activity allocation in a changing world. J. Appl. Probability 25(A), 287–298 (1988). https://doi.org/10.2307/3214163. https://www.cambridge.org/core/journals/journal-of-applied-probability/article/abs/restless-bandits-activity-allocation-in-a-changing-world/DDEB5E22AFFEFF50AA97ADC96B71AE35

  22. Wierman, A., Andrew, L.L.H., Tang, A.: Power-aware speed scaling in processor sharing systems. In: IEEE INFOCOM 2009. pp. 2007–2015, April 2009. https://doi.org/10.1109/INFCOM.2009.5062123, https://ieeexplore.ieee.org/abstract/document/5062123, iSSN: 0743-166X

Download references

Acknowledgments

F. Robledo has received funding from the Department of Education of the Basque Government through the Consolidated Research Group MATHMODE (IT1456-22). Research partially supported by the French “Agence Nationale de la Recherche (ANR)” through the project ANR-22-CE25-0013-02 (ANR EPLER) and DST-Inria Cefipra project LION.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francisco Robledo .

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Robledo, F., Ayesta, U., Avrachenkov, K. (2025). Deep Reinforcement Learning for Weakly Coupled MDP’s with Continuous Actions. In: Devos, A., Horváth, A., Rossi, S. (eds) Analytical and Stochastic Modelling Techniques and Applications. ASMTA 2024. Lecture Notes in Computer Science, vol 14826. Springer, Cham. https://doi.org/10.1007/978-3-031-70753-7_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70753-7_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70752-0

  • Online ISBN: 978-3-031-70753-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics