Skip to main content
Log in

SCC-rFMQ: a multiagent reinforcement learning method in cooperative Markov games with continuous actions

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Although many multiagent reinforcement learning (MARL) methods have been proposed for learning the optimal solutions in continuous-action domains, multiagent cooperation domains with independent learners (ILs) have received relatively few investigations, especially in traditional RL domain. In this paper, we propose an sample based independent learning method, named Sample Continuous Coordination with recursive Frequency Maximum Q-Value (SCC-rFMQ), which divides the multiagent cooperative problem with continuous actions into two layers. The first layer samples a finite set of actions from the continuous action spaces by a re-sampling mechanism with variable exploratory rates, and the second layer evaluates the actions in the sampled action set and updates the policy using a reinforcement learning cooperative method. By constructing cooperative mechanisms at both levels, SCC-rFMQ can handle cooperative problems in continuous action cooperative Markov games effectively. The effectiveness of SCC-rFMQ is experimentally demonstrated on two well-designed games, i.e., a continuous version of the climbing game and a cooperative version of the boat problem. Experimental results show that SCC-rFMQ outperforms other reinforcement learning algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://github.com/zcchenvy/SCC-rFMQ

  2. https://github.com/marlbenchmark/on-policy.

References

  1. Chevalier-Boisvert M, Willems L, Pal S (2018) Minimalistic gridworld environment for openai gym. GitHub repository, GitHub. https://github.com/maximecb/gym-minigrid

  2. Chu T, Wang J, Codecà L, Li Z (2019) Multi-agent deep reinforcement learning for large-scale traffic signal control. IEEE Trans Intell Transport Syst 21(3):1086–1095

    Article  Google Scholar 

  3. Foerster JN, Farquhar G, Afouras T, Nardelli N, Whiteson S (2018) Counterfactual multi-agent policy gradients. In: Thirty-Second AAAI Conference on artificial intelligence, vol 32, no. 1. AAAI

  4. Ganapathi Subramanian S, Poupart P, Taylor ME, Hegde N (2020) Multi type mean field reinforcement learning. In: Proceedings of the 19th International Conference on autonomous agents and multiagent systems. AAMAS, pp 411–419

  5. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press

    MATH  Google Scholar 

  6. Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International Conference on machine learning. PMLR, pp 1856–1865

  7. Hao X, Wang W, Hao J, Yang Y (2019) Independent generative adversarial self-imitation learning in cooperative multiagent systems. In: Proceedings of the 18th International Conference on autonomous agents and multiagent systems. AAMAS, pp 1315–1323

  8. Jong SD, Verbeeck K, Verbeeck K (2008) Artificial agents learning human fairness. In: Proceedings of the 7th International Joint Conference on autonomous agents and multiagent systems. AAAI, pp 863–870

  9. Jouffe L (1998) Fuzzy inference system learning by reinforcement methods. Trans Syst Man Cybern Part C 28(3):338–355

    Article  Google Scholar 

  10. Konda VR, Tsitsiklis JN (2003) Actor-critic algorithms. In: Advances in neural information processing systems, pp 1008–1014

  11. Lauer M, Riedmiller M (2000) An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In: Proceedings of the Seventeenth International Conference on machine learning. Citeseer

  12. Lauer M, Riedmiller MA (2000) An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In: Proceedings of the 17h International Conference on machine learning. ICML, pp 535–542

  13. Lazaric A, Restelli M, Bonarini A (2007) Reinforcement learning in continuous action spaces through sequential Monte Carlo methods. In: Conference on neural information processing systems. NeurIPS, pp 833–840

  14. Li D, Yang Q, Yu W, An D, Zhang Y, Zhao W (2020) Towards differential privacy-based online double auction for smart grid. IEEE Trans Inf Forensics and Secur 15:971–986. https://doi.org/10.1109/TIFS.2019.2932911

    Article  Google Scholar 

  15. Li H, Wu Y, Chen M (2020) Adaptive fault-tolerant tracking control for discrete-time multiagent systems via reinforcement learning algorithm. IEEE Trans Cybern 51(99):1–12

    Google Scholar 

  16. Liang H, Liu G, Zhang H, Huang T (2020) Neural-network-based event-triggered adaptive control of nonaffine nonlinear multiagent systems with dynamic uncertainties. IEEE Trans Neural Netw Learn Syst 32(5):2239–2250

    Article  MathSciNet  Google Scholar 

  17. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971

  18. Lowe R, WU Y, Tamar A, Harb J, Pieter Abbeel O, Mordatch I (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In: Advances in neural information processing systems, vol 30. Curran Associates, Inc, pp 6379–6390

  19. Matignon L, Laurent GJ, Fort-Piat NL (2007) Hysteretic q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams. In: IEEE/RSJ International Conference on intelligent robots and systems IROS. IEEE, pp 64–69

  20. Matignon L, Laurent Gj, Le fort piat N (2012) Review: Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems. Knowl Eng Rev 27(1):1–31

    Article  Google Scholar 

  21. Meng J, Williams D, Shen C (2015) Channels matter: multimodal connectedness, types of co-players and social capital for multiplayer online battle arena gamers. Comput Hum Behav 52:190–199

    Article  Google Scholar 

  22. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529

    Article  Google Scholar 

  23. Omidshafiei S, Pazis J, Amato C, How JP, Vian J (2017) Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In: Proceedings of the 34th International Conference on machine learning-volume 70, pp 2681–2690. JMLR. org

  24. Palmer G, Savani R, Tuyls K (2019) Negative update intervals in deep multi-agent reinforcement learning. In: Proceedings of the 18th International Conference on autonomous agents and multiagent systems, pp 43–51. International Foundation for Autonomous Agents and Multiagent Systems

  25. Palmer G, Tuyls K, Bloembergen D, Savani R (2018) Lenient multi-agent deep reinforcement learning. In: Proceedings of the 17th International Conference on autonomous agents and multiagent systems, pp 443–451. International Foundation for Autonomous Agents and Multiagent Systems

  26. Pan Y, Du P, Xue H, Lam HK (2020) Singularity-free fixed-time fuzzy control for robotic systems with user-defined performance. IEEE Trans Fuzzy Syst. https://doi.org/10.1109/TFUZZ.2020.2999746

    Article  Google Scholar 

  27. Panait L, Sullivan K, Luke S (2006) Lenient learners in cooperative multiagent systems. In: Proceedings of the 5th International Joint Conference on autonomous agents and multiagent systems. AAMAS, pp 801–803

  28. Peters J, Schaal S (2008) 2008 special issue: reinforcement learning of motor skills with policy gradients. Neural Netw 21(4):682–697

    Article  Google Scholar 

  29. Rashid T, Samvelyan M, Witt CS, Farquhar G, Foerster J, Whiteson S (2018) Qmix: monotonic value function factorization for deep multi-agent reinforcement learning. In: International Conference on machine learning. PMLR, pp 4295–4304

  30. Riedmiller M, Gabel T, Hafner R, Lange S (2009) Reinforcement learning for robot soccer. Auton Robots 27(1):55–73

    Article  Google Scholar 

  31. Saha Ray S (2016) Numerical analysis with algorithms and programming. CRC Press, Taylor & Francis Group, Boca Raton

    MATH  Google Scholar 

  32. Sallans B, Hinton GE (2004) Reinforcement learning with factored states and actions. J Mach Learn Res 5:1063–1088

    MathSciNet  MATH  Google Scholar 

  33. Schulman J, Levine S, Abbeel P, Jordan M, Moritz P (2015) Trust region policy optimization. In: International Conference on machine learning. PMLR, pp 1889–1897

  34. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347

  35. Son K, Kim D, Kang WJ, Hostallero DE, Yi Y (2019) Qtran: learning to factorize with transformation for cooperative multi-agent reinforcement learning. In: International Conference on machine learning. PMLR, pp 5887–5896

  36. Sukhbaatar S, Fergus R et al (2016) Learning multiagent communication with backpropagation. In: Advances in neural information processing systems. NeurIPS, pp 2244–2252

  37. Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT Press

    MATH  Google Scholar 

  38. Sutton RS, Maei HR, Precup D, Bhatnagar S, Silver D, Wiewiora E (2009) Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings of the 26th Annual International Conference on Machine Learning. PMLR, pp 993–1000

  39. Tang H, Houthooft R, Foote D, Stooke A, Chen X, Duan Y, Schulman J, De Turck F, Abbeel P (2017) # exploration: A study of count-based exploration for deep reinforcement learning. In: 31st Conference on neural information processing systems (NIPS), vol. 30, pp 1–18

  40. Thathachar ML, Sastry PS (2002) Varieties of learning automata: an overview. Syst Man Cybern Part B Cybern IEEE Trans 32(6):711–722

    Article  Google Scholar 

  41. Wei E, Luke S (2016) Lenient learning in independent-learner stochastic cooperative games. J Mach Learn Res 17(1):2914–2955

    MathSciNet  MATH  Google Scholar 

  42. Wen C, Yao X, Wang Y, Tan X (2020) Smix (\(\lambda\)): enhancing centralized value functions for cooperative multi-agent reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, pp 7301–7308

  43. Yang Y, Rui L, Li M, Ming Z, Wang J (2018) Mean field multi-agent reinforcement learning. In: The 35th International Conference on machine learning. PMLR, pp 5571–5580

  44. Yu C, Velu A, Vinitsky E, Wang Y, Bayen A, Wu Y (2021) The surprising effectiveness of ppo in cooperative multi-agent games. arXiv preprint arXiv:2103.01955

Download references

Acknowledgements

This work is supported by the National Key R&D Program of China (No.2020YFB1006102), the National Natural Science Foundation of China (Nos.: 61906027, 61906135), and China Postdoctoral Science Foundation Funded Project (No.: 2019M661080).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chengwei Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Bilinear interpolation

Using the bilinear interpolation techniques [31], we construct continuous game models extended by the CG (PSCG) game. Bilinear interpolation is an extension of linear interpolation for interpolating functions of two variables on a rectilinear 2D grid. The key idea is to perform linear interpolation first in one direction, and then again in the other direction (see Fig. 10).

Fig. 10
figure 10

The four red dots show the data points and the green dot is the point at which we want to interpolate

Suppose that we want to find the value of the unknown function f at the point (xy). It is assumed that we know the value of f at the four points \(Q_{11}=(x_1, y_1)\), \(Q_{12}=(x_1,y_2)\), \(Q_{21}=(x_2,y_1)\), and \(Q_{22}=(x_2,y_2)\). We first do linear interpolation in the x-direction:

$$\begin{aligned}\begin{array}{l} f\left( x,{y_1}\right) = \frac{{{x_2} - x}}{{{x_2} - {x_1}}}f\left( {Q_{11}}\right) + \frac{{x - {x_1}}}{{{x_2} - {x_1}}}f\left( {Q_{21}}\right) \\ f\left( x,{y_2}\right) = \frac{{{x_2} - x}}{{{x_2} - {x_1}}}f\left( {Q_{12}}\right) + \frac{{x - {x_1}}}{{{x_2} - {x_1}}}f\left( {Q_{22}}\right) \end{array}\end{aligned}$$

then proceed by interpolating in the y-direction to obtain the desired estimate:

$$\begin{aligned}f(x,y) = \frac{{{y_2} - y}}{{{y_2} - {y_1}}}f\left( x,{y_1}\right) + \frac{{y - {y_1}}}{{{y_2} - {y_1}}}f\left( x,{y_2}\right) \end{aligned}$$

Parameter setting

Table 1 where parameters used for the results presented in this section. With the detailed algorithmic descriptions in SCC-rFMQ and these parameters detail, all of the presented results are reproducible. For MAPPO, we used the official code and parameters directly (in GitHubFootnote 2). For others, we select the values with the best performance after extensive simulations. In Table  1, \(\epsilon\) and \(\epsilon _i^{re}(s)\) are strategy valuables for SCC-rFMQ and rFMQ, which are defined as maps that decrease with increasing learning trials t. \(A(0)_n\) defined in SCC-rFMQ, SMC and rFMQ are initial distributed action sets evenly sampled from action space [0, 1], where n is the sampling set size. For SMC and rFMQ, the common parameters (e.g., \(\alpha _Q\) and \(\gamma\)) are set to be the same as SCC-rFMQ for a fair comparison. Parameter definition of \(\sigma\) and \(\tau\) are the same as [12], and \(\lambda\) and \(\sigma _L\) are the same as [8]. Parameter definitions of MADDPG and DDPG are the same as [18], where policies and critics are parameterized by a two-layer ReLU MLP followed by a fully connected layer (activated by tanh function for policy nets). Parameter definitions of L-DDQN and H-DDQN are the same as [25]. It should be noted that parameter changes do not significantly affect the conclusion of the experiment in our experiments. For SMC+rFMQ, we use rFMQ with \(\epsilon\)-greedy strategy to learn Q value, and use the resampling strategy of SMC to update action set every \(c=200\) episodes. Weight value \(w_{i}^{t+1}(s,a)\) used in the SMC resampling strategy is calculated by the Boltzmann exploration strategy, where \(\varDelta {\mathrm{Q}}_i^{t + 1}(s,a) = {\mathrm{Q}}_i^{t + 1}(s,a) - {\mathrm{Q}}_i^t(s,a)\).

Table 1 Parameter setting

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, C., Han, Z., Liu, B. et al. SCC-rFMQ: a multiagent reinforcement learning method in cooperative Markov games with continuous actions. Int. J. Mach. Learn. & Cyber. 13, 1927–1944 (2022). https://doi.org/10.1007/s13042-021-01497-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-021-01497-0

Keywords

Navigation