Skip to main content

Laser Learning Environment: A New Environment for Coordination-Critical Multi-agent Tasks

  • Conference paper
  • First Online:
Artificial Intelligence and Machine Learning (BNAIC/Benelearn 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2187))

Included in the following conference series:

  • 16 Accesses

Abstract

We introduce the Laser Learning Environment (LLE), a collaborative multi-agent reinforcement learning environment where coordination is key. In LLE, agents depend on each other to make progress (interdependence), must jointly take specific sequences of actions to succeed (perfect coordination), and accomplishing those joint actions does not yield any intermediate reward (zero-incentive dynamics). The challenge of such problems lies in the difficulty of escaping state space bottlenecks caused by interdependence steps since escaping those bottlenecks is not rewarded. We test multiple state-of-the-art value-based MARL algorithms against LLE and show that they consistently fail at the collaborative task because of their inability to escape state space bottlenecks, even though they successfully achieve perfect coordination. We show that Q-learning extensions such as prioritised experience replay and n-steps return hinder exploration in environments with zero-incentive dynamics, and find that intrinsic curiosity with random network distillation is not sufficient to escape those bottlenecks. We demonstrate the need for novel methods to solve this problem and the relevance of LLE as cooperative MARL benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The code is available at https://github.com/yamoling/bnaic-2023-lle.

References

  • Amodei, D., Olah, C., Steinhardt, J., Christiano, P.F., Schulman, J., Mané, D.: Concrete problems in AI safety. CoRR, abs/1606.06565 (2016)

    Google Scholar 

  • Avalos, R., Reymond, M., Nowé, A., Roijers, D.M.: Local advantage networks for cooperative multi-agent reinforcement learning. In: AAMAS 2022: Proceedings of the 21st International Conference on Autonomous Agents and MultiAgent Systems (Extended Abstract) (2022)

    Google Scholar 

  • Bard, N., et al.: The hanabi challenge: a new frontier for AI research. Artif. Intell. 280, 103216 (2020). ISSN: 00043702

    Article  MathSciNet  Google Scholar 

  • Boutilier, C.: Planning, learning and coordination in multiagent decision processes. In: Proceedings of the 6th Conference on Theoretical Aspects of Rationality and Knowledge, pp. 195–210 (1996)

    Google Scholar 

  • Burda, Y., Edwards, H., Storkey, A., Klimov, O.: Exploration by random network distillation (2018)

    Google Scholar 

  • Cao, Y., Wenwu, Yu., Ren, W., Chen, G.: An overview of recent progress in the study of distributed multi-agent coordination. IEEE Trans. Ind. Inf. 9(1), 427–438 (2013)

    Article  Google Scholar 

  • Claus, C., Boutilier, C.: The dynamics of reinforcement learning in cooperative multiagent systems. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence and Tenth Innovative Applications of Artificial Intelligence Conference, AAAI 1998 (1998)

    Google Scholar 

  • Klima, R., et al.: Space debris removal: learning to cooperate and the price of anarchy. Front. Robot. AI 5, 54 (2018). ISSN: 2296-9144

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  • Laurent, G.J., Matignon, L., Le Fort-Piat, N.: The world of independent learners is not Markovian. Int. J. Knowl.-Based Intell. Eng. Syst. 15(1), 55–64 (2011). ISSN: 18758827, 13272314

    Google Scholar 

  • LeCun, Y., Bottou, L., Bengio, Y., Ha, P.: Gradient-based learning applied to document recognition (1998)

    Google Scholar 

  • Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. In: Bengio, Y., LeCun, Y. (eds.) ICLR (2016)

    Google Scholar 

  • Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I.: Multi-agent actor-critic for mixed cooperative-competitive environments. In: Neural Information Processing Systems (NIPS) (2017)

    Google Scholar 

  • Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015). ISSN: 1476-4687

    Article  ADS  CAS  PubMed  Google Scholar 

  • Mordatch, I., Abbeel, P.: Emergence of grounded compositional language in multi-agent populations. arXiv preprint arXiv:1703.04908 (2017)

  • Oliehoek, F.A., Amato, C.: A Concise Introduction to Decentralized POMDPs. SpringerBriefs in Intelligent Systems, Springer, Cham (2016). https://doi.org/10.1007/978-3-319-28929-8. ISBN: 978-3-319-28929-8

    Book  Google Scholar 

  • Oliehoek, F.A., Spaan, M.T.J., Vlassis, N.: Optimal and approximate Q-value functions for decentralized POMDPs. J. Artif. Intell. Res. 32, 289–353 (2008a). ISSN: 1076-9757

    Google Scholar 

  • Oliehoek, F.A., Spaan, M.T.J., Whiteson, S.: Exploiting locality of interaction in factored Dec-POMDPs. In: International Joint Conference on Autonomous Agents and Multi-Agent Systems, pp. 517–524 (2008b)

    Google Scholar 

  • Panait, L., Luke, S.: Cooperative multi-agent learning: the state of the art. Auton. Agents Multi-Agent Syst. 11(3), 387–434 (2005). ISSN: 1573-7454

    Article  Google Scholar 

  • Parker-Holder, J., et al.: Evolving curricula with regret-based environment design. In: International Conference on Machine Learning, pp. 17473–17498. PMLR (2022)

    Google Scholar 

  • Rashid, T., Samvelyan, M., De Witt, C.S.: QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning (2018)

    Google Scholar 

  • Samvelyan, M., et al.: The StarCraft multi-agent challenge. CoRR, abs/1902.04043 (2019)

    Google Scholar 

  • Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. In: 4th International Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings, pp. 1–21 (2016)

    Google Scholar 

  • Schmidhuber, J.: A possibility for implementing curiosity and boredom in model-building neural controllers. In: Meyer, J.-A. (ed.) From Animals to Animats, pp. 222–227. The MIT Press (1991). International Conference on Simulation Adaptive Behavior: From Animals to Animats Edition. ISBN: 978-0-262-25667-4

    Google Scholar 

  • Son, K., Kim, D., Kang, W.J., Hostallero, D.E., Yi, Y.: QTRAN: learning to factorize with transformation for cooperative multi-agent reinforcement learning (2019)

    Google Scholar 

  • Stern, R., et al.: Multi-agent pathfinding: definitions, variants, and benchmarks (2017)

    Google Scholar 

  • Sunehag, P., et al.: Value-decomposition networks for cooperative multi-agent learning based on team reward. In: Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS, vol. 3, pp. 2085–2087 (2018). ISSN: 15582914. ISBN: 9781510868083

    Google Scholar 

  • Sutton, R.S., Barto, A.G.: Reinforcement learning: an introduction (2018)

    Google Scholar 

  • Sutton, R.S.: Temporal credit assignment in reinforcement learning. Ph.D. thesis, University of Massachusetts Amherst (1984). AAI8410337

    Google Scholar 

  • Tuyls, K., Weiss, G.: Multiagent learning: basics, challenges, and prospects. AI Mag. 33(3), 41 (2012). ISSN: 2371-9621, 0738-4602

    Google Scholar 

  • Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning (2016). ISBN: 9781577357605

    Google Scholar 

  • van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., Modayil, J.: Deep reinforcement learning and the deadly triad (2018)

    Google Scholar 

  • Watkins, C.: Learning from delayed rewards. Ph.D. thesis, University of Cambridge (1989)

    Google Scholar 

  • Wu, S.A., Wang, R.E., Evans, J.A., Tenenbaum, J.B., Parkes, D.C., Kleiman-Weiner, M.: Too many cooks: coordinating multi-agent collaboration through inverse planning. Topics in Cognitive Science (2021)

    Google Scholar 

Download references

Acknowledgements

Raphaël Avalos is supported by the FWO (Research Foundation – Flanders) under the grant 11F5721N. Tom Lenaerts is supported by an FWO project (grant nr. G054919N) and two FRS-FNRS PDR (grant numbers 31257234 and 40007793). His is furthermore supported by Service Public de Wallonie Recherche under grant no 2010235-ariac by digitalwallonia4.ai. Ann Nowé and Tom Lenaerts are also suported by the Flemish Government through the AI Research Program and TAILOR, a project funded by EU Horizon 2020 research and innovation programme under GA No 952215.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yannick Molinghen .

Editor information

Editors and Affiliations

Appendices

A Hyperparameters

A hyperparameter search has been performed with VDN on a combination of batch sizes (32, 64 and 128 transitions), memory sizes (50k, 100k and 200k transitions) and training intervals (1 and 5). Then, we performed a hyperparameter search for prioritised experience replay on a combination of \(\alpha \) (0.3, 0.4, 0.5, 0.6, 0.7, 0.8) and \(\beta \) (0.3, 0.4, 0.5, 0.6, 0.7, 0.8). For random network distillation, we have explored update ratios \(p \in \{0, 0.25, 0.5, 0.75 \}\).

Table 2. Hyperparameters used across all the experiments

B Neural Networks Architectures

1.1 B.1 Q-Network

The Q-network is made of two parts with an interconnection: a convolutional neural network of three layers, an interconnect that flattens the CNN, and finally a neural network of three linear layers. This is depicted in Table 3.

Table 3. Q-network architecture

1.2 B.2 Random Network Distillation

The random network used for the intrinsic reward computation is a convolutional one similar to the Q-network and is depicted in Table 4. The frozen random network (the target) consists in the first part of the table with an output of size 512. The optimised network (the predictor) has an additional tail with one ReLU activation and one linear layer.

Table 4. Random network architecture

C Results of n-Steps Returns with VDN

We plot in Fig. 6 the score and exit rate over the course of the training on level 6 illustrated in Fig. 1. The agents are trained on VDN with n-steps return and with the hyperparameters shown in Table 2. These results show that a higher value of n yield worse results, as discussed in Sect. 5.3.

Fig. 6.
figure 6

Scores and exit rates for different values of n in n-steps return. Results are averaged on 10 different seeds and shown ± standard deviation

D Maps Provided by LLE

LLE comes with six predefined levels illustrated in Fig. 7.

Fig. 7.
figure 7

Six standard levels of LLE. The levels have been designed with incremental level of interdependence in mind. Level 6 is the main level studied in this work

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Molinghen, Y., Avalos, R., Van Achter, M., Nowé, A., Lenaerts, T. (2025). Laser Learning Environment: A New Environment for Coordination-Critical Multi-agent Tasks. In: Oliehoek, F.A., Kok, M., Verwer, S. (eds) Artificial Intelligence and Machine Learning. BNAIC/Benelearn 2023. Communications in Computer and Information Science, vol 2187. Springer, Cham. https://doi.org/10.1007/978-3-031-74650-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-74650-5_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-74649-9

  • Online ISBN: 978-3-031-74650-5

  • eBook Packages: Artificial Intelligence (R0)

Publish with us

Policies and ethics