Skip to main content

Layered Controller Synthesis for Dynamic Multi-agent Systems

  • Conference paper
  • First Online:
Formal Modeling and Analysis of Timed Systems (FORMATS 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14138))

  • 215 Accesses

Abstract

In this paper we present a layered approach for multi-agent control problem, decomposed into three stages, each building upon the results of the previous one. First, a high-level plan for a coarse abstraction of the system is computed, relying on parametric timed automata augmented with stopwatches as they allow to efficiently model simplified dynamics of such systems. In the second stage, the high-level plan, based on SMT-formulation, mainly handles the combinatorial aspects of the problem, provides a more dynamically accurate solution. These stages are collectively referred to as the SWA-SMT solver. They are correct by construction but lack a crucial feature: they cannot be executed in real time. To overcome this, we use SWA-SMT solutions as the initial training dataset for our last stage, which aims at obtaining a neural network control policy. We use reinforcement learning to train the policy, and show that the initial dataset is crucial for the overall success of the method.

This work was partially funded by ANR project TickTac (ANR-18-CE40-0015).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The full construction is detailed in appendix A of the extended version [14].

  2. 2.

    full description is presented in the appendix B of the extended version [14].

  3. 3.

    Generating a successful SWA-SMT trace takes on average about 15sec on a Intel i5-1235u with 16 GB of RAM. Note that there is a high variance in this runtime ranging from under a second to several minutes. A timeout was set to 900 s.

  4. 4.

    Using the xpag RL library [30], with a single Intel Core i7 CPU, 32 GB of RAM, and a single NVIDIA Quadro P3000 GPU, the training took between 40 and 50 min per million steps.

References

  1. Almeida, L.B.: Multilayer perceptrons. In: Handbook of Neural Computation, pp. C1.2:1–C1.2:30 (1997)

    Google Scholar 

  2. Alur, R., et al.: The algorithmic analysis of hybrid systems. Theoret. Comput. Sci. 138(1), 3–34 (1995). ISSN 0304-3975. https://doi.org/10.1016/0304-3975(94)00202-T. https://www.sciencedirect.com/science/article/pii/030439759400202T. Accessed 03 Oct 2023

  3. Alur, R., Dill, D.L.: A theory of timed automata. Theoret. Comput. Sci. 126(2), 183–235 (1994). ISSN 0304-3975. https://doi.org/10.1016/0304-3975(94)90010-8. https://www.sciencedirect.com/science/article/pii/0304397594900108

  4. André, É.: IMITATOR 3: synthesis of timing parameters beyond decidability. In: Silva, A., Leino, K.R.M. (eds.) CAV 2021. LNCS, vol. 12759, pp. 552–565. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-81685-8_26

    Chapter  Google Scholar 

  5. André, É.: What’s decidable about parametric timed automata? Int. J. Softw. Tools Technol. Transf. 21(2), 203–219 (2019)

    Article  Google Scholar 

  6. André, É., Lime, D., Roux, O.H.: Decision problems for parametric timed automata. In: Ogata, K., Lawford, M., Liu, S. (eds.) ICFEM 2016. LNCS, vol. 10009, pp. 400–416. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47846-3_25

    Chapter  Google Scholar 

  7. Behrmann, G., et al.: UPPAAL 4.0. In: 3rd International Conference on the Quantitative Evaluation of Systems, QEST 2006, Riverside, California, USA, September 2006, pp. 125–126. IEEE Computer Society (2006). https://doi.org/10.1109/QEST.2006.59

  8. Behrmann, G., Cougnard, A., David, A., Fleury, E., Larsen, K.G., Lime, D.: UPPAAL-Tiga: time for playing games! In: Damm, W., Hermanns, H. (eds.) CAV 2007. LNCS, vol. 4590, pp. 121–125. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-73368-3_14

    Chapter  Google Scholar 

  9. Bellemare, M., et al.: Unifying count-based exploration and intrinsic motivation. In: Advances in Neural Information Processing Systems. vol. 29 (2016). https://proceedings.neurips.cc/paper_files/paper/2016/file/afda332245e2af431fb7b672a68b659d-Paper.pdf

  10. van den Berg, J.P., Lin, M.C., Manocha, D.: Reciprocal velocity obstacles for real-time multi-agent navigation. In: 2008 IEEE International Conference on Robotics and Automation, ICRA 2008, Pasadena, USA, pp. 1928–1935 (2008). https://doi.org/10.1109/ROBOT.2008.4543489

  11. Bøgh, S., et al.: Distributed fleet management in noisy environments via model-predictive control. In: Proceedings of the International Conference on Automated Planning and Scheduling, vol. 32, pp. 565–573 (2022)

    Google Scholar 

  12. Brand, D., Zafiropulo, P.: On communicating finite-state machines. J. ACM 30(2), 323–342 (1983). https://doi.org/10.1145/322374.322380

  13. Chen, J., et al.: Scalable and safe multi-agent motion planning with nonlinear dynamics and bounded disturbances. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 11237–11245 (2021)

    Google Scholar 

  14. Clement, E., Perrin-Gilbert, N., Schlehuber-Caissier, P.: Layered controller synthesis for dynamic multi-agent systems (2023). arXiv:2307.06758 [cs.AI]

  15. Colombo, A., Del Vecchio, D.: Efficient algorithms for collision avoidance at intersections. In: Dang, T., Mitchell, I.M. (eds.) Hybrid Systems: Computation and Control, HSCC 2012, Beijing, China, pp. 145–154 (2012). https://doi.org/10.1145/2185632.2185656

  16. Dorri, A., Kanhere, S.S., Jurdak, R.: Multi-agent systems: a survey. IEEE Access 6, 28573–28593 (2018)

    Article  Google Scholar 

  17. Fiorini, P., Shiller, Z.: Motion planning in dynamic environments using velocity obstacles. Int. J. Robot. Res. 17(7), 760–772 (1998). https://doi.org/10.1177/027836499801700706

    Article  Google Scholar 

  18. Fujimoto, S., Gu, S.S.: A minimalist approach to offline reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 20132–20145 (2021)

    Google Scholar 

  19. Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596. PMLR (2018)

    Google Scholar 

  20. Henzinger, T.A., et al.: What’s decidable about hybrid automata? J. Comput. Syst. Sci. 57(1), 94–124 (1998). https://doi.org/10.1006/jcss.1998.1581

    Article  MathSciNet  MATH  Google Scholar 

  21. Herbreteau, F., Point, G.: The TChecker tool and librairies. https://github.com/ticktac-project/tchecker

  22. Hilscher, M., Linker, S., Olderog, E.-R.: Proving safety of traffic manoeuvres on country roads. In: Liu, Z., Woodcock, J., Zhu, H. (eds.) Theories of Programming and Formal Methods. LNCS, vol. 8051, pp. 196–212. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39698-4_12

    Chapter  MATH  Google Scholar 

  23. Hilscher, M., Schwammberger, M.: An abstract model for proving safety of autonomous urban traffic. In: Sampaio, A., Wang, F. (eds.) ICTAC 2016. LNCS, vol. 9965, pp. 274–292. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46750-4_16

    Chapter  MATH  Google Scholar 

  24. Hilscher, M., Linker, S., Olderog, E.-R., Ravn, A.P.: An abstract model for proving safety of multi-lane traffic manoeuvres. In: Qin, S., Qiu, Z. (eds.) ICFEM 2011. LNCS, vol. 6991, pp. 404–419. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24559-6_28

    Chapter  Google Scholar 

  25. Hune, T., et al.: Linear parametric model checking of timed automata. J. Logic Algebraic Program. 52, 183–220 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  26. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (Poster) (2015). http://dblp.uni-trier.de/db/conf/iclr/iclr2015.html#KingmaB14

  27. Kress-Gazit, H., Fainekos, G.E., Pappas, G.J.: Where’s Waldo? Sensor-based temporal logic motion planning. In: 2007 IEEE International Conference on Robotics and Automation, ICRA 2007, 10–14 April 2007, Roma, Italy, pp. 3116–3121. IEEE (2007). https://doi.org/10.1109/ROBOT.2007.363946

  28. Li, X., Ma, Y., Belta, C.: A policy search method for temporal logic specified reinforcement learning tasks. In: 2018 Annual American Control Conference (ACC), pp. 240–245. IEEE (2018)

    Google Scholar 

  29. Loos, S.M., Platzer, A.: Safe intersections: at the crossing of hybrid systems and verification. In: 14th International IEEE Conference on Intelligent Transportation Systems, ITSC 2011, Washington, DC, USA, pp. 1181–1186. IEEE (2011). https://doi.org/10.1109/ITSC.2011.6083138

  30. Perrin-Gilbert, N.: xpag: a modular reinforcement learning library with JAX agents (2022). https://github.com/perrin-isir/xpag

  31. Precup, D., Sutton, R.S., Dasgupta, S.: Off-policy temporal difference learning with function approximation. In: ICML, pp. 417–424 (2001)

    Google Scholar 

  32. Queffelec, A.: Connected multi-agent path finding: how robots get away with texting and driving. Ph.D. thesis. University of Rennes, France (2021). https://tel.archives-ouvertes.fr/tel-03517091

  33. Stern, R.: Multi-agent path finding-an overview. In: Artificial Intelligence: 5th RAAI Summer School, Tutorial Lectures, Dolgoprudny, Russia, 4–7 July 2019, pp. 96–115 (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Emily Clement , Nicolas Perrin-Gilbert or Philipp Schlehuber-Caissier .

Editor information

Editors and Affiliations

Appendices

A Markov Decision Process for the Running Example

The Markov Decision Process is defined by its state space S, action space A, initial state distribution \(p(s_0 \in S)\), reward function \(r(s_t \in S, a_t \in A, s_{t+1} \in S)\) and deterministic transition function \(s_{t+1} = \texttt {step}(s_t, a_t)\).

We describe here all the elements of the MDP defined for our running example:

  • The state space (\(\mathbb {R}^{720}\)). The environment contains 3 paths, which are unions of sections, and as detailed in Sect. 2, we have imposed a maximum number of 3 cars per path, so there are at most 9 cars, to which we can attribute a unique identifier (we use \(\{-1, 0, 1\}^2\)). The state is entirely defined by the speed and position of each car. We could thus use vectors of size 18 to represent states, but instead we chose a sparser representation with a better structure. To remain coherent, we use the same road network presented in Sect. 2 which is composed of 24 section (since every car has a dedicated initial and goal node subdividing the sections containing initial and goal positions). On this road network, three different paths are defined, and each section being shared by at most 2 paths. At any given time any section may contain at most 6 cars by construction. For each section, we define a list of 6 tuples, all equal to (0, 0, (0, 0), 0) if no car is currently inside the section. However if there are cars in the section, say 2 cars for example, then the first two tuples have this structure:

    $$ (\texttt {position~with~the~section}, \texttt {normalized~velocity}, \texttt {car~identifier}, 1) $$

    We represent states as a concatenation of the values of all these tuples for all the 24 sections, which amounts to a vector of size 720. It is a sparse representation, but its advantage is that it makes it easy to find cars close to each other, as they are either in the same section or in neighbor sections.

  • The action space (\(\mathbb {R}^9\)) and transition dynamics. Given an ordering of the 9 cars, an action is simply a vector of 9 accelerations. If \(a_i\) is the acceleration for the car i, and if at the current time step its position within its path is \(p_i\), and its speed is \(v_i\), then at the next time step its position will be \(p_i + v_i\), and its speed will be \(v_i + a_i\). This defines the transition dynamics of the MDP. The components of an action corresponding to cars that are not present in the state are simply ignored. Remark: actions can be computed straightforwardly from a sequence of states as they are equal to the difference between consecutive speeds for each car.

  • The reward. When all cars have reached their destination, i.e. crossed the end of their path, a reward of 2000 is given, and the episode is terminated. Besides, when there is either a collision (a violation of the safety distance between two cars) or two car facing each other in opposite directions in the same section, a negative reward (−100) is given and the episode is terminated. Finally, at each time step, two positive rewards are given, one proportional to the average velocity of the cars (to encourage cars to go fast), and one proportional to the (clamped) minimum distance between all cars (to encourage cars to stay far from each other). We set the maximum number of time step per episode to 85, and adjust these rewards so that an episode cannot reach a cumulated reward of 2000 unless it is truly successful and gets the final +2000 reward.

  • The initial state distribution. We define an arbitrary initial state distribution in which each of the 9 cars has an 80% chance of being present. The speed of each car is defined randomly, and positions are also defined randomly (within roughly the first two third of each path). Safety distances are ensured, so that the inital states are not in collision, however speeds may be such that there will a collision after the first time step, so there is no guarantee of feasibility.

B Hyperparameters of the RL Algorithms

For TD3:

  • Actor network architecture: multi-layer perceptron (MLP) [1] with 3 hidden layers of size 256 and rectified linear unit (ReLU) activation functions.

  • Actor optimizer: ADAM [26], actor learning rate: \(10^{-3}\)

  • Critic network architecture: MLP with 3 hidden layers of size 256 and ReLU activation functions.

  • Critic optimizer: ADAM, critic learning rate: \(10^{-3}\)

  • Discount factor: 0.99

  • Soft update coefficient (\(\tau \)): 0.05

For TD3BC:

  • Actor network architecture: MLP with 3 hidden layers of size 256 and ReLU activation functions.

  • Actor optimizer: ADAM, actor learning rate: \(10^{-3}\)

  • Critic network architecture: MLP with 3 hidden layers of size 256 and ReLU activation functions.

  • Critic optimizer: ADAM, critic learning rate: \(10^{-3}\)

  • Discount factor: 0.99

  • Soft update coefficient (\(\tau \)): 0.05

  • \(\alpha \): 2.5

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Clement, E., Perrin-Gilbert, N., Schlehuber-Caissier, P. (2023). Layered Controller Synthesis for Dynamic Multi-agent Systems. In: Petrucci, L., Sproston, J. (eds) Formal Modeling and Analysis of Timed Systems. FORMATS 2023. Lecture Notes in Computer Science, vol 14138. Springer, Cham. https://doi.org/10.1007/978-3-031-42626-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-42626-1_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-42625-4

  • Online ISBN: 978-3-031-42626-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics