Layered Controller Synthesis for Dynamic Multi-agent Systems

Clement, Emily; Perrin-Gilbert, Nicolas; Schlehuber-Caissier, Philipp

doi:10.1007/978-3-031-42626-1_4

Emily Clement⁹,
Nicolas Perrin-Gilbert⁹ &
Philipp Schlehuber-Caissier¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14138))

Included in the following conference series:

International Conference on Formal Modeling and Analysis of Timed Systems

215 Accesses

Abstract

In this paper we present a layered approach for multi-agent control problem, decomposed into three stages, each building upon the results of the previous one. First, a high-level plan for a coarse abstraction of the system is computed, relying on parametric timed automata augmented with stopwatches as they allow to efficiently model simplified dynamics of such systems. In the second stage, the high-level plan, based on SMT-formulation, mainly handles the combinatorial aspects of the problem, provides a more dynamically accurate solution. These stages are collectively referred to as the SWA-SMT solver. They are correct by construction but lack a crucial feature: they cannot be executed in real time. To overcome this, we use SWA-SMT solutions as the initial training dataset for our last stage, which aims at obtaining a neural network control policy. We use reinforcement learning to train the policy, and show that the initial dataset is crucial for the overall success of the method.

This work was partially funded by ANR project TickTac (ANR-18-CE40-0015).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Verifiable strategy synthesis for multiple autonomous agents: a scalable approach

Article Open access 30 March 2022

Verifiable and Scalable Mission-Plan Synthesis for Autonomous Agents

Distributed Hybrid Control Synthesis for Multi-Agent Systems from High-Level Specifications

Notes

1.
The full construction is detailed in appendix A of the extended version [14].
2.
full description is presented in the appendix B of the extended version [14].
3.
Generating a successful SWA-SMT trace takes on average about 15sec on a Intel i5-1235u with 16 GB of RAM. Note that there is a high variance in this runtime ranging from under a second to several minutes. A timeout was set to 900 s.
4.
Using the xpag RL library [30], with a single Intel Core i7 CPU, 32 GB of RAM, and a single NVIDIA Quadro P3000 GPU, the training took between 40 and 50 min per million steps.

References

Almeida, L.B.: Multilayer perceptrons. In: Handbook of Neural Computation, pp. C1.2:1–C1.2:30 (1997)
Google Scholar
Alur, R., et al.: The algorithmic analysis of hybrid systems. Theoret. Comput. Sci. 138(1), 3–34 (1995). ISSN 0304-3975. https://doi.org/10.1016/0304-3975(94)00202-T. https://www.sciencedirect.com/science/article/pii/030439759400202T. Accessed 03 Oct 2023
Alur, R., Dill, D.L.: A theory of timed automata. Theoret. Comput. Sci. 126(2), 183–235 (1994). ISSN 0304-3975. https://doi.org/10.1016/0304-3975(94)90010-8. https://www.sciencedirect.com/science/article/pii/0304397594900108
André, É.: IMITATOR 3: synthesis of timing parameters beyond decidability. In: Silva, A., Leino, K.R.M. (eds.) CAV 2021. LNCS, vol. 12759, pp. 552–565. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-81685-8_26
Chapter Google Scholar
André, É.: What’s decidable about parametric timed automata? Int. J. Softw. Tools Technol. Transf. 21(2), 203–219 (2019)
Article Google Scholar
André, É., Lime, D., Roux, O.H.: Decision problems for parametric timed automata. In: Ogata, K., Lawford, M., Liu, S. (eds.) ICFEM 2016. LNCS, vol. 10009, pp. 400–416. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47846-3_25
Chapter Google Scholar
Behrmann, G., et al.: UPPAAL 4.0. In: 3rd International Conference on the Quantitative Evaluation of Systems, QEST 2006, Riverside, California, USA, September 2006, pp. 125–126. IEEE Computer Society (2006). https://doi.org/10.1109/QEST.2006.59
Behrmann, G., Cougnard, A., David, A., Fleury, E., Larsen, K.G., Lime, D.: UPPAAL-Tiga: time for playing games! In: Damm, W., Hermanns, H. (eds.) CAV 2007. LNCS, vol. 4590, pp. 121–125. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-73368-3_14
Chapter Google Scholar
Bellemare, M., et al.: Unifying count-based exploration and intrinsic motivation. In: Advances in Neural Information Processing Systems. vol. 29 (2016). https://proceedings.neurips.cc/paper_files/paper/2016/file/afda332245e2af431fb7b672a68b659d-Paper.pdf
van den Berg, J.P., Lin, M.C., Manocha, D.: Reciprocal velocity obstacles for real-time multi-agent navigation. In: 2008 IEEE International Conference on Robotics and Automation, ICRA 2008, Pasadena, USA, pp. 1928–1935 (2008). https://doi.org/10.1109/ROBOT.2008.4543489
Bøgh, S., et al.: Distributed fleet management in noisy environments via model-predictive control. In: Proceedings of the International Conference on Automated Planning and Scheduling, vol. 32, pp. 565–573 (2022)
Google Scholar
Brand, D., Zafiropulo, P.: On communicating finite-state machines. J. ACM 30(2), 323–342 (1983). https://doi.org/10.1145/322374.322380
Chen, J., et al.: Scalable and safe multi-agent motion planning with nonlinear dynamics and bounded disturbances. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 11237–11245 (2021)
Google Scholar
Clement, E., Perrin-Gilbert, N., Schlehuber-Caissier, P.: Layered controller synthesis for dynamic multi-agent systems (2023). arXiv:2307.06758 [cs.AI]
Colombo, A., Del Vecchio, D.: Efficient algorithms for collision avoidance at intersections. In: Dang, T., Mitchell, I.M. (eds.) Hybrid Systems: Computation and Control, HSCC 2012, Beijing, China, pp. 145–154 (2012). https://doi.org/10.1145/2185632.2185656
Dorri, A., Kanhere, S.S., Jurdak, R.: Multi-agent systems: a survey. IEEE Access 6, 28573–28593 (2018)
Article Google Scholar
Fiorini, P., Shiller, Z.: Motion planning in dynamic environments using velocity obstacles. Int. J. Robot. Res. 17(7), 760–772 (1998). https://doi.org/10.1177/027836499801700706
Article Google Scholar
Fujimoto, S., Gu, S.S.: A minimalist approach to offline reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 20132–20145 (2021)
Google Scholar
Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596. PMLR (2018)
Google Scholar
Henzinger, T.A., et al.: What’s decidable about hybrid automata? J. Comput. Syst. Sci. 57(1), 94–124 (1998). https://doi.org/10.1006/jcss.1998.1581
Article MathSciNet MATH Google Scholar
Herbreteau, F., Point, G.: The TChecker tool and librairies. https://github.com/ticktac-project/tchecker
Hilscher, M., Linker, S., Olderog, E.-R.: Proving safety of traffic manoeuvres on country roads. In: Liu, Z., Woodcock, J., Zhu, H. (eds.) Theories of Programming and Formal Methods. LNCS, vol. 8051, pp. 196–212. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39698-4_12
Chapter MATH Google Scholar
Hilscher, M., Schwammberger, M.: An abstract model for proving safety of autonomous urban traffic. In: Sampaio, A., Wang, F. (eds.) ICTAC 2016. LNCS, vol. 9965, pp. 274–292. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46750-4_16
Chapter MATH Google Scholar
Hilscher, M., Linker, S., Olderog, E.-R., Ravn, A.P.: An abstract model for proving safety of multi-lane traffic manoeuvres. In: Qin, S., Qiu, Z. (eds.) ICFEM 2011. LNCS, vol. 6991, pp. 404–419. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24559-6_28
Chapter Google Scholar
Hune, T., et al.: Linear parametric model checking of timed automata. J. Logic Algebraic Program. 52, 183–220 (2002)
Article MathSciNet MATH Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (Poster) (2015). http://dblp.uni-trier.de/db/conf/iclr/iclr2015.html#KingmaB14
Kress-Gazit, H., Fainekos, G.E., Pappas, G.J.: Where’s Waldo? Sensor-based temporal logic motion planning. In: 2007 IEEE International Conference on Robotics and Automation, ICRA 2007, 10–14 April 2007, Roma, Italy, pp. 3116–3121. IEEE (2007). https://doi.org/10.1109/ROBOT.2007.363946
Li, X., Ma, Y., Belta, C.: A policy search method for temporal logic specified reinforcement learning tasks. In: 2018 Annual American Control Conference (ACC), pp. 240–245. IEEE (2018)
Google Scholar
Loos, S.M., Platzer, A.: Safe intersections: at the crossing of hybrid systems and verification. In: 14th International IEEE Conference on Intelligent Transportation Systems, ITSC 2011, Washington, DC, USA, pp. 1181–1186. IEEE (2011). https://doi.org/10.1109/ITSC.2011.6083138
Perrin-Gilbert, N.: xpag: a modular reinforcement learning library with JAX agents (2022). https://github.com/perrin-isir/xpag
Precup, D., Sutton, R.S., Dasgupta, S.: Off-policy temporal difference learning with function approximation. In: ICML, pp. 417–424 (2001)
Google Scholar
Queffelec, A.: Connected multi-agent path finding: how robots get away with texting and driving. Ph.D. thesis. University of Rennes, France (2021). https://tel.archives-ouvertes.fr/tel-03517091
Stern, R.: Multi-agent path finding-an overview. In: Artificial Intelligence: 5th RAAI Summer School, Tutorial Lectures, Dolgoprudny, Russia, 4–7 July 2019, pp. 96–115 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Sorbonne Université, CNRS, Institut des Systèmes Intelligents et de Robotique, ISIR, 75005, Paris, France
Emily Clement & Nicolas Perrin-Gilbert
EPITA Research Laboratory, Le Kremlin-Bicêtre, France
Philipp Schlehuber-Caissier

Authors

Emily Clement
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Perrin-Gilbert
View author publications
You can also search for this author in PubMed Google Scholar
Philipp Schlehuber-Caissier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Emily Clement , Nicolas Perrin-Gilbert or Philipp Schlehuber-Caissier .

Editor information

Editors and Affiliations

Université Sorbonne Paris Nord, Villetaneuse, France
Laure Petrucci
University of Turin, Turin, Italy
Jeremy Sproston

Appendices

A Markov Decision Process for the Running Example

The Markov Decision Process is defined by its state space S, action space A, initial state distribution $p(s_0 \in S)$, reward function $r(s_t \in S, a_t \in A, s_{t+1} \in S)$ and deterministic transition function $s_{t+1} = \texttt {step}(s_t, a_t)$.

We describe here all the elements of the MDP defined for our running example:

The state space ($\mathbb {R}^{720}$). The environment contains 3 paths, which are unions of sections, and as detailed in Sect. 2, we have imposed a maximum number of 3 cars per path, so there are at most 9 cars, to which we can attribute a unique identifier (we use $\{-1, 0, 1\}^2$). The state is entirely defined by the speed and position of each car. We could thus use vectors of size 18 to represent states, but instead we chose a sparser representation with a better structure. To remain coherent, we use the same road network presented in Sect. 2 which is composed of 24 section (since every car has a dedicated initial and goal node subdividing the sections containing initial and goal positions). On this road network, three different paths are defined, and each section being shared by at most 2 paths. At any given time any section may contain at most 6 cars by construction. For each section, we define a list of 6 tuples, all equal to (0, 0, (0, 0), 0) if no car is currently inside the section. However if there are cars in the section, say 2 cars for example, then the first two tuples have this structure:
$$ (\texttt {position~with~the~section}, \texttt {normalized~velocity}, \texttt {car~identifier}, 1) $$
We represent states as a concatenation of the values of all these tuples for all the 24 sections, which amounts to a vector of size 720. It is a sparse representation, but its advantage is that it makes it easy to find cars close to each other, as they are either in the same section or in neighbor sections.
The action space ($\mathbb {R}^9$) and transition dynamics. Given an ordering of the 9 cars, an action is simply a vector of 9 accelerations. If $a_i$ is the acceleration for the car i, and if at the current time step its position within its path is $p_i$, and its speed is $v_i$, then at the next time step its position will be $p_i + v_i$, and its speed will be $v_i + a_i$. This defines the transition dynamics of the MDP. The components of an action corresponding to cars that are not present in the state are simply ignored. Remark: actions can be computed straightforwardly from a sequence of states as they are equal to the difference between consecutive speeds for each car.
The reward. When all cars have reached their destination, i.e. crossed the end of their path, a reward of 2000 is given, and the episode is terminated. Besides, when there is either a collision (a violation of the safety distance between two cars) or two car facing each other in opposite directions in the same section, a negative reward (−100) is given and the episode is terminated. Finally, at each time step, two positive rewards are given, one proportional to the average velocity of the cars (to encourage cars to go fast), and one proportional to the (clamped) minimum distance between all cars (to encourage cars to stay far from each other). We set the maximum number of time step per episode to 85, and adjust these rewards so that an episode cannot reach a cumulated reward of 2000 unless it is truly successful and gets the final +2000 reward.
The initial state distribution. We define an arbitrary initial state distribution in which each of the 9 cars has an 80% chance of being present. The speed of each car is defined randomly, and positions are also defined randomly (within roughly the first two third of each path). Safety distances are ensured, so that the inital states are not in collision, however speeds may be such that there will a collision after the first time step, so there is no guarantee of feasibility.

B Hyperparameters of the RL Algorithms

For TD3:

Actor network architecture: multi-layer perceptron (MLP) [1] with 3 hidden layers of size 256 and rectified linear unit (ReLU) activation functions.
Actor optimizer: ADAM [26], actor learning rate: $10^{-3}$
Critic network architecture: MLP with 3 hidden layers of size 256 and ReLU activation functions.
Critic optimizer: ADAM, critic learning rate: $10^{-3}$
Discount factor: 0.99
Soft update coefficient ($\tau $): 0.05

For TD3BC:

Actor network architecture: MLP with 3 hidden layers of size 256 and ReLU activation functions.
Actor optimizer: ADAM, actor learning rate: $10^{-3}$
Critic network architecture: MLP with 3 hidden layers of size 256 and ReLU activation functions.
Critic optimizer: ADAM, critic learning rate: $10^{-3}$
Discount factor: 0.99
Soft update coefficient ($\tau $): 0.05
$\alpha $: 2.5

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Clement, E., Perrin-Gilbert, N., Schlehuber-Caissier, P. (2023). Layered Controller Synthesis for Dynamic Multi-agent Systems. In: Petrucci, L., Sproston, J. (eds) Formal Modeling and Analysis of Timed Systems. FORMATS 2023. Lecture Notes in Computer Science, vol 14138. Springer, Cham. https://doi.org/10.1007/978-3-031-42626-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-42626-1_4
Published: 29 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42625-4
Online ISBN: 978-3-031-42626-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Layered Controller Synthesis for Dynamic Multi-agent Systems