Skip to main content
Log in

Hierarchical reinforcement learning via dynamic subspace search for multi-agent planning

  • Published:
Autonomous Robots Aims and scope Submit manuscript

Abstract

We consider scenarios where a swarm of unmanned vehicles (UxVs) seek to satisfy a number of diverse, spatially distributed objectives. The UxVs strive to determine an efficient plan to service the objectives while operating in a coordinated fashion. We focus on developing autonomous high-level planning, where low-level controls are leveraged from previous work in distributed motion, target tracking, localization, and communication. We rely on the use of state and action abstractions in a Markov decision processes framework to introduce a hierarchical algorithm, Dynamic Domain Reduction for Multi-Agent Planning, that enables multi-agent planning for large multi-objective environments. Our analysis establishes the correctness of our search procedure within specific subsets of the environments, termed ‘sub-environment’ and characterizes the algorithm performance with respect to the optimal trajectories in single-agent and sequential multi-agent deployment scenarios using tools from submodularity. Simulated results show significant improvement over using a standard Monte Carlo tree search in an environment with large state and action spaces.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. pseudocode of functions denoted with \(\dagger \) is omitted but described in detail.

References

  • Agha-mohammadi, A. A., Chakravorty, S., & Amato, N. M. (2011). FIRM: Feedback controller-based information-state roadmap-a framework for motion planning under uncertainty. In IEEE/RSJ international conference on intelligent robots and systems (pp. 4284–4291). San Francisco, CA.

  • Bai, A., Srivastava, S., & Russell, S. (2016). Markovian state and action abstractions for MDPs via hierarchical MCTS. In Proceedings of the twenty-fifth international joint conference on artificial intelligence (pp. 3029–3039). New York, NY: IJCAI.

  • Bellman, R. (1966). Dynamic programming. Science, 153(3731), 34–37.

    Article  Google Scholar 

  • Ben-Tal, A., & Nemirovski, A. (1998). Robust convex optimization. Mathematics of Operations Research, 23, 769–805.

    Article  MathSciNet  Google Scholar 

  • Bertsekas, D. P. (1995). Dynamic programming and optimal control. Belmont: Athena Scientific.

    MATH  Google Scholar 

  • Bian, A. A., Buhmann, J. M., Krause, A., & Tschiatschek, S. (2017). Guarantees for greedy maximization of non-submodular functions with applications. In International conference on machine learning Vol. 70 (pp. 498–507). Sydney.

  • Blum, A., Chawla, S., Karger, D. R., Lane, T., Meyerson, A., & Minkoff, M. (2007). Approximation algorithms for orienteering and discounted-reward TSP. SIAM Journal on Computing, 37(2), 653–670.

    Article  MathSciNet  Google Scholar 

  • Broz, F., Nourbakhsh, I., & Simmons, R. (2008). Planning for human–robot interaction using time-state aggregated POMDPs. AAAI, 8, 1339–1344.

    Google Scholar 

  • Bullo, F., Cortés, J., & Martínez, S. (2009). Distributed control of robotic networks. Applied mathematics series. Princeton, NJ: Princeton University Press.

    Book  Google Scholar 

  • Campi, M. C., Garatti, S., & Prandini, M. (2009). The scenario approach for systems and control design. Annual Reviews in Control, 32(2), 149–157.

    Article  Google Scholar 

  • Clark, A., Alomair, B., Bushnell, L., & Poovendran, R. (2016). Submodularity in dynamics and control of networked systems., Communications and control engineering New York: Springer.

    Book  Google Scholar 

  • Cortés, J., & Egerstedt, M. (2017). Coordinated control of multi-robot systems: A survey. SICE Journal of Control, Measurement, and System Integration, 10(6), 495–503.

    Article  Google Scholar 

  • Das, A., & Kempe, D. (February 2011). Submodular meets spectral: Greedy algorithms for subset selection, sparse approximation and dictionary selection. In: CoRR.

  • Das, J., Py, F., Harvey, J. B. J., Ryan, J. P., Gellene, A., Graham, R., et al. (2015). Data-driven robotic sampling for marine ecosystem monitoring. The International Journal of Robotics Research, 34(12), 1435–1452.

    Article  Google Scholar 

  • Dunbabin, M., & Marques, L. (2012). Robots for environmental monitoring: Significant advancements and applications. IEEE Robotics and Automation Magazine, 19(1), 24–39.

    Article  Google Scholar 

  • Gerkey, B. P., & Mataric, M. J. (2004). A formal analysis and taxonomy of task allocation in multi-robot systems. International Journal of Robotics Research, 23(9), 939–954.

    Article  Google Scholar 

  • Ghaoui, L. E., Oustry, F., & Lebret, H. (1998). Robust solutions to uncertain semidefinite programs. SIAM Journal on Optimization, 9(1), 33–52.

    Article  MathSciNet  Google Scholar 

  • Goundan, P. R., & Schulz, A. S. (2007). Revisiting the greedy approach to submodular set function maximization. Optimization Online (pp. 1–25).

  • Hansen, E. A., & Feng, Z. (2000). Dynamic programming for POMDPs using a factored state representation. In International conference on artificial intelligence planning systems (pp. 130–139). Breckenridge, CO.

  • Howard, R. (1960). Dynamic programming and Markov processes. Cambridge: M.I.T. Press.

    MATH  Google Scholar 

  • Kocsis, L., & Szepesvári, C. (2006). Bandit based Monte-Carlo planning. In ECML Vol. 6 (pp. 282–293). Springer.

  • LaValle, S. M., & Kuffner, J. J. (2000). Rapidly-exploring random trees: Progress and prospects. In Workshop on algorithmic foundations of robotics (pp. 293–308). Dartmouth, NH.

  • Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observed Markov decision processes. Annals of Operations Research, 28(1), 47–65.

    Article  MathSciNet  Google Scholar 

  • Ma, A., Ouimet, M., & Cortés, J. (2017). Dynamic domain reduction for multi-agent planning. In International symposium on multi-robot and multi-agent systems (pp. 142–149). Los Angeles, CA.

  • McCallum, A. K., & Ballard, D. (1996). Reinforcement learning with selective perception and hidden state. Ph.D. Dissertation, University of Rochester. Department of Computer Science.

  • Mesbahi, M., & Egerstedt, M. (2010). Graph theoretic methods in multiagent networks., Applied mathematics series Princeton: Princeton University Press.

    Book  Google Scholar 

  • Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529.

    Article  Google Scholar 

  • Nemhauser, G., Wolsey, L., & Fisher, M. (1978). An analysis of the approximations for maximizing submodular set functions. Mathematical Programming, 14, 265–294.

    Article  MathSciNet  Google Scholar 

  • Oliehoek, F. A., & Amato, C. (2016). A concise introduction to decentralized POMDPs., SpringerBriefs in intelligent systems New York: Springer.

    Book  Google Scholar 

  • Omidshafiei, S., Agha-mohammadi, A. A., Amato, C., & How, J. P. (May. 2015). Decentralized control of partially observable Markov decision processes using belief space macro-actions. In IEEE international conference on robotics and automation (pp. 5962–5969). Seattle, WA.

  • Papadimitriou, C. H., & Tsitsiklis, J. N. (1987). The complexity of Markov decision processes. Mathematics of Operations Research, 12(3), 441–450.

    Article  MathSciNet  Google Scholar 

  • Parr, R., & Russell, S. (1998). Hierarchical control and learning for Markov decision processes. Berkeley, CA: University of California.

    Google Scholar 

  • Prentice, S., & Roy, N. (2010). The belief roadmap: Efficient planning in linear POMDPs by factoring the covariance. In Robotics research (pp. 293–305). Springer.

  • Puterman, M. (2014). Markov decision processes: Discrete stochastic dynamic programming. Hoboken: Wiley.

    MATH  Google Scholar 

  • Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In International conference on machine learning (pp. 1889–1897). Lille: France.

  • Sutton, R., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1–2), 181–211.

    Article  MathSciNet  Google Scholar 

  • Theocharous, G., & Kaelbling, L. P. (2004). Approximate planning in POMDPs with macro-actions. In Advances in neural information processing systems (pp. 775–782).

  • Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., & Ba, J. (2017). Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems Vol. 30 (pp. 5285–5294).

Download references

Acknowledgements

This work was supported by ONR Award N00014-16-1-2836. The authors would like to thank the organizers of the International Symposium on Multi-Robot and Multi-Agent Systems (MRS 2017), which provided us with the opportunity to obtain valuable feedback on this research, and the reviewers.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aaron Ma.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this work appeared as Ma et al. (2017) at the International Symposium on Multi-Robot and Multi-Agent Systems.

This is one of the several papers published in Autonomous Robots comprising the Special Issue on Multi-Robot and Multi-Agent Systems.

Appendices

Submodularity

We review here concepts of submodularity and monotonicity of set functions following Clark et al. (2016). A power set function \(f:2^\varOmega \rightarrow \mathbb {R}\) is submodular if it satisfies the property of diminishing returns,

$$\begin{aligned} f(X\cup \{x\}) - f(X) \ge f(Y\cup \{x\})-f(Y), \end{aligned}$$
(A.1)

for all \(X\subseteq Y\subseteq \varOmega \) and \(x\in \varOmega \setminus Y\). The set function f is monotone if

$$\begin{aligned} f(X) \le f(Y), \end{aligned}$$
(A.2)

for all \(X\subseteq Y \subseteq \varOmega \). In general, monotonicity of a set function does not imply submodularity, and vice versa. These properties play a key role in determining near-optimal solutions to the cardinality-constrained submodular maximization problem defined by

$$\begin{aligned} \begin{aligned}&\max \,\,f(X)\\&\text {s.t. } |X| \le k. \end{aligned} \end{aligned}$$
(A.3)

In general, this problem is NP-hard. Greedy algorithms seek to find a suboptimal solution to (A.3) by building a set X one element at a time, starting with \(|X|=0\) to \(|X|=k\). These algorithms proceed by choosing the best next element,

$$\begin{aligned} \underset{x\in \varOmega \setminus X}{\max } f(X\cup \{x\}), \end{aligned}$$

to include in X. The following result (Clark et al. 2016; Nemhauser et al. 1978) provides a lower bound on the performance of greedy algorithms.

Theorem A.1

Let \(X^*\) denote the optimal solution of problem (A.3). If f is monotone, submodular, and satisfies \(f(\emptyset )=0\), then the set X returned by the greedy algorithm satisfies

$$\begin{aligned} f(X)\ge (1-e^{-1})f(X^*). \end{aligned}$$

An important extension of this result characterizes the performance of a greedy algorithm where, at each step, one chooses an element x that satisfies

$$\begin{aligned} f(X\cup \{x\})-f(X)\ge \alpha (f(X\cup \{x^*\})-f(X)), \end{aligned}$$

for some \(\alpha \in [0,1]\). That is, the algorithm chooses an element that is at least an \(\alpha \)-fraction of the local optimal element choice, \(x^*\). In this case, the following result (Goundan and Schulz 2007) characterizes the performance.

Theorem A.2

Let \(X^*\) denote the optimal solution of problem (A.3). If f is monotone, submodular, and satisfies \(f(\emptyset )=0\), then the set X returned by a greedy algorithm that chooses elements of at least \(\alpha \)-fraction of the local optimal element choice satisfies

$$\begin{aligned} f(X)\ge (1-e^{-\alpha })f(X^*). \end{aligned}$$

A generalization of the notion of submodular set function is given by the submodularity ratio (Das and Kempe 2011), which measures how far the function is from being submodular. This ratio is defined as largest scalar \(\lambda \in [0,1]\) such that

$$\begin{aligned} \lambda \le \frac{\sum \limits _{z\in Z} f(X \cup \{z\}) -f(X)}{f(X\cup Z)-f(X)}, \end{aligned}$$
(A.4)

for all \(X,Z\subset \varOmega \). The function f is called weakly submodular if it has a submodularity ratio in (0, 1]. If a function f is submodular, then its submodularity ratio is 1. The following result (Das and Kempe 2011) generalizes Theorem A.1 to monotone set functions with submodularity ratio \(\lambda \).

Theorem A.3

Let \(X^*\) denote the optimal solution of problem (A.3). If f is monotone, weakly submodular with submodularity ratio \(\lambda \in (0,1]\), and satisfies \(f(\emptyset )=0\), then the set X returned by the greedy algorithm satisfies

$$\begin{aligned} f(X)\ge (1-e^{-\lambda })f(X^*). \end{aligned}$$

Scenario optimization

Scenario optimization aims to determine robust solutions for practical problems with unknown parameters (Ben-Tal and Nemirovski 1998; Ghaoui et al. 1998) by hedging against uncertainty. Consider the following robust convex optimization problem defined by

$$\begin{aligned} \begin{aligned} \text {RCP: }&\min \limits _{\gamma \in \mathbb {R}^d }{c^T\gamma } \\&\text {subject to: } f_\delta (\gamma )\le 0,\,\, \forall \delta \in \varDelta , \end{aligned} \end{aligned}$$
(B.1)

where \(f_{\delta }\) is a convex function, d is the dimension of the optimization variable, \(\delta \) is an uncertain parameter, and \(\varDelta \) is the set of all possible parameter values. In practice, solving the optimization (B.1) can be difficult depending on the cardinality of \(\varDelta \). One approach to this problem is to solve (B.1) with sampled constraint parameters from \(\varDelta \). This approach views the uncertainty of situations in the robust convex optimization problem through a probability distribution \(Pr^{\delta }\) of \(\varDelta \), which encodes either the likelihood or importance of situations occurring through the constraint parameters. To alleviate the computational load, one selects a finite number \(N_{\text {SCP}}\) of parameter values in \(\varDelta \) sampled according to \(Pr^{\delta }\) and solves the scenario convex program (Campi et al. 2009) defined by

$$\begin{aligned} \begin{aligned} \text {SCP}_N :&\underset{\gamma \in \mathbb {R}^d}{\min }\,\,c^T\gamma \\&\text {s.t. } f_{\delta ^{(i)}}(\gamma )\le 0,\,\, i=1,\ldots ,N_{\text {SCP}}. \end{aligned} \end{aligned}$$
(B.2)

The following result states to what extent the solution of (B.2) solves the original robust optimization problem.

Theorem B.1

Let \(\gamma ^*\) be the optimal solution to the scenario convex program (B.2) when \(N_{\text {SCP}}\) is the number of convex constraints. Given a ‘violation parameter’, \(\varepsilon \), and a ‘confidence parameter’, \(\varpi \), if

$$\begin{aligned} N_{\text {SCP}}\ge \frac{2}{\varepsilon }\left( \text {ln}\frac{1}{\varpi }+d\right) \end{aligned}$$

then, with probability \(1-\varpi \), \(\gamma ^*\) satisfies all but an \(\varepsilon \)-fraction of constraints in \(\varDelta \).

List of symbols

\(({\mathbb {Z}}_{\ge 1}){\mathbb {Z}}\) :

(Non-negative) integers

\(({\mathbb {R}}_{>0})\mathbb {R}\) :

(Positive) real numbers

|Y|:

Cardinality of set Y

\(s\in S\) :

State/state space

\(a\in A\) :

Action/action space

\(Pr^s\) :

Transition function

rR:

Reward, reward function

\((\pi ^*)\pi \) :

(Optimal) policy

\(V^{\pi }\) :

Value of a state given policy, \(\pi \)

\(\gamma \) :

Discount factor

\(\alpha ,\beta \) :

Agent indices

\({\mathcal {A}}\) :

Set of agents

\(o\in {\mathcal {O}}^{b}\in {\mathcal {O}}\) :

Waypoint/objective/objective set

\({\mathcal {E}}\) :

Environment

\(x\in \varOmega _x\) :

Region/region set

\({\mathcal {O}}^{b}_{i}\) :

Set of waypoints of an objective in a region

\(s^{b}_{i}\) :

Abstracted objective state

\(s_{i}\) :

Regional state

\(\tau \) :

Task

\(\varGamma \) :

Set of feasible tasks

\(\overrightarrow{x}\) :

Ordered list of regions

\(\xi _{\overrightarrow{x}}\) :

Repeated region list

\(\phi _t(,)\) :

Time abstraction function

\((\epsilon _k)\epsilon \) :

(Partial) sub-environment

\(N_{\epsilon }\) :

Size of a sub-environment

\(\overrightarrow{\tau }\) :

Ordered list of tasks

\(\overrightarrow{\text {Pr}^t}\) :

Ordered list of probability distributions

\((\vartheta ^p_{\beta })\vartheta \) :

(Partial) task trajectory

\({\mathcal {I}}_{\alpha }\) :

Interaction set of agent \(\alpha \)

\({\mathcal {N}}\) :

Max size of interaction set

\(\theta \) :

Claimed regional objective

\(\varTheta _{\alpha }\) :

Claimed objective set of agent \(\alpha \)

\(\varTheta ^{{\mathcal {A}}}\) :

Global claimed objective set

\(\varXi \) :

Interaction matrix

Q :

Value for choosing \(\tau \) given \(\epsilon . s\)

\({\hat{Q}}\) :

Estimated value for choosing \(\tau \) given \(\epsilon . s\)

N :

Number of simulations in \(\epsilon . s\)

\(N_{{\mathcal {O}}^{b}}\) :

Number of simulations of an objective in \(\epsilon . s\)

t :

Time

T :

Multi-agent expected discounted reward per task

\(\lambda \) :

Submodularity ratio

\({\hat{\lambda }}\) :

Approximate submodularity ratio

\(f_{x}\) :

Sub-environment search value set function

\(X_{x}\subset Y_{x}\subset \varOmega _x\) :

Finite set of regions

\(f_{\vartheta }\) :

Sequential multi-agent deployment value set function

\(X_{\vartheta }\subset Y_{\vartheta }\subset \varOmega _{\vartheta }\) :

Finite set of trajectories

\(\varpi \) :

Confidence parameter

\(\varepsilon \) :

Violation parameter

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, A., Ouimet, M. & Cortés, J. Hierarchical reinforcement learning via dynamic subspace search for multi-agent planning. Auton Robot 44, 485–503 (2020). https://doi.org/10.1007/s10514-019-09871-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10514-019-09871-2

Keywords

Navigation