Hierarchical reinforcement learning via dynamic subspace search for multi-agent planning

Ma, Aaron; Ouimet, Michael; Cortés, Jorge

doi:10.1007/s10514-019-09871-2

Hierarchical reinforcement learning via dynamic subspace search for multi-agent planning

Published: 12 July 2019

Volume 44, pages 485–503, (2020)
Cite this article

Autonomous Robots Aims and scope Submit manuscript

2103 Accesses
3 Altmetric
Explore all metrics

Abstract

We consider scenarios where a swarm of unmanned vehicles (UxVs) seek to satisfy a number of diverse, spatially distributed objectives. The UxVs strive to determine an efficient plan to service the objectives while operating in a coordinated fashion. We focus on developing autonomous high-level planning, where low-level controls are leveraged from previous work in distributed motion, target tracking, localization, and communication. We rely on the use of state and action abstractions in a Markov decision processes framework to introduce a hierarchical algorithm, Dynamic Domain Reduction for Multi-Agent Planning, that enables multi-agent planning for large multi-objective environments. Our analysis establishes the correctness of our search procedure within specific subsets of the environments, termed ‘sub-environment’ and characterizes the algorithm performance with respect to the optimal trajectories in single-agent and sequential multi-agent deployment scenarios using tools from submodularity. Simulated results show significant improvement over using a standard Monte Carlo tree search in an environment with large state and action spaces.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cooperative Dynamic Domain Reduction

Multi-Vehicle Adaptive Planning with Online Estimated Cost Due to Disturbance Forces

Learning Models for Predictive Adaptation in State Lattices

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

pseudocode of functions denoted with $\dagger $ is omitted but described in detail.

References

Agha-mohammadi, A. A., Chakravorty, S., & Amato, N. M. (2011). FIRM: Feedback controller-based information-state roadmap-a framework for motion planning under uncertainty. In IEEE/RSJ international conference on intelligent robots and systems (pp. 4284–4291). San Francisco, CA.
Bai, A., Srivastava, S., & Russell, S. (2016). Markovian state and action abstractions for MDPs via hierarchical MCTS. In Proceedings of the twenty-fifth international joint conference on artificial intelligence (pp. 3029–3039). New York, NY: IJCAI.
Bellman, R. (1966). Dynamic programming. Science, 153(3731), 34–37.
Article Google Scholar
Ben-Tal, A., & Nemirovski, A. (1998). Robust convex optimization. Mathematics of Operations Research, 23, 769–805.
Article MathSciNet Google Scholar
Bertsekas, D. P. (1995). Dynamic programming and optimal control. Belmont: Athena Scientific.
MATH Google Scholar
Bian, A. A., Buhmann, J. M., Krause, A., & Tschiatschek, S. (2017). Guarantees for greedy maximization of non-submodular functions with applications. In International conference on machine learning Vol. 70 (pp. 498–507). Sydney.
Blum, A., Chawla, S., Karger, D. R., Lane, T., Meyerson, A., & Minkoff, M. (2007). Approximation algorithms for orienteering and discounted-reward TSP. SIAM Journal on Computing, 37(2), 653–670.
Article MathSciNet Google Scholar
Broz, F., Nourbakhsh, I., & Simmons, R. (2008). Planning for human–robot interaction using time-state aggregated POMDPs. AAAI, 8, 1339–1344.
Google Scholar
Bullo, F., Cortés, J., & Martínez, S. (2009). Distributed control of robotic networks. Applied mathematics series. Princeton, NJ: Princeton University Press.
Book Google Scholar
Campi, M. C., Garatti, S., & Prandini, M. (2009). The scenario approach for systems and control design. Annual Reviews in Control, 32(2), 149–157.
Article Google Scholar
Clark, A., Alomair, B., Bushnell, L., & Poovendran, R. (2016). Submodularity in dynamics and control of networked systems., Communications and control engineering New York: Springer.
Book Google Scholar
Cortés, J., & Egerstedt, M. (2017). Coordinated control of multi-robot systems: A survey. SICE Journal of Control, Measurement, and System Integration, 10(6), 495–503.
Article Google Scholar
Das, A., & Kempe, D. (February 2011). Submodular meets spectral: Greedy algorithms for subset selection, sparse approximation and dictionary selection. In: CoRR.
Das, J., Py, F., Harvey, J. B. J., Ryan, J. P., Gellene, A., Graham, R., et al. (2015). Data-driven robotic sampling for marine ecosystem monitoring. The International Journal of Robotics Research, 34(12), 1435–1452.
Article Google Scholar
Dunbabin, M., & Marques, L. (2012). Robots for environmental monitoring: Significant advancements and applications. IEEE Robotics and Automation Magazine, 19(1), 24–39.
Article Google Scholar
Gerkey, B. P., & Mataric, M. J. (2004). A formal analysis and taxonomy of task allocation in multi-robot systems. International Journal of Robotics Research, 23(9), 939–954.
Article Google Scholar
Ghaoui, L. E., Oustry, F., & Lebret, H. (1998). Robust solutions to uncertain semidefinite programs. SIAM Journal on Optimization, 9(1), 33–52.
Article MathSciNet Google Scholar
Goundan, P. R., & Schulz, A. S. (2007). Revisiting the greedy approach to submodular set function maximization. Optimization Online (pp. 1–25).
Hansen, E. A., & Feng, Z. (2000). Dynamic programming for POMDPs using a factored state representation. In International conference on artificial intelligence planning systems (pp. 130–139). Breckenridge, CO.
Howard, R. (1960). Dynamic programming and Markov processes. Cambridge: M.I.T. Press.
MATH Google Scholar
Kocsis, L., & Szepesvári, C. (2006). Bandit based Monte-Carlo planning. In ECML Vol. 6 (pp. 282–293). Springer.
LaValle, S. M., & Kuffner, J. J. (2000). Rapidly-exploring random trees: Progress and prospects. In Workshop on algorithmic foundations of robotics (pp. 293–308). Dartmouth, NH.
Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observed Markov decision processes. Annals of Operations Research, 28(1), 47–65.
Article MathSciNet Google Scholar
Ma, A., Ouimet, M., & Cortés, J. (2017). Dynamic domain reduction for multi-agent planning. In International symposium on multi-robot and multi-agent systems (pp. 142–149). Los Angeles, CA.
McCallum, A. K., & Ballard, D. (1996). Reinforcement learning with selective perception and hidden state. Ph.D. Dissertation, University of Rochester. Department of Computer Science.
Mesbahi, M., & Egerstedt, M. (2010). Graph theoretic methods in multiagent networks., Applied mathematics series Princeton: Princeton University Press.
Book Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529.
Article Google Scholar
Nemhauser, G., Wolsey, L., & Fisher, M. (1978). An analysis of the approximations for maximizing submodular set functions. Mathematical Programming, 14, 265–294.
Article MathSciNet Google Scholar
Oliehoek, F. A., & Amato, C. (2016). A concise introduction to decentralized POMDPs., SpringerBriefs in intelligent systems New York: Springer.
Book Google Scholar
Omidshafiei, S., Agha-mohammadi, A. A., Amato, C., & How, J. P. (May. 2015). Decentralized control of partially observable Markov decision processes using belief space macro-actions. In IEEE international conference on robotics and automation (pp. 5962–5969). Seattle, WA.
Papadimitriou, C. H., & Tsitsiklis, J. N. (1987). The complexity of Markov decision processes. Mathematics of Operations Research, 12(3), 441–450.
Article MathSciNet Google Scholar
Parr, R., & Russell, S. (1998). Hierarchical control and learning for Markov decision processes. Berkeley, CA: University of California.
Google Scholar
Prentice, S., & Roy, N. (2010). The belief roadmap: Efficient planning in linear POMDPs by factoring the covariance. In Robotics research (pp. 293–305). Springer.
Puterman, M. (2014). Markov decision processes: Discrete stochastic dynamic programming. Hoboken: Wiley.
MATH Google Scholar
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In International conference on machine learning (pp. 1889–1897). Lille: France.
Sutton, R., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1–2), 181–211.
Article MathSciNet Google Scholar
Theocharous, G., & Kaelbling, L. P. (2004). Approximate planning in POMDPs with macro-actions. In Advances in neural information processing systems (pp. 775–782).
Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., & Ba, J. (2017). Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems Vol. 30 (pp. 5285–5294).

Download references

Acknowledgements

This work was supported by ONR Award N00014-16-1-2836. The authors would like to thank the organizers of the International Symposium on Multi-Robot and Multi-Agent Systems (MRS 2017), which provided us with the opportunity to obtain valuable feedback on this research, and the reviewers.

Author information

Authors and Affiliations

Department of Mechanical and Aerospace Engineering, University of California, San Diego, La Jolla, USA
Aaron Ma & Jorge Cortés
Naval Information Warfare Center Pacific, San Diego, USA
Michael Ouimet

Authors

Aaron Ma
View author publications
You can also search for this author inPubMed Google Scholar
Michael Ouimet
View author publications
You can also search for this author inPubMed Google Scholar
Jorge Cortés
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Aaron Ma.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this work appeared as Ma et al. (2017) at the International Symposium on Multi-Robot and Multi-Agent Systems.

This is one of the several papers published in Autonomous Robots comprising the Special Issue on Multi-Robot and Multi-Agent Systems.

Appendices

Submodularity

We review here concepts of submodularity and monotonicity of set functions following Clark et al. (2016). A power set function $f:2^\varOmega \rightarrow \mathbb {R}$ is submodular if it satisfies the property of diminishing returns,

$$\begin{aligned} f(X\cup \{x\}) - f(X) \ge f(Y\cup \{x\})-f(Y), \end{aligned}$$

(A.1)

for all $X\subseteq Y\subseteq \varOmega $ and $x\in \varOmega \setminus Y$. The set function f is monotone if

$$\begin{aligned} f(X) \le f(Y), \end{aligned}$$

(A.2)

for all $X\subseteq Y \subseteq \varOmega $. In general, monotonicity of a set function does not imply submodularity, and vice versa. These properties play a key role in determining near-optimal solutions to the cardinality-constrained submodular maximization problem defined by

$$\begin{aligned} \begin{aligned}&\max \,\,f(X)\\&\text {s.t. } |X| \le k. \end{aligned} \end{aligned}$$

(A.3)

In general, this problem is NP-hard. Greedy algorithms seek to find a suboptimal solution to (A.3) by building a set X one element at a time, starting with $|X|=0$ to $|X|=k$. These algorithms proceed by choosing the best next element,

$$\begin{aligned} \underset{x\in \varOmega \setminus X}{\max } f(X\cup \{x\}), \end{aligned}$$

to include in X. The following result (Clark et al. 2016; Nemhauser et al. 1978) provides a lower bound on the performance of greedy algorithms.

Theorem A.1

Let $X^*$ denote the optimal solution of problem (A.3). If f is monotone, submodular, and satisfies $f(\emptyset )=0$, then the set X returned by the greedy algorithm satisfies

$$\begin{aligned} f(X)\ge (1-e^{-1})f(X^*). \end{aligned}$$

An important extension of this result characterizes the performance of a greedy algorithm where, at each step, one chooses an element x that satisfies

$$\begin{aligned} f(X\cup \{x\})-f(X)\ge \alpha (f(X\cup \{x^*\})-f(X)), \end{aligned}$$

for some $\alpha \in [0,1]$. That is, the algorithm chooses an element that is at least an $\alpha $-fraction of the local optimal element choice, $x^*$. In this case, the following result (Goundan and Schulz 2007) characterizes the performance.

Theorem A.2

Let $X^*$ denote the optimal solution of problem (A.3). If f is monotone, submodular, and satisfies $f(\emptyset )=0$, then the set X returned by a greedy algorithm that chooses elements of at least $\alpha $-fraction of the local optimal element choice satisfies

$$\begin{aligned} f(X)\ge (1-e^{-\alpha })f(X^*). \end{aligned}$$

A generalization of the notion of submodular set function is given by the submodularity ratio (Das and Kempe 2011), which measures how far the function is from being submodular. This ratio is defined as largest scalar $\lambda \in [0,1]$ such that

$$\begin{aligned} \lambda \le \frac{\sum \limits _{z\in Z} f(X \cup \{z\}) -f(X)}{f(X\cup Z)-f(X)}, \end{aligned}$$

(A.4)

for all $X,Z\subset \varOmega $. The function f is called weakly submodular if it has a submodularity ratio in (0, 1]. If a function f is submodular, then its submodularity ratio is 1. The following result (Das and Kempe 2011) generalizes Theorem A.1 to monotone set functions with submodularity ratio $\lambda $.

Theorem A.3

Let $X^*$ denote the optimal solution of problem (A.3). If f is monotone, weakly submodular with submodularity ratio $\lambda \in (0,1]$, and satisfies $f(\emptyset )=0$, then the set X returned by the greedy algorithm satisfies

$$\begin{aligned} f(X)\ge (1-e^{-\lambda })f(X^*). \end{aligned}$$

Scenario optimization

Scenario optimization aims to determine robust solutions for practical problems with unknown parameters (Ben-Tal and Nemirovski 1998; Ghaoui et al. 1998) by hedging against uncertainty. Consider the following robust convex optimization problem defined by

$$\begin{aligned} \begin{aligned} \text {RCP: }&\min \limits _{\gamma \in \mathbb {R}^d }{c^T\gamma } \\&\text {subject to: } f_\delta (\gamma )\le 0,\,\, \forall \delta \in \varDelta , \end{aligned} \end{aligned}$$

(B.1)

where $f_{\delta }$ is a convex function, d is the dimension of the optimization variable, $\delta $ is an uncertain parameter, and $\varDelta $ is the set of all possible parameter values. In practice, solving the optimization (B.1) can be difficult depending on the cardinality of $\varDelta $. One approach to this problem is to solve (B.1) with sampled constraint parameters from $\varDelta $. This approach views the uncertainty of situations in the robust convex optimization problem through a probability distribution $Pr^{\delta }$ of $\varDelta $, which encodes either the likelihood or importance of situations occurring through the constraint parameters. To alleviate the computational load, one selects a finite number $N_{\text {SCP}}$ of parameter values in $\varDelta $ sampled according to $Pr^{\delta }$ and solves the scenario convex program (Campi et al. 2009) defined by

$$\begin{aligned} \begin{aligned} \text {SCP}_N :&\underset{\gamma \in \mathbb {R}^d}{\min }\,\,c^T\gamma \\&\text {s.t. } f_{\delta ^{(i)}}(\gamma )\le 0,\,\, i=1,\ldots ,N_{\text {SCP}}. \end{aligned} \end{aligned}$$

(B.2)

The following result states to what extent the solution of (B.2) solves the original robust optimization problem.

Theorem B.1

Let $\gamma ^*$ be the optimal solution to the scenario convex program (B.2) when $N_{\text {SCP}}$ is the number of convex constraints. Given a ‘violation parameter’, $\varepsilon $, and a ‘confidence parameter’, $\varpi $, if

$$\begin{aligned} N_{\text {SCP}}\ge \frac{2}{\varepsilon }\left( \text {ln}\frac{1}{\varpi }+d\right) \end{aligned}$$

then, with probability $1-\varpi $, $\gamma ^*$ satisfies all but an $\varepsilon $-fraction of constraints in $\varDelta $.

List of symbols

$({\mathbb {Z}}_{\ge 1}){\mathbb {Z}}$ :: (Non-negative) integers
$({\mathbb {R}}_{>0})\mathbb {R}$ :: (Positive) real numbers
|Y|:: Cardinality of set Y
$s\in S$ :: State/state space
$a\in A$ :: Action/action space
$Pr^s$ :: Transition function
r, R:: Reward, reward function
$(\pi ^*)\pi $ :: (Optimal) policy
$V^{\pi }$ :: Value of a state given policy, $\pi $
$\gamma $ :: Discount factor
$\alpha ,\beta $ :: Agent indices
${\mathcal {A}}$ :: Set of agents
$o\in {\mathcal {O}}^{b}\in {\mathcal {O}}$ :: Waypoint/objective/objective set
${\mathcal {E}}$ :: Environment
$x\in \varOmega _x$ :: Region/region set
${\mathcal {O}}^{b}_{i}$ :: Set of waypoints of an objective in a region
$s^{b}_{i}$ :: Abstracted objective state
$s_{i}$ :: Regional state
$\tau $ :: Task
$\varGamma $ :: Set of feasible tasks
$\overrightarrow{x}$ :: Ordered list of regions
$\xi _{\overrightarrow{x}}$ :: Repeated region list
$\phi _t(,)$ :: Time abstraction function
$(\epsilon _k)\epsilon $ :: (Partial) sub-environment
$N_{\epsilon }$ :: Size of a sub-environment
$\overrightarrow{\tau }$ :: Ordered list of tasks
$\overrightarrow{\text {Pr}^t}$ :: Ordered list of probability distributions
$(\vartheta ^p_{\beta })\vartheta $ :: (Partial) task trajectory
${\mathcal {I}}_{\alpha }$ :: Interaction set of agent $\alpha $
${\mathcal {N}}$ :: Max size of interaction set
$\theta $ :: Claimed regional objective
$\varTheta _{\alpha }$ :: Claimed objective set of agent $\alpha $
$\varTheta ^{{\mathcal {A}}}$ :: Global claimed objective set
$\varXi $ :: Interaction matrix
Q :: Value for choosing $\tau $ given $\epsilon . s$
${\hat{Q}}$ :: Estimated value for choosing $\tau $ given $\epsilon . s$
N :: Number of simulations in $\epsilon . s$
$N_{{\mathcal {O}}^{b}}$ :: Number of simulations of an objective in $\epsilon . s$
t :: Time
T :: Multi-agent expected discounted reward per task
$\lambda $ :: Submodularity ratio
${\hat{\lambda }}$ :: Approximate submodularity ratio
$f_{x}$ :: Sub-environment search value set function
$X_{x}\subset Y_{x}\subset \varOmega _x$ :: Finite set of regions
$f_{\vartheta }$ :: Sequential multi-agent deployment value set function
$X_{\vartheta }\subset Y_{\vartheta }\subset \varOmega _{\vartheta }$ :: Finite set of trajectories
$\varpi $ :: Confidence parameter
$\varepsilon $ :: Violation parameter

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ma, A., Ouimet, M. & Cortés, J. Hierarchical reinforcement learning via dynamic subspace search for multi-agent planning. Auton Robot 44, 485–503 (2020). https://doi.org/10.1007/s10514-019-09871-2

Download citation

Received: 24 April 2018
Accepted: 22 June 2019
Published: 12 July 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10514-019-09871-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical reinforcement learning via dynamic subspace search for multi-agent planning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Cooperative Dynamic Domain Reduction

Multi-Vehicle Adaptive Planning with Online Estimated Cost Due to Disturbance Forces

Learning Models for Predictive Adaptation in State Lattices

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Submodularity

Theorem A.1

Theorem A.2

Theorem A.3

Scenario optimization

Theorem B.1

List of symbols

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now