Abstract
We consider scenarios where a swarm of unmanned vehicles (UxVs) seek to satisfy a number of diverse, spatially distributed objectives. The UxVs strive to determine an efficient plan to service the objectives while operating in a coordinated fashion. We focus on developing autonomous high-level planning, where low-level controls are leveraged from previous work in distributed motion, target tracking, localization, and communication. We rely on the use of state and action abstractions in a Markov decision processes framework to introduce a hierarchical algorithm, Dynamic Domain Reduction for Multi-Agent Planning, that enables multi-agent planning for large multi-objective environments. Our analysis establishes the correctness of our search procedure within specific subsets of the environments, termed ‘sub-environment’ and characterizes the algorithm performance with respect to the optimal trajectories in single-agent and sequential multi-agent deployment scenarios using tools from submodularity. Simulated results show significant improvement over using a standard Monte Carlo tree search in an environment with large state and action spaces.









Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
pseudocode of functions denoted with \(\dagger \) is omitted but described in detail.
References
Agha-mohammadi, A. A., Chakravorty, S., & Amato, N. M. (2011). FIRM: Feedback controller-based information-state roadmap-a framework for motion planning under uncertainty. In IEEE/RSJ international conference on intelligent robots and systems (pp. 4284–4291). San Francisco, CA.
Bai, A., Srivastava, S., & Russell, S. (2016). Markovian state and action abstractions for MDPs via hierarchical MCTS. In Proceedings of the twenty-fifth international joint conference on artificial intelligence (pp. 3029–3039). New York, NY: IJCAI.
Bellman, R. (1966). Dynamic programming. Science, 153(3731), 34–37.
Ben-Tal, A., & Nemirovski, A. (1998). Robust convex optimization. Mathematics of Operations Research, 23, 769–805.
Bertsekas, D. P. (1995). Dynamic programming and optimal control. Belmont: Athena Scientific.
Bian, A. A., Buhmann, J. M., Krause, A., & Tschiatschek, S. (2017). Guarantees for greedy maximization of non-submodular functions with applications. In International conference on machine learning Vol. 70 (pp. 498–507). Sydney.
Blum, A., Chawla, S., Karger, D. R., Lane, T., Meyerson, A., & Minkoff, M. (2007). Approximation algorithms for orienteering and discounted-reward TSP. SIAM Journal on Computing, 37(2), 653–670.
Broz, F., Nourbakhsh, I., & Simmons, R. (2008). Planning for human–robot interaction using time-state aggregated POMDPs. AAAI, 8, 1339–1344.
Bullo, F., Cortés, J., & Martínez, S. (2009). Distributed control of robotic networks. Applied mathematics series. Princeton, NJ: Princeton University Press.
Campi, M. C., Garatti, S., & Prandini, M. (2009). The scenario approach for systems and control design. Annual Reviews in Control, 32(2), 149–157.
Clark, A., Alomair, B., Bushnell, L., & Poovendran, R. (2016). Submodularity in dynamics and control of networked systems., Communications and control engineering New York: Springer.
Cortés, J., & Egerstedt, M. (2017). Coordinated control of multi-robot systems: A survey. SICE Journal of Control, Measurement, and System Integration, 10(6), 495–503.
Das, A., & Kempe, D. (February 2011). Submodular meets spectral: Greedy algorithms for subset selection, sparse approximation and dictionary selection. In: CoRR.
Das, J., Py, F., Harvey, J. B. J., Ryan, J. P., Gellene, A., Graham, R., et al. (2015). Data-driven robotic sampling for marine ecosystem monitoring. The International Journal of Robotics Research, 34(12), 1435–1452.
Dunbabin, M., & Marques, L. (2012). Robots for environmental monitoring: Significant advancements and applications. IEEE Robotics and Automation Magazine, 19(1), 24–39.
Gerkey, B. P., & Mataric, M. J. (2004). A formal analysis and taxonomy of task allocation in multi-robot systems. International Journal of Robotics Research, 23(9), 939–954.
Ghaoui, L. E., Oustry, F., & Lebret, H. (1998). Robust solutions to uncertain semidefinite programs. SIAM Journal on Optimization, 9(1), 33–52.
Goundan, P. R., & Schulz, A. S. (2007). Revisiting the greedy approach to submodular set function maximization. Optimization Online (pp. 1–25).
Hansen, E. A., & Feng, Z. (2000). Dynamic programming for POMDPs using a factored state representation. In International conference on artificial intelligence planning systems (pp. 130–139). Breckenridge, CO.
Howard, R. (1960). Dynamic programming and Markov processes. Cambridge: M.I.T. Press.
Kocsis, L., & Szepesvári, C. (2006). Bandit based Monte-Carlo planning. In ECML Vol. 6 (pp. 282–293). Springer.
LaValle, S. M., & Kuffner, J. J. (2000). Rapidly-exploring random trees: Progress and prospects. In Workshop on algorithmic foundations of robotics (pp. 293–308). Dartmouth, NH.
Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observed Markov decision processes. Annals of Operations Research, 28(1), 47–65.
Ma, A., Ouimet, M., & Cortés, J. (2017). Dynamic domain reduction for multi-agent planning. In International symposium on multi-robot and multi-agent systems (pp. 142–149). Los Angeles, CA.
McCallum, A. K., & Ballard, D. (1996). Reinforcement learning with selective perception and hidden state. Ph.D. Dissertation, University of Rochester. Department of Computer Science.
Mesbahi, M., & Egerstedt, M. (2010). Graph theoretic methods in multiagent networks., Applied mathematics series Princeton: Princeton University Press.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529.
Nemhauser, G., Wolsey, L., & Fisher, M. (1978). An analysis of the approximations for maximizing submodular set functions. Mathematical Programming, 14, 265–294.
Oliehoek, F. A., & Amato, C. (2016). A concise introduction to decentralized POMDPs., SpringerBriefs in intelligent systems New York: Springer.
Omidshafiei, S., Agha-mohammadi, A. A., Amato, C., & How, J. P. (May. 2015). Decentralized control of partially observable Markov decision processes using belief space macro-actions. In IEEE international conference on robotics and automation (pp. 5962–5969). Seattle, WA.
Papadimitriou, C. H., & Tsitsiklis, J. N. (1987). The complexity of Markov decision processes. Mathematics of Operations Research, 12(3), 441–450.
Parr, R., & Russell, S. (1998). Hierarchical control and learning for Markov decision processes. Berkeley, CA: University of California.
Prentice, S., & Roy, N. (2010). The belief roadmap: Efficient planning in linear POMDPs by factoring the covariance. In Robotics research (pp. 293–305). Springer.
Puterman, M. (2014). Markov decision processes: Discrete stochastic dynamic programming. Hoboken: Wiley.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In International conference on machine learning (pp. 1889–1897). Lille: France.
Sutton, R., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1–2), 181–211.
Theocharous, G., & Kaelbling, L. P. (2004). Approximate planning in POMDPs with macro-actions. In Advances in neural information processing systems (pp. 775–782).
Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., & Ba, J. (2017). Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems Vol. 30 (pp. 5285–5294).
Acknowledgements
This work was supported by ONR Award N00014-16-1-2836. The authors would like to thank the organizers of the International Symposium on Multi-Robot and Multi-Agent Systems (MRS 2017), which provided us with the opportunity to obtain valuable feedback on this research, and the reviewers.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A preliminary version of this work appeared as Ma et al. (2017) at the International Symposium on Multi-Robot and Multi-Agent Systems.
This is one of the several papers published in Autonomous Robots comprising the Special Issue on Multi-Robot and Multi-Agent Systems.
Appendices
Submodularity
We review here concepts of submodularity and monotonicity of set functions following Clark et al. (2016). A power set function \(f:2^\varOmega \rightarrow \mathbb {R}\) is submodular if it satisfies the property of diminishing returns,
for all \(X\subseteq Y\subseteq \varOmega \) and \(x\in \varOmega \setminus Y\). The set function f is monotone if
for all \(X\subseteq Y \subseteq \varOmega \). In general, monotonicity of a set function does not imply submodularity, and vice versa. These properties play a key role in determining near-optimal solutions to the cardinality-constrained submodular maximization problem defined by
In general, this problem is NP-hard. Greedy algorithms seek to find a suboptimal solution to (A.3) by building a set X one element at a time, starting with \(|X|=0\) to \(|X|=k\). These algorithms proceed by choosing the best next element,
to include in X. The following result (Clark et al. 2016; Nemhauser et al. 1978) provides a lower bound on the performance of greedy algorithms.
Theorem A.1
Let \(X^*\) denote the optimal solution of problem (A.3). If f is monotone, submodular, and satisfies \(f(\emptyset )=0\), then the set X returned by the greedy algorithm satisfies
An important extension of this result characterizes the performance of a greedy algorithm where, at each step, one chooses an element x that satisfies
for some \(\alpha \in [0,1]\). That is, the algorithm chooses an element that is at least an \(\alpha \)-fraction of the local optimal element choice, \(x^*\). In this case, the following result (Goundan and Schulz 2007) characterizes the performance.
Theorem A.2
Let \(X^*\) denote the optimal solution of problem (A.3). If f is monotone, submodular, and satisfies \(f(\emptyset )=0\), then the set X returned by a greedy algorithm that chooses elements of at least \(\alpha \)-fraction of the local optimal element choice satisfies
A generalization of the notion of submodular set function is given by the submodularity ratio (Das and Kempe 2011), which measures how far the function is from being submodular. This ratio is defined as largest scalar \(\lambda \in [0,1]\) such that
for all \(X,Z\subset \varOmega \). The function f is called weakly submodular if it has a submodularity ratio in (0, 1]. If a function f is submodular, then its submodularity ratio is 1. The following result (Das and Kempe 2011) generalizes Theorem A.1 to monotone set functions with submodularity ratio \(\lambda \).
Theorem A.3
Let \(X^*\) denote the optimal solution of problem (A.3). If f is monotone, weakly submodular with submodularity ratio \(\lambda \in (0,1]\), and satisfies \(f(\emptyset )=0\), then the set X returned by the greedy algorithm satisfies
Scenario optimization
Scenario optimization aims to determine robust solutions for practical problems with unknown parameters (Ben-Tal and Nemirovski 1998; Ghaoui et al. 1998) by hedging against uncertainty. Consider the following robust convex optimization problem defined by
where \(f_{\delta }\) is a convex function, d is the dimension of the optimization variable, \(\delta \) is an uncertain parameter, and \(\varDelta \) is the set of all possible parameter values. In practice, solving the optimization (B.1) can be difficult depending on the cardinality of \(\varDelta \). One approach to this problem is to solve (B.1) with sampled constraint parameters from \(\varDelta \). This approach views the uncertainty of situations in the robust convex optimization problem through a probability distribution \(Pr^{\delta }\) of \(\varDelta \), which encodes either the likelihood or importance of situations occurring through the constraint parameters. To alleviate the computational load, one selects a finite number \(N_{\text {SCP}}\) of parameter values in \(\varDelta \) sampled according to \(Pr^{\delta }\) and solves the scenario convex program (Campi et al. 2009) defined by
The following result states to what extent the solution of (B.2) solves the original robust optimization problem.
Theorem B.1
Let \(\gamma ^*\) be the optimal solution to the scenario convex program (B.2) when \(N_{\text {SCP}}\) is the number of convex constraints. Given a ‘violation parameter’, \(\varepsilon \), and a ‘confidence parameter’, \(\varpi \), if
then, with probability \(1-\varpi \), \(\gamma ^*\) satisfies all but an \(\varepsilon \)-fraction of constraints in \(\varDelta \).
List of symbols
- \(({\mathbb {Z}}_{\ge 1}){\mathbb {Z}}\) :
-
(Non-negative) integers
- \(({\mathbb {R}}_{>0})\mathbb {R}\) :
-
(Positive) real numbers
- |Y|:
-
Cardinality of set Y
- \(s\in S\) :
-
State/state space
- \(a\in A\) :
-
Action/action space
- \(Pr^s\) :
-
Transition function
- r, R:
-
Reward, reward function
- \((\pi ^*)\pi \) :
-
(Optimal) policy
- \(V^{\pi }\) :
-
Value of a state given policy, \(\pi \)
- \(\gamma \) :
-
Discount factor
- \(\alpha ,\beta \) :
-
Agent indices
- \({\mathcal {A}}\) :
-
Set of agents
- \(o\in {\mathcal {O}}^{b}\in {\mathcal {O}}\) :
-
Waypoint/objective/objective set
- \({\mathcal {E}}\) :
-
Environment
- \(x\in \varOmega _x\) :
-
Region/region set
- \({\mathcal {O}}^{b}_{i}\) :
-
Set of waypoints of an objective in a region
- \(s^{b}_{i}\) :
-
Abstracted objective state
- \(s_{i}\) :
-
Regional state
- \(\tau \) :
-
Task
- \(\varGamma \) :
-
Set of feasible tasks
- \(\overrightarrow{x}\) :
-
Ordered list of regions
- \(\xi _{\overrightarrow{x}}\) :
-
Repeated region list
- \(\phi _t(,)\) :
-
Time abstraction function
- \((\epsilon _k)\epsilon \) :
-
(Partial) sub-environment
- \(N_{\epsilon }\) :
-
Size of a sub-environment
- \(\overrightarrow{\tau }\) :
-
Ordered list of tasks
- \(\overrightarrow{\text {Pr}^t}\) :
-
Ordered list of probability distributions
- \((\vartheta ^p_{\beta })\vartheta \) :
-
(Partial) task trajectory
- \({\mathcal {I}}_{\alpha }\) :
-
Interaction set of agent \(\alpha \)
- \({\mathcal {N}}\) :
-
Max size of interaction set
- \(\theta \) :
-
Claimed regional objective
- \(\varTheta _{\alpha }\) :
-
Claimed objective set of agent \(\alpha \)
- \(\varTheta ^{{\mathcal {A}}}\) :
-
Global claimed objective set
- \(\varXi \) :
-
Interaction matrix
- Q :
-
Value for choosing \(\tau \) given \(\epsilon . s\)
- \({\hat{Q}}\) :
-
Estimated value for choosing \(\tau \) given \(\epsilon . s\)
- N :
-
Number of simulations in \(\epsilon . s\)
- \(N_{{\mathcal {O}}^{b}}\) :
-
Number of simulations of an objective in \(\epsilon . s\)
- t :
-
Time
- T :
-
Multi-agent expected discounted reward per task
- \(\lambda \) :
-
Submodularity ratio
- \({\hat{\lambda }}\) :
-
Approximate submodularity ratio
- \(f_{x}\) :
-
Sub-environment search value set function
- \(X_{x}\subset Y_{x}\subset \varOmega _x\) :
-
Finite set of regions
- \(f_{\vartheta }\) :
-
Sequential multi-agent deployment value set function
- \(X_{\vartheta }\subset Y_{\vartheta }\subset \varOmega _{\vartheta }\) :
-
Finite set of trajectories
- \(\varpi \) :
-
Confidence parameter
- \(\varepsilon \) :
-
Violation parameter
Rights and permissions
About this article
Cite this article
Ma, A., Ouimet, M. & Cortés, J. Hierarchical reinforcement learning via dynamic subspace search for multi-agent planning. Auton Robot 44, 485–503 (2020). https://doi.org/10.1007/s10514-019-09871-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10514-019-09871-2