A variable neighborhood search based algorithm for finite-horizon Markov Decision Processes
Introduction
The Markov Decision Processes (MDPs) framework for decision making, planning, and control is rich in capturing the essence of purposeful activity in various situations. The mathematical framework provided by MDPs addresses the situation where outcomes are partly random and partly under the control of the decision maker, which is useful for studying a wide range of optimization problems solved via dynamic programming and reinforcement learning. These models and associated problems arise in many areas, including medical decision making, maintenance planning, robot navigation, inventory management, and so on, which are all important topics of research.
More precisely a Markov Decision Process is a discrete time stochastic control process characterized by modelling sequential decision-making problems. At a specified point in time, a decision maker observes the state of a system and chooses an action. The action choice and the state produce two results: the decision maker receives an immediate reward (or incurs an immediate cost), and the system evolves probabilistically to a new state at a subsequent discrete point in time. At this subsequent point in time, the decision maker faces a similar problem. The goal is to find a policy of choosing actions (dependent on the observations of the state and the history) which maximizes the rewards (or minimizes the costs) after a certain time period (maybe infinity). The states of a MDP possess the Markov property. This means that if the current state of the MDP at time t is known, transitions to a new state at time t + 1 are independent of all previous states.
There exist various classes of MDPs. In terms of the time horizon involved for the decisions, a MDP can be classified as infinite-horizon and finite-horizon. In this paper, we address finite-horizon MDPs (H stages, i.e., from stage 0 to stage H − 1) with known finite state space, finite action space, and recognized transition probabilities. It is well known that the optimal solution of a finite-horizon MDP can be computed by the standard backwards dynamic programming recursion (standard DP algorithm for short) with known solution at stage H (see, e.g., Puterman 1994). The optimal policy for a MDP with finite-horizon is a time-dependent policy, under discounted or total expected rewards (or costs), which makes its choice of actions dependent on the actual observation and on the number of steps the process performed already.
Even though finite-horizon MDPs can be solved in time which increases polynomially in the number of states and actions (Vincent, 2000), many problems of practical interest involve a very large number of states and (or) actions, while the problem data are succinctly described, in terms of a small number of parameters. As a result, the practical applicability of the standard DP algorithm in finite-horizon MDP problems is somewhat restricted, a phenomenon that Bellman has termed the “curse of dimensionality” [2]. Quite a few researchers have devoted themselves to efficient methods for solving this problem.
Some methods have focused on reducing the size of the state space ([1], [18], [23], [24], etc.). For example, Kearns et al. (2002) present an algorithm that, given only a generative model (a natural and common type of simulator) for an arbitrary MDP, performs on-line near-optimal planning with a per-state running time that has no dependence on the number of states, but is still exponential in the horizon time (which depends on the discount factor and the desired degree of approximation to the optimal policy). The algorithm is based on the idea of sparse sampling. More precisely, given any state x, the algorithm uses the generative model to draw samples for many state action pairs, and uses these samples to compute a near-optimal action from x, which is then executed. The key idea throughout is to avoid enumerating the entire state space, which is also effective in solving infinite-horizon MDPs ([3], [4], [12], etc.).
As the approaches that focus on reducing the size of the state space generally require searching the entire action space, problems with large action spaces may still pose a computational challenge, either in infinite-horizon MDP or in finite-horizon MDP conditions. The idea of using bounds on the optimal return function to identify actions that are not part of an optimal policy was introduced by MacQueen [19] , and since then such a procedure has been applied to several standard methods like policy iteration, value iteration, etc. (see Puterman (1994) for a review), where the search in the optimization is still conducted on the entire action space. Chang et al. [9] proposed a novel algorithm called evolutionary policy iteration (EPI) to solve MDPs for an infinite horizon discounted reward criterion. EPI inherits the spirit of policy iteration but eliminates the need to maximize over the entire action space in the policy iteration, which is especially targeted to problems where the state space is small but the action space is extremely large. Recently, Hu et al. (2007) approach this problem in a similar manner, by using an evolutionary population-based approach called evolutionary random policy search (ERPS) for solving infinite-horizon discounted-cost MDPs. ERPS approaches a given MDP by iteratively dividing it into a sequence of sub-MDP problems based on information obtained from random sampling of the entire action space and local search. The elite policy gained by solving the sub-MDP with a variant of the standard policy-improvement technique is used to construct a new population size (sample of the entire action space) , for a new sub-MDP. It is shown that the sequence of elite policies converges to an optimal policy with probability one.
Different from above methods which focus on sampling either the states or the actions, the algorithm called recursive automata sampling algorithm (RASA) presented in Chang et al. (2007b) ([11]) for control of finite-horizon (H stages) MDPs samples both states and actions. In RASA, at each sampled state at a stage, a fixed sampling budget is allocated among feasible actions, and the budget is used with the current probability estimate for the optimal action. RASA builds a sampled tree in a recursive manner to estimate the optimal value at an initial state x0 in a bottom-up fashion and makes use of an adaptive sampling scheme of the action space while building the tree. The running time complexity of RASA is O((maxiKi)H), where Ki is the total number of samples that are used per state sampled at stage i. In addition to RASA, the authors also develop other similar simulation-based algorithms for solving large finite-horizon MDPs [8], [10].
In this paper, aiming at avoiding the burden of enumerating the entire action space, we present a Variable Neighborhood Search (VNS) based metaheuristic for searching for the optimal solution of finite-horizon (H stages) MDPs. The VNSMDP algorithm is characterized by limiting and systematically changing the neighborhood set of the current solution of state x ∈ S in stage t ∈ H, from which an action is chosen randomly, acting as the core of an interval for conducting local search. The main idea behind the VNSMDP algorithm is that, by inheriting the results gained in stage t + 1, the search for the optimal solution of any state x in stage t is conducted within some subsets of the action space, rather than over the whole action set. A subset is formed by specifying an interval, whose midpoint is selected randomly from the neighborhood set of the current solution. The size of the neighborhood set is variable, depending on the performance of the iteration process.
The VNSMDP algorithm fits into the methodology of sampling the actions but enumerating the entire state space during the iteration process, which is targeted particularly at problems with a large action space. Contrasting to ERPS in [17] which is designed specially for infinite-horizon MDPs, the VNSMDP algorithm is more flexible by varying intelligently the neighborhood sets. Compared with simulation-based methods presented in [8], [10], [11], in which the sample states are a fixed sum and the number of sampled actions is 1, the idea of sampling the items (which are actions in VNSMDP) in a variable way can modify intelligently the search direction, thus reducing the risk of getting into a local optimum.
The VNSMDP algorithm is in fact an application of a metaheuristic used in complex optimization problems. Variable Neighborhood Search (VNS) is one among several metaheuristics designed for solving such problems, which exploits systematically the idea of neighborhood change, both in the descent to local optimum and in the escape from the valleys which contain them [21]. The basic steps of the VNS metaheuristic are given as follows:
Initialization Select a set of neighborhood structures , k = 1, … , kmax, and random distributions for the Shaking step that will be used in the search; find an initial solution x; choose a stopping condition;
Repeat the following sequence until the stopping condition is met:
- (1)
Set k = 1;
- (2)
Repeat the following steps until k > kmax:
- (a)
Shaking: Generate a point y randomly from the kth neighborhood of x ();
- (b)
Local search: Apply some local search method with y as initial solution to obtain a local optimum given by y′;
- (c)
Neighborhood change: If this local optimum is better than the incumbent, move there (x = y′), and continue the search with (k = 1); otherwise, set k = k + 1.
- (a)
VNS has been applied in many areas, mostly for solving combinatorial optimization problems. For the numerous successful applications of VNS, see, e.g., the survey papers [13], [14], [15], [16] and for a recent theoretical properties of VNS see [6]. To our knowledge, however, the idea of applying the VNS algorithm for solving MDP problems is new. The significance of the paper lies not only in its originality of tackling MDPs with VNS, but presenting a framework for solving various kinds of MDPs by the VNS algorithm as well.
The rest of the paper is organized as follows. In Section 2, the problem addressed is illustrated. Pseudo-code of the VNS algorithm for finite-horizon MDPs is outlined in Section 3. In Section 4, we analyze complexity and convergence attributes of VNSMDP. Computational analysis are conducted in Section 5, where a series of inventory problems are solved by the VNSMDP and the standard DP algorithms. Finally, conclusions are given in Section 6.
Section snippets
The problem
Consider a finite-horizon MDP, M = (S, A, P, R, H), with finite state space S, finite action space A with ∣A∣ > 1, nonnegative reward function , transition function P that maps a state and action pair to a probability distribution over S, and time horizon H (i.e., from 0 to H − 1). We denote the probability of transition to state y ∈ S when taking action a ∈ A in state x ∈ S by P(x, a)(y), which is assumed to be independent of stage t ∈ H. Any action is admissible only in the states where the limitations
The VNSMDP algorithm
We now demonstrate the Variable Neighborhood Search based algorithm for finite-horizon Markov Decision Processes (VNSMDP), to estimate for a given state . In addition to the parameters introduced in Section 2, in the following, we list other parameters used in the VNSMDP algorithm.
- •
X. X is the state set in which the VNS procedure is conducted for each element (state) respectively. At the beginning of any VNS procedure, X = S.
- •
Bt(x). Bt(x) ⊆ A(x) is the action set for state x in stage t, x ∈ X
Running-time complexity analysis
Since it is a defined one-time calculation of function value regarding any combination {x, a, y} in stage t (i.e., transit to state y in stage t + 1 when taking action a with state x in stage t) as one-time running of either the VNSMDP or DP algorithm, x ∈ S, a ∈ A(x), t ∈ H, we can draw the following conclusion. Property 1 The running time of the VNSMDP algorithm dose not exceed that of the standard DP algorithm. Proof It can be seen from the pseudo-code of the VNSMDP algorithm that, for state x ∈ X in stage t ∈ H, any action
Numerical examples
To evaluate the VNSMDP algorithm, we conduct some computational experiments on the finite-horizon inventory control problem with lost sales. The objective is to find the (non-stationary) policy to minimize the expected costs, which comprise fixed, holding, and penalty costs. It is assumed that demand is a random variable, following a discrete uniform distribution in [d1, d2]. Under the proposed inventory control policy, at any period (stage) t ∈ H, the decision process as well as the variation of
Conclusions
In this paper, we employ the framework of the Variable Neighborhood Search (VNS) metaheuristic and successfully design a VNS based solution procedure for finite-horizon (H stages) Markov Decision Processes (MDPs). Comparing with the standard DP algorithm, the proposed VNSMDP algorithm is able to get quality solutions, often optimal, with less computational time. Its efficiency and robustness are also shown by solving a series of inventory decision problems.
The VNSMDP algorithm is characterized
Acknowledgements
We are grateful to the authors in [7] for providing the original code of the standard DP algorithm. The work for this paper was supported by the National Natural Science Foundation of China under Projects No.70771001 and No.70821061. The work was also supported by New Century Excellent Talents in University of China under Project No. NCET-07-0049 and China Scholarship Council (CSC) under Project No. 2005A03010.
References (24)
- et al.
A decomposition algorithm for limiting average Markov decision problems
Operations Research Letters
(2003) - et al.
A survey of computational complexity results in systems and control
Automatica
(2000) - et al.
Variable neighborhood search
European Journal of Operational Research
(2008) - et al.
Variable neighborhood search
Computers & Operations Research
(1997) Dynamic Programming
(1957)Casta on D.A. daptive aggregation methods for infinite horizon dynamic programming
IEEE Transaction on Automatic Control
(1989)- (1995)
- et al.
Attraction probabilities in variable neighborhood search
4OR-A quarterly Journal of Operations Research
(2010) - I. Chadeˆs, M.J. Cros, F. Garcia, R. Sabbadin, Markov Decision Process (MDP) Toolbox v2.0 for MATLAB....
- et al.
An adaptive sampling algorithm for solving Markov decision processes
Operations Research
(2005)
Evolutionary policy iteration for solving Markov decision processes
IEEE Transactions on Automatic Control
An asymptotically efficient simulation-Based algorithm for finite horizon stochastic dynamic programming
IEEE Transactions on Automatic Control
Cited by (4)
A mixed integer linear programming model and variable neighborhood search for Maximally Balanced Connected Partition Problem
2014, Applied Mathematics and ComputationCitation Excerpt :Harmonic means clustering [17]; Finite-horizon Markov Decision Processes [18]; 0–1 mixed integer programming [19].
Dynamic allocation of promotional budgets based on maximizing customer equity
2021, Operational ResearchResearch on integrated optimization problem in a multi-product supply chain based on Markov decision processes
2012, Journal of Convergence Information Technology