A policy gradient method for semi-Markov decision processes with application to call admission control

doi:10.1016/j.ejor.2006.02.023

European Journal of Operational Research

Volume 178, Issue 3, 1 May 2007, Pages 808-818

https://doi.org/10.1016/j.ejor.2006.02.023 Get rights and content

Abstract

Solving a semi-Markov decision process (SMDP) using value or policy iteration requires precise knowledge of the probabilistic model and suffers from the curse of dimensionality. To overcome these limitations, we present a reinforcement learning approach where one optimizes the SMDP performance criterion with respect to a family of parameterised policies. We propose an online algorithm that simultaneously estimates the gradient of the performance criterion and optimises it using stochastic approximation. We apply our algorithm to call admission control. Our proposed policy gradient SMDP algorithm and its application to admission control is novel.

Introduction

A semi-Markov decision process (SMDP) can be solved using classical methods such as dynamic programming (DP). However, these classical methods suffer from the curse of dimensionality, which means the memory required to represent the optimal value function is proportional to the dimension of the SMDP and is prohibitive for problems with large state and action spaces. Secondly, classical methods require knowledge of the transition probabilities of the SMDP or equivalently, a model. In applications like Call admission control (CAC), the state space is large and the SMDP transition probabilities are not known a priori [13], [14]. The SMDP framework is widely used to solve quality-of-service provisioning problems in communication networks, be they landline or wireless networks [9], [13], [21]. This includes both routing and CAC. Other applications of the SMDP framework include inventory routing [1], preventive maintenance [10] and robotics [16].

To overcome the memory requirement and the need for the system model, reinforcement learning (RL) with function approximation is used [5]. RL combines DP and stochastic approximation (SA) to learn the optimal value function online from a family parameterised functions that are compactly represented. RL for a MDP is well studied [5] and recently it has been extended to SMDPs [6], [10], [11]. These works are value function based approaches. In [6], a discounted cost SMDP problem is solved using real-time DP that builds a system model. Gosavi [10] solves an average cost SMDP problem using Q-learning while Marbach et al. [11] approximate the value function and use a “greedy” policy with respect to the approximated value function. Inspired by the work in [2], [15], we present an alternative approach where the policy of a SMDP is represented by its own function approximator to yield a family of parameterised policies. The gradient of the SMDP performance criterion with respect to the policy parameters is then estimated online. Using this estimated gradient, a SA algorithm is used to update the policy parameters to improve performance. This approach, which is known as the policy gradient method [2], [15], also overcomes the memory requirement and the need for model. In this paper, the performance criterion used is the average cost and we prove the convergence of the SMDP gradient estimator derived under mild regularity conditions. We apply our algorithm to CAC and in simulations, demonstrate convergence of the online algorithm to the optimal policy. Our proposed policy gradient SMDP algorithm and its application to admission control is novel.

The advantage of the policy gradient framework is that it can be applied to optimise a SMDP subject to average cost constraints by using recent results on constrained stochastic approximation [19]. This problem is important in CAC for multi-class networks as the constraints are used to impose upper bounds on the blocking probability of the various user classes. Additionally, we consider the general case where we assume no knowledge of the semi-Markov kernel or quantities derived from it (see Section 2.1 for details). Hence, estimating the SMDP gradient does not simplify to estimating the realisation matrix only, as done in [7].

Section 2 presents the average cost SMDP problem and the main algorithms of the paper. The first is an algorithm to estimate the SMDP objective function gradient while the second is an algorithm that optimises it. Both are single sample path based online algorithms. A derivation of the SMDP objective function gradient estimator as well as its convergence is studied in Section 3. The application to CAC with numerical examples is given in Section 4. All proofs appear in Appendix A.

Section snippets

Problem formulation and the main algorithm

Let $θ = [θ_{1}, \dots, θ_{K}]^{T} \in R^{K}$ be the parameter that determines the control policy in effect. For each θ, let ${\{(x_{k}^{θ}, τ_{k}^{θ}, a_{k}^{θ})\}}_{k ⩾ 0}$ be a controlled Markov chain where $x_{k}^{θ} \in X ≔ {1, \dots, n}$ ; X being the state space of the embedded chain ${x_{k}^{θ}}_{k ⩾ 0}$ . $a_{k}^{θ} \in A$ is the control (or action) applied at decision epoch k where the control space A is finite. $τ_{k}^{θ} \in R_{+} ≔ [0, \infty)$ is the state dwell time of the continuous time semi-Markov process ${(x^{θ} (t), a^{θ} (t))}_{t \in R_{+}}$ defined by forming the right-continuous interpolations of ${x_{k}^{θ}}_{k ⩾ 0}$ and ${a_{k}^{θ}}_{k ⩾ 0}$

Convergence of algorithm SMDPBG

By virtue of Assumption 1, η_c and η_τ are performance criteria for average cost MDPs with cost functions c(x_k, a_k)τ(x_k, a_k) and τ(x_k, a_k), respectively, i.e., $η_{c} (θ) = \lim_{T \to \infty} T^{- 1} E_{θ} \{\sum_{k = 0}^{T - 1} c (x_{k}, a_{k}) τ (x_{k}, a_{k})\},$ and similarly for η_τ(θ). (We have dropped the superscript θ on the processes in favour of the subscript on the expectation operator E to indicate the randomised policy θ is use.) Define the ith component of $\bar{c} (θ) \in R^{| X |}$ to be $\bar{c} (θ, i) ≔ E_{θ} {c (x_{k}, a_{k}) τ (x_{k}, a_{k}) | x_{k} = i} = \sum_{a \in A} c (i, a) τ (i, a) \tilde{u} (θ, i, a) .$ $\bar{c}$ is the “smoothed”

Application to OLSMDP to call admission control

Consider a multi-service loss network that services {1, 2, … , K} classes of users. A class i user arrives according to a Poisson process with intensity λⁱ, has service time exponentially distributed with mean 1/μⁱ and uses $b^{i} \in Z_{+}$ of the total bandwidth M available. We assume Poisson arrival and departures but not knowledge of the numerical values of the intensities. This immediately precludes the use of value iteration or linear programming (LP) to solve the CAC problem [18]. An arriving class i

Conclusion

We have presented a policy gradient method for the online optimisation of a SMDP based on a single sample path. The method proposed requires no knowledge of the semi-Markov transition kernel. The gradient estimator is based on the discounted score method of [2] while the optimisation step is performed using two time-scale SA. The advantage of the policy gradient approach is that it extends easily to the case where the SMDP is also subject to average cost constraints as SA can also be used for

References (21)

A. Gosavi
Reinforcement learning for long-run average cost
European Journal of Operational Research
(2004)
R.S. Sutton et al.
Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning
Artificial Intelligence
(1999)
D. Adelman
Price-directed replenishment of subsets: Methodology and its application to inventory routing
Manufacturing and Service Operations Management
(2003)
J. Baxter et al.
Infinite-horizon policy-gradient estimation
Journal of Artificial Intelligence Research
(2001)
D.P. Bertsekas
(1995)
D.P. Bertsekas et al.
Gradient convergence in gradient methods with errors
SIAM Journal on Optimization
(2000)
D.P. Bertsekas et al.
Neuro-dynamic Programming
(1996)
S.J. Bradtke et al.
Reinforcement learning methods for continuous-time Markov decision problems
Advances in Neural Information Processing Systems
(1995)
X.-R. Cao
Semi-Markov decision problems and performance sensitivity analysis
IEEE Transactions on Automatic Control
(2003)
E. Cinlar
Introduction to Stochastic Processes
(1974)

There are more references available in the full text version of this article.

Cited by (33)

Optimization of Controlled Queueing Systems: The Case of Car Wash Services
2021, Transportation Research Procedia
The paper is devoted to optimization of the accumulation functional of controlled queueing systems with regard to customer satisfaction: the case of car wash services. Any system is created in order to achieve a specific effect and its effectiveness is judged using a chosen indicator. However, when dealing with the service sector it is important to consider not only the main economic indicators, but also the satisfaction of customers, which are equated to queries received by the system. The goal of conducting investigations using queuing service systems is the development of recommendations which will ensure that the system will function at high efficiency. In this study, the management of a queuing system taking into account customer satisfaction is considered within the framework of controllable semi-Markov processes in the case of car wash services. In order to calculate the accumulation functional, which indicates customer satisfaction, satisfaction coefficients were introduced, which were obtained as a result of a qualitative assessment using a sample of patrons of a car wash.
Minimum risk probability for finite horizon semi-Markov decision processes
2013, Journal of Mathematical Analysis and Applications
Citation Excerpt :
The overview of the literature on the risk probability criteria above shows that the works for DTMDPs are complete in the sense that the cases of finite, infinite and random horizons have been all considered, while the works for SMDPs are concentrated on infinite or random horizons, and the case of finite horizon has not been explored yet. On the other hand, we note that, compared to DTMDPs, SMDPs are a sort of more general stochastic control models in which action choice is allowed at random times whenever the system state changes, and thus can characterize many situations such as queuing control and equipment maintenance [7,14,16,17,21]. Therefore, the study on the risk probability criteria for finite horizon SMDPs is desirable and necessary.
This paper studies the risk probability criteria for finite horizon semi-Markov decision processes. The goal is to find an optimal policy with the minimum risk probability that the total reward produced by a system during a finite horizon does not exceed a reward level, where the optimality is over the class of all randomized historic policies which include states, planning horizons and also reward levels. Under mild conditions, the optimality equation and the existence of optimal policies are established, and in addition, an iteration algorithm for solving optimal policies is developed. Our main results are applied to a manufacturing system.
A basic formula for performance gradient estimation of semi-Markov decision processes
2013, European Journal of Operational Research
Citation Excerpt :
The simulation results of convergence of PRF-based and TR-based algorithms are similar, so we do not list all the simulation results of other two algorithms. Moreover, we apply the algorithm in Singh et al. (2007) to the SMDP example (see Fig. 1b) and the CAC problem (see Fig. 1d). From the simulation results, we can find that the gradient estimation algorithms in this paper have good convergence property and the convergence rate is comparable to the algorithm in Singh et al. (2007).
This paper presents a basic formula for performance gradient estimation of semi-Markov decision processes (SMDPs) under average-reward criterion. This formula directly follows from a sensitivity equation in perturbation analysis. With this formula, we develop three sample-path-based gradient estimation algorithms by using a single sample path. These algorithms naturally extend many gradient estimation algorithms for discrete-time Markov systems to continuous time semi-Markov models. In particular, they require less storage than the algorithm in the literature.
Minimizing mean weighted tardiness in unrelated parallel machine scheduling with reinforcement learning
2012, Computers and Operations Research
Citation Excerpt :
Wang and Usher [38] applied Q-Learning to a single machine dispatching rule selection problem to learn the commonly accepted dispatching rules. Singh et al. [30] proposed a policy gradient method for SMDP with application to call admission control. Csáji et al. [10] presented an adaptive iterative distributed scheduling algorithm that operated in a market-based production control system, where each machine and each job was associated with its own software agent.
We address an unrelated parallel machine scheduling problem with R-learning, an average-reward reinforcement learning (RL) method. Different types of jobs dynamically arrive in independent Poisson processes. Thus the arrival time and the due date of each job are stochastic. We convert the scheduling problems into RL problems by constructing elaborate state features, actions, and the reward function. The state features and actions are defined fully utilizing prior domain knowledge. Minimizing the reward per decision time step is equivalent to minimizing the schedule objective, i.e. mean weighted tardiness. We apply an on-line R-learning algorithm with function approximation to solve the RL problems. Computational experiments demonstrate that R-learning learns an optimal or near-optimal policy in a dynamic environment from experience and outperforms four effective heuristic priority rules (i.e. WSPT, WMDD, ATC and WCOVERT) in all test problems.
Approximate dynamic programming for capacity allocation in the service industry
2012, European Journal of Operational Research
We consider a problem where different classes of customers can book different types of service in advance and the service company has to respond immediately to the booking request confirming or rejecting it. The objective of the service company is to maximize profit made of class-type specific revenues, refunds for cancellations or no-shows as well as cost of overtime. For the calculation of the latter, information on the underlying appointment schedule is required. In contrast to most models in the literature we assume that the service time of clients is stochastic and that clients might be unpunctual. Throughout the paper we will relate the problem to capacity allocation in radiology services. The problem is modeled as a continuous-time Markov decision process and solved using simulation-based approximate dynamic programming (ADP) combined with a discrete event simulation of the service period. We employ an adapted heuristic ADP algorithm from the literature and investigate on the benefits of applying ADP to this type of problem. First, we study a simplified problem with deterministic service times and punctual arrival of clients and compare the solution from the ADP algorithm to the optimal solution. We find that the heuristic ADP algorithm performs very well in terms of objective function value, solution time, and memory requirements. Second, we study the problem with stochastic service times and unpunctuality. It is then shown that the resulting policy constitutes a large improvement over an “optimal” policy that is deduced using restrictive, simplifying assumptions.
Semiconductor final test scheduling with Sarsa(λ, k) algorithm
2011, European Journal of Operational Research
Semiconductor test scheduling problem is a variation of reentrant unrelated parallel machine problems considering multiple resource constraints, intricate {product, tester, kit, enabler assembly} eligibility constraints, sequence-dependant setup times, etc. A multi-step reinforcement learning (RL) algorithm called Sarsa(λ, k) is proposed and applied to deal with the scheduling problem with throughput related objective. Allowing enabler reconfiguration, the production capacity of the test facility is expanded and scheduling optimization is performed at the bottom level. Two forms of Sarsa(λ, k), i.e. forward view Sarsa(λ, k) and backward view Sarsa(λ, k), are constructed and proved equivalent in off-line updating. The upper bound of the error of the action-value function in tabular Sarsa(λ, k) is provided when solving deterministic problems. In order to apply Sarsa(λ, k), the scheduling problem is transformed into an RL problem by representing states, constructing actions, the reward function and the function approximator. Sarsa(λ, k) achieves smaller mean scheduling objective value than the Industrial Method (IM) by 68.59% and 76.89%, respectively for real industrial problems and randomly generated test problems. Computational experiments show that Sarsa(λ, k) outperforms IM and any individual action constructed with the heuristics derived from the existing heuristics or scheduling rules.

View all citing articles on Scopus

^☆: Part of this paper was presented at the 7th International Conference on Control, Automation, Robotics and Vision, ICARCV, Singapore, December 2002.

¹: Tel.: +44 114 222 5198; fax: +44 114 222 5661.

View full text

Stochastics and StatisticsA policy gradient method for semi-Markov decision processes with application to call admission control☆

Abstract

Introduction

Section snippets

Problem formulation and the main algorithm

Convergence of algorithm SMDPBG

Application to OLSMDP to call admission control

Conclusion

European Journal of Operational Research

Artificial Intelligence

Price-directed replenishment of subsets: Methodology and its application to inventory routing

Manufacturing and Service Operations Management

Infinite-horizon policy-gradient estimation

Journal of Artificial Intelligence Research

Gradient convergence in gradient methods with errors

SIAM Journal on Optimization

Neuro-dynamic Programming

Reinforcement learning methods for continuous-time Markov decision problems

Advances in Neural Information Processing Systems

Semi-Markov decision problems and performance sensitivity analysis

IEEE Transactions on Automatic Control

Introduction to Stochastic Processes

Stochastics and Statistics
A policy gradient method for semi-Markov decision processes with application to call admission control☆