Stochastics and Statistics
A policy gradient method for semi-Markov decision processes with application to call admission control

https://doi.org/10.1016/j.ejor.2006.02.023Get rights and content

Abstract

Solving a semi-Markov decision process (SMDP) using value or policy iteration requires precise knowledge of the probabilistic model and suffers from the curse of dimensionality. To overcome these limitations, we present a reinforcement learning approach where one optimizes the SMDP performance criterion with respect to a family of parameterised policies. We propose an online algorithm that simultaneously estimates the gradient of the performance criterion and optimises it using stochastic approximation. We apply our algorithm to call admission control. Our proposed policy gradient SMDP algorithm and its application to admission control is novel.

Introduction

A semi-Markov decision process (SMDP) can be solved using classical methods such as dynamic programming (DP). However, these classical methods suffer from the curse of dimensionality, which means the memory required to represent the optimal value function is proportional to the dimension of the SMDP and is prohibitive for problems with large state and action spaces. Secondly, classical methods require knowledge of the transition probabilities of the SMDP or equivalently, a model. In applications like Call admission control (CAC), the state space is large and the SMDP transition probabilities are not known a priori [13], [14]. The SMDP framework is widely used to solve quality-of-service provisioning problems in communication networks, be they landline or wireless networks [9], [13], [21]. This includes both routing and CAC. Other applications of the SMDP framework include inventory routing [1], preventive maintenance [10] and robotics [16].

To overcome the memory requirement and the need for the system model, reinforcement learning (RL) with function approximation is used [5]. RL combines DP and stochastic approximation (SA) to learn the optimal value function online from a family parameterised functions that are compactly represented. RL for a MDP is well studied [5] and recently it has been extended to SMDPs [6], [10], [11]. These works are value function based approaches. In [6], a discounted cost SMDP problem is solved using real-time DP that builds a system model. Gosavi [10] solves an average cost SMDP problem using Q-learning while Marbach et al. [11] approximate the value function and use a “greedy” policy with respect to the approximated value function. Inspired by the work in [2], [15], we present an alternative approach where the policy of a SMDP is represented by its own function approximator to yield a family of parameterised policies. The gradient of the SMDP performance criterion with respect to the policy parameters is then estimated online. Using this estimated gradient, a SA algorithm is used to update the policy parameters to improve performance. This approach, which is known as the policy gradient method [2], [15], also overcomes the memory requirement and the need for model. In this paper, the performance criterion used is the average cost and we prove the convergence of the SMDP gradient estimator derived under mild regularity conditions. We apply our algorithm to CAC and in simulations, demonstrate convergence of the online algorithm to the optimal policy. Our proposed policy gradient SMDP algorithm and its application to admission control is novel.

The advantage of the policy gradient framework is that it can be applied to optimise a SMDP subject to average cost constraints by using recent results on constrained stochastic approximation [19]. This problem is important in CAC for multi-class networks as the constraints are used to impose upper bounds on the blocking probability of the various user classes. Additionally, we consider the general case where we assume no knowledge of the semi-Markov kernel or quantities derived from it (see Section 2.1 for details). Hence, estimating the SMDP gradient does not simplify to estimating the realisation matrix only, as done in [7].

Section 2 presents the average cost SMDP problem and the main algorithms of the paper. The first is an algorithm to estimate the SMDP objective function gradient while the second is an algorithm that optimises it. Both are single sample path based online algorithms. A derivation of the SMDP objective function gradient estimator as well as its convergence is studied in Section 3. The application to CAC with numerical examples is given in Section 4. All proofs appear in Appendix A.

Section snippets

Problem formulation and the main algorithm

Let θ=[θ1,,θK]TRK be the parameter that determines the control policy in effect. For each θ, let (xkθ,τkθ,akθ)k0 be a controlled Markov chain where xkθX{1,,n}; X being the state space of the embedded chain {xkθ}k0. akθA is the control (or action) applied at decision epoch k where the control space A is finite. τkθR+[0,) is the state dwell time of the continuous time semi-Markov process {(xθ(t),aθ(t))}tR+ defined by forming the right-continuous interpolations of {xkθ}k0 and {akθ}k0

Convergence of algorithm SMDPBG

By virtue of Assumption 1, ηc and ητ are performance criteria for average cost MDPs with cost functions c(xk, ak)τ(xk, ak) and τ(xk, ak), respectively, i.e.,ηc(θ)=limTT-1Eθk=0T-1c(xk,ak)τ(xk,ak),and similarly for ητ(θ). (We have dropped the superscript θ on the processes in favour of the subscript on the expectation operator E to indicate the randomised policy θ is use.) Define the ith component of c¯(θ)R|X| to bec¯(θ,i)Eθ{c(xk,ak)τ(xk,ak)|xk=i}=aAc(i,a)τ(i,a)u˜(θ,i,a).c¯ is the “smoothed”

Application to OLSMDP to call admission control

Consider a multi-service loss network that services {1, 2,  , K} classes of users. A class i user arrives according to a Poisson process with intensity λi, has service time exponentially distributed with mean 1/μi and uses biZ+ of the total bandwidth M available. We assume Poisson arrival and departures but not knowledge of the numerical values of the intensities. This immediately precludes the use of value iteration or linear programming (LP) to solve the CAC problem [18]. An arriving class i

Conclusion

We have presented a policy gradient method for the online optimisation of a SMDP based on a single sample path. The method proposed requires no knowledge of the semi-Markov transition kernel. The gradient estimator is based on the discounted score method of [2] while the optimisation step is performed using two time-scale SA. The advantage of the policy gradient approach is that it extends easily to the case where the SMDP is also subject to average cost constraints as SA can also be used for

References (21)

  • A. Gosavi

    Reinforcement learning for long-run average cost

    European Journal of Operational Research

    (2004)
  • R.S. Sutton et al.

    Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

    Artificial Intelligence

    (1999)
  • D. Adelman

    Price-directed replenishment of subsets: Methodology and its application to inventory routing

    Manufacturing and Service Operations Management

    (2003)
  • J. Baxter et al.

    Infinite-horizon policy-gradient estimation

    Journal of Artificial Intelligence Research

    (2001)
  • D.P. Bertsekas
    (1995)
  • D.P. Bertsekas et al.

    Gradient convergence in gradient methods with errors

    SIAM Journal on Optimization

    (2000)
  • D.P. Bertsekas et al.

    Neuro-dynamic Programming

    (1996)
  • S.J. Bradtke et al.

    Reinforcement learning methods for continuous-time Markov decision problems

    Advances in Neural Information Processing Systems

    (1995)
  • X.-R. Cao

    Semi-Markov decision problems and performance sensitivity analysis

    IEEE Transactions on Automatic Control

    (2003)
  • E. Cinlar

    Introduction to Stochastic Processes

    (1974)
There are more references available in the full text version of this article.

Cited by (33)

  • Minimum risk probability for finite horizon semi-Markov decision processes

    2013, Journal of Mathematical Analysis and Applications
    Citation Excerpt :

    The overview of the literature on the risk probability criteria above shows that the works for DTMDPs are complete in the sense that the cases of finite, infinite and random horizons have been all considered, while the works for SMDPs are concentrated on infinite or random horizons, and the case of finite horizon has not been explored yet. On the other hand, we note that, compared to DTMDPs, SMDPs are a sort of more general stochastic control models in which action choice is allowed at random times whenever the system state changes, and thus can characterize many situations such as queuing control and equipment maintenance [7,14,16,17,21]. Therefore, the study on the risk probability criteria for finite horizon SMDPs is desirable and necessary.

  • A basic formula for performance gradient estimation of semi-Markov decision processes

    2013, European Journal of Operational Research
    Citation Excerpt :

    The simulation results of convergence of PRF-based and TR-based algorithms are similar, so we do not list all the simulation results of other two algorithms. Moreover, we apply the algorithm in Singh et al. (2007) to the SMDP example (see Fig. 1b) and the CAC problem (see Fig. 1d). From the simulation results, we can find that the gradient estimation algorithms in this paper have good convergence property and the convergence rate is comparable to the algorithm in Singh et al. (2007).

  • Minimizing mean weighted tardiness in unrelated parallel machine scheduling with reinforcement learning

    2012, Computers and Operations Research
    Citation Excerpt :

    Wang and Usher [38] applied Q-Learning to a single machine dispatching rule selection problem to learn the commonly accepted dispatching rules. Singh et al. [30] proposed a policy gradient method for SMDP with application to call admission control. Csáji et al. [10] presented an adaptive iterative distributed scheduling algorithm that operated in a market-based production control system, where each machine and each job was associated with its own software agent.

  • Semiconductor final test scheduling with Sarsa(λ, k) algorithm

    2011, European Journal of Operational Research
View all citing articles on Scopus

Part of this paper was presented at the 7th International Conference on Control, Automation, Robotics and Vision, ICARCV, Singapore, December 2002.

1

Tel.: +44 114 222 5198; fax: +44 114 222 5661.

View full text