Skip to main content
Log in

Accelerated modified policy iteration algorithms for Markov decision processes

  • Original Article
  • Published:
Mathematical Methods of Operations Research Aims and scope Submit manuscript

Abstract

We propose a new approach to accelerate the convergence of the modified policy iteration method for Markov decision processes with the total expected discounted reward. In the new policy iteration an additional operator is applied to the iterate generated by Markov operator, resulting in a bigger improvement in each iteration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Bersekas DP (1995) Generaic rank-one corrections for value iteration in Markovian decision problems. Oper Res Lett 17:111–119

    Article  MathSciNet  Google Scholar 

  • Chen R-R, Meyn S (1999) Value iteration and optimization of multiclass queueing networks. Queueing Syst 32(1–3):65–97

    Article  MathSciNet  MATH  Google Scholar 

  • de Farias D, Van Roy B (2003) The linear programming approach to approximate dynamic programming. Oper Res 51(6):850–856

    Article  MathSciNet  MATH  Google Scholar 

  • Filar JA, Tolwinski B (1991) On the agorithm of pollatschek and Avi-Itzhak. Stoch Games Relat Top 7(3):59–70

    Article  Google Scholar 

  • Herzberg M, Yechiali U (1994) Accelerating procedures of the value iteration algorithm for discounted Markov decision process, based on a one-step lookahead analysis. Oper Res 42(5):940–946

    Article  MathSciNet  MATH  Google Scholar 

  • Herzberg M, Yechiali U (1996) A k-step look-ahead analysis of value iteration algorithm for Markov decision processes. Eur J Oper Res 88:622–636

    Article  MATH  Google Scholar 

  • Kushner HJ, Kleinman AJ (1971) Accelerated procedures for the solution of discrete Markov control problems. IEEE Trans Autom Control 16:147–152

    Article  MathSciNet  Google Scholar 

  • MacQueen J (1966) A modified dynamic programming method for Markovian decision problems. J Math Anal Appl 14:38–43

    Article  MathSciNet  MATH  Google Scholar 

  • Pollatschek MA, Avi-Itzhak B (1969) Algorithms for stochastic games with geometrical interpretation. Manag Sci 15:399–415

    Article  MathSciNet  MATH  Google Scholar 

  • Porteus E, Totten J (1978) Accelerated computation of the expected discounted return in a Markov chain. Oper Res 26:350–358

    Google Scholar 

  • Puterman ML, Shin MC (1978) Modified policy iteration algorithms for discounted Markov decision problems. Manag Sci 24:1127–1137

    Article  MathSciNet  MATH  Google Scholar 

  • Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, New York

    Book  MATH  Google Scholar 

  • Saad Y (1996) Iterative methods for sparse linear systems. PWS Publishing Company, Boston

    MATH  Google Scholar 

  • Shlakhter O, Lee C-G, Khmelev D, Jaber N (2010) Acceleration operators in the value iteration algorithms for Markov decision processes. Oper Res 58:193–202

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oleksandr Shlakhter.

Appendix

Appendix

Lemma 6.1

For any decision rule \(d_n\), if \(u\ge {}(>)v\), then for any regular splitting based operator \((T_{RS})_{d_n} \;(T_{RS})_{d_n} u\ge {}(>)(T_{RS})_{d_n} v\).

Proof of Lemma 1

Let \(v\le (<){}u\), then for any decision rule \(d_n (T_{RS})_{d_n} v = Q^{-1}_{d_n} r_{d_n}+Q^{-1}_{d_n}R_{d_n} v \le (<) r_{d_n}+Q^{-1}_{d_n}R_{d_n} u = (T_{RS})_{d_n} u\). \(\square \)

Lemma 6.2

For any decision rule \(d_n\) the following relations hold

$$\begin{aligned} \widetilde{V}_{d_n}=(\widetilde{V}_J)_{d_n}, \;\;\; (\widetilde{V}_{GS})_{d_n}=(\widetilde{V}_{GSJ})_{d_n},\; \text{ and} \quad \widetilde{V}_{d_n}\subset {(\widetilde{V}_{GS})_{d_n}}. \end{aligned}$$
(5)

Proof of Lemma 2

Is similar to the proof of Lemma 5 from (Shlakhter et al. 2010). \(\square \)

Lemma 6.3

For any decision rule \(d_n,\,(T_{GS})_{d_n}( (\widetilde{V}_{GS})_{d_n}) \subset \widetilde{V}_{d_n}\), and

\((T_{GSJ})_{d_n}((\widetilde{V}_{GSJ})_{d_n})\subset \widetilde{V}_{d_n}\).

Proof of Lemma 3

Is similar to the proof of Lemma 6 from (Shlakhter et al. 2010). \(\square \)

Proof of Lemma 2.1

Let \(v \in \widetilde{V}_{d_n}\) and \(u=T_{d_n}v\). By definition of \(\widetilde{V}_{d_n},\,u = T_{d_n}v \ge v\). By monotonicity shown in Lemma 6.1, \(T_{d_n}u \ge T_{d_n}v=u\). Thus, \(u \in \widetilde{V}_{d_n}\). \(\square \)

Proof of Theorem 2.1

The sequence of iterates of GAMPI \(v^n\) is monotone because for any index \(n,\,v^n \le Tv^n\le v^{n+1}\). Because of monotonicity of \(v^n\) and from lemma 6.1 \(v^{n+1} \ge T^n v^0 \rightarrow v^*\). From this we have that lower limit

$$\begin{aligned} \lim _{n \rightarrow \infty }\inf v^n \ge v^*, \end{aligned}$$

but \(v^n \in \widetilde{V}\) for all \(n\), and \(v^* \ge u\) for all \(u \in \widetilde{V}\). Then \(\lim _{n \rightarrow \infty } v^n = v^*\). \(\square \)

Proof of Theorem 2.2

Condition (A) is satisfied trivially since \(\alpha u \in \widetilde{V}_{d_n}\) for any \(u\in \widetilde{V}_{d_n}\) by the definition of \(Z_{d_n}\) given in (3). Now we have to show that \(Z_{d_n}\) satisfies condition (B). We know \(\alpha =1\) is feasible to the linear program (3) since \(u\in \widetilde{V}_{d_n}\) (or \(T_{d_n}u\ge u\)). Since \(\sum _i u(i)\ge 0\) due to \(r(i,a)\ge 0,\; \forall i, a\), we have \(\alpha ^* \le 1\). Therefore, \(Z_{d_n}u = \alpha ^* u \ge u\). \(\square \)

Proof of Lemma 2.2

Suppose that \(p_{ij}(a)>0, \; \forall i,j,a\). Then if \(u \in \widetilde{V}_{d_n},\,u \ne v^n\), where \(v^n\) is the fixed point of \(T_{d_n}\), and \(v=T_{d_n}u \ge u\), there exists \(k\) such that \(v(k) > u(k)\). As a result,

$$\begin{aligned} T_{d_n}v(i)&= r(i,a)+\lambda \sum _{j\ne k}p_{ij}(a)v(j) +\lambda p_{ik}(a)v(k)\\&> r(i,a)+\lambda \sum _{j\ne k}p_{ij}(a)u(j) +\lambda p_{ik}(a)u(k) = v(i) \quad \text{ for} \text{ all} i\in S. \end{aligned}$$

\(\square \)

Proof of Lemma 2.3

Let \(v \in int(\widetilde{V}_{d_n})\) and \(u=T_{d_n}v\). By definition of \(int(\widetilde{V}_{d_n}), u = T_{d_n}v > v\). By monotonicity shown in Lemma 6.1, \(T_{d_n}u > T_{d_n}v=u\). Thus, \(u \in int(\widetilde{V})\). \(\square \)

Proof of Theorem 2.3

For \(u\in \widetilde{V}_{d_n},\,\widetilde{Z}_{d_n}(u)=u+\alpha ^*(T_{d_n}u-u)\), where \(\alpha ^*\) is an optimal solution to (4). Since \(u+\alpha ^*(T_{d_n}u-u)\) is feasible to (4), we have \(T_{d_n}(u+\alpha ^*(T_{d_n}u-u))\ge u+\alpha ^*(T_{d_n}u-u)\). This suffices Condition (A). By \(T_{d_n}u=u\in \widetilde{V}_{d_n},\,\alpha =1\) is feasible. Since \(T_{d_n}u\ge u\) and \(\alpha =1\) is feasible, \(\alpha ^*>0\). Hence, \(Z_{d_n}u \ge u\) and Condition (B) is satisfied. \(\square \)

Proof of Lemma 2.4

This proof is similar to the proofs of Lemmas 2.1 and  2.3. \(\square \)

Proof of Theorem 2.4

By monotonicity shown in Lemma 6.1, \((T_J)_{d_n}(\widetilde{V}_J)_{d_n} \subset (\widetilde{V}_J)_{d_n}\). Therefore, by \(\widetilde{V}_{d_n}=(\widetilde{V}_J)_{d_n}\) in Lemma 6.2, \((T_J)_{d_n}\widetilde{V}_{d_n} \subset \widetilde{V}_{d_n}\). By Lemmas 6.2 and 6.3,

$$\begin{aligned} (T_{GS})_{d_n}\widetilde{V}_{d_n} \!\subset \! (T_{GS})_{d_n}(\widetilde{V}_{GS})_{d_n} \!\subset \! \widetilde{V}_{d_n} \;\;\text{ and}\;\; (T_{GSJ})_{d_n}\widetilde{V}_{d_n}\!\subset \! (T_{GSJ})_{d_n}(\widetilde{V}_{GSJ})_{d_n}\!\subset \! \widetilde{V}_{d_n}. \end{aligned}$$

\(\square \)

Proof of Theorem 4.1

Let \(v^n\) be a sequence of iterates of MPI and \(w^n\) be a sequence of iterates of GAMPI. Let also \(v^0= w^0\). We need to show that \(\overline{\lim }_{n\rightarrow \infty }\frac{\Vert w^{n+1}-v^* \Vert }{\Vert w^{n}-v^* \Vert } \le \overline{\lim }_{n\rightarrow \infty }\frac{\Vert v^{n+1}-v^* \Vert }{\Vert v^{n}-v^* \Vert }\). Using theorem 2 from (Puterman and Shin 1978) we have \(\Vert v^*-v^{n+1} \Vert \le \left( \frac{\lambda (1-\lambda ^k)}{1-\lambda }\Vert P_{d_n}-P_d^* \Vert +\lambda ^{m_n +1}\right) \Vert v^n-v^*\Vert \). Using similar the same steps one can show the sequence \(w^n\) satisfies the same inequality. Let us notice that if the state space is finite than the rule “choose \(d_{n+1}= d_{n}\)” assures that \(P_{d_n}=P_d^*\) after finite number of iterations. Using inequality above we obtain that \(\Vert v^{n+1}-v^*\Vert \le \lambda ^{m_n+1}\Vert v^{n}-v^*\Vert \) and \(\Vert w^{n+1}-v^*\Vert \le \lambda ^{m_n+1}\Vert w^{n}-v^*\Vert \). From the theorem 6.3.3 from (Puterman 1994) we know that \(\lambda \) is a rate of convergence for value iteration and hence \(\lambda ^{m_n+1}\) is an asymptotic rate of convergence for \(v^n\). The first statement of the theorem is proved. Taking limit \(m_n \rightarrow \infty \) we immediately obtain \(\lim _{n \rightarrow \infty } \frac{\Vert w^{n+1}-v^*\Vert }{\Vert w^{n}-v^*\Vert } = \lim _{n \rightarrow \infty } \frac{\Vert v^{n+1}-v^*\Vert }{\Vert v^{n}-v^*\Vert } =0\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shlakhter, O., Lee, CG. Accelerated modified policy iteration algorithms for Markov decision processes. Math Meth Oper Res 78, 61–76 (2013). https://doi.org/10.1007/s00186-013-0432-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00186-013-0432-y

Keywords

Navigation