Accelerated modified policy iteration algorithms for Markov decision processes

Shlakhter, Oleksandr; Lee, Chi-Guhn

doi:10.1007/s00186-013-0432-y

Accelerated modified policy iteration algorithms for Markov decision processes

Original Article
Published: 27 February 2013

Volume 78, pages 61–76, (2013)
Cite this article

Mathematical Methods of Operations Research Aims and scope Submit manuscript

Oleksandr Shlakhter¹ &
Chi-Guhn Lee²

477 Accesses
1 Citation
Explore all metrics

Abstract

We propose a new approach to accelerate the convergence of the modified policy iteration method for Markov decision processes with the total expected discounted reward. In the new policy iteration an additional operator is applied to the iterate generated by Markov operator, resulting in a bigger improvement in each iteration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Markov Decision Processes with Discounted Rewards: New Action Elimination Procedure

Approximate Policy Iteration for Markov Decision Processes via Quantitative Adaptive Aggregations

Policy iteration for robust nonstationary Markov decision processes

Article 18 May 2016

Saumya Sinha & Archis Ghate

References

Bersekas DP (1995) Generaic rank-one corrections for value iteration in Markovian decision problems. Oper Res Lett 17:111–119
Article MathSciNet Google Scholar
Chen R-R, Meyn S (1999) Value iteration and optimization of multiclass queueing networks. Queueing Syst 32(1–3):65–97
Article MathSciNet MATH Google Scholar
de Farias D, Van Roy B (2003) The linear programming approach to approximate dynamic programming. Oper Res 51(6):850–856
Article MathSciNet MATH Google Scholar
Filar JA, Tolwinski B (1991) On the agorithm of pollatschek and Avi-Itzhak. Stoch Games Relat Top 7(3):59–70
Article Google Scholar
Herzberg M, Yechiali U (1994) Accelerating procedures of the value iteration algorithm for discounted Markov decision process, based on a one-step lookahead analysis. Oper Res 42(5):940–946
Article MathSciNet MATH Google Scholar
Herzberg M, Yechiali U (1996) A k-step look-ahead analysis of value iteration algorithm for Markov decision processes. Eur J Oper Res 88:622–636
Article MATH Google Scholar
Kushner HJ, Kleinman AJ (1971) Accelerated procedures for the solution of discrete Markov control problems. IEEE Trans Autom Control 16:147–152
Article MathSciNet Google Scholar
MacQueen J (1966) A modified dynamic programming method for Markovian decision problems. J Math Anal Appl 14:38–43
Article MathSciNet MATH Google Scholar
Pollatschek MA, Avi-Itzhak B (1969) Algorithms for stochastic games with geometrical interpretation. Manag Sci 15:399–415
Article MathSciNet MATH Google Scholar
Porteus E, Totten J (1978) Accelerated computation of the expected discounted return in a Markov chain. Oper Res 26:350–358
Google Scholar
Puterman ML, Shin MC (1978) Modified policy iteration algorithms for discounted Markov decision problems. Manag Sci 24:1127–1137
Article MathSciNet MATH Google Scholar
Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, New York
Book MATH Google Scholar
Saad Y (1996) Iterative methods for sparse linear systems. PWS Publishing Company, Boston
MATH Google Scholar
Shlakhter O, Lee C-G, Khmelev D, Jaber N (2010) Acceleration operators in the value iteration algorithms for Markov decision processes. Oper Res 58:193–202
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Joseph L. Rotman School of Management, University of Toronto, 105 St. George Street, Toronto, ON, M5S 3E6, Canada
Oleksandr Shlakhter
Department of Mechanical and Industrial Engineering, University of Toronto, 5 King’s College Road, Toronto, ON, M5S 3G8, Canada
Chi-Guhn Lee

Authors

Oleksandr Shlakhter
View author publications
You can also search for this author in PubMed Google Scholar
Chi-Guhn Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Oleksandr Shlakhter.

Appendix

Lemma 6.1

For any decision rule $d_n$, if $u\ge {}(>)v$, then for any regular splitting based operator $(T_{RS})_{d_n} \;(T_{RS})_{d_n} u\ge {}(>)(T_{RS})_{d_n} v$.

Proof of Lemma 1

Let $v\le (<){}u$, then for any decision rule $d_n (T_{RS})_{d_n} v = Q^{-1}_{d_n} r_{d_n}+Q^{-1}_{d_n}R_{d_n} v \le (<) r_{d_n}+Q^{-1}_{d_n}R_{d_n} u = (T_{RS})_{d_n} u$. $\square $

Lemma 6.2

For any decision rule $d_n$ the following relations hold

$$\begin{aligned} \widetilde{V}_{d_n}=(\widetilde{V}_J)_{d_n}, \;\;\; (\widetilde{V}_{GS})_{d_n}=(\widetilde{V}_{GSJ})_{d_n},\; \text{ and} \quad \widetilde{V}_{d_n}\subset {(\widetilde{V}_{GS})_{d_n}}. \end{aligned}$$

(5)

Proof of Lemma 2

Is similar to the proof of Lemma 5 from (Shlakhter et al. 2010). $\square $

Lemma 6.3

For any decision rule $d_n,\,(T_{GS})_{d_n}( (\widetilde{V}_{GS})_{d_n}) \subset \widetilde{V}_{d_n}$, and

$(T_{GSJ})_{d_n}((\widetilde{V}_{GSJ})_{d_n})\subset \widetilde{V}_{d_n}$.

Proof of Lemma 3

Is similar to the proof of Lemma 6 from (Shlakhter et al. 2010). $\square $

Proof of Lemma 2.1

Let $v \in \widetilde{V}_{d_n}$ and $u=T_{d_n}v$. By definition of $\widetilde{V}_{d_n},\,u = T_{d_n}v \ge v$. By monotonicity shown in Lemma 6.1, $T_{d_n}u \ge T_{d_n}v=u$. Thus, $u \in \widetilde{V}_{d_n}$. $\square $

Proof of Theorem 2.1

The sequence of iterates of GAMPI $v^n$ is monotone because for any index $n,\,v^n \le Tv^n\le v^{n+1}$. Because of monotonicity of $v^n$ and from lemma 6.1 $v^{n+1} \ge T^n v^0 \rightarrow v^*$. From this we have that lower limit

$$\begin{aligned} \lim _{n \rightarrow \infty }\inf v^n \ge v^*, \end{aligned}$$

but $v^n \in \widetilde{V}$ for all $n$, and $v^* \ge u$ for all $u \in \widetilde{V}$. Then $\lim _{n \rightarrow \infty } v^n = v^*$. $\square $

Proof of Theorem 2.2

Condition (A) is satisfied trivially since $\alpha u \in \widetilde{V}_{d_n}$ for any $u\in \widetilde{V}_{d_n}$ by the definition of $Z_{d_n}$ given in (3). Now we have to show that $Z_{d_n}$ satisfies condition (B). We know $\alpha =1$ is feasible to the linear program (3) since $u\in \widetilde{V}_{d_n}$ (or $T_{d_n}u\ge u$). Since $\sum _i u(i)\ge 0$ due to $r(i,a)\ge 0,\; \forall i, a$, we have $\alpha ^* \le 1$. Therefore, $Z_{d_n}u = \alpha ^* u \ge u$. $\square $

Proof of Lemma 2.2

Suppose that $p_{ij}(a)>0, \; \forall i,j,a$. Then if $u \in \widetilde{V}_{d_n},\,u \ne v^n$, where $v^n$ is the fixed point of $T_{d_n}$, and $v=T_{d_n}u \ge u$, there exists $k$ such that $v(k) > u(k)$. As a result,

$$\begin{aligned} T_{d_n}v(i)&= r(i,a)+\lambda \sum _{j\ne k}p_{ij}(a)v(j) +\lambda p_{ik}(a)v(k)\\&> r(i,a)+\lambda \sum _{j\ne k}p_{ij}(a)u(j) +\lambda p_{ik}(a)u(k) = v(i) \quad \text{ for} \text{ all} i\in S. \end{aligned}$$

$\square $

Proof of Lemma 2.3

Let $v \in int(\widetilde{V}_{d_n})$ and $u=T_{d_n}v$. By definition of $int(\widetilde{V}_{d_n}), u = T_{d_n}v > v$. By monotonicity shown in Lemma 6.1, $T_{d_n}u > T_{d_n}v=u$. Thus, $u \in int(\widetilde{V})$. $\square $

Proof of Theorem 2.3

For $u\in \widetilde{V}_{d_n},\,\widetilde{Z}_{d_n}(u)=u+\alpha ^*(T_{d_n}u-u)$, where $\alpha ^*$ is an optimal solution to (4). Since $u+\alpha ^*(T_{d_n}u-u)$ is feasible to (4), we have $T_{d_n}(u+\alpha ^*(T_{d_n}u-u))\ge u+\alpha ^*(T_{d_n}u-u)$. This suffices Condition (A). By $T_{d_n}u=u\in \widetilde{V}_{d_n},\,\alpha =1$ is feasible. Since $T_{d_n}u\ge u$ and $\alpha =1$ is feasible, $\alpha ^*>0$. Hence, $Z_{d_n}u \ge u$ and Condition (B) is satisfied. $\square $

Proof of Lemma 2.4

This proof is similar to the proofs of Lemmas 2.1 and 2.3. $\square $

Proof of Theorem 2.4

By monotonicity shown in Lemma 6.1, $(T_J)_{d_n}(\widetilde{V}_J)_{d_n} \subset (\widetilde{V}_J)_{d_n}$. Therefore, by $\widetilde{V}_{d_n}=(\widetilde{V}_J)_{d_n}$ in Lemma 6.2, $(T_J)_{d_n}\widetilde{V}_{d_n} \subset \widetilde{V}_{d_n}$. By Lemmas 6.2 and 6.3,

$$\begin{aligned} (T_{GS})_{d_n}\widetilde{V}_{d_n} \!\subset \! (T_{GS})_{d_n}(\widetilde{V}_{GS})_{d_n} \!\subset \! \widetilde{V}_{d_n} \;\;\text{ and}\;\; (T_{GSJ})_{d_n}\widetilde{V}_{d_n}\!\subset \! (T_{GSJ})_{d_n}(\widetilde{V}_{GSJ})_{d_n}\!\subset \! \widetilde{V}_{d_n}. \end{aligned}$$

$\square $

Proof of Theorem 4.1

Let $v^n$ be a sequence of iterates of MPI and $w^n$ be a sequence of iterates of GAMPI. Let also $v^0= w^0$. We need to show that $\overline{\lim }_{n\rightarrow \infty }\frac{\Vert w^{n+1}-v^* \Vert }{\Vert w^{n}-v^* \Vert } \le \overline{\lim }_{n\rightarrow \infty }\frac{\Vert v^{n+1}-v^* \Vert }{\Vert v^{n}-v^* \Vert }$. Using theorem 2 from (Puterman and Shin 1978) we have $\Vert v^*-v^{n+1} \Vert \le \left( \frac{\lambda (1-\lambda ^k)}{1-\lambda }\Vert P_{d_n}-P_d^* \Vert +\lambda ^{m_n +1}\right) \Vert v^n-v^*\Vert $. Using similar the same steps one can show the sequence $w^n$ satisfies the same inequality. Let us notice that if the state space is finite than the rule “choose $d_{n+1}= d_{n}$” assures that $P_{d_n}=P_d^*$ after finite number of iterations. Using inequality above we obtain that $\Vert v^{n+1}-v^*\Vert \le \lambda ^{m_n+1}\Vert v^{n}-v^*\Vert $ and $\Vert w^{n+1}-v^*\Vert \le \lambda ^{m_n+1}\Vert w^{n}-v^*\Vert $. From the theorem 6.3.3 from (Puterman 1994) we know that $\lambda $ is a rate of convergence for value iteration and hence $\lambda ^{m_n+1}$ is an asymptotic rate of convergence for $v^n$. The first statement of the theorem is proved. Taking limit $m_n \rightarrow \infty $ we immediately obtain $\lim _{n \rightarrow \infty } \frac{\Vert w^{n+1}-v^*\Vert }{\Vert w^{n}-v^*\Vert } = \lim _{n \rightarrow \infty } \frac{\Vert v^{n+1}-v^*\Vert }{\Vert v^{n}-v^*\Vert } =0$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shlakhter, O., Lee, CG. Accelerated modified policy iteration algorithms for Markov decision processes. Math Meth Oper Res 78, 61–76 (2013). https://doi.org/10.1007/s00186-013-0432-y

Download citation

Received: 10 September 2011
Accepted: 07 February 2013
Published: 27 February 2013
Issue Date: August 2013
DOI: https://doi.org/10.1007/s00186-013-0432-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerated modified policy iteration algorithms for Markov decision processes

Abstract

Access this article

Similar content being viewed by others

Markov Decision Processes with Discounted Rewards: New Action Elimination Procedure

Approximate Policy Iteration for Markov Decision Processes via Quantitative Adaptive Aggregations

Policy iteration for robust nonstationary Markov decision processes

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Lemma 6.1

Proof of Lemma 1

Lemma 6.2

Proof of Lemma 2

Lemma 6.3

Proof of Lemma 3

Proof of Lemma 2.1

Proof of Theorem 2.1

Proof of Theorem 2.2

Proof of Lemma 2.2

Proof of Lemma 2.3

Proof of Theorem 2.3

Proof of Lemma 2.4

Proof of Theorem 2.4

Proof of Theorem 4.1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Accelerated modified policy iteration algorithms for Markov decision processes

Abstract

Access this article

Similar content being viewed by others

Markov Decision Processes with Discounted Rewards: New Action Elimination Procedure

Approximate Policy Iteration for Markov Decision Processes via Quantitative Adaptive Aggregations

Policy iteration for robust nonstationary Markov decision processes

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Lemma 6.1

Proof of Lemma 1

Lemma 6.2

Proof of Lemma 2

Lemma 6.3

Proof of Lemma 3

Proof of Lemma 2.1

Proof of Theorem 2.1

Proof of Theorem 2.2

Proof of Lemma 2.2

Proof of Lemma 2.3

Proof of Theorem 2.3

Proof of Lemma 2.4

Proof of Theorem 2.4

Proof of Theorem 4.1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation