Abstract
We propose a new approach to accelerate the convergence of the modified policy iteration method for Markov decision processes with the total expected discounted reward. In the new policy iteration an additional operator is applied to the iterate generated by Markov operator, resulting in a bigger improvement in each iteration.
Similar content being viewed by others
References
Bersekas DP (1995) Generaic rank-one corrections for value iteration in Markovian decision problems. Oper Res Lett 17:111–119
Chen R-R, Meyn S (1999) Value iteration and optimization of multiclass queueing networks. Queueing Syst 32(1–3):65–97
de Farias D, Van Roy B (2003) The linear programming approach to approximate dynamic programming. Oper Res 51(6):850–856
Filar JA, Tolwinski B (1991) On the agorithm of pollatschek and Avi-Itzhak. Stoch Games Relat Top 7(3):59–70
Herzberg M, Yechiali U (1994) Accelerating procedures of the value iteration algorithm for discounted Markov decision process, based on a one-step lookahead analysis. Oper Res 42(5):940–946
Herzberg M, Yechiali U (1996) A k-step look-ahead analysis of value iteration algorithm for Markov decision processes. Eur J Oper Res 88:622–636
Kushner HJ, Kleinman AJ (1971) Accelerated procedures for the solution of discrete Markov control problems. IEEE Trans Autom Control 16:147–152
MacQueen J (1966) A modified dynamic programming method for Markovian decision problems. J Math Anal Appl 14:38–43
Pollatschek MA, Avi-Itzhak B (1969) Algorithms for stochastic games with geometrical interpretation. Manag Sci 15:399–415
Porteus E, Totten J (1978) Accelerated computation of the expected discounted return in a Markov chain. Oper Res 26:350–358
Puterman ML, Shin MC (1978) Modified policy iteration algorithms for discounted Markov decision problems. Manag Sci 24:1127–1137
Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, New York
Saad Y (1996) Iterative methods for sparse linear systems. PWS Publishing Company, Boston
Shlakhter O, Lee C-G, Khmelev D, Jaber N (2010) Acceleration operators in the value iteration algorithms for Markov decision processes. Oper Res 58:193–202
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Lemma 6.1
For any decision rule \(d_n\), if \(u\ge {}(>)v\), then for any regular splitting based operator \((T_{RS})_{d_n} \;(T_{RS})_{d_n} u\ge {}(>)(T_{RS})_{d_n} v\).
Proof of Lemma 1
Let \(v\le (<){}u\), then for any decision rule \(d_n (T_{RS})_{d_n} v = Q^{-1}_{d_n} r_{d_n}+Q^{-1}_{d_n}R_{d_n} v \le (<) r_{d_n}+Q^{-1}_{d_n}R_{d_n} u = (T_{RS})_{d_n} u\). \(\square \)
Lemma 6.2
For any decision rule \(d_n\) the following relations hold
Proof of Lemma 2
Is similar to the proof of Lemma 5 from (Shlakhter et al. 2010). \(\square \)
Lemma 6.3
For any decision rule \(d_n,\,(T_{GS})_{d_n}( (\widetilde{V}_{GS})_{d_n}) \subset \widetilde{V}_{d_n}\), and
\((T_{GSJ})_{d_n}((\widetilde{V}_{GSJ})_{d_n})\subset \widetilde{V}_{d_n}\).
Proof of Lemma 3
Is similar to the proof of Lemma 6 from (Shlakhter et al. 2010). \(\square \)
Proof of Lemma 2.1
Let \(v \in \widetilde{V}_{d_n}\) and \(u=T_{d_n}v\). By definition of \(\widetilde{V}_{d_n},\,u = T_{d_n}v \ge v\). By monotonicity shown in Lemma 6.1, \(T_{d_n}u \ge T_{d_n}v=u\). Thus, \(u \in \widetilde{V}_{d_n}\). \(\square \)
Proof of Theorem 2.1
The sequence of iterates of GAMPI \(v^n\) is monotone because for any index \(n,\,v^n \le Tv^n\le v^{n+1}\). Because of monotonicity of \(v^n\) and from lemma 6.1 \(v^{n+1} \ge T^n v^0 \rightarrow v^*\). From this we have that lower limit
but \(v^n \in \widetilde{V}\) for all \(n\), and \(v^* \ge u\) for all \(u \in \widetilde{V}\). Then \(\lim _{n \rightarrow \infty } v^n = v^*\). \(\square \)
Proof of Theorem 2.2
Condition (A) is satisfied trivially since \(\alpha u \in \widetilde{V}_{d_n}\) for any \(u\in \widetilde{V}_{d_n}\) by the definition of \(Z_{d_n}\) given in (3). Now we have to show that \(Z_{d_n}\) satisfies condition (B). We know \(\alpha =1\) is feasible to the linear program (3) since \(u\in \widetilde{V}_{d_n}\) (or \(T_{d_n}u\ge u\)). Since \(\sum _i u(i)\ge 0\) due to \(r(i,a)\ge 0,\; \forall i, a\), we have \(\alpha ^* \le 1\). Therefore, \(Z_{d_n}u = \alpha ^* u \ge u\). \(\square \)
Proof of Lemma 2.2
Suppose that \(p_{ij}(a)>0, \; \forall i,j,a\). Then if \(u \in \widetilde{V}_{d_n},\,u \ne v^n\), where \(v^n\) is the fixed point of \(T_{d_n}\), and \(v=T_{d_n}u \ge u\), there exists \(k\) such that \(v(k) > u(k)\). As a result,
\(\square \)
Proof of Lemma 2.3
Let \(v \in int(\widetilde{V}_{d_n})\) and \(u=T_{d_n}v\). By definition of \(int(\widetilde{V}_{d_n}), u = T_{d_n}v > v\). By monotonicity shown in Lemma 6.1, \(T_{d_n}u > T_{d_n}v=u\). Thus, \(u \in int(\widetilde{V})\). \(\square \)
Proof of Theorem 2.3
For \(u\in \widetilde{V}_{d_n},\,\widetilde{Z}_{d_n}(u)=u+\alpha ^*(T_{d_n}u-u)\), where \(\alpha ^*\) is an optimal solution to (4). Since \(u+\alpha ^*(T_{d_n}u-u)\) is feasible to (4), we have \(T_{d_n}(u+\alpha ^*(T_{d_n}u-u))\ge u+\alpha ^*(T_{d_n}u-u)\). This suffices Condition (A). By \(T_{d_n}u=u\in \widetilde{V}_{d_n},\,\alpha =1\) is feasible. Since \(T_{d_n}u\ge u\) and \(\alpha =1\) is feasible, \(\alpha ^*>0\). Hence, \(Z_{d_n}u \ge u\) and Condition (B) is satisfied. \(\square \)
Proof of Lemma 2.4
This proof is similar to the proofs of Lemmas 2.1 and 2.3. \(\square \)
Proof of Theorem 2.4
By monotonicity shown in Lemma 6.1, \((T_J)_{d_n}(\widetilde{V}_J)_{d_n} \subset (\widetilde{V}_J)_{d_n}\). Therefore, by \(\widetilde{V}_{d_n}=(\widetilde{V}_J)_{d_n}\) in Lemma 6.2, \((T_J)_{d_n}\widetilde{V}_{d_n} \subset \widetilde{V}_{d_n}\). By Lemmas 6.2 and 6.3,
\(\square \)
Proof of Theorem 4.1
Let \(v^n\) be a sequence of iterates of MPI and \(w^n\) be a sequence of iterates of GAMPI. Let also \(v^0= w^0\). We need to show that \(\overline{\lim }_{n\rightarrow \infty }\frac{\Vert w^{n+1}-v^* \Vert }{\Vert w^{n}-v^* \Vert } \le \overline{\lim }_{n\rightarrow \infty }\frac{\Vert v^{n+1}-v^* \Vert }{\Vert v^{n}-v^* \Vert }\). Using theorem 2 from (Puterman and Shin 1978) we have \(\Vert v^*-v^{n+1} \Vert \le \left( \frac{\lambda (1-\lambda ^k)}{1-\lambda }\Vert P_{d_n}-P_d^* \Vert +\lambda ^{m_n +1}\right) \Vert v^n-v^*\Vert \). Using similar the same steps one can show the sequence \(w^n\) satisfies the same inequality. Let us notice that if the state space is finite than the rule “choose \(d_{n+1}= d_{n}\)” assures that \(P_{d_n}=P_d^*\) after finite number of iterations. Using inequality above we obtain that \(\Vert v^{n+1}-v^*\Vert \le \lambda ^{m_n+1}\Vert v^{n}-v^*\Vert \) and \(\Vert w^{n+1}-v^*\Vert \le \lambda ^{m_n+1}\Vert w^{n}-v^*\Vert \). From the theorem 6.3.3 from (Puterman 1994) we know that \(\lambda \) is a rate of convergence for value iteration and hence \(\lambda ^{m_n+1}\) is an asymptotic rate of convergence for \(v^n\). The first statement of the theorem is proved. Taking limit \(m_n \rightarrow \infty \) we immediately obtain \(\lim _{n \rightarrow \infty } \frac{\Vert w^{n+1}-v^*\Vert }{\Vert w^{n}-v^*\Vert } = \lim _{n \rightarrow \infty } \frac{\Vert v^{n+1}-v^*\Vert }{\Vert v^{n}-v^*\Vert } =0\). \(\square \)
Rights and permissions
About this article
Cite this article
Shlakhter, O., Lee, CG. Accelerated modified policy iteration algorithms for Markov decision processes. Math Meth Oper Res 78, 61–76 (2013). https://doi.org/10.1007/s00186-013-0432-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00186-013-0432-y