Abstract
This paper considers the problem of computing an optimal policy for a Markov decision process, under lack of complete a priori knowledge of (1) the branching probability distributions determining the evolution of the process state upon the execution of the different actions, and (2) the probability distributions characterizing the immediate rewards returned by the environment as a result of the execution of these actions at different states of the process. In addition, it is assumed that the underlying process evolves in a repetitive, episodic manner, with each episode starting from a well-defined initial state and evolving over an acyclic state space. A novel efficient algorithm for this problem is proposed, and its convergence properties and computational complexity are rigorously characterized in the formal framework of computational learning theory. Furthermore, in the process of deriving the aforementioned results, the presented work generalizes Bechhofer’s “indifference-zone” approach for the ranking & selection problem, that arises in statistical inference theory, so that it applies to populations with bounded general distributions.
Similar content being viewed by others
References
Bechhofer RE (1954) A single-sample multiple decision procedure for ranking means of normal populations with known variances. Ann Math Stat 25:16–39
Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific, Belmont
Even-Dar E, Mannor S, Mansour Y (2002) PAC bounds for multi-armed bandit and Markov decision processes. In: Proceedings of COLT’02. ACM, New York, pp 255–270
Feller W (1971) An introduction to probability theory and its applications, vol. II, 2nd edn. Wiley, New York
Fiechter CN (1994) Efficient reinforcement learning. In: Proceedings of COLT’94. ACM, New York, pp 88–97
Fiechter CN (1997) Expected mistake bound model for on-line reinforcement learning. In: Proceedings of ICML’97. AAAI, Menlo Park, pp 116–124
Heizer J, Render B (2004) Operations management, 7th edn. Pearson/Prentice Hall, Upper Saddle River
Hoeffding W (1963) Probability inequalities for sum of bounded random variables. J Am Stat Assoc 58:13–30
Kearns M, Singh S (1999) Finite-sample convergence rates for Q-learning and indirect algorithms. Neural Inf Process Syst 11:996–1002
Kearns M, Singh S (2002) Near-optimal reinforcement learning in polynomial time. Mach Learn 49:209–232
Kearns MJ, Vazirani UV (1994) An introduction to computational learning theory. MIT Press, Cambridge
Kim S-H, Nelson BL (2004) Selecting the best system. Technical report, School of Industrial & Systems Eng., Georgia Tech
Mitchell TM (1997) Machine learning. McGraw Hill, London
Reveliotis SA (2003) Uncertainty management in optimal disassembly planning through learning-based strategies. In: Proceedings of the NSF–IEEE–ORSI international workshop on IT-enabled manufacturing, logistics and supply chain management. NSF/IEEE/ORSI, Piscataway, pp 135–141
Reveliotis SA (2004) Modelling and controlling uncertainty in optimal disassembly planning through reinforcement learning. In: IEEE international conference on robotics & automation. IEEE, Piscataway, pp 2625–2632
Reveliotis SA (2007) Uncertainty management in optimal disassembly planning through learning-based strategies. IIE Trans 39:645–658
Reveliotis SA, Bountourelis T (2006) Efficient learning algorithms for episodic tasks with acyclic state spaces. In: Proceedings of the 2006 IEEE international conference on automation science and engineering. IEEE, Piscataway, pp 421–428
Sutton RS, Barto AG (2000) Reinforcement learning. MIT Press, Cambridge
Thrun S, Burgard W, Fox D (2005) Probabilistic robotics. MIT Press, Cambridge
Watkins CJCH (1989) Learning from delayed rewards. Ph.D. thesis, Cambridge University
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Reveliotis, S., Bountourelis, T. Efficient PAC Learning for Episodic Tasks with Acyclic State Spaces. Discrete Event Dyn Syst 17, 307–327 (2007). https://doi.org/10.1007/s10626-007-0014-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10626-007-0014-3