Skip to main content

Advertisement

Log in

A formal proof of the 𝜖-optimality of discretized pursuit algorithms

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Learning Automata (LA) can be reckoned to be the founding algorithms on which the field of Reinforcement Learning has been built. Among the families of LA, Estimator Algorithms (EAs) are certainly the fastest, and of these, the family of discretized algorithms are proven to converge even faster than their continuous counterparts. However, it has recently been reported that the previous proofs for 𝜖-optimality for all the reported algorithms for the past three decades have been flawed. We applaud the researchers who discovered this flaw, and who further proceeded to rectify the proof for the Continuous Pursuit Algorithm (CPA). The latter proof examines the monotonicity property of the probability of selecting the optimal action, and requires the learning parameter to be continuously changing. In this paper, we provide a new method to prove the 𝜖-optimality of the Discretized Pursuit Algorithm (DPA) which does not require this constraint, by virtue of the fact that the DPA has, in and of itself, absorbing barriers to which the LA can jump in a discretized manner. Unlike the proof given (Zhang et al., Appl Intell 41:974–985, 3) for an absorbing version of the CPA, which utilizes the single-action Hoeffding’s inequality, the current proof invokes what we shall refer to as the “multi-action” version of the Hoeffding’s inequality. We believe that our proof is both unique and pioneering. It can also form the basis for formally showing the 𝜖-optimality of the other EAs that possess absorbing states.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. PAs have also been extended by allowing them to be of the Reward-Penalty paradigms [13]. We do not consider these here.

  2. In addition, like the proofs for the asymptotic convergence, the finite time analysis of both the CPA and the DPA were also done in [2]. Unfortunately, these analyses are also flawed inasmuch as they are also based on the reasoning of the above-mentioned “monotonicity” assumption of the probability of selecting the optimal action.

  3. We state the conditions and parameters for the CPA, while the analogous counterpart conditions and parameters for the DPA are stated in parenthesis to avoid repetition. The reader should observe that we, really, did not have to mention the conditions and parameters for the CPA. But we have opted to show it because the proof given in [25], which demonstrated the flaw, is based on the CPA, and we believe that this will improve the readability of the present paper.

  4. In the interest of simplicity, at this juncture we have assumed in (4) that the \(\hat {d}_{j}\)’s are independent of each other. We believe that this assumption can be easily relaxed by considering only the individual d j ’s and not all of them together.

  5. If one is interested in pursuing the general r-action scenario in greater detail without invoking the 2-action results, the arguments involved are almost identical, except that the algebra is a little more cumbersome. Without going into the detailed algebraic manipulations, we can submit the arguments as follows. For the specific Environment, we define:

    $$\begin{array}{@{}rcl@{}} H_{j} &=& d_{m}-d_{j}, j \neq m,\\ \hat{H}_{j}(t) &=& \hat{d}_{m}(t)-\hat{d}_{j}(t), j \neq m. \end{array} $$

    Then, given any δ ∈ (0, 1), if we denote \(\delta ^{\star } =1-\sqrt [r-1]{1-\delta }\), we can show that there exists a time instant t 0, such that within the time defined by t 0, α m has been selected more than \(\left \lceil \frac {-\ln {\delta ^{\star }}}{(min\{H_{j}\})^{2}} \right \rceil \) times, and α j,(jm) has been selected more than \(\left \lceil \frac {-\ln {\delta ^{\star }}}{{H_{j}^{2}}} \right \rceil \) times. Consequently, for ∀t > t 0, q j (t) > 1 − δ and \(q(t) \geq \prod \limits _{j=1...r, j \neq m}{q_{j}(t)}>1-\delta \).

References

  1. Zhang X, Oommen BJ, Granmo O-C, Jiao L (2014) Using the theory of regular functions to formally prove the 𝜖-optimality of discretized pursuit learning algorithms. In: Proceedings of IEA-AIE. Springer, Kaohsiung, Taiwan, pp 379–388

  2. Rajaraman K, Sastry PS (1996) Finite time analysis of the pursuit algorithm for learning automata. IEEE Trans Syst Man Cybern B: Cybern 26:590–598

    Article  Google Scholar 

  3. Zhang X, Granmo O-C, Oommen BJ, Jiao L (2014) A formal proof of the 𝜖-optimality of absorbing continuous pursuit algorithms using the theory of regular functions. Appl Intell 41:974–985

    Article  Google Scholar 

  4. Narendra KS, Thathachar MAL (1989) Learning automata: an introduction. Prentice Hall

  5. Oommen BJ (1986) Absorbing and ergodic discretized two-action learning automata. IEEE Trans Syst Man Cybern 16:282–296

    Article  MathSciNet  MATH  Google Scholar 

  6. Thathachar MAL, Sastry PS (1986) Estimator algorithms for learning automata. In: Proceedings of the Platinum Jubilee Conference on Systems and Signal Processing, Bangalore, India, pp 29–32

  7. Agache M, Oommen BJ (2002) Generalized pursuit learning schemes: new families of continuous and discretized learning automata. IEEE Trans Syst Man Cybern B: Cybern 32(6):738–749

    Article  Google Scholar 

  8. Zhang X, Granmo O-C, Oommen BJ (2011) The Bayesian pursuit algorithm: A new family of estimator learning automata. In: Proceedings of IEA-AIE 2011. Springer, New York, USA, pp 608–620

  9. Zhang X, Granmo O-C, Oommen BJ (2013) On incorporating the paradigms of discretization and Bayesian estimation to create a new family of pursuit learning automata. Appl Intell 39:782–792

    Article  Google Scholar 

  10. Oommen BJ, Lanctôt JK (1990) Discretized pursuit learning automata. IEEE Trans Syst Man Cybern 20:931–938

    Article  MATH  Google Scholar 

  11. Lanctôt JK, Oommen BJ (1991) On discretizing estimator-based learning algorithms. IEEE Trans Syst Man Cybern B: Cybern 2:1417–1422

    Google Scholar 

  12. Lanctôt JK, Oommen BJ (1992) Discretized estimator learning automata. IEEE Trans Syst Man Cybern B: Cybern 22(6):1473–1483

    Article  Google Scholar 

  13. Oommen BJ, Agache M (2001) Continuous and discretized pursuit learning schemes: various algorithms and their comparison. IEEE Trans Syst Man Cybern B: Cyber 31(3):277–287

    Article  Google Scholar 

  14. Zhang X, Granmo O-C, Oommen BJ (2012) “Discretized Bayesian pursuit - a new scheme for reinforcement learning. In: Proceedings of IEA-AIE 2012, Dalian, China, pp 784–793

  15. Oommen BJ, Granmo O-C, Pedersen A (2007) Using stochastic AI techniques to achieve unbounded resolution in finite player Goore Games and its applications. In: Proceedings of IEEE Symposium on Computational Intelligence and Games, Honolulu, HI, pp 161–167

  16. Beigy H, Meybodi MR (2000) Adaptation of parameters of BP algorithm using learning automata. In: Proceedings of Sixth Brazilian Symposium on Neural Networks, JR, Brazil, pp 24–31

  17. Granmo O-C, Oommen BJ, Myrer S-A, Olsen MG (2007) Learning automata-based solutions to the nonlinear fractional knapsack problem with applications to optimal resource allocation. IEEE Trans Syst Man Cybern B 37(1):166–175

    Article  Google Scholar 

  18. Unsal C, Kachroo P, Bay JS (1999) Multiple stochastic learning automata for vehicle path control in an automated highway system. IEEE Trans Syst Man Cybern A 29:120–128

    Article  Google Scholar 

  19. Oommen BJ, Roberts TD (2000) Continuous learning automata solutions to the capacity assignment problem. IEEE Trans Comput 49:608–620

    Article  Google Scholar 

  20. Granmo O-C (2010) Solving stochastic nonlinear resource allocation problems using a hierarchy of twofold resource allocation automata. IEEE Trans Comput 59(4):545–560

    Article  MathSciNet  Google Scholar 

  21. Oommen BJ, Croix TDS (1997) String taxonomy using learning automata. IEEE Trans Syst Man Cybern 27:354–365

    Article  Google Scholar 

  22. Oommen BJ, de St. Croix EV (1996) Graph partitioning using learning automata. IEEE Trans Comput 45:195–208

    Article  MathSciNet  MATH  Google Scholar 

  23. Dean T, Angluin D, Basye K, Engelson S, Aelbling L, Maron O (1995) Inferring finite automata with stochastic output functions and an application to map learning. Mach Learn 18:81–08

    Google Scholar 

  24. Song Y, Fang Y, Zhang Y (2007) Stochastic channel selection in cognitive radio networks. In: Proceedings of IEEE Global Telecommunications Conference, Washington DC, USA, pp 4878–4882

  25. Ryan M, Omkar T (2012) On 𝜖-optimality of the pursuit learning algorithm. J Appl Probab 49(3):795–805

    Article  MathSciNet  MATH  Google Scholar 

  26. Zhang X, Granmo O-C, Oommen BJ, Jiao L (2013) On using the theory of regular functions to prove the 𝜖-optimality of the continuous pursuit learning automaton. In: Proceedings of IEA-AIE 2013. Springer, Amsterdan, Holland, pp 262– 271

  27. Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58:13–30

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xuan Zhang.

Additional information

This work was partially supported by NSERC, the Natural Sciences and Engineering Research Council of Canada. A preliminary version of some of the results of this paper was presented at IEAAIE-2014, the 27th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, Kaohsiung, Taiwan, in June 2014 [1]. This paper won the Best Paper Award at the conference.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, X., Oommen, B.J., Granmo, OC. et al. A formal proof of the 𝜖-optimality of discretized pursuit algorithms. Appl Intell 44, 282–294 (2016). https://doi.org/10.1007/s10489-015-0670-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-015-0670-1

Keywords

Navigation