Skip to main content
Log in

Controller exploitation-exploration reinforcement learning architecture for computing near-optimal policies

  • Foundations
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

This paper suggests a new controller exploitation-exploration (CEE) reinforcement learning (RL) architecture that attains a near-optimal policy. The proposed architecture consists of three modules: controller, fast-tracked learning and the actor-critic. The strategies are represented by a probability distribution \(c_{ik}\). The controller employs a combination (balance) of the exploration or exploitation using the Kullback–Leibler divergence deciding if the new strategies are better than currently employed immediate strategy. The exploitation uses a fast-tracked learning algorithm, which employs a fix strategy and priori knowledge. The method is (only) asked to find estimated values of the transition matrices and utilities. The exploration employs an actor-critic architecture. The actor is responsible for the computation of the strategies using a policy gradient method. The critic determines the acceptance of the proposed strategies. We show the convergence of the proposed algorithms for implementing the architecture. An application example related to inventory shows the effectiveness of the proposed architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. In contrast, the main problem of an actor-critic architecture is its premature convergence to suboptimal policies.

References

  • Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795

    Article  Google Scholar 

  • Abualigah LM, Khader AT, Hanandeh ES, Gandomi AH (2017) A novel hybridization strategy for krill herd algorithm applied to clustering techniques. Appl Soft Comput 60:423–435

    Article  Google Scholar 

  • Auer P, Ortner R (2007) Logarithmic online regret bounds for undiscounted reinforcement learning. In: Advances in neural information processing systems, vol 19 (NIPS 2006), MIT Press, pp 49–56

  • Brafman RI, Tennenholtz M (2000) A near-optimal polynomial time algorithm for learning in certain classes of stochastic games. Artif Intell 121(1–2):31–47

    Article  MathSciNet  MATH  Google Scholar 

  • Brafman RI, Tennenholtz M (2002) R-max—a general polynomial time algorithm for nearoptimal reinforcement learning. J Mach Lean Res 3:213–231

    MATH  Google Scholar 

  • Castronovo M, Maes F, Fonteneau R, Ernst D (2012) Learning exploration/exploitation strategies for single trajectory reinforcement learning. In: 10th European workshop on reinforcement learning (EWRL), vol 24, pp 1–9

  • Clempner JB, Poznyak AS (2014) Simple computing of the customer lifetime value: a fixed local-optimal policy approach. J Syst Sci Syst Eng 23(4):439–459

    Article  Google Scholar 

  • Grondman I, Busoniu L, Lopes GAD, Babuska R (2012) A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Trans Syst Man Cybern C 42(6):1291–1307

    Article  Google Scholar 

  • Ishii S, Yoshida W, Yoshimoto J (2002) Control of exploitation-exploration meta-parameter in reinforcement learning. Neural Netw 15:665–687

    Article  Google Scholar 

  • Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J Mach Lean Res 11:1563–1600

    MathSciNet  MATH  Google Scholar 

  • Kaelbling L (1993) Hierarchical reinforcement learning: preliminary results. In: Proceedings of the tenth international conference on machine learning, San Mateo, CA, USA, pp 167–173

  • Kakade SM (2003) On the sample complexity of reinforcement learning. PhD thesis, University College London

  • Kearns M, Singh S (2002) Near-optimal reinforcement learning in polynomial time. Mach Learn 49:209–232

    Article  MATH  Google Scholar 

  • Maclin R, Shavlik J, Torrey L, Walker T, Wild E (2005) Giving advice about preferred actions to reinforcement learners via knowledge-based kernel regression. In: Proceedings of the 20th national conference on artificial intelligence, Pittsburgh, Pennsylvania, USA, pp 819–824

  • Maes F, Wehenkel L, Ernst D (2012) Learning to play k-armed bandit problems. In: Proceedings of the 4th international conference on agents and artificial intelligence, Vilamoura, Algarve, Portugal

  • Najim K, Poznyak AS (1994) Learning automata: theory and applications. Pergamon Press, Inc., Elmsford

    MATH  Google Scholar 

  • Poznyak AS, Najim K, Gómez-Ramírez E (2000) Self-learning control of finite Markov chains. Marcel Dekker Inc, New York

    Book  MATH  Google Scholar 

  • Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, New York

    Book  MATH  Google Scholar 

  • Ribeiro CH (1998) Embedding a priori knowledge in reinforcement learning. J Intell Rob Syst 21:51–71

    Article  Google Scholar 

  • Salgado M, Clempner JB (2018) Measuring the emotional distance using game theory via reinforcement learning: a Kullback–Leibler divergence approach. Expert Syst Appl 87:266–275

    Article  Google Scholar 

  • Sánchez EM, Clempner JB, Poznyak AS (2015) A priori-knowledge/actor-critic reinforcement learning architecture for computing the mean-variance customer portfolio: the case of bank marketing campaigns. Eng Appl Artif Intell 46:82–92

    Article  Google Scholar 

  • Sutton RS, Barto A (1998) Reinforcement learning: an introduction. MIT Press, Cambridge

    MATH  Google Scholar 

  • Xu X, Liu C, Hu D (2011) Continuous-action reinforcement learning with fast policy search and adaptive basis function selection. Soft Comput 15:1055–1070

    Article  Google Scholar 

  • Yin KK, Yin GG, Liu H (2004) Stochastic modeling for inventory and production planning in the paper industry. AIChE J 50(11):2877–2890

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Julio B. Clempner.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Human and animal rights

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by A. Di Nola.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Asiain, E., Clempner, J.B. & Poznyak, A.S. Controller exploitation-exploration reinforcement learning architecture for computing near-optimal policies. Soft Comput 23, 3591–3604 (2019). https://doi.org/10.1007/s00500-018-3225-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-018-3225-7

Keywords

Navigation