Abstract
This paper suggests a new controller exploitation-exploration (CEE) reinforcement learning (RL) architecture that attains a near-optimal policy. The proposed architecture consists of three modules: controller, fast-tracked learning and the actor-critic. The strategies are represented by a probability distribution \(c_{ik}\). The controller employs a combination (balance) of the exploration or exploitation using the Kullback–Leibler divergence deciding if the new strategies are better than currently employed immediate strategy. The exploitation uses a fast-tracked learning algorithm, which employs a fix strategy and priori knowledge. The method is (only) asked to find estimated values of the transition matrices and utilities. The exploration employs an actor-critic architecture. The actor is responsible for the computation of the strategies using a policy gradient method. The critic determines the acceptance of the proposed strategies. We show the convergence of the proposed algorithms for implementing the architecture. An application example related to inventory shows the effectiveness of the proposed architecture.
Similar content being viewed by others
Notes
In contrast, the main problem of an actor-critic architecture is its premature convergence to suboptimal policies.
References
Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795
Abualigah LM, Khader AT, Hanandeh ES, Gandomi AH (2017) A novel hybridization strategy for krill herd algorithm applied to clustering techniques. Appl Soft Comput 60:423–435
Auer P, Ortner R (2007) Logarithmic online regret bounds for undiscounted reinforcement learning. In: Advances in neural information processing systems, vol 19 (NIPS 2006), MIT Press, pp 49–56
Brafman RI, Tennenholtz M (2000) A near-optimal polynomial time algorithm for learning in certain classes of stochastic games. Artif Intell 121(1–2):31–47
Brafman RI, Tennenholtz M (2002) R-max—a general polynomial time algorithm for nearoptimal reinforcement learning. J Mach Lean Res 3:213–231
Castronovo M, Maes F, Fonteneau R, Ernst D (2012) Learning exploration/exploitation strategies for single trajectory reinforcement learning. In: 10th European workshop on reinforcement learning (EWRL), vol 24, pp 1–9
Clempner JB, Poznyak AS (2014) Simple computing of the customer lifetime value: a fixed local-optimal policy approach. J Syst Sci Syst Eng 23(4):439–459
Grondman I, Busoniu L, Lopes GAD, Babuska R (2012) A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Trans Syst Man Cybern C 42(6):1291–1307
Ishii S, Yoshida W, Yoshimoto J (2002) Control of exploitation-exploration meta-parameter in reinforcement learning. Neural Netw 15:665–687
Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J Mach Lean Res 11:1563–1600
Kaelbling L (1993) Hierarchical reinforcement learning: preliminary results. In: Proceedings of the tenth international conference on machine learning, San Mateo, CA, USA, pp 167–173
Kakade SM (2003) On the sample complexity of reinforcement learning. PhD thesis, University College London
Kearns M, Singh S (2002) Near-optimal reinforcement learning in polynomial time. Mach Learn 49:209–232
Maclin R, Shavlik J, Torrey L, Walker T, Wild E (2005) Giving advice about preferred actions to reinforcement learners via knowledge-based kernel regression. In: Proceedings of the 20th national conference on artificial intelligence, Pittsburgh, Pennsylvania, USA, pp 819–824
Maes F, Wehenkel L, Ernst D (2012) Learning to play k-armed bandit problems. In: Proceedings of the 4th international conference on agents and artificial intelligence, Vilamoura, Algarve, Portugal
Najim K, Poznyak AS (1994) Learning automata: theory and applications. Pergamon Press, Inc., Elmsford
Poznyak AS, Najim K, Gómez-Ramírez E (2000) Self-learning control of finite Markov chains. Marcel Dekker Inc, New York
Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, New York
Ribeiro CH (1998) Embedding a priori knowledge in reinforcement learning. J Intell Rob Syst 21:51–71
Salgado M, Clempner JB (2018) Measuring the emotional distance using game theory via reinforcement learning: a Kullback–Leibler divergence approach. Expert Syst Appl 87:266–275
Sánchez EM, Clempner JB, Poznyak AS (2015) A priori-knowledge/actor-critic reinforcement learning architecture for computing the mean-variance customer portfolio: the case of bank marketing campaigns. Eng Appl Artif Intell 46:82–92
Sutton RS, Barto A (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
Xu X, Liu C, Hu D (2011) Continuous-action reinforcement learning with fast policy search and adaptive basis function selection. Soft Comput 15:1055–1070
Yin KK, Yin GG, Liu H (2004) Stochastic modeling for inventory and production planning in the paper industry. AIChE J 50(11):2877–2890
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Human and animal rights
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Communicated by A. Di Nola.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Asiain, E., Clempner, J.B. & Poznyak, A.S. Controller exploitation-exploration reinforcement learning architecture for computing near-optimal policies. Soft Comput 23, 3591–3604 (2019). https://doi.org/10.1007/s00500-018-3225-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-018-3225-7