Controller exploitation-exploration reinforcement learning architecture for computing near-optimal policies

Asiain, Erick; Clempner, Julio B.; Poznyak, Alexander S.

doi:10.1007/s00500-018-3225-7

Controller exploitation-exploration reinforcement learning architecture for computing near-optimal policies

Foundations
Published: 10 May 2018

Volume 23, pages 3591–3604, (2019)
Cite this article

Soft Computing Aims and scope Submit manuscript

Erick Asiain¹,
Julio B. Clempner² &
Alexander S. Poznyak¹

330 Accesses
18 Citations
Explore all metrics

Abstract

This paper suggests a new controller exploitation-exploration (CEE) reinforcement learning (RL) architecture that attains a near-optimal policy. The proposed architecture consists of three modules: controller, fast-tracked learning and the actor-critic. The strategies are represented by a probability distribution \(c_{ik}\). The controller employs a combination (balance) of the exploration or exploitation using the Kullback–Leibler divergence deciding if the new strategies are better than currently employed immediate strategy. The exploitation uses a fast-tracked learning algorithm, which employs a fix strategy and priori knowledge. The method is (only) asked to find estimated values of the transition matrices and utilities. The exploration employs an actor-critic architecture. The actor is responsible for the computation of the strategies using a policy gradient method. The critic determines the acceptance of the proposed strategies. We show the convergence of the proposed algorithms for implementing the architecture. An application example related to inventory shows the effectiveness of the proposed architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A practical guide to multi-objective reinforcement learning and planning

Article Open access 13 April 2022

Conor F. Hayes, Roxana Rădulescu, … Diederik M. Roijers

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A review of motion planning algorithms for intelligent robots

Article Open access 25 November 2021

Chengmin Zhou, Bingding Huang & Pasi Fränti

Notes

In contrast, the main problem of an actor-critic architecture is its premature convergence to suboptimal policies.

References

Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795
Article Google Scholar
Abualigah LM, Khader AT, Hanandeh ES, Gandomi AH (2017) A novel hybridization strategy for krill herd algorithm applied to clustering techniques. Appl Soft Comput 60:423–435
Article Google Scholar
Auer P, Ortner R (2007) Logarithmic online regret bounds for undiscounted reinforcement learning. In: Advances in neural information processing systems, vol 19 (NIPS 2006), MIT Press, pp 49–56
Brafman RI, Tennenholtz M (2000) A near-optimal polynomial time algorithm for learning in certain classes of stochastic games. Artif Intell 121(1–2):31–47
Article MathSciNet MATH Google Scholar
Brafman RI, Tennenholtz M (2002) R-max—a general polynomial time algorithm for nearoptimal reinforcement learning. J Mach Lean Res 3:213–231
MATH Google Scholar
Castronovo M, Maes F, Fonteneau R, Ernst D (2012) Learning exploration/exploitation strategies for single trajectory reinforcement learning. In: 10th European workshop on reinforcement learning (EWRL), vol 24, pp 1–9
Clempner JB, Poznyak AS (2014) Simple computing of the customer lifetime value: a fixed local-optimal policy approach. J Syst Sci Syst Eng 23(4):439–459
Article Google Scholar
Grondman I, Busoniu L, Lopes GAD, Babuska R (2012) A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Trans Syst Man Cybern C 42(6):1291–1307
Article Google Scholar
Ishii S, Yoshida W, Yoshimoto J (2002) Control of exploitation-exploration meta-parameter in reinforcement learning. Neural Netw 15:665–687
Article Google Scholar
Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J Mach Lean Res 11:1563–1600
MathSciNet MATH Google Scholar
Kaelbling L (1993) Hierarchical reinforcement learning: preliminary results. In: Proceedings of the tenth international conference on machine learning, San Mateo, CA, USA, pp 167–173
Kakade SM (2003) On the sample complexity of reinforcement learning. PhD thesis, University College London
Kearns M, Singh S (2002) Near-optimal reinforcement learning in polynomial time. Mach Learn 49:209–232
Article MATH Google Scholar
Maclin R, Shavlik J, Torrey L, Walker T, Wild E (2005) Giving advice about preferred actions to reinforcement learners via knowledge-based kernel regression. In: Proceedings of the 20th national conference on artificial intelligence, Pittsburgh, Pennsylvania, USA, pp 819–824
Maes F, Wehenkel L, Ernst D (2012) Learning to play k-armed bandit problems. In: Proceedings of the 4th international conference on agents and artificial intelligence, Vilamoura, Algarve, Portugal
Najim K, Poznyak AS (1994) Learning automata: theory and applications. Pergamon Press, Inc., Elmsford
MATH Google Scholar
Poznyak AS, Najim K, Gómez-Ramírez E (2000) Self-learning control of finite Markov chains. Marcel Dekker Inc, New York
Book MATH Google Scholar
Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, New York
Book MATH Google Scholar
Ribeiro CH (1998) Embedding a priori knowledge in reinforcement learning. J Intell Rob Syst 21:51–71
Article Google Scholar
Salgado M, Clempner JB (2018) Measuring the emotional distance using game theory via reinforcement learning: a Kullback–Leibler divergence approach. Expert Syst Appl 87:266–275
Article Google Scholar
Sánchez EM, Clempner JB, Poznyak AS (2015) A priori-knowledge/actor-critic reinforcement learning architecture for computing the mean-variance customer portfolio: the case of bank marketing campaigns. Eng Appl Artif Intell 46:82–92
Article Google Scholar
Sutton RS, Barto A (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
MATH Google Scholar
Xu X, Liu C, Hu D (2011) Continuous-action reinforcement learning with fast policy search and adaptive basis function selection. Soft Comput 15:1055–1070
Article Google Scholar
Yin KK, Yin GG, Liu H (2004) Stochastic modeling for inventory and production planning in the paper industry. AIChE J 50(11):2877–2890
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Control Automatics, Center for Research and Advanced Studies, Av. IPN 2508, Col. San Pedro Zacatenco, 07360, Mexico City, Mexico
Erick Asiain & Alexander S. Poznyak
Escuela Superior de Física y Matemáticas(School of Physics and Mathematics), Instituto Politécnico Nacional (National Polytechnic Institute), Edificio 9 U.P. Adolfo Lopez Mateos, Col. San Pedro Zacatenco, 07730, Mexico City, Mexico
Julio B. Clempner

Authors

Erick Asiain
View author publications
You can also search for this author in PubMed Google Scholar
Julio B. Clempner
View author publications
You can also search for this author in PubMed Google Scholar
Alexander S. Poznyak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Julio B. Clempner.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Human and animal rights

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by A. Di Nola.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Asiain, E., Clempner, J.B. & Poznyak, A.S. Controller exploitation-exploration reinforcement learning architecture for computing near-optimal policies. Soft Comput 23, 3591–3604 (2019). https://doi.org/10.1007/s00500-018-3225-7

Download citation

Published: 10 May 2018
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s00500-018-3225-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Controller exploitation-exploration reinforcement learning architecture for computing near-optimal policies

Abstract

Access this article

Similar content being viewed by others

A practical guide to multi-objective reinforcement learning and planning

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A review of motion planning algorithms for intelligent robots

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and animal rights

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Controller exploitation-exploration reinforcement learning architecture for computing near-optimal policies

Abstract

Access this article

Similar content being viewed by others

A practical guide to multi-objective reinforcement learning and planning

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A review of motion planning algorithms for intelligent robots

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and animal rights

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation