Abstract
From the perspective of the deep reinforcement learning algorithm, the training effect of the agent will be affected because of the excessive randomness of the ε-greedy method. This paper proposes a novel action decision method to replace the ε-greedy method and avoid excessive randomness. First, a confidence bound span fitting model based on a deep neural network is proposed to fundamentally solve the problem that UCB cannot estimate the confidence bound span of each action in high-dimensional state space. Then, a confidence bound span balance model based on target value in reverse order is proposed. The parameters of the U network are updated after each action decision using the backpropagation of the neural network to balance the confidence bound span. Finally, an exploration-exploitation dynamic balance factor \(\alpha\) is introduced to balance exploration and exploitation in the training process. Experiments are conducted using the Nature DQN and Double DQN algorithms, and the results demonstrate that the proposed method achieves higher performance than the ε-greedy method under the basic algorithm and experimental environment of this paper. The method presented in this paper has significance for applying a confidence bound to solve complex reinforcement problems.
Similar content being viewed by others
Data availability
The datasets analysed during the current study are available in the OpenAI Gym library, https://www.gymlibrary.dev/.
References
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT Press, Cambridge
Stapelberg B, Malan KM (2020) A survey of benchmarking frameworks for reinforcement learning. South Afr Comput J 32(2):258–292
Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366
Upadhyay SK, Kumar A (2021) A novel approach for rice plant diseases classification with deep convolutional neural network. Int J Inform Technol 14(1):185–199
Mnih V, Kavukcuoglu K, Silver D et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Senior AW, Evans R, Jumper J et al (2020) Improved protein structure prediction using potentials from deep learning. Nature 577(7792):706–710
Pang K, Zhang Y, Yin C (2020) A decision-making method for Self-driving based on deep reinforcement learning. J Phys: Conf Ser 1576(1):012025
Lee J, Koh H, Choe HJ (2021) Learning to trade in financial time series using high-frequency through wavelet transformation and deep reinforcement learning. Appl Intell 51(8):6202–6223
Kotsiopoulos T, Sarigiannidis P, Ioannidis D et al (2021) Machine learning and deep learning in smart manufacturing: the smart grid paradigm. Comput Sci Rev 40: 00341
Hua J, Zeng L, Li G et al (2021) Learning for a robot: deep reinforcement learning, imitation learning, transfer learning. Sensors 21(4):1278
Valarezo Añazco E, Rivera Lopez P, Park N et al (2021) Natural object manipulation using anthropomorphic robotic hand through deep reinforcement learning and deep grasping probability network. Appl Intell 51:1041–1055
Liu T, Wang J, Yang B et al (2021) NGDNet: nonuniform gaussian-label distribution learning for infrared head pose estimation and on-task behavior understanding in the classroom. Neurocomputing 436:210–220
Liu H, Nie H, Zhang Z et al (2021) Anisotropic angle distribution learning for head pose estimation and attention understanding in human-computer interaction. Neurocomputing 433:310–322
Liu H, Zheng C, Li D, et al (2021) EDMF: efficient deep matrix factorization with review feature learning for industrial recommender system. IEEE Trans Ind Inf 18(7):4361-4371
Nian R, Liu J, Huang B (2020) A review on reinforcement learning: introduction and applications in industrial process control. Comput Chem Eng 139:106886
De Ath G, Everson RM, Rahat AA, et al (2021) Greed is good: exploration and exploitation trade-offs in bayesian optimisation. ACM Trans Evol Learn Optim 1(1):1-22
Li Q, Zhong J, Cao Z et al (2020) Optimizing streaming graph partitioning via a heuristic greedy method and caching strategy. Optim Methods Softw 35(6):1144–1159
Yao Y, Wang HY (2019) Optimal subsampling for softmax regression. Stat Pap 60(2):585–599
Alshahrani M, Zhu F, Mekouar S et al (2021) Identification of Top-K Influencers based on Upper confidence bound and local structure. Big Data Research 25(1):100208
Ye W, Chen D (2022) Analysis of performance measure in Q Learning with UCB exploration. Mathematics 10(4):575
Van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence 30(1):2094–2100
Sutton RS, McAllester D, Singh S et al (1999) Policy gradient methods for reinforcement learning with function approximation. Adv Neural Inf Process Syst 12:1057–1063
Lillicrap TP, Hunt JJ, Pritzel A et al (2016) Continuous control with deep reinforcement learning. Comput Sci 8(6):A187
Plappert M, Houthooft R, Dhariwal P et al (2018) Parameter space noise for exploration. International Conference on Learning Representations. Vancouver Convention Center, Canada
Colas C, Sigaud O, Oudeyer PY (2018) Gep-pg: Decoupling exploration and exploitation in deep reinforcement learning algorithms. International Conference on Machine Learning, PMLR: 1039–1048
Zhang J, Wetzel N, Dorka N et al (2019) Scheduled intrinsic drive: A hierarchical take on intrinsically motivated exploration. arXiv preprint arXiv: 1903.07400
Bougie N, Ichise R (2021) Fast and slow curiosity for high-level exploration in reinforcement learning. Appl Intell 51:1086–1107
Bougie N, Ichise R (2022) Hierarchical learning from human preferences and curiosity. Appl Intell 52:7459–7479
Beyer L, Vincent D, Teboul O et al (2019) MULEX: Disentangling exploitation from exploration in deep rl. arXiv preprint arXiv: 1907.00868
Souissi B, Ghorbel A (2022) Upper confidence bound integrated genetic algorithm-optimized long short‐term memory network for click‐through rate prediction. Appl Stoch Models Bus Ind 38(3):475–496
Zheng L, Ratliff L (2020) Constrained upper confidence reinforcement learning. Learn Dynamics Control PMLR:620–629
Liang Y, Huang C, Bao X et al (2020) Sequential dynamic event recommendation in event-based social networks: an upper confidence bound approach. Inf Sci 542:1–23
Zhou D, Li L, Gu Q (2020) Neural contextual bandits with UCB-based exploration. International Conference on Machine Learning, PMLR: 11492–11502
Gym Documentation. https://www.gymlibrary.dev/#gym-is-a-standard-api-for-reinforcement-learning-and-a-diverse-collection-of-reference-environments. Accessed 11 Jan 2023
Liu H, Fang S, Zhang Z et al (2022) MFDNet: collaborative poses perception and matrix Fisher distribution for head pose estimation. IEEE Trans Multimedia 24:2449–2460
Liu H, Liu T, Zhang Z, et al (2022) ARHPE: asymmetric relation-aware representation learning for head pose estimation in industrial human–computer interaction. IEEE Trans Ind Inf 18(10):7107-7117
Liu H, Liu T, Chen Y et al (2023) EHPE: skeleton cues-based gaussian coordinate encoding for efficient human pose estimation. IEEE Trans Multimedia. https://doi.org/10.1109/TMM.2022.3197364
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Grant No. 62073245), the Natural Science Foundation of Shanghai (20ZR1440500), and Pudong New Area Science & Technology Development Fund (PKX2021-R07).
Author information
Authors and Affiliations
Contributions
Study design: Wenhao Zhang, Yaqing Song, Xiangpeng Liu, Qianqian Shangguan and Kang An; Conduct of the study: Wenhao Zhang; Writing—original draft: Wenhao Zhang and Yaqing Song; Supervision: Kang An, Xiangpeng Liu and Qianqian Shangguan; Writing—review and editing: Yaqing Song, Wenhao Zhang and Kang An. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, W., Song, Y., Liu, X. et al. A novel action decision method of deep reinforcement learning based on a neural network and confidence bound. Appl Intell 53, 21299–21311 (2023). https://doi.org/10.1007/s10489-023-04695-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04695-1