Skip to main content
Log in

Improving actor-critic structure by relatively optimal historical information for discrete system

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Recently, actor-critic structure based neural networks are widely used in many reinforcement learning tasks. It consists of two main parts: (i) an actor module which outputs the probability distribution of action, and (ii) a critic module which outputs the predicted value based on the current environment. Actor-critic structure based networks usually need expert demonstration to provide an appropriate pre-training for the actor module, but the demonstration data is often hard or even impossible to obtain. And most of them, such as those used in the maze and robot control tasks, suffer from a lack of proper pre-training and unstable error propagation from the critic module to the actor module, which would result in poor and unstable performance. Therefore, a specially designed module which is called relatively optimal historical information learning (ROHI) is proposed. The proposed ROHI module can record the historical explored information and obtain the relatively optimal information through a customized merging algorithm. Then, the relatively optimal historical information is used to assist in training the actor module during the main learning process. We introduce two complex experimental environments, including the complex maze problem and flipping game, to evaluate the effectiveness of the proposed module. The experimental results demonstrate that the extended models with ROHI can significantly improve the success rate of the original actor-critic structure based models and slightly decrease the number of iteration required to reach the stable phase of value-based networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap TP, Leach M, Kavukcuoglu K, Graepel T, Hassabis D (2016) Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489

    Article  Google Scholar 

  2. Hausknecht MJ, Lehman J, Miikkulainen R, Stone P (2014) A neuroevolution approach to general atari game playing. IEEE Trans Comput Intell AI Games 6(4):355–366

    Article  Google Scholar 

  3. Omerdic E, Trslic P, Kaknjo A, Weir A, Rao M, Dooly G, Toal D (2020) Geometric insight into the control allocation problem for open-frame rovs and visualisation of solution. Robotics 9(1):7

    Article  Google Scholar 

  4. Kuwada S, Aota T, Uehara K, Nara S (2018) Application of chaos in a recurrent neural network to control in ill-posed problems: a novel autonomous robot arm. Biol Cybern 112(5):495–508

    Article  MathSciNet  Google Scholar 

  5. Xu X, Du Z, Chen X, Cai C (2019) Confidence consensus-based model for large-scale group decision making: a novel approach to managing non-cooperative behaviors. Inf Sci 477:410–427

    Article  Google Scholar 

  6. Meng F, Tang J, Wang P, Chen X (2018) A programming-based algorithm for interval-valued intuitionistic fuzzy group decision making. Knowl Based Syst 144:122–143

    Article  Google Scholar 

  7. Meng F, An Q, Tan C, Chen X (2017) An approach for group decision making with interval fuzzy preference relations based on additive consistency and consensus analysis. IEEE Trans Syst Man Cybern Syst 47(8):2069–2082

    Article  Google Scholar 

  8. Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller MA (2014) Deterministic policy gradient algorithms. In: Proceedings of the 31th international conference on machine learning, ICML 2014, Beijing, China, 21–26 June, vol 32 of JMLR workshop and conference proceedings, 2014, pp 387–395

  9. Shi W, Song S, Wu C, Chen CLP (2019) Multi pseudo q-learning-based deterministic policy gradient for tracking control of autonomous underwater vehicles. IEEE Trans Neural Netw Learn Syst 30(12):3534–3546

    Article  MathSciNet  Google Scholar 

  10. Otto J, Vogel-Heuser B, Niggemann O (2018) Automatic parameter estimation for reusable software components of modular and reconfigurable cyber-physical production systems in the domain of discrete manufacturing. IEEE Trans Ind Inform 14(1):275–282

    Article  Google Scholar 

  11. Simões DA, Lau N, Reis LP (2020) Multi-agent actor centralized-critic with communication. Neurocomputing 390:40–56

    Article  Google Scholar 

  12. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller MA, Fidjeland A, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533

    Article  Google Scholar 

  13. Wang Y, Li Y, Lan T, Aggarwal V (2019) Deepchunk: deep q-learning for chunk-based caching in wireless data processing networks. IEEE Trans Cogn Commun Netw 5(4):1034–1045

    Article  Google Scholar 

  14. Bu X (2019) Actor-critic reinforcement learning control of non-strict feedback nonaffine dynamic systems. IEEE Access 7:65569–65578

    Article  Google Scholar 

  15. Yang H, Xie X (2020) An actor-critic deep reinforcement learning approach for transmission scheduling in cognitive internet of things systems. IEEE Syst J 14(1):51–60

    Article  Google Scholar 

  16. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2016) Continuous control with deep reinforcement learning. In: 4th International conference on learning representations, ICLR 2016, San Juan, Puerto Rico, May 2–4, 2016, conference track proceedings

  17. Degris T, White M, Sutton RS(2012) Linear off-policy actor-critic. In: Proceedings of the 29th international conference on machine learning, ICML, Edinburgh, Scotland, UK, June 26–July 1, 2012

  18. Mnih V, Badia A.P, Mirza M, Graves A, Lillicrap T.P, Harley T, Silver D, Kavukcuoglu K (2016) Degris2012degris2012. In: Proceedings of the 33nd international conference on machine learning, ICML 2016, New York City, NY, USA, June 19–24, 2016

  19. Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Dy JG, Krause A (eds) Proceedings of the 35th international conference on machine learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10–15, 2018, vol 80 of proceedings of machine learning research, pp 1856–1865

  20. Passalis N, Tefas A (2020) Continuous drone control using deep reinforcement learning for frontal view person shooting. Neural Comput Appl 32(9):4227–4238

    Article  Google Scholar 

  21. Aboussalah AM, Lee C (2020) Continuous control with stacked deep dynamic recurrent reinforcement learning for portfolio optimization. Expert Syst Appl 140

  22. Yang Z, Merrick KE, Jin L, Abbass HA (2018) Hierarchical deep reinforcement learning for continuous action control. IEEE Trans Neural Netw Learn Syst 29(11):5174–5184

    Article  MathSciNet  Google Scholar 

  23. Xu W, Miao Z, Yu J, Ji Q (2020) Deep reinforcement learning for weak human activity localization. IEEE Trans Image Process 29:1522–1535

    Article  MathSciNet  Google Scholar 

  24. Zhang X, Ma H (2018) Pretraining deep actor-critic reinforcement learning algorithms with expert demonstrations. CoRR

  25. Peters J, Schaal S (2008) Reinforcement learning of motor skills with policy gradients. Neural Netw 21(4):682–697

    Article  Google Scholar 

  26. Huang Z, Zhang Y, Liu Y, Zhang G (2019) Four actor-critic structures and algorithms for nonlinear multi-input multi-output system. Neurocomputing 330:172–187

    Article  Google Scholar 

  27. Iwaki R, Asada M (2019) Implicit incremental natural actor critic algorithm. Neural Netw 109:103–112

    Article  Google Scholar 

  28. Gu S, Lillicrap TP, Ghahramani Z, Turner RE, Levine S (2017) Q-prop: Sample-efficient policy gradient with an off-policy critic. In: 5th International conference on learning representations, ICLR 2017, Toulon, France, April 24–26, 2017, conference track proceedings

  29. O’Donoghue B, Munos R, Kavukcuoglu K, Mnih V (2017) Combining policy gradient and q-learning. In: 5th International conference on learning representations, ICLR 2017, Toulon, France, April 24–26, 2017, conference track proceedings

  30. Song R, Lewis FL, Wei Q, Zhang H (2016) Off-policy actor-critic structure for optimal control of unknown systems with disturbances. IEEE Trans Cybern 46(5):1041–1050

    Article  Google Scholar 

  31. Suttle W, Yang Z, Zhang K, Wang Z, Basar T, Liu J. A multi-agent off-policy actor-critic algorithm for distributed reinforcement learning. CoRR arXiv:1903.06372

  32. Vrabel R (2019) Stabilisation and state trajectory tracking problem for nonlinear control systems in the presence of disturbances. Int J Control 92(3):540–548

    Article  MathSciNet  Google Scholar 

  33. Hafez MB, Weber C, Kerzel M, Wermter S (2019) Deep intrinsically motivated continuous actor-critic for efficient robotic visuomotor skill learning. Paladyn 10(1):14–29

    Google Scholar 

  34. Abbeel P, Ng AY (2004) Apprenticeship learning via inverse reinforcement learning. In: Brodley CE (ed.) Machine learning, proceedings of the twenty-first international conference (ICML 2004), Banff, Alberta, Canada, July 4–8, 2004, vol 69 of ACM international conference proceeding series

  35. Abbeel P, Ng AY (2017) Inverse reinforcement learning. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning and data mining. Springer, Berlin, pp 678–682

    Chapter  Google Scholar 

  36. Zuo G, Chen K, Lu J, Huang X (2020) Deterministic generative adversarial imitation learning. Neurocomputing 388:60–69

    Article  Google Scholar 

  37. Ho J, Gupta J.K, Ermon S (2016) Model-free imitation learning with policy optimization. In: Balcan M, Weinberger KQ (eds) Proceedings of the 33nd international conference on machine learning, ICML, New York City, NY, USA, June 19–24, vol 48 of JMLR workshop and conference proceedings, 2016, pp 2760–2769

  38. Bhattacharya B, Winer E (2019) Augmented reality via expert demonstration authoring (AREDA). Comput Ind 105:61–79

    Article  Google Scholar 

  39. Ezzeddine A, Mourad N, Araabi BN, Ahmadabadi MN (2018) Combination of learning from non-optimal demonstrations and feedbacks using inverse reinforcement learning and bayesian policy improvement. Expert Syst Appl 112:331–341

    Article  Google Scholar 

  40. Yan T, Zhang W, Yang SX, Yu L (2019) Soft actor-critic reinforcement learning for robotic manipulator with hindsight experience replay. Int J Robotics Autom 34(5)

  41. Ming Y, Zhang Y (2020) Efficient scalable spatiotemporal visual tracking based on recurrent neural networks. Multimed Tools Appl 79(3–4):2239–2261

    Article  Google Scholar 

  42. Tian L, Li X, Ye Y, Xie P, Li Y (2020) A generative adversarial gated recurrent unit model for precipitation nowcasting. IEEE Geosci Remote Sens Lett 17(4):601–605

    Article  Google Scholar 

  43. Pflueger M, Agha-Mohammadi A, Sukhatme GS (2019) Rover-irl: inverse reinforcement learning with soft value iteration networks for planetary rover path planning. IEEE Robotics Autom Lett 4(2):1387–1394

    Article  Google Scholar 

  44. Hausknecht MJ, Stone P (2015) Deep recurrent q-learning for partially observable mdps. In: AAAI Fall symposia, Arlington, Virginia, USA, November 12–14, 2015, pp 29–37

Download references

Acknowledgements

The authors would like to thank the editor, the associate editor and anonymous reviewers for their constructive comments in helping improve our work. This work was supported by the NSFC Project under Grant Nos. 62176069 and 61933013, the Excellent Youth Scientific Research Project of Hunan Education Department under grant Nos. 21B0582 and 21B0565, Natural Science Foundation of Guangdong Province under Grant No. 2019A1515011076, Natural Science Foundation of Henan Province under Grant Nos. 202300410092 and 202300410093.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiao-Yuan Jing.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, X., Li, W., Zhu, X. et al. Improving actor-critic structure by relatively optimal historical information for discrete system. Neural Comput & Applic 34, 10023–10037 (2022). https://doi.org/10.1007/s00521-022-06988-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-06988-x

Keywords

Navigation