Skip to main content

Advertisement

Log in

A Prioritized objective actor-critic method for deep reinforcement learning

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

An increasing number of complex problems have naturally posed significant challenges in decision-making theory and reinforcement learning practices. These problems often involve multiple conflicting reward signals that inherently cause agents’ poor exploration in seeking a specific goal. In extreme cases, the agent gets stuck in a sub-optimal solution and starts behaving harmfully. To overcome such obstacles, we introduce two actor-critic deep reinforcement learning methods, namely Multi-Critic Single Policy (MCSP) and Single Critic Multi-Policy (SCMP), which can adjust agent behaviors to efficiently achieve a designated goal by adopting a weighted-sum scalarization of different objective functions. In particular, MCSP creates a human-centric policy that corresponds to a predefined priority weight of different objectives. Whereas, SCMP is capable of generating a mixed policy based on a set of priority weights, i.e., the generated policy uses the knowledge of different policies (each policy corresponds to a priority weight) to dynamically prioritize objectives in real time. We examine our methods by using the Asynchronous Advantage Actor-Critic (A3C) algorithm to utilize the multithreading mechanism for dynamically balancing training intensity of different policies into a single network. Finally, simulation results show that MCSP and SCMP significantly outperform A3C with respect to the mean of total rewards in two complex problems: Food Collector and Seaquest.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

References

  1. Sutton RS, Barto AG (2012) Reinforcement learning: an introduction. The MIT Press, Cambridge

    MATH  Google Scholar 

  2. Mahadevan S, Connell J (1992) Automatic programming of behavior-based robots using reinforcement learning. Artif Intell 55(2–3):311–365

    Article  Google Scholar 

  3. Schaal S (1997) Learning from demonstration. In: Adv Neural Inf Process Syst, pp. 1040–1046

  4. Kim HJ, Jordan MI, Sastry S, Ng AY (2004) Autonomous helicopter flight via reinforcement learning. In: Adv Neural Inf Process. Syst., pp. 799–806

  5. Ng AY, Coates A, Diel M (2006) V. Ganapathi, J. Schulte, B. Tse, E. Berger, and E. Liang, Autonomous inverted helicopter flight via reinforcement learning. In: Experimental Robot. IX, pp. 363–372

  6. Riedmiller M, Gabel T, Hafner R, Lange S (2009) Reinforcement learning for robot soccer. J Auton Robot 27(1):55–73

    Article  Google Scholar 

  7. Mülling K, Kober J, Kroemer O, Peters J (2013) Learning to select and generalize striking movements in robot table tennis. Int J Robot Res 32(3):263–279

    Article  Google Scholar 

  8. Martinez D, Alenya G, Torras C (2015) Safe robot execution in model-based reinforcement learning. In: Int Conf Intell Robot Sys (IROS), pp. 6422–6427

  9. Bertsekas DP (2005) Dynamic Programming and Optimal Control. Athena Scientific, Belmont, MA

    MATH  Google Scholar 

  10. Mnih V et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533

    Article  Google Scholar 

  11. Silver D et al (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7578):484–489

    Article  Google Scholar 

  12. Schaul T, Quan J, Antonoglou I, Silver D (2016) Prioritized experience replay. In: ICLR

  13. Vamplew P, Dazeley R, Berry A, Issabekov R, Dekker E (2011) Empirical evaluation methods for multiobjective reinforcement learning algorithms. Mach Learn 84(1–2):51–80

    Article  MathSciNet  Google Scholar 

  14. Roijers DM, Vamplew P, Whiteson S, Dazeley R (2013) A survey of multi-objective sequential decision-making. J Artif Intell Res 48:67–113

    Article  MathSciNet  Google Scholar 

  15. Mnih V et al (2016) Asynchronous methods for deep reinforcement learning. Int Conf Mach Learn, pp 1928–1937

  16. Konda VR, Tsitsiklis JN (2000) Actor-critic algorithms. In: Adv Neural Inf Process. Syst., pp. 1008–1014

  17. Amodei D, Olah C, Steinhardt J, Christiano P, Schulman J, Mane D (2016) Concrete problems in AI safety in arXiv:1606.06565 [cs]

  18. Hasselt HV (2010) Double Q-learning. In: Adv. Neural Inf Process Syst, pp. 2613–2621

  19. Hasselt HV, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning. In: Proc. AAAI Conf Artif Intell, pp. 2094–2100

  20. Wang Z et al. (2015) “Dueling network architectures for deep reinforcement learning,” arXiv:1511.06581 [cs]

  21. Sorokin I, Seleznev A, Pavlov M, Fedorov A, Ignateva A (2015) “Deep attention recurrent q-network. arXiv:1512.01693 [cs]

  22. Wang Z, Bapst V, Heess N, Mnih V, Munos R, Kavukcuoglu K, de Freitas N (2016) “Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224

  23. Dietterich TG (2000) Hierarchical reinforcement learning with the maxq value function decomposition. J. Artif. Intell. Res. 13:227–303

    Article  MathSciNet  Google Scholar 

  24. Barto AG, Mahadevan S (2003) Recent advances in hierarchical reinforcement learning. Dis Event Dyn Syst 13(4):341–379

    Article  MathSciNet  Google Scholar 

  25. Chentanez N, Barto AG, Singh SP (2005) Intrinsically motivated reinforcement learning. In: Adv. Neural Inf. Process. Syst., pp. 1281–1288

  26. Kulkarni TD, Narasimhan KR, Saeedi A, Tenenbaum JB (2016) Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In: Adv. Neural Inf. Process. Syst., pp. 3675–3683

  27. Vamplew P, Yearwood J, Dazeley R, Berry A (2008) On the limitations of scalarisation for multi-objective reinforcement learning of pareto fronts. In: Australasian Joint Conf. Artif. Intell., pp. 372–378

  28. Deb K, Thiele L, Laumanns M, Zitzler E (2002) Scalable multi-objective optimization test problems. In: Evolutionary Computation

  29. Bullinaria J (2005) “Evolved age dependent plasticity improves neural network performance. In: International Conference on Hybrid Intelligent Systems

  30. Nishimoto R, Namikawa J, Tani J (2008) Learning multiple goal-directed actions through self-organization of a dynamic neural network model: a humanoid robot experiment. In: Adaptive Behavior, pp. 166–181

  31. Vargas DV, Takano H, Murata J (2015) Novelty-organizing team of classifiers in noisy and dynamic environments. In: IEEE Congress on Evolutionary Computation

  32. Yang R, Sun X, Narasimhan K (2019) A generalized algorithm for multi-objective reinforcement learning and policy adaptation. In: Advances in Neural Information Processing Systems

  33. Mnih V (2013) et al., Playing atari with deep reinforcement learning. arXiv:1312.5602 [cs]

  34. Bellemare MG, Naddaf Y, Veness J, Bowling M (2013) The arcade learning environment: an evaluation platform for general agents. J Artif Intell Res 47:253–279

    Article  Google Scholar 

  35. Tieleman T, Hinton G Lecture 6.5–rmsprop: Divide the gradient by a running average of its recent magnitude. In: COURSERA: Neural Networks for Machine Learning

  36. Gábor Z, Kalmár Z, Szepesvári C (1998) “Multi-criteria reinforcement learning. In: Int. Conf, Mach, Learn

  37. Abbeel P, Ng AY (2004) “Apprenticeship learning via inverse reinforcement learning. In: Int. Conf, Mach, Learn

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ngoc Duy Nguyen or Thanh Thi Nguyen.

Ethics declarations

Conflict of Interest

The authors declare that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A. agent behavior analysis

In this section, we conduct an analysis of agent behaviors by using MCSP and SCMP in Food Collector. Thereby, we understand the underlying mechanism of MCSP and SCMP that is used to tackle the LiS problem. We use a priority weight of (0.5, 0.5) for both methods. Figure 11 shows the differences between the two methods with respect to the agent’s behaviors in Food Collector. The agent using MCSP eats chicken even when the energy is full, and it only collects rice to accumulate rewards. Therefore, the agent follows a mechanical strategy to find a long-term solution. In contrast to MCSP, the agent using SCMP is greedier as it collects both chicken and rice to accumulate rewards (even the energy is low). The agent only eats the chicken when the situation is dangerous. Specifically, the agent prioritizes collecting the food over eating the chicken in the early stage because the energy slowly decreases. When the agent achieves a high score, the energy decreases rapidly. The agent prioritizes eating the food over collecting it to avoid energy depletion. In other words, the agent using SCMP is “smarter” than the MCSP’s agent as it dynamically modifies the priority of different objectives in real time to adapt to the environment. The strategy used in SCMP is quite similar to human strategy and hence enables the agent to efficiently finish the task in a short period. To measure the efficiency of agents in each method, we compare the mean total of steps per episode of each agent during the training process. In Fig. 11c shows the performance of two agents: one agent is trained with MCSP using the priority weight (0.5, 0.5), another agent is trained with SCMP using the set \(P_4\) and an input weight (0.5, 0.5) for evaluation. The SCMP’s agent is more efficient than MCSP’s agent regarding the mean total of steps per episode.

Appendix B. SCMP with different input weights

In this Appendix, we examine the performance of SCMP by varying input weights for evaluation in Food Collector and Seaquest. We use the same training weight sets for SCMP as in Sect. 4. Figures 12, 13, 14 and 15 present the performance of SCMP in Food Collector with different input weights for evaluation. We see that the policy produced by SCMP does not depend on the input weight with respect to the mean total rewards of collecting food (the primary reward signal). Therefore, it is helpful to use SCMP as we do not need to find the optimal value of an input weight. The only use of an input weight is to adjust agent behaviors to prioritize objectives. In Food Collector, for example, the weight (0, 1) maximizes the mean total rewards of eating food while the weight (1, 0) restricts the agent from eating food. Figures 16, 17, 18, 19 and 20 shows the performance of SCMP using different input weights in Seaquest. We also infer the same conclusion as in Food Collector.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nguyen, N.D., Nguyen, T.T., Vamplew, P. et al. A Prioritized objective actor-critic method for deep reinforcement learning. Neural Comput & Applic 33, 10335–10349 (2021). https://doi.org/10.1007/s00521-021-05795-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-05795-0

Keywords

Navigation