Abstract
Deep reinforcement learning (RL) has become one of the most popular topics in artificial intelligence research. It has been widely used in various fields, such as end-to-end control, robotic control, recommendation systems, and natural language dialogue systems. In this survey, we systematically categorize the deep RL algorithms and applications, and provide a detailed review over existing deep RL algorithms by dividing them into modelbased methods, model-free methods, and advanced RL methods. We thoroughly analyze the advances including exploration, inverse RL, and transfer RL. Finally, we outline the current representative applications, and analyze four open problems for future research.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abbeel P, Ng AY, 2004. Apprenticeship learning via inverse reinforcement learning. Proc 21st Int Conf on Machine Learning, p.1–8. https://doi.org/10.1145/1015330.1015430
Achiam J, Held D, Tamar A, et al., 2017. Constrained policy optimization. Proc 34th Int Conf on Machine Learning, p.22–31.
Al-Nima RRO, Han TT, Chen TL, 2019. Road tracking using deep reinforcement learning for self-driving car applications. Int Conf on Computer Recognition Systems, p.106–116. https://doi.org/10.1007/978-3-030-19738-4_12
Arik SO, Chen JT, Peng KN, et al., 2018. Neural voice cloning with a few samples. Proc 32nd Neural Information Processing Systems, p.10019–10029.
Aytar Y, Pfaff T, Budden D, et al., 2018. Playing hard exploration games by watching YouTube. Proc 32nd Neural Information Processing Systems, p.2930–2941.
Bellemare MG, Naddaf Y, Veness J, et al., 2013. The Arcade learning environment: an evaluation platform for general agents. J Artif Intell Res, 47:253–279. https://doi.org/10.1613/jair.3912
Bellemare MG, Srinivasan S, Ostrovski G, et al., 2016. Unifying count-based exploration and intrinsic motivation. Proc 30th Neural Information Processing Systems, p.1471–1479.
Bellemare MG, Dabney W, Munos R, 2017. A distributional perspective on reinforcement learning. Proc 34th Int Conf on Machine Learning, p.449–458.
Blundell C, Cornebise J, Kavukcuoglu K, et al., 2015. Weight uncertainty in neural networks. Proc 32nd Int Conf on Machine Learning, p.1613–1622.
Botvinick M, Ritter S, Wang JX, et al., 2019. Reinforcement learning, fast and slow. Trends Cogn Sci, 23(5):408–422. https://doi.org/10.1016/j.tics.2019.02.006
Buckman J, Hafner D, Tucker G, et al., 2018. Sample-efficient reinforcement learning with stochastic ensemble value expansion. Proc 32nd Neural Information Processing Systems, p.8224–8234.
Burda Y, Edwards H, Pathak D, et al., 2019. Large-scale study of curiosity-driven learning. https://arxiv.org/abs/1808.04355
Chapelle O, Li LH, 2011. An empirical evaluation of Thompson sampling. Proc 24th Neural Information Processing Systems, p.2249–2257.
Chebotar Y, Hausman K, Zhang M, et al., 2017. Combining model-based and model-free updates for trajectory-centric reinforcement learning. Proc 34th Int Conf on Machine Learning, p.703–711.
Chen L, Lingys J, Chen K, et al., 2018. AuTO: scaling deep reinforcement learning for datacenter-scale automatic traffic optimization. Proc Conf of the ACM Special Interest Group on Data Communication, p.191–205. https://doi.org/10.1145/3230543.3230551
Chen YT, Assael Y, Shillingford B, et al., 2019. Sample efficient adaptive text-to-speech. https://arxiv.org/abs/1809.10460
Chua K, Calandra R, McAllister R, et al., 2018. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Proc 32nd Neural Information Processing Systems, p.4754–4765.
Devin C, Gupta A, Darrell T, et al., 2017. Learning modular neural network policies for multi-task and multi-robot transfer. Proc IEEE Int Conf on Robotics and Automation, p.2169–2176. https://doi.org/10.1109/ICRA.2017.7989250
Dhingra B, Li LH, Li XJ, et al., 2017. Towards end-to-end reinforcement learning of dialogue agents for information access. Proc 55th Annual Meeting of the Association for Computational Linguistics, p.484–495. https://doi.org/10.18653/v1/P17-1045
Duan Y, Schulman J, Chen X, et al., 2017. RL2:fast reinforcement learning via slow reinforcement learning. https://arxiv.org/abs/1611.02779
Ebert F, Finn C, Lee AX, et al., 2017. Self-supervised visual planning with temporal skip connections. Proc 1st Annual Conf on Robot Learning, p.344–356.
Feinberg V, Wan A, Stoica I, et al., 2018. Model-based value estimation for efficient model-free reinforcement learning. https://arxiv.org/abs/1803.00101
Finn C, Levine S, 2017. Deep visual foresight for planning robot motion. Proc IEEE Int Conf on Robotics and Automation, p.2786–2793. https://doi.org/10.1109/ICRA.2017.7989324
Finn C, Levine S, Abbeel P, 2016a. Guided cost learning: deep inverse optimal control via policy optimization. Proc 33rd Int Conf on Machine Learning, p.49–58.
Finn C, Tan XY, Duan Y, et al., 2016b. Deep spatial autoencoders for visuomotor learning. Proc IEEE Int Conf on Robotics and Automation, p.512–519. https://doi.org/10.1109/ICRA.2016.7487173
Finn C, Abbeel P, Levine S, 2017a. Model-agnostic meta-learning for fast adaptation of deep networks. Proc 34th Int Conf on Machine Learning, p.1126–1135.
Finn C, Yu T, Zhang T, et al., 2017b. One-shot visual imitation learning via meta-learning. Proc 1 st Conf on Robot Learning, p.357–368.
Fortunato M, Azar MG, Piot B, et al., 2019. Noisy networks for exploration. https://arxiv.org/abs/1706.10295
Fu J, Levine S, Abbeel P, 2016. One-shot learning of manipulation skills with online dynamics adaptation and neural network priors. Proc IEEE/RSJ Int Conf on Intelligent Robots and Systems, p.4019–4026. https://doi.org/10.1109/IROS.2016.7759592
Fu J, Co-Reyes JD, Levine S, 2017a. EX2: exploration with exemplar models for deep reinforcement learning. Proc 30th Neural Information Processing Systems, p.2577–2587.
Fu J, Luo K, Levine S, 2017b. Learning robust rewards with adversarial inverse reinforcement learning. https://arxiv.org/abs/1710.11248
Fujimoto S, Hoof H, Meger D, 2018. Addressing function approximation error in actor-critic methods. Proc 35th Int Conf on Machine Learning, p.1587–1596.
Gal Y, Hron J, Kendall A, 2017. Concrete dropout. Proc 30th Neural Information Processing Systems, p.3581–3590.
Garcia FM, Thomas PS, 2019. A meta-MDP approach to exploration for lifelong reinforcement learning. Proc 32nd Neural Information Processing Systems, p.5691–5700.
Ghasemipour SKS, Gu SX, Zemel R, 2019. SMILe: scalable meta inverse reinforcement learning through context-conditional policies. Proc 32nd Neural Information Processing Systems, p.7879–7889.
Gu JT, Hassan H, Devlin J, et al., 2018a. Universal neural machine translation for extremely low resource languages. Proc 16th Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.344–354. https://doi.org/10.18653/v1/N18-1032
Gu JT, Wang Y, Chen Y, et al., 2018b. Meta-learning for low-resource neural machine translation. Proc Conf on Empirical Methods in Natural Language Processing, p.3622–3631. https://doi.org/10.18653/v1/D18-1398
Gu SX, Lillicrap T, Sutskever I, et al., 2016. Continuous deep Q-learning with model-based acceleration. Proc 33rd Int Conf on Machine Learning, p.2829–2838.
Gu SX, Holly E, Lillicrap T, et al., 2017a. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. Proc IEEE Int Conf on Robotics and Automation, p.3389–3396. https://doi.org/10.1109/ICRA.2017.7989385
Gu SX, Lillicrap T, Ghahramani Z, et al., 2017b. Q-Prop: sample-efficient policy gradient with an off-policy critic. https://arxiv.org/abs/1611.02247
Gupta A, Mendonca R, Liu YX, et al., 2018. Meta-reinforcement learning of structured exploration strategies. Proc 32nd Neural Information Processing Systems, p.5302–5311.
Haarnoja T, Tang HR, Abbeel P, et al., 2017. Reinforcement learning with deep energy-based policies. Proc 34th Int Conf on Machine Learning, p.1352–1361.
Haarnoja T, Zhou A, Abbeel P, et al., 2018. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proc 35th Int Conf on Machine Learning, p.1861–1870.
Hausknecht M, Stone P, 2017. Deep recurrent Q-learning for partially observable MDPs. https://arxiv.org/abs/1507.06527
He D, Xia YC, Qin T, et al., 2016. Dual learning for machine translation. Proc 30th Neural Information Processing Systems, p.820–828.
Heess N, Sriram S, Lemmon J, et al., 2017. Emergence of locomotion behaviours in rich environments. https://arxiv.org/abs/1707.02286
Hessel M, Modayil J, van Hasselt H, et al., 2018. Rainbow: combining improvements in deep reinforcement learning. https://arxiv.org/abs/1710.02298
Ho J, Ermon S, 2016. Generative adversarial imitation learning. Proc 30th Neural Information Processing Systems, p.4565–4573.
Horgan D, Quan J, Budden D, et al., 2018. Distributed prioritized experience replay. https://arxiv.org/abs/1803.00933
Houthooft R, Chen X, Duan Y, et al., 2017. Variational information maximizing exploration. Proc 30th Neural Information Processing Systems, p.1109–1117.
Kakade S, Langford J, 2002. Approximately optimal approximate reinforcement learning. Proc 19th Int Conf on Machine Learning, p.267–274.
Kalashnikov D, Irpan A, Pastor P, et al., 2018. QT-Opt: scalable deep reinforcement learning for vision-based robotic manipulation. Proc 2nd Conf on Robot Learning, p.651–673.
Klein E, Geist M, Piot B, et al., 2012. Inverse reinforcement learning through structured classification. Proc 25th Neural Information Processing Systems, p.1007–1015.
Kolter JZ, Ng AY, 2009. Near-Bayesian exploration in polynomial time. Proc 26th Int Conf on Machine Learning, p.513–520. https://doi.org/10.1145/1553374.1553441
Krizhevsky A, Sutskever I, Hinton GE, 2012. ImageNet classification with deep convolutional neural networks. Proc 25th Neural Information Processing Systems, p.1097–1105.
Kröse BJA, 1995. Learning from delayed rewards. Robot Auton Syst, 15(4):233–235. https://doi.org/10.1016/0921-8890(95)00026-C
Lange S, Riedmiller M, Voigtländer A, 2012. Autonomous reinforcement learning on raw visual input data in a real world application. Proc Int Joint Conf on Neural Networks, p.1–8. https://doi.org/10.1109/IJCNN.2012.6252823
Levine S, Koltun V, 2013. Guided policy search. Proc 30th Int Conf on Machine Learning, p.1–9.
Levine S, Wagener N, Abbeel P, 2015. Learning contact-rich manipulation skills with guided policy search. Proc IEEE Int Conf on Robotics and Automation, p.156–163. https://doi.org/10.1109/ICRA.2015.7138994
Levine S, Finn C, Darrell T, et al., 2016. End-to-end training of deep visuomotor policies. J Mach Learn Res, 17(1): 1334–1373.
Lillicrap TP, Hunt JJ, Pritzel A, et al., 2016. Continuous control with deep reinforcement learning. Proc 4th Int Conf on Learning Representations, p.2829–2838.
Lin LJ, 1992. Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach Learn, 8(3–4):293–321. https://doi.org/10.1007/BF00992699
Mao HZ, Alizadeh M, Menache I, et al., 2016. Resource management with deep reinforcement learning. Proc 15th ACM Workshop on Hot Topics in Networks, p.50–56. https://doi.org/10.1145/3005745.3005750
Mao HZ, Schwarzkopf M, Venkatakrishnan SB, et al., 2019a. Learning scheduling algorithms for data processing clusters. Proc ACM Special Interest Group on Data Communication, p.270–288.
Mao HZ, Negi P, Narayan A, et al., 2019b. Park: an open platform for learning-augmented computer systems. Proc 36th Int Conf on Machine Learning, p.2490–2502.
Mishra N, Rohaninejad M, Chen X, et al., 2018. A simple neural attentive meta-learner. https://arxiv.org/abs/1707.03141
Mnih V, Kavukcuoglu K, Silver D, et al., 2013. Playing Atari with deep reinforcement learning. https://arxiv.org/abs/1312.5602
Mnih V, Kavukcuoglu K, Silver D, et al., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533. https://doi.org/10.1038/nature14236
Mnih V, Badia AP, Mirza M, et al., 2016. Asynchronous methods for deep reinforcement learning. Proc 33rd Int Conf on Machine Learning, p.1928–1937.
Mousavi SS, Schukat M, Howley E, 2018. Deep reinforcement learning: an overview. Proc SAI Intelligent Systems Conf, p.426–440. https://doi.org/10.1007/978-3-319-56991-8_32
Nachum O, Norouzi M, Xu K, et al., 2017a. Bridging the gap between value and policy based reinforcement learning. Proc 31st Neural Information Processing Systems, p.2775–2785.
Nachum O, Norouzi M, Xu K, et al., 2017b. Trust-PCL: an off-policy trust region method for continuous control. https://arxiv.org/abs/1707.01891
Nagabandi A, Kahn G, Fearing RS, et al., 2018. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. IEEE Int Conf on Robotics and Automation, p.7559–7566. https://doi.org/10.1109/ICRA.2018.8463189
Nagabandi A, Clavera I, Liu SM, et al., 2019. Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. https://arxiv.org/abs/1803.11347v6
Ng AY, Russell SJ, 2000. Algorithms for inverse reinforcement learning. Proc 17th Int Conf on Machine Learning, p.663–670.
Osband I, Blundell C, Pritzel A, et al., 2016. Deep exploration via bootstrapped DQN. Proc 29th Neural Information Processing Systems, p.4026–4034.
Ostrovski G, Bellemare MG, van den Oord A, et al., 2017. Count-based exploration with neural density models. Proc 34th Int Conf on Machine Learning, p.2721–2730.
Parisotto E, Ba JL, Salakhutdinov R, 2016. Actor-Mimic: deep multitask and transfer reinforcement learning. https://arxiv.org/abs/1511.06342
Pathak D, Agrawal P, Efros AA, et al., 2017. Curiosity-driven exploration by self-supervised prediction. Proc IEEE Conf on Computer Vision and Pattern Recognition Workshops, p.488–489. https://doi.org/10.1109/CVPRW.2017.70
Peng XB, Abbeel P, Levine S, et al., 2018a. DeepMimic: example-guided deep reinforcement learning of physics-based character skills. ACM Trans Graph, 37(4):143. https://doi.org/10.1145/3197517.3201311
Peng XB, Andrychowicz M, Zaremba W, et al., 2018b. Sim-to-real transfer of robotic control with dynamics randomization. Proc IEEE Int Conf on Robotics and Automation, p.3803–3810. https://doi.org/10.1109/ICRA.2018.8460528
Ping W, Peng KN, Gibiansky A, et al., 2018. Deep voice 3: 2000-speaker neural text-to-speech. Proc Int Conf on Learning Representations, p.214–217.
Pohlen T, Piot B, Hester T, et al., 2018. Observe and look further: achieving consistent performance on Atari. https://arxiv.org/abs/1805.11593
Racanière S, Weber T, Reichert DP, et al., 2017. Imagination-augmented agents for deep reinforcement learning. Proc 31st Neural Information Processing Systems, p.5694–5705.
Rahmatizadeh R, Abolghasemi P, Behal A, et al., 2016. Learning real manipulation tasks from virtual demonstrations using LSTM. https://arxiv.org/abs/1603.03833v2
Rajeswaran A, Ghotra S, Ravindran B, et al., 2017. EPOpt: learning robust neural network policies using model ensembles. https://arxiv.org/abs/1610.01283
Rakelly K, Zhou A, Quillen D, et al., 2019. Efficient off-policy meta-reinforcement learning via probabilistic context variables. Proc 36th Int Conf on Machine Learning, p.5331–5340.
Ratliff ND, Bagnell JA, Zinkevich MA, 2006. Maximum margin planning. Proc 23rd Int Conf on Machine Learning, p.729–736. https://doi.org/10.1145/1143844.1143936
Russo D, Roy BV, 2014. Learning to optimize via information-directed sampling. Proc 27th Neural Information Processing Systems, p.1583–1591.
Rusu AA, Colmenarejo SG, Gulcehre C, et al., 2016a. Policy distillation. https://arxiv.org/abs/1511.06295
Rusu AA, Rabinowitz NC, Desjardins G, et al., 2016b. Progressive neural networks. https://arxiv.org/abs/1606.04671
Schaul T, Quan J, Antonoglou I, et al., 2016. Prioritized experience replay. https://arxiv.org/abs/1511.05952
Schulman J, Levine S, Moritz P, et al., 2015. Trust region policy optimization. Proc Int Conf on Machine Learning, p.1889–1897.
Schulman J, Moritz P, Levine S, et al., 2016. High-dimensional continuous control using generalized advantage estimation. https://arxiv.org/abs/1506.02438
Schulman J, Wolski F, Dhariwal P, et al., 2017. Proximal policy optimization algorithms. https://arxiv.org/abs/1707.06347
Shum HY, He XD, Li D, 2018. From Eliza to XiaoIce: challenges and opportunities with social chatbots. Front Inform Technol Electron Eng, 19(1):10–26. https://doi.org/10.1631/FITEE.1700826
Silver D, Lever G, Heess N, et al., 2014. Deterministic policy gradient algorithms. Proc 31 st Int Conf on Machine Learning, p.387–395.
Silver D, Huang A, Maddison CJ, et al., 2016. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489. https://doi.org/10.1038/nature16961
Skerry-Ryan RJ, Battenberg E, Xiao Y, et al., 2018. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. Int Conf on Machine Learning, p.4693–4702.
Stadie BC, Yang G, Houthooft R, et al., 2018. Some considerations on learning to explore via meta-reinforcement learning. https://arxiv.org/abs/1803.01118
Strehl AL, Littman ML, 2008. An analysis of model-based interval estimation for Markov decision processes. J Comput Syst Sci, 74(8):1309–1331. https://doi.org/10.1016/j.jcss.2007.08.009
Sutton RS, 1988. Learning to predict by the methods of temporal differences. Mach Learn, 3(1):9–44. https://doi.org/10.1023/A:1022633531479
Sutton RS, Barto AG, 2018. Reinforcement Learning: an Introduction (2nd Ed.). MIT Press, Cambridge, MA, USA.
Tang HR, Houthooft R, Foote D, et al., 2017. #Exploration: a study of count-based exploration for deep reinforcement learning. Proc 31 st Neural Information Processing Systems, p.2753–2762.
van Hasselt H, Guez A, Silver D, 2016. Deep reinforcement learning with double Q-learning. Proc 30th AAAI Conf on Artificial Intelligence, p.2096–2100.
Vanschoren J, 2018. Meta-learning: a survey. https://arxiv.org/abs/1810.03548
Vinyals O, Ewalds T, Bartunov S, et al., 2017. StarCraft II: a new challenge for reinforcement learning. https://arxiv.org/abs/1708.04782
Wang JX, Kurth-Nelson Z, Tirumala D, et al., 2017. Learning to reinforcement learn. https://arxiv.org/abs/1611.05763
Wang ZY, Schaul T, Hessel M, et al., 2016. Dueling network architectures for deep reinforcement learning. Proc 33rd Int Conf on Machine Learning, p.1995–2003.
Wang ZY, Bapst V, Heess N, et al., 2017. Sample efficient actor-critic with experience replay. https://arxiv.org/abs/1611.01224
Watter M, Springenberg JT, Boedecker J, et al., 2015. Embed to control: a locally linear latent dynamics model for control from raw images. Proc 28th Neural Information Processing Systems, p.2746–2754.
Williams RJ, 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn, 8(3–4):229–256. https://doi.org/10.1023/A:1022672621406
Wu YH, Mansimov E, Grosse RB, et al., 2017. Scalable trustregion method for deep reinforcement learning using Kronecker-factored approximation. Proc 30th Neural Information Processing Systems, p.5279–5288.
Xia C, El Kamel A, 2016. Neural inverse reinforcement learning in autonomous navigation. Robot Auton Syst, 84:1–14. https://doi.org/10.1016/j.robot.2016.06.003
Yahya A, Li A, Kalakrishnan M, et al., 2017. Collective robot reinforcement learning with distributed asynchronous guided policy search. IEEE/RSJ Int Conf on Intelligent Robots and Systems, p.79–86. https://doi.org/10.1109/IROS.2017.8202141
Yu TH, Finn C, Xie AN, et al., 2018. One-shot imitation from observing humans via domain-adaptive meta-learning. https://arxiv.org/abs/1802.01557v1
Yu WH, Tan J, Liu CK, et al., 2017. Preparing for the unknown: learning a universal policy with online system identification. https://arxiv.org/abs/1702.02453
Zhang M, Vikram S, Smith L, et al., 2019. SOLAR: deep structured representations for model-based reinforcement learning. Proc 36th Int Conf on Machine Learning, p.7444–7453.
Ziebart BD, Maas A, Bagnell JA, et al., 2008. Maximum entropy inverse reinforcement learning. Proc 23rd AAAI Conf on Artificial Intelligence, p.1433–1438.
Zintgraf L, Shiarli K, Kurin V, et al., 2019. Fast context adaptation via meta-learning. Proc 36th Int Conf on Machine Learning, p.7693–7702.
Author information
Authors and Affiliations
Contributions
Hao-nan WANG designed the research. Ning LIU and Yi-yun ZHANG collected the literature. Hao-nan WANG drafted the manuscript. Ning LIU, Da-wei FENG, and Feng HUANG helped organize the manuscript. Hao-nan WANG and Ning LIU revised the manuscript. Hao-nan WANG finalized the paper under the guidance of Dong-sheng LI and Yi-ming ZHANG.
Corresponding author
Additional information
Compliance with ethics guidelines
Hao-nan WANG, Ning LIU, Yi-yun ZHANG, Da-wei FENG, Feng HUANG, Dong-sheng LI, and Yi-ming ZHANG declare that they have no conflict of interest.
Project supported by the National Natural Science Foundation of China (Nos. 61772541, 61872376, and 61932001)
Rights and permissions
About this article
Cite this article
Wang, Hn., Liu, N., Zhang, Yy. et al. Deep reinforcement learning: a survey. Front Inform Technol Electron Eng 21, 1726–1744 (2020). https://doi.org/10.1631/FITEE.1900533
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/FITEE.1900533