research-article

FA3C: FPGA-Accelerated Deep Reinforcement Learning

Authors:
Hyungmin Cho

Hongik University, Seoul, Rebublic of Korea

Hongik University, Seoul, Rebublic of Korea
View Profile

,
Pyeongseok Oh

Seoul National University, Seoul, Rebublic of Korea

Seoul National University, Seoul, Rebublic of Korea
View Profile

,
Jiyoung Park

Seoul National University, Seoul, Rebublic of Korea

Seoul National University, Seoul, Rebublic of Korea
View Profile

,
Wookeun Jung

Seoul National University, Seoul, Rebublic of Korea

Seoul National University, Seoul, Rebublic of Korea
View Profile

,
Jaejin Lee

Seoul National University, Seoul, Rebublic of Korea

Seoul National University, Seoul, Rebublic of Korea
View Profile

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating SystemsApril 2019Pages 499–513https://doi.org/10.1145/3297858.3304058

Published:04 April 2019Publication History

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 499–513

ABSTRACT

Deep Reinforcement Learning (Deep RL) is applied to many areas where an agent learns how to interact with the environment to achieve a certain goal, such as video game plays and robot controls. Deep RL exploits a DNN to eliminate the need for handcrafted feature engineering that requires prior domain knowledge. The Asynchronous Advantage Actor-Critic (A3C) is one of the state-of-the-art Deep RL methods. In this paper, we present an FPGA-based A3C Deep RL platform, called FA3C. Traditionally, FPGA-based DNN accelerators have mainly focused on inference only by exploiting fixed-point arithmetic. Our platform targets both inference and training using single-precision floating-point arithmetic. We demonstrate the performance and energy efficiency of FA3C using multiple A3C agents that learn the control policies of six Atari 2600 games. Its performance is better than a high-end GPU-based platform (NVIDIA Tesla P100). FA3C achieves 27.9% better performance than that of a state-of-the-art GPU-based implementation. Moreover, the energy efficiency of FA3C is 1.62x better than that of the GPU-based implementation.

References

Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, 265--283. Google ScholarDigital Library
Altera. 2018. Embedded Memory in Altera FPGAs. (2018). Retrieved Jan. 20, 2019 from https://www.intel.com/content/www/us/en/programmable/solutions/technology/memory/embedded.htmlGoogle Scholar
Amazon Web Services, Inc. 2019. Amazon EC2 F1 Instances. (2019). Retrieved Jan. 20, 2019 from https://aws.amazon.com/ec2/instance-types/f1/Google Scholar
Utku Aydonat, Shane O'Connell, Davor Capalija, Andrew C. Ling, and Gordon R. Chiu. 2017. An OpenCL#8482; Deep Learning Accelerator on Arria 10. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '17). ACM Press, 55--64. Google ScholarDigital Library
Mohammad Babaeizadeh, Iuri Frosio, Stephen Tyree, Jason Clemons, and Jan Kautz. 2017. Reinforcement Learning thorugh Asynchronous Advantage Actor-Critic on a GPU. In ICLR .Google Scholar
Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. 2015. The Arcade Learning Environment: An Evaluation Platform for General Agents. In Proceedings of the 24th International Conference on Artificial Intelligence (IJCAI'15). AAAI Press, 4148--4152. Google ScholarDigital Library
Alfredo V. Clemente, Humberto Nicolá s Castejó n Mart'i nez, and Arjun Chandra. 2017. Efficient Parallel Methods for Deep Reinforcement Learning. arXiv preprint (2017). arxiv: 1705.04862Google Scholar
Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A. Rusu, Alexander Pritzel, and Daan Wierstra. 2017. PathNet: Evolution Channels Gradient Descent in Super Neural Networks. arXiv preprint (2017). arxiv: 1701.08734Google Scholar
Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong. 2017. FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates. In Proceedings of the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE.Google ScholarCross Ref
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 243--254. Google ScholarDigital Library
Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. 2012. Overview of mini-batch gradient descent. (2012). Retrieved Jan. 20, 2019 from http://www.cs.toronto.edu/ tijmen/csc321/slides/lecture_slides_lec6.pdfGoogle Scholar
Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z. Leibo, David Silver, and Koray Kavukcuoglu. 2016. Reinforcement Learning with Unsupervised Auxiliary Tasks. arXiv preprint (2016). arxiv: 1611.05397Google Scholar
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 1--12. Google ScholarDigital Library
Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kyung Kim, and Hadi Esmaeilzadeh. 2016. TABLA: A unified template-based framework for accelerating statistical machine learning. In Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 14--26.Google ScholarCross Ref
Kosuke Miyoshi. 2016. Asynchronous deep reinforcement learning. (2016). Retrieved Aug. 7, 2018 from https://github.com/miyosuda/async_deep_reinforceGoogle Scholar
Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P. Lillicrap, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML '16), Vol. 48. JMLR.org, 1928--1937. Google ScholarDigital Library
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller, Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing Atari With Deep Reinforcement Learning. NIPS Deep Learning Workshop .Google Scholar
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature, Vol. 518, 7540 (2015), 529--533.Google Scholar
Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, Shane Legg, Volodymyr Mnih, Koray Kavukcuoglu, and David Silver. 2015. Massively Parallel Methods for Deep Reinforcement Learning. arXiv preprint.arxiv: 1507.04296Google Scholar
Eriko Nurvitadhi, Suchit Subhaschandra, Guy Boudoukh, Ganesh Venkatesh, Jaewoong Sim, Debbie Marr, Randy Huang, Jason Ong Gee Hock, Yeong Tat Liew, Krishnan Srivatsan, and Duncan Moss. 2017. Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '17). ACM Press, 5--14. Google ScholarDigital Library
NVIDIA. 2018. cuBLAS Dense Linear Algebra on GPUs. (2018). Retrieved Jan. 20, 2019 from https://developer.nvidia.com/cublasGoogle Scholar
NVIDIA. 2018. NVIDIA cuDNN GPU Accelerated Deep Learning. (2018). Retrieved Jan. 20, 2019 from https://developer.nvidia.com/cudnnGoogle Scholar
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS Workshop .Google Scholar
Jiantao Qiu, Sen Song, Yu Wang, Huazhong Yang, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, and Ningyi Xu. 2016. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '16). ACM Press, 26--35. Google ScholarDigital Library
Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. 2016. Progressive Neural Networks. arXiv preprint (2016). arxiv: 1606.04671Google Scholar
Herman Schmit and Randy Huang. 2016. Dissecting XeonGoogle Scholar
FPGA: Why the Integration of CPUs and FPGAs Makes a Power Difference for the Datacenter: Invited Paper. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design (ISLPED '16). ACM Press, 152--153. Google ScholarDigital Library
Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-49). IEEE, 1--12. Google ScholarDigital Library
Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement Learning : An Introduction .MIT Press. Google ScholarDigital Library
Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Ré mi Munos, Koray Kavukcuoglu, and Nando de Freitas. 2016. Sample Efficient Actor-Critic with Experience Replay. arXiv preprint.arxiv: 1611.01224Google Scholar
Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017 (DAC '17). ACM Press, Article 29, 6 pages. Google ScholarDigital Library
Xilinx. 2017. Block Memory Generator,. (2017). Retrieved Jan. 20, 2019 from https://www.xilinx.com/products/intellectual-property/block_memory_generator.htmlGoogle Scholar
Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks. In Proceedings of the 35th International Conference on Computer-Aided Design (ICCAD '16). ACM Press, Article 12, 8 pages. Google ScholarDigital Library
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '15). ACM Press, 161--170. Google ScholarDigital Library
Jialiang Zhang and Jing Li. 2017. Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '17). ACM Press, 25--34. Google ScholarDigital Library

Index Terms

FA3C: FPGA-Accelerated Deep Reinforcement Learning
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
2. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Reinforcement learning
        Multi-agent reinforcement learning

Recommendations

Implementation of Deep Reinforcement Learning
ICISS '19: Proceedings of the 2nd International Conference on Information Science and Systems

Reinforcement Learning (RL) is different from supervised learning, which is learning from a training set of labeled examples provided by a knowledgable external supervisor. RL is also different from unsupervised learning, which is typically about ...
Read More
Lightening the Load with Highly Accurate Storage- and Energy-Efficient LightNNs
Special Issue on Deep learning on FPGAs

Hardware implementations of deep neural networks (DNNs) have been adopted in many systems because of their higher classification speed. However, while they may be characterized by better accuracy, larger DNNs require significant energy and area, thereby ...
Read More
E2HRL: An Energy-efficient Hardware Accelerator for Hierarchical Deep Reinforcement Learning
Recently, Reinforcement Learning (RL) has shown great performance in solving sequential decision-making and control in dynamic environment problems. Despite its achievements, deploying Deep Neural Network (DNN)-based RL is expensive in terms of time and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems
April 2019
1126 pages
ISBN:9781450362405
DOI:10.1145/3297858
General Chairs:
Iris Bahar
Brown University
,
Maurice Herlihy
Brown University
,
Program Chairs:
Emmett Witchel
University of Texas, Austin
,
Alvin Lebeck
Duke University
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 April 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FPGA
deep neural networks
reinforcement learning
Qualifiers
- research-article
Conference

Acceptance Rates
ASPLOS '19 Paper Acceptance Rate74of351submissions,21%Overall Acceptance Rate535of2,713submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 38
  Total Citations
  View Citations
- 2,643
  Total Downloads
- Downloads (Last 12 months)304
- Downloads (Last 6 weeks)27
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

FA3C: FPGA-Accelerated Deep Reinforcement Learning

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Implementation of Deep Reinforcement Learning

Lightening the Load with Highly Accurate Storage- and Energy-Efficient LightNNs

E2HRL: An Energy-efficient Hardware Accelerator for Hierarchical Deep Reinforcement Learning