skip to main content
10.1145/3400302.3415636acmconferencesArticle/Chapter ViewAbstractPublication PagesiccadConference Proceedingsconference-collections
research-article

A many-core accelerator design for on-chip deep reinforcement learning

Published: 17 December 2020 Publication History

Abstract

Deep Reinforcement Learning (DRL) is substantially resource-consuming, and it requires large-scale distributed computing-nodes to learn complicated tasks, like videogame and Go play. This work attempts to down-scale a distributed DRL system into a specialized many-core chip and achieve energy-efficient on-chip DRL. With the customized Network-on-Chip that handles the communication of on-chip data and control-signals, we proposed a Synchronous Asynchronous RL Architecture (SARLA) and the according many-core chip that completely avoids the unnecessary data duplication and synchronization activities in multi-node RL systems. In evaluation, the SARLA system achieves considerable energy-efficiency boost over the GPU-based implementations for typical DRL workloads built with OpenAI-gym.

References

[1]
Mnih V, Kavukcuoglu K, Silver D, et al. "Human-level control through deep reinforcement learning," Nature, 2015, 518(7540): 529--533
[2]
Arun. Nair, et al. "Massively parallel methods for deep reinforcement learning. In ICML Deep Learning Workshop. 2015.
[3]
Mnih V, Badia A P, Mirza M, et al. "Asynchronous methods for deep reinforcement learning," In Proc. ICML. New York, USA, 2016: 1928--1937
[4]
Y.-H. Chen, et al. "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks," IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127--138, 2017.
[5]
N. P. Jouppi, et al., "In-datacenter performance analysis of a tensor processing unit," arXiv preprint arXiv:1704.04760, 2017
[6]
Y. Chen, et al., " DaDianNao: A Machine-Learning Supercomputer," in Proc. MICRO, 2014.
[7]
Sutton R S, A G. Barto, "Reinforcement learning: an introduction," Cambridge: MIT press, 1998
[8]
M. Riedmiller, "Neural fitted q iteration-first experiences with a data efficient neural reinforcement learning method," In Proc. ICML, 2005.
[9]
Lange S, et al., "Autonomous reinforcement learning on raw visual input data in a real world application. In Proc. IJCNN, Australia, 2012.
[10]
W. Wen, et al., "Learning structured sparsity in deep neural networks," in Proc. NIPS, 2016, pp. 2074--2082.
[11]
T. Lillicrap, et al., "Continuous control with deep reinforcement learning," arXiv preprint arXiv:1509.02971, 2015.
[12]
D. Kim et al., 3D-MAPS: 3D Massively Parallel Processor with Stacked Memory, In Proc. Solid-State Circuits Conference (ISSCC), pp.188--190, 2012.
[13]
Hoeju Chung, et al., A 58nm 1.8V 1Gb PRAM with 6.4MB/s program BW, In Proc. Solid-State Circuits Conference (ISSCC), pp.588--590, 2011.
[14]
B. C. Lee et al., Architecting Phase Change Memory as a Scalable DRAM Alternative, In Proc. International Symposium on Computer Architecture (ISCA), pp.2--12, 2009.
[15]
V. Seshadri et al., RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization, in Proc. International Symposium on Microarchitecture (MICRO), pp. 185--197, 2013.
[16]
G. Graefe et al., B-tree indexes and CPU caches, In Proc. International Conference on Data Engineering (ICDE), 2001.
[17]
R. Horspool, Practical fast searching in strings, J. Software: Practice and Experience, vol.10, no.6, pp.501--506, 1980.
[18]
J. Chhugani, Efficient Implementation of Sorting on MultiCore SIMD CPU Architecture, In Proc. the VLDB Endowment, vol.1, no.2, pp.1313--1324, 2008.
[19]
R. Ubal et al., Multi2Sim: a simulation framework for CPU-GPU computing, In Proc. Parallel architectures and compilation techniques (PACT), pp.335--344, 2012.
[20]
X. Dong et al., NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Non-Volatile Memory, IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol.31, no.7, pp.994--1007, 2012.
[21]
F. Ahmad et al., PUMA: Purdue MapReduce Benchmarks Suite, Technical Report, Purdue ECE Tech Report TR-ECE-12-11.
[22]
M. Guthaus et al., MiBench: A free, commercially representative embedded benchmark suite, In Proc. Workload Characterization (WWC), pp.3--14, 2001.
[23]
OpenCV library; http://code.opencv.org.
[24]
Pizza&Chili repository, http://pizzachili.dcc.uchile.cl/texts.html
[25]
DARPA Intrusion Detection Data Sets, http://www.ll.mit.edu/mission/
[26]
P. Svärd et al. Evaluation of delta compression techniques for efficient live migration of large virtual machines, in Proc. Virtual execution environments (VEE), pp.111--120, 2011.
[27]
S. Li et al., McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures, In International Symposium on Microarchitecture (MICRO), pp.469--480, 2009.
[28]
Free PDK 45nm open-access based PDK for the 45nm technology node. http://www.eda.ncsu.edu/wiki/FreePDK.

Cited By

View all
  • (2024)HDRLPIM: A Simulator for Hyper-Dimensional Reinforcement Learning Based on Processing In-MemoryACM Journal on Emerging Technologies in Computing Systems10.1145/369587520:4(1-17)Online publication date: 28-Nov-2024
  • (2024)A FPGA Accelerator of Distributed A3C Algorithm with Optimal Resource DeploymentIET Computers & Digital Techniques10.1049/2024/78552502024(1-13)Online publication date: 27-May-2024
  • (2023)Hardware-Optimized Hyperdimensional Computing for Real-Time Learning2023 IEEE 66th International Midwest Symposium on Circuits and Systems (MWSCAS)10.1109/MWSCAS57524.2023.10406123(811-815)Online publication date: 6-Aug-2023
  • Show More Cited By

Index Terms

  1. A many-core accelerator design for on-chip deep reinforcement learning

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICCAD '20: Proceedings of the 39th International Conference on Computer-Aided Design
      November 2020
      1396 pages
      ISBN:9781450380263
      DOI:10.1145/3400302
      • General Chair:
      • Yuan Xie
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      In-Cooperation

      • IEEE CAS
      • IEEE CEDA
      • IEEE CS

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 December 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. distributed learning
      2. many-core chip
      3. network-on-chip
      4. reinforcement learning

      Qualifiers

      • Research-article

      Funding Sources

      • National Natural Science Foundation of China
      • Youth Innovation Promotion Association, CAS
      • State Key Laboratory of Computer Architecture

      Conference

      ICCAD '20
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 457 of 1,762 submissions, 26%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)65
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 15 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)HDRLPIM: A Simulator for Hyper-Dimensional Reinforcement Learning Based on Processing In-MemoryACM Journal on Emerging Technologies in Computing Systems10.1145/369587520:4(1-17)Online publication date: 28-Nov-2024
      • (2024)A FPGA Accelerator of Distributed A3C Algorithm with Optimal Resource DeploymentIET Computers & Digital Techniques10.1049/2024/78552502024(1-13)Online publication date: 27-May-2024
      • (2023)Hardware-Optimized Hyperdimensional Computing for Real-Time Learning2023 IEEE 66th International Midwest Symposium on Circuits and Systems (MWSCAS)10.1109/MWSCAS57524.2023.10406123(811-815)Online publication date: 6-Aug-2023
      • (2023)A Deep Q Network Hardware Accelerator Based on Heterogeneous Computing2023 IEEE 15th International Conference on ASIC (ASICON)10.1109/ASICON58565.2023.10396321(1-4)Online publication date: 24-Oct-2023
      • (2023)FARANE-Q: Fast Parallel and Pipeline Q-Learning Accelerator for Configurable Reinforcement Learning SoCIEEE Access10.1109/ACCESS.2022.323285311(144-161)Online publication date: 2023
      • (2022)Deep Reinforcement Learning Acceleration for Real-Time Edge Computing Mixed Integer Programming ProblemsIEEE Access10.1109/ACCESS.2022.314767410(18526-18543)Online publication date: 2022
      • (2022)Data streaming and traffic gathering in mesh-based NoC for deep neural network accelerationJournal of Systems Architecture10.1016/j.sysarc.2022.102466126(102466)Online publication date: May-2022

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media