research-article

A many-core accelerator design for on-chip deep reinforcement learning

Authors:

Xiaowei LiAuthors Info & Claims

ICCAD '20: Proceedings of the 39th International Conference on Computer-Aided Design

Article No.: 46, Pages 1 - 7

https://doi.org/10.1145/3400302.3415636

Published: 17 December 2020 Publication History

Abstract

Deep Reinforcement Learning (DRL) is substantially resource-consuming, and it requires large-scale distributed computing-nodes to learn complicated tasks, like videogame and Go play. This work attempts to down-scale a distributed DRL system into a specialized many-core chip and achieve energy-efficient on-chip DRL. With the customized Network-on-Chip that handles the communication of on-chip data and control-signals, we proposed a Synchronous Asynchronous RL Architecture (SARLA) and the according many-core chip that completely avoids the unnecessary data duplication and synchronization activities in multi-node RL systems. In evaluation, the SARLA system achieves considerable energy-efficiency boost over the GPU-based implementations for typical DRL workloads built with OpenAI-gym.

References

[1]

Mnih V, Kavukcuoglu K, Silver D, et al. "Human-level control through deep reinforcement learning," Nature, 2015, 518(7540): 529--533

[2]

Arun. Nair, et al. "Massively parallel methods for deep reinforcement learning. In ICML Deep Learning Workshop. 2015.

[3]

Mnih V, Badia A P, Mirza M, et al. "Asynchronous methods for deep reinforcement learning," In Proc. ICML. New York, USA, 2016: 1928--1937

[4]

Y.-H. Chen, et al. "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks," IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127--138, 2017.

[5]

N. P. Jouppi, et al., "In-datacenter performance analysis of a tensor processing unit," arXiv preprint arXiv:1704.04760, 2017

[6]

Y. Chen, et al., " DaDianNao: A Machine-Learning Supercomputer," in Proc. MICRO, 2014.

Digital Library

[7]

Sutton R S, A G. Barto, "Reinforcement learning: an introduction," Cambridge: MIT press, 1998

[8]

M. Riedmiller, "Neural fitted q iteration-first experiences with a data efficient neural reinforcement learning method," In Proc. ICML, 2005.

[9]

Lange S, et al., "Autonomous reinforcement learning on raw visual input data in a real world application. In Proc. IJCNN, Australia, 2012.

[10]

W. Wen, et al., "Learning structured sparsity in deep neural networks," in Proc. NIPS, 2016, pp. 2074--2082.

[11]

T. Lillicrap, et al., "Continuous control with deep reinforcement learning," arXiv preprint arXiv:1509.02971, 2015.

[12]

D. Kim et al., 3D-MAPS: 3D Massively Parallel Processor with Stacked Memory, In Proc. Solid-State Circuits Conference (ISSCC), pp.188--190, 2012.

[13]

Hoeju Chung, et al., A 58nm 1.8V 1Gb PRAM with 6.4MB/s program BW, In Proc. Solid-State Circuits Conference (ISSCC), pp.588--590, 2011.

[14]

B. C. Lee et al., Architecting Phase Change Memory as a Scalable DRAM Alternative, In Proc. International Symposium on Computer Architecture (ISCA), pp.2--12, 2009.

Digital Library

[15]

V. Seshadri et al., RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization, in Proc. International Symposium on Microarchitecture (MICRO), pp. 185--197, 2013.

[16]

G. Graefe et al., B-tree indexes and CPU caches, In Proc. International Conference on Data Engineering (ICDE), 2001.

Digital Library

[17]

R. Horspool, Practical fast searching in strings, J. Software: Practice and Experience, vol.10, no.6, pp.501--506, 1980.

[18]

J. Chhugani, Efficient Implementation of Sorting on MultiCore SIMD CPU Architecture, In Proc. the VLDB Endowment, vol.1, no.2, pp.1313--1324, 2008.

Digital Library

[19]

R. Ubal et al., Multi2Sim: a simulation framework for CPU-GPU computing, In Proc. Parallel architectures and compilation techniques (PACT), pp.335--344, 2012.

[20]

X. Dong et al., NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Non-Volatile Memory, IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol.31, no.7, pp.994--1007, 2012.

Digital Library

[21]

F. Ahmad et al., PUMA: Purdue MapReduce Benchmarks Suite, Technical Report, Purdue ECE Tech Report TR-ECE-12-11.

[22]

M. Guthaus et al., MiBench: A free, commercially representative embedded benchmark suite, In Proc. Workload Characterization (WWC), pp.3--14, 2001.

[23]

OpenCV library; http://code.opencv.org.

[24]

Pizza&Chili repository, http://pizzachili.dcc.uchile.cl/texts.html

[25]

DARPA Intrusion Detection Data Sets, http://www.ll.mit.edu/mission/

[26]

P. Svärd et al. Evaluation of delta compression techniques for efficient live migration of large virtual machines, in Proc. Virtual execution environments (VEE), pp.111--120, 2011.

Digital Library

[27]

S. Li et al., McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures, In International Symposium on Microarchitecture (MICRO), pp.469--480, 2009.

[28]

Free PDK 45nm open-access based PDK for the 45nm technology node. http://www.eda.ncsu.edu/wiki/FreePDK.

Cited By

Rakka MAmer WChen HImani MKurdahi F(2024)HDRLPIM: A Simulator for Hyper-Dimensional Reinforcement Learning Based on Processing In-MemoryACM Journal on Emerging Technologies in Computing Systems10.1145/369587520:4(1-17)Online publication date: 28-Nov-2024
https://dl.acm.org/doi/10.1145/3695875
Ge FZhang GLi ZZhou F(2024)A FPGA Accelerator of Distributed A3C Algorithm with Optimal Resource DeploymentIET Computers & Digital Techniques10.1049/2024/78552502024(1-13)Online publication date: 27-May-2024
https://doi.org/10.1049/2024/7855250
Chen HBarkam HImani M(2023)Hardware-Optimized Hyperdimensional Computing for Real-Time Learning2023 IEEE 66th International Midwest Symposium on Circuits and Systems (MWSCAS)10.1109/MWSCAS57524.2023.10406123(811-815)Online publication date: 6-Aug-2023
https://doi.org/10.1109/MWSCAS57524.2023.10406123
Show More Cited By

Index Terms

A many-core accelerator design for on-chip deep reinforcement learning
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
2. Computing methodologies
  1. Artificial intelligence

Recommendations

A reconfigurable source-synchronous on-chip network for GALS many-core platforms
Special issue on the 2009 ACM/IEEE international symposium on networks-on-chip

This paper presents a globally-asynchronous locally-synchronous (GALS)-compatible circuit-switched on-chip network that is well suited for use in many-core platforms targeting streaming digital signal processing and embedded applications which typically ...
Flexible Reconfigurable On-chip Networks for Multi-core SoCs
HEART '18: Proceedings of the 9th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies

Multi and many-core embedded SoCs (System-on-Chip) provide key solutions to meet the extraordinary demands of current and future applications. This fact becomes critical when the chip design dives to the limitation of sub-nanometer technologies that ...
Group-Agent Reinforcement Learning
Artificial Neural Networks and Machine Learning – ICANN 2023
Abstract
It can largely benefit the reinforcement learning (RL) process of each agent if multiple geographically distributed agents perform their separate RL tasks cooperatively. Different from multi-agent reinforcement learning (MARL) where multiple ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICCAD '20: Proceedings of the 39th International Conference on Computer-Aided Design

November 2020

1396 pages

ISBN:9781450380263

DOI:10.1145/3400302

General Chair:
Yuan Xie
Univ. of California, Santa Barbara, CA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

In-Cooperation

IEEE CAS
IEEE CEDA
IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 December 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Youth Innovation Promotion Association, CAS
State Key Laboratory of Computer Architecture

Conference

ICCAD '20

Sponsor:

SIGDA

ICCAD '20: IEEE/ACM International Conference on Computer-Aided Design

November 2 - 5, 2020

Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 457 of 1,762 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
314
Total Downloads

Downloads (Last 12 months)65
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Rakka MAmer WChen HImani MKurdahi F(2024)HDRLPIM: A Simulator for Hyper-Dimensional Reinforcement Learning Based on Processing In-MemoryACM Journal on Emerging Technologies in Computing Systems10.1145/369587520:4(1-17)Online publication date: 28-Nov-2024
https://dl.acm.org/doi/10.1145/3695875
Ge FZhang GLi ZZhou F(2024)A FPGA Accelerator of Distributed A3C Algorithm with Optimal Resource DeploymentIET Computers & Digital Techniques10.1049/2024/78552502024(1-13)Online publication date: 27-May-2024
https://doi.org/10.1049/2024/7855250
Chen HBarkam HImani M(2023)Hardware-Optimized Hyperdimensional Computing for Real-Time Learning2023 IEEE 66th International Midwest Symposium on Circuits and Systems (MWSCAS)10.1109/MWSCAS57524.2023.10406123(811-815)Online publication date: 6-Aug-2023
https://doi.org/10.1109/MWSCAS57524.2023.10406123
Zhang GGe FZhou F(2023)A Deep Q Network Hardware Accelerator Based on Heterogeneous Computing2023 IEEE 15th International Conference on ASIC (ASICON)10.1109/ASICON58565.2023.10396321(1-4)Online publication date: 24-Oct-2023
https://doi.org/10.1109/ASICON58565.2023.10396321
Sutisna NIlmy ASyafalni IMulyawan RAdiono T(2023)FARANE-Q: Fast Parallel and Pipeline Q-Learning Accelerator for Configurable Reinforcement Learning SoCIEEE Access10.1109/ACCESS.2022.323285311(144-161)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2022.3232853
Gerogiannis GBirbas MLeftheriotis AMylonas ETzanis NBirbas A(2022)Deep Reinforcement Learning Acceleration for Real-Time Edge Computing Mixed Integer Programming ProblemsIEEE Access10.1109/ACCESS.2022.314767410(18526-18543)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3147674
Tiwari BYang MWang XJiang Y(2022)Data streaming and traffic gathering in mesh-based NoC for deep neural network accelerationJournal of Systems Architecture10.1016/j.sysarc.2022.102466126(102466)Online publication date: May-2022
https://doi.org/10.1016/j.sysarc.2022.102466

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten