research-article

Re-architecting the on-chip memory sub-system of machine-learning accelerator for embedded devices

Authors:

Ying Wang,

Huawei Li,

Xiaowei LiAuthors Info & Claims

2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

Pages 1 - 6

https://doi.org/10.1145/2966986.2967068

Published: 07 November 2016 Publication History

Abstract

The rapid development of deep learning are enabling a plenty of novel applications such as image and speech recognition for embedded systems, robotics or smart wearable devices. However, typical deep learning models like deep convolutional neural networks (CNNs) consume so much on-chip storage and high-throughput compute resources that they cannot be easily handled by mobile or embedded devices with thrifty silicon and power budget. In order to enable large CNN models in mobile or more cutting-edge devices for IoT or cyberphysics applications, we proposed an efficient on-chip memory architecture for CNN inference acceleration, and showed its application to our in-house general-purpose deep learning accelerator. The redesigned on-chip memory subsystem, Memsqueezer, includes an active weight buffer set and data buffer set that embrace specialized compression methods to reduce the footprint of CNN weight and data set respectively. The Memsqueezer buffer can compress the data and weight set according to their distinct features, and it also includes a built-in redundancy detection mechanism that actively scans through the work-set of CNNs to boost their inference performance by eliminating the data redundancy. In our experiment, it is shown that the CNN accelerators with Memsqueezer buffers achieves more than 2× performance improvement and reduces 80% energy consumption on average over the conventional buffer design with the same area budget.

References

[1]

S. Han et al., “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding”, arXiv preprint, 2015.

Google Scholar

[2]

S. Cadambi et al., “A programmable parallel accelerator for learning and classification”, in Proc. PACT, 2010.

Google Scholar

[3]

T. Chen et al., “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning”, in Proc. of ASPLOS, 2014.

Google Scholar

[4]

Y. Wang et al., “DeepBurning: Automatic Generation of FPGA-based Learning Accelerators for the Neural Network Family”, in Proc. DAC, 2016.

Google Scholar

[5]

C. Farabet et al., “NeuFlow: A runtime reconfigurable dataflow processor for vision”, In CVPR Workshop, 2011.

Google Scholar

[6]

S. Chakradhar et al., “A dynamically configurable coprocessor for convolutional neural networks”, in Proc ISCA, 2010.

Google Scholar

[7]

J. Albericio et al., “Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing”, in Proc. ISCA, 2016.

Google Scholar

[8]

S. Han et al., “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, in Proc. ISCA, 2016.

Google Scholar

[9]

Y. Chen et al., “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks”, in Proc. ISCA, 2016.

Google Scholar

[10]

A. R. Alameldeen et al., “Adaptive cache compression for highperformance processors”, in Proc. ISCA, 2004.

Google Scholar

[11]

Nangate Open Cell Library. http://www.si2.org/openeda.si2.org/projects/nangatelib.

Google Scholar

[12]

L. Song et al., “C-Brain: A deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization”, in Proc. DAC, 2016.

Google Scholar

[13]

O. Russakovsky et al., “Imagenet large scale visual recognition challenge. International Journal of Computer Vision”, pp. 1–42, 2014.

Google Scholar

[14]

Cacti 6.5, in: http://www.hpl.hp.com/research/cacti/.

Google Scholar

Cited By

View all

金学王彬(2024)基于拐角流量检测的视觉特征提取与跟踪方法智能机器人10.52810/JIR.2024.0011:1(1-10)Online publication date: 2-Mar-2024
https://doi.org/10.52810/JIR.2024.001
Zhou KQiu K(2024)REC: REtime Convolutional Layers to Fully Exploit Harvested Energy for ReRAM-based CNN AcceleratorsACM Transactions on Embedded Computing Systems10.1145/365259323:6(1-25)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3652593
Zheng ZHuang YChen D(2024)A lightweight distillation recurrent convolution network on FPGA for real-time video super-resolutionMultimedia Systems10.1007/s00530-024-01528-030:6Online publication date: 15-Oct-2024
https://dl.acm.org/doi/10.1007/s00530-024-01528-0
Show More Cited By

Index Terms

Re-architecting the on-chip memory sub-system of machine-learning accelerator for embedded devices
1. Computer systems organization
  1. Architectures
  2. Embedded and cyber-physical systems
2. Hardware
  1. Integrated circuits

Index terms have been assigned to the content through auto-classification.

Recommendations

An OpenCL™ Deep Learning Accelerator on Arria 10
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Convolutional neural nets (CNNs) have become a practical means to perform vision tasks, particularly in the area of image classification. FPGAs are well known to be able to perform convolutions efficiently, however, most recent efforts to run CNNs on ...
Architecting phase change memory as a scalable dram alternative

Memory scaling is in jeopardy as charge storage and sensing mechanisms become less reliable for prevalent memory technologies, such as DRAM. In contrast, phase change memory (PCM) storage relies on scalable current and thermal mechanisms. To exploit PCM'...
Embedded NAND flash file system for mobile multimedia devices

In this work, we present a novel mobile multimedia file system, MNFS, which is specifically designed for NAND flash memory. Our design specifically addresses the needs of devices such as MP3 players, personal media players (PMPs), digital camcorders, ...

Comments

Information & Contributors

Information

Published In

2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

Nov 2016

946 pages

Publisher

IEEE Press

Publication History

Published: 07 November 2016

Permissions

Request permissions for this article.

Request Permissions

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
539
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

金学王彬(2024)基于拐角流量检测的视觉特征提取与跟踪方法智能机器人10.52810/JIR.2024.0011:1(1-10)Online publication date: 2-Mar-2024
https://doi.org/10.52810/JIR.2024.001
Zhou KQiu K(2024)REC: REtime Convolutional Layers to Fully Exploit Harvested Energy for ReRAM-based CNN AcceleratorsACM Transactions on Embedded Computing Systems10.1145/365259323:6(1-25)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3652593
Zheng ZHuang YChen D(2024)A lightweight distillation recurrent convolution network on FPGA for real-time video super-resolutionMultimedia Systems10.1007/s00530-024-01528-030:6Online publication date: 15-Oct-2024
https://dl.acm.org/doi/10.1007/s00530-024-01528-0
Abd Algani YKumar AAla Walid MS BVelayutham PSasi Kumar A(2023)Topological Dependencies in Deep Learning for Mobile Edge: Distributed and Collaborative High-Speed Inference2023 Second International Conference on Electronics and Renewable Systems (ICEARS)10.1109/ICEARS56392.2023.10084935(1165-1171)Online publication date: 2-Mar-2023
https://doi.org/10.1109/ICEARS56392.2023.10084935
Lin YLiang YChen TChang YChen SShih WMitra TYoung EXiong J(2022)On Minimizing the Read Latency of Flash Memory to Preserve Inter-Tree Locality in Random ForestProceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design10.1145/3508352.3549365(1-9)Online publication date: 30-Oct-2022
https://dl.acm.org/doi/10.1145/3508352.3549365
Henna SDavy A(2022)Distributed and Collaborative High-Speed Inference Deep Learning for Mobile Edge with Topological DependenciesIEEE Transactions on Cloud Computing10.1109/TCC.2020.297884610:2(821-834)Online publication date: 1-Apr-2022
https://doi.org/10.1109/TCC.2020.2978846
Nekooei ASafari S(2022)Compression of Deep Neural Networks based on quantized tensor decomposition to implement on reconfigurable hardware platformsNeural Networks10.1016/j.neunet.2022.02.024150:C(350-363)Online publication date: 18-May-2022
https://dl.acm.org/doi/10.1016/j.neunet.2022.02.024
Menshchikov ASomov A(2022)Aerial Robotics for Precision Agriculture: Weeds Detection Through UAV and Machine VisionOptoelectronic Devices in Robotic Systems10.1007/978-3-031-09791-1_2(23-51)Online publication date: 15-Jun-2022
https://doi.org/10.1007/978-3-031-09791-1_2
Skirelis J(2021)Edge computing tied in artificial neural network classifiers10.20334/2021-021-MOnline publication date: 2021
https://doi.org/10.20334/2021-021-M
Tu CSun QCheng M(2021)On designing the adaptive computation framework of distributed deep learning models for Internet-of-Things applicationsThe Journal of Supercomputing10.1007/s11227-021-03795-4Online publication date: 21-Apr-2021
https://doi.org/10.1007/s11227-021-03795-4
Show More Cited By

Abstract

References

Cited By

Index Terms

Recommendations

An OpenCL™ Deep Learning Accelerator on Arria 10

Architecting phase change memory as a scalable dram alternative

Embedded NAND flash file system for mobile multimedia devices

Comments

Information

Published In

Publisher

Publication History

Permissions

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Share

Share this Publication link

Share on social media

Affiliations