research-article

C-brain: a deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization

Authors:

Xiaowei LiAuthors Info & Claims

DAC '16: Proceedings of the 53rd Annual Design Automation Conference

Article No.: 123, Pages 1 - 6

https://doi.org/10.1145/2897937.2897995

Published: 05 June 2016 Publication History

Abstract

Convolutional neural networks (CNN) accelerators have been proposed as an efficient hardware solution for deep learning based applications, which are known to be both compute-and-memory intensive. Although the most advanced CNN accelerators can deliver high computational throughput, the performance is highly unstable. Once changed to accommodate a new network with different parameters like layers and kernel size, the fixed hardware structure, may no longer well match the data flows. Consequently, the accelerator will fail to deliver high performance due to the underutilization of either logic resource or memory bandwidth. To overcome this problem, we proposed a novel deep learning accelerator, which offers multiple types of data-level parallelism: inter-kernel, intra-kernel and hybrid. Our design can adaptively switch among the three types of parallelism and the corresponding data tiling schemes to dynamically match different networks or even different layers of a single network. No matter how we change the hardware configurations or network types, the proposed network mapping strategy ensures the optimal performance and energy-efficiency. Compared with previous state-of-the-art NN accelerators, it is possible to achieve a speedup of 4.0x-8.3x for some layers of the well-known large scale CNNs. For the whole phase of network forward-propagation, our design achieves 28.04% PE energy saving, 90.3% on-chip memory energy saving on average.

References

[1]

Krizhevsky, A., et al., ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, 2012.

Digital Library

[2]

Hinton, G., et al., Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine, 2012.

[3]

Simonyan, K., et al., Very Deep Convolutional Networks for Large-Scale Image Recognition. Arxiv, 2014.

[4]

LeCun, Y., et al., Deep learning. Nature, 2015.

[5]

Coates, A., et al. Deep learning with COTS HPC systems. In Proc. of ICML, 2013.

[6]

Peemen, M., et al. Memory-centric accelerator design for convolutional neural networks. In Computer Design (ICCD), 2013.

[7]

Farabet, C., et al. Neuflow: A runtime reconfigurable dataflow processor for vision. In Proc. of CVPRW, 2011.

[8]

Chen, T., et al. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proc. of ASPLOS, 2014.

Digital Library

[9]

Park, S., et al. 4.6 A1. 93TOPS/W scalable deep learning/inference processor with tetra-parallel MIMD architecture for big-data applications. In Proc. of ISSCC, 2015.

[10]

Szegedy, C., et al., Going deeper with convolutions. ArXiv, 2014.

[11]

Sankaradas, M., et al. A massively parallel coprocessor for convolutional neural networks. In Proc. Of ASAP, 2009.

Digital Library

[12]

Cadambi, S., et al. A programmable parallel accelerator for learning and classification. In Proc. of PACT, 2010.

Digital Library

[13]

Chakradhar, S., et al. A dynamically configurable coprocessor for convolutional neural networks. In ACM SIGARCH Computer Architecture News, 2010.

Digital Library

[14]

Zhang, C., et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proc. of FPGA, 2015.

Digital Library

[15]

Du, Z., et al. ShiDianNao: shifting vision processing closer to the sensor. In Proc. of ISCA, 2015.

Digital Library

[16]

Lin, M., et al., <Network In Network>. In Proc.of ICLR, 2014.

[17]

Russakovsky, O., et al., Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2014.

Digital Library

[18]

Jia, Y., et al. Caffe: Convolutional architecture for fast feature embedding. ArXiv, 2014.

Digital Library

Cited By

Ma ZChen AKim DChen TWang SRoychoudhury APaiva AAbreu RStorey M(2024)LLMParser: An Exploratory Study on Using Large Language Models for Log ParsingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639150(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639150
Wang SXu HMamandipoor AMahapatra RAhn BGhodrati SKailas KAlian MEsmaeilzadeh H(2024)Data Motion Acceleration: Chaining Cross-Domain Multi Accelerators2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00083(1043-1062)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00083
Kong RLi YYuan YKong LHui PAmiri Sani ANurmi PLiu Y(2023)ConvReLU++: Reference-based Lossless Acceleration of Conv-ReLU Operations on Mobile CPUProceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services10.1145/3581791.3596831(503-515)Online publication date: 18-Jun-2023
https://dl.acm.org/doi/10.1145/3581791.3596831
Show More Cited By

Recommendations

Accelerating complex brain-model simulations on GPU platforms
DATE '15: Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition

The Inferior Olive (IO) in the brain, in conjunction with the cerebellum, is responsible for crucial sensorimotor-integration functions in humans. In this paper, we simulate a computationally challenging IO neuron model consisting of three compartments ...
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Nuclear Reactor Simulations on OpenCL FPGA Platform
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Field-programmable gate arrays (FPGAs) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The maturing high-level synthesis (HLS) ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

DAC '16: Proceedings of the 53rd Annual Design Automation Conference

June 2016

1048 pages

ISBN:9781450342360

DOI:10.1145/2897937

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

DAC '16

DAC '16: The 53rd Annual Design Automation Conference 2016

June 5 - 9, 2016

Texas, Austin

Acceptance Rates

Overall Acceptance Rate 1,317 of 3,929 submissions, 34%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

86
Total Citations
View Citations
809
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)4

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ma ZChen AKim DChen TWang SRoychoudhury APaiva AAbreu RStorey M(2024)LLMParser: An Exploratory Study on Using Large Language Models for Log ParsingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639150(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639150
Wang SXu HMamandipoor AMahapatra RAhn BGhodrati SKailas KAlian MEsmaeilzadeh H(2024)Data Motion Acceleration: Chaining Cross-Domain Multi Accelerators2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00083(1043-1062)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00083
Kong RLi YYuan YKong LHui PAmiri Sani ANurmi PLiu Y(2023)ConvReLU++: Reference-based Lossless Acceleration of Conv-ReLU Operations on Mobile CPUProceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services10.1145/3581791.3596831(503-515)Online publication date: 18-Jun-2023
https://dl.acm.org/doi/10.1145/3581791.3596831
Varughese DSridevi S(2023)Optimum Resource Utilization for the Implementation of FPGA-based Fast Convolutional Algorithms for CNN Modelling2023 IEEE Women in Technology Conference (WINTECHCON)10.1109/WINTECHCON58518.2023.10276660(1-6)Online publication date: 21-Sep-2023
https://doi.org/10.1109/WINTECHCON58518.2023.10276660
Zhao XWang YLiu CShi CTu KZhang L(2023)Network Pruning for Bit-Serial AcceleratorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.320395542:5(1597-1609)Online publication date: May-2023
https://doi.org/10.1109/TCAD.2022.3203955
Wang SZhang SHuang XChang L(2023)A high-efficiency spaceborne processor for hybrid neural networksNeurocomputing10.1016/j.neucom.2023.126230541:COnline publication date: 7-Jul-2023
https://dl.acm.org/doi/10.1016/j.neucom.2023.126230
Wang YWang YLi HLi X(2022)An Efficient Deep Learning Accelerator Architecture for Compressed Video AnalysisIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.312007641:9(2808-2820)Online publication date: Sep-2022
https://doi.org/10.1109/TCAD.2021.3120076
Wang YHe YCheng LLi HLi X(2022)A Fast Precision Tuning Solution for Always-On DNN AcceleratorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.308966741:5(1236-1248)Online publication date: May-2022
https://doi.org/10.1109/TCAD.2021.3089667
Zou KWang YCheng LQu SLi HLi X(2022)CAP: Communication-Aware Automated Parallelization for Deep Learning Inference on CMP ArchitecturesIEEE Transactions on Computers10.1109/TC.2021.309968871:7(1626-1639)Online publication date: 1-Jul-2022
https://doi.org/10.1109/TC.2021.3099688
Bavikadi SDhavlle AGanguly AHaridass AHendy HMerkel CReddi VSutradhar PJoseph APudukotai Dinakarrao S(2022)A Survey on Machine Learning Accelerators and Evolutionary Hardware PlatformsIEEE Design & Test10.1109/MDAT.2022.316112639:3(91-116)Online publication date: Jun-2022
https://doi.org/10.1109/MDAT.2022.3161126
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten