research-article

DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

Authors:

Olivier TemamAuthors Info & Claims

ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Pages 269 - 284

https://doi.org/10.1145/2541940.2541967

Published: 24 February 2014 Publication History

Abstract

Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve towards heterogeneous multi-cores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope.

Until now, most machine-learning accelerator designs have focused on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy.

We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02 mm2 and 485 mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87x faster, and it can reduce the total energy by 21.08x. The accelerator characteristics are obtained after layout at 65 nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.

References

[1]

R. S. Amant, D. A. Jimenez, and D. Burger. Low-power, high-performance analog neural branch prediction. In International Symposium on Microarchitecture, Como, 2008.

Digital Library

[2]

C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In International Conference on Parallel Architectures and Compilation Techniques, New York, New York, USA, 2008. ACM Press.

Digital Library

[3]

S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi. A dynamically configurable coprocessor for convolutional neural networks. In International symposium on Computer Architecture, page 247, Saint Malo, France, June 2010. ACM Press.

Digital Library

[4]

T. Chen, Y. Chen, M. Duranton, Q. Guo, A. Hashmi, M. Lipasti, A. Nere, S. Qiu, M. Sebag, and O. Temam. BenchNN: On the Broad Potential Application Scope of Hardware Neural Network Accelerators. In International Symposium on Workload Characterization, 2012.

Digital Library

[5]

A. Coates, B. Huval, T. Wang, D. J. Wu, and A. Y. Ng. Deep learning with cots hpc systems. In International Conference on Machine Learning, 2013.

Digital Library

[6]

C. Cortes and V. Vapnik. Support-Vector Networks. In Machine Learning, pages 273--297, 1995.

Digital Library

[7]

G. Dahl, T. Sainath, and G. Hinton. Improving Deep Neural Networks for LVCSR using Rectified Linear Units and Dropout. In International Conference on Acoustics, Speech and Signal Processing, 2013.

[8]

S. Draghici. On the capabilities of neural networks using limited precision weights. Neural Netw., 15(3):395--414, 2002.

Digital Library

[9]

Z. Du, A. Lingamneni, Y. Chen, K. V. Palem, O. Temam, and C. Wu. Leveraging the Error Resilience of Machine-Learning Applications for Designing Highly Energy Efficient Accelerators. In Asia and South Pacific Design Automation Conference, 2014.

[10]

H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger. Dark Silicon and the End of Multicore Scaling. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA), June 2011.

Digital Library

[11]

H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural Acceleration for General-Purpose Approximate Programs. In International Symposium on Microarchitecture, number 3, pages 1--6, 2012.

Digital Library

[12]

K. Fan, M. Kudlur, G. S. Dasika, and S. A. Mahlke. Bridging the computation gap between programmable processors and hardwired accelerators. In HPCA, pages 313--322. IEEE Computer Society, 2009.

[13]

C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun. NeuFlow: A runtime reconfigurable dataflow processor for vision. In CVPR Workshop, pages 109--116. Ieee, June 2011.

[14]

R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz. Understanding sources of inefficiency in general-purpose chips. In International Symposium on Computer Architecture, page 37, New York, New York, USA, 2010. ACM Press.

Digital Library

[15]

A. Hashmi, A. Nere, J. J. Thomas, and M. Lipasti. A case for neuromorphic ISAs. In International Conference on Architectural Support for Programming Languages and Operating Systems, New York, NY, 2011. ACM.

Digital Library

[16]

G. Hinton and N. Srivastava. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv: łdots, pages 1--18, 2012.

[17]

J. L. Holi and J.-N. Hwang. Finite Precision Error Analysis of Neural Network Hardware Implementations. IEEE Transactions on Computers, 42(3):281--290, 1993.

Digital Library

[18]

M. Holler, S. Tam, H. Castro, and R. Benson. An electrically trainable artificial neural network (ETANN) with 10240 "floating gate" synapses. In Artificial neural networks, pages 50--55, Piscataway, NJ, USA, 1990. IEEE Press.

Digital Library

[19]

P. Huang, X. He, J. Gao, and L. Deng. Learning deep structured semantic models for web search using clickthrough data. In International Conference on Information and Knowledge Management, 2013.

Digital Library

[20]

M. M. Khan, D. R. Lester, L. A. Plana, A. Rast, X. Jin, E. Painkras, and S. B. Furber. SpiNNaker: Mapping neural networks onto a massively-parallel chip multiprocessor. In IEEE International Joint Conference on Neural Networks (IJCNN), pages 2849--2856. Ieee, 2008.

[21]

J.-y. Kim, S. Member, M. Kim, S. Lee, J. Oh, K. Kim, and H.-j. Yoo. A 201.4 GOPS 496 mW Real-Time Multi-Object Recognition Processor With Bio-Inspired Neural Perception Engine. IEEE Journal of Solid-State Circuits, 45(1):32--45, Jan. 2010.

[22]

E. J. King and E. E. Swartzlander Jr. Data-dependent truncation scheme for parallel multipliers. In Signals, Systems & Computers, 1997. Conference Record of the Thirty-First Asilomar Conference on, volume 2, pages 1178--1182. IEEE, 1997.

[23]

D. Larkin, A. Kinane, V. Muresan, and N. E. O'Connor. An Efficient Hardware Architecture for a Neural Network Activation Function Generator. In J. Wang, Z. Yi, J. M. Zurada, B.-L. Lu, and H. Yin, editors, ISNN (2), volume 3973 of Lecture Notes in Computer Science, pages 1319--1327. Springer, 2006.

Digital Library

[24]

D. Larkin, A. Kinane, and N. E. O'Connor. Towards Hardware Acceleration of Neuroevolution for Multimedia Processing Applications on Mobile Devices. In ICONIP (3), pages 1178--1188, 2006.

Digital Library

[25]

H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In International Conference on Machine Learning, pages 473--480, New York, New York, USA, 2007. ACM Press.

Digital Library

[26]

Q. V. Le, M. A. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and A. Y. Ng. Building High-level Features Using Large Scale Unsupervised Learning. In International Conference on Machine Learning, June 2012.

[27]

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 1998.

[28]

S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages 469--480, New York, NY, USA, 2009. ACM.

Digital Library

[29]

A. A. Maashri, M. Debole, M. Cotter, N. Chandramoorthy, Y. Xiao, V. Narayanan, and C. Chakrabarti. Accelerating neuromorphic vision algorithms for recognition. Proceedings of the 49th Annual Design Automation Conference on - DAC'12, page 579, 2012.

Digital Library

[30]

P. Merolla, J. Arthur, F. Akopyan, N. Imam, R. Manohar, and D. Modha. A digital neurosynaptic core using embedded crossbar memory with 45pJ per spike in 45nm. In IEEE Custom Integrated Circuits Conference, pages 1--4. IEEE, Sept. 2011.

[31]

V. Mnih and G. Hinton. Learning to Label Aerial Images from Noisy Data. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 567--574, 2012.

Digital Library

[32]

M. Muller. Dark Silicon and the Internet. In EE Times "Designing with ARM" virtual conference, 2010.

[33]

W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. A. Horowitz. Convolution engine: balancing efficiency & flexibility in specialized computing. In International Symposium on Computer Architecture, 2013.

Digital Library

[34]

J. Schemmel, J. Fieres, and K. Meier. Wafer-scale integration of analog neural networks. In International Joint Conference on Neural Networks, pages 431--438. Ieee, June 2008.

[35]

P. Sermanet, S. Chintala, and Y. LeCun. Convolutional Neural Networks Applied to House Numbers Digit Classification. In Pattern Recognition (ICPR), ...., 2012.

[36]

P. Sermanet and Y. LeCun. Traffic sign recognition with multi-scale Convolutional Networks. In International Joint Conference on Neural Networks, pages 2809--2813. Ieee, July 2011.

[37]

T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio. Robust object recognition with cortex-like mechanisms. IEEE transactions on pattern analysis and machine intelligence, 29(3):411--26, Mar. 2007.

Digital Library

[38]

O. Temam. A Defect-Tolerant Accelerator for Emerging High-Performance Applications. In International Symposium on Computer Architecture, Portland, Oregon, 2012.

Digital Library

[39]

O. Temam and N. Drach. Software assistance for data caches. Future Generation Computer Systems, 11(6):519--536, 1995.

Digital Library

[40]

S. Thoziyoor, N. Muralimanohar, and J. Ahn. CACTI 5.1. HP Labs, Palo Alto, Tech, 2008.

[41]

V. Vanhoucke, A. Senior, and M. Z. Mao. Improving the speed of neural networks on CPUs. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.

[42]

G. Venkatesh, J. Sampson, N. Goulding-hotta, S. K. Venkata, M. B. Taylor, and S. Swanson. QsCORES : Trading Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores Categories and Subject Descriptors. In International Symposium on Microarchitecture, 2011.

Digital Library

[43]

R. J. Vogelstein, U. Mallik, J. T. Vogelstein, and G. Cauwenberghs. Dynamically reconfigurable silicon array of spiking neurons with conductance-based synapses. IEEE Transactions on Neural Networks, 18(1):253--265, 2007.

Digital Library

[44]

S. Yehia, S. Girbal, H. Berry, and O. Temam. Reconciling specialization and flexibility through compound circuits. In International Symposium on High Performance Computer Architecture, pages 277--288, Raleigh, North Carolina, Feb. 2009. Ieee.

Cited By

Qiu YLu LYi SJing MZeng XKong YFan Y(2025)Flips: A Flexible Partitioning Strategy Near Memory Processing Architecture for Recommendation SystemIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2025.353953436:4(745-758)Online publication date: Apr-2025
https://doi.org/10.1109/TPDS.2025.3539534
Lee KAshok MMaji SAgrawal RJoshi AYan MEmer JChandrakasan A(2025)Secure Machine Learning Hardware: Challenges and Progress [Feature]IEEE Circuits and Systems Magazine10.1109/MCAS.2024.350937625:1(8-34)Online publication date: Sep-2026
https://doi.org/10.1109/MCAS.2024.3509376
Zhou CHassman ZShah DRichard VLi YRodríguez GSadayappan PSukumaran-Rajam A(2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641566
Show More Cited By

Index Terms

DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
1. Hardware

Recommendations

DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
ASPLOS '14

Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural ...
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
ASPLOS '14

Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural ...
From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture

Comparing the architectures and performance levels of an Nvidia Fermi accelerator with an Intel MIC Architecture coprocessor demonstrates the benefit of the coprocessor for bringing highly parallel applications into, or even beyond, GPGPU performance ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

February 2014

780 pages

ISBN:9781450323055

DOI:10.1145/2541940

General Chairs:
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Program Chair:
Sarita Adve
University of Illinois at Urbana-Champ

ACM SIGPLAN Notices Volume 49, Issue 4
ASPLOS '14
April 2014
729 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2644865
Editors:
Mark W. Bailey
Hamilton College, Clinton, NY
,
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Sarita Adve
University of Illinois at Urbana-Champ
Issue’s Table of Contents
ACM SIGARCH Computer Architecture News Volume 42, Issue 1
ASPLOS '14
March 2014
729 pages
ISSN:0163-5964
DOI:10.1145/2654822
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '14

Sponsor:

ASPLOS '14: Architectural Support for Programming Languages and Operating Systems

March 1 - 5, 2014

Utah, Salt Lake City, USA

Acceptance Rates

ASPLOS '14 Paper Acceptance Rate 49 of 217 submissions, 23%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1,462
Total Citations
View Citations
14,382
Total Downloads

Downloads (Last 12 months)1,366
Downloads (Last 6 weeks)216

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Qiu YLu LYi SJing MZeng XKong YFan Y(2025)Flips: A Flexible Partitioning Strategy Near Memory Processing Architecture for Recommendation SystemIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2025.353953436:4(745-758)Online publication date: Apr-2025
https://doi.org/10.1109/TPDS.2025.3539534
Lee KAshok MMaji SAgrawal RJoshi AYan MEmer JChandrakasan A(2025)Secure Machine Learning Hardware: Challenges and Progress [Feature]IEEE Circuits and Systems Magazine10.1109/MCAS.2024.350937625:1(8-34)Online publication date: Sep-2026
https://doi.org/10.1109/MCAS.2024.3509376
Zhou CHassman ZShah DRichard VLi YRodríguez GSadayappan PSukumaran-Rajam A(2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641566
Turner JCrowley EO'Boyle M(2024)Neural Architecture Search as Program Transformation ExplorationCommunications of the ACM10.1145/3624775Online publication date: 25-Sep-2024
https://doi.org/10.1145/3624775
Mahapatra RGhodrati SAhn BKinzer SWang SXu HKarthikeyan LSharma HYazdanbakhsh AAlian MEsmaeilzadeh HTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)In-Storage Domain-Specific Acceleration for Serverless ComputingProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640413(530-548)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640413
Li YLouri AKaranth A(2024)A High-Performance and Energy-Efficient Photonic Architecture for Multi-DNN AccelerationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332753535:1(46-58)Online publication date: Jan-2024
https://doi.org/10.1109/TPDS.2023.3327535
Delacour CCarapezzi SAbernot MTodri-Sanial A(2024) Energy-Performance Assessment of Oscillatory Neural Networks Based on VO 2 Devices for Future Edge AI Computing IEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.323847335:7(10045-10058)Online publication date: Jul-2024
https://doi.org/10.1109/TNNLS.2023.3238473
Lehtimäki MPaunonen LLinne M(2024)Accelerating Neural ODEs Using Model Order ReductionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.317575735:1(519-531)Online publication date: Jan-2024
https://doi.org/10.1109/TNNLS.2022.3175757
Wang ZLuo TGoh RZhou J(2024)EDCompress: Energy-Aware Model Compression for DataflowsIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.317294135:1(208-220)Online publication date: Jan-2024
https://doi.org/10.1109/TNNLS.2022.3172941
Chen YZhang JLyu DLi ZHe G(2024)A Broad-Spectrum and High-Throughput Compression Engine for Neural Network ProcessorsIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2024.336470871:7(3528-3532)Online publication date: Jul-2024
https://doi.org/10.1109/TCSII.2024.3364708
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten