research-article

PuDianNao: A Polyvalent Machine Learning Accelerator

Authors:

Shengyuan Zhou,

Yunji ChenAuthors Info & Claims

ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 369 - 381

https://doi.org/10.1145/2694344.2694358

Published: 14 March 2015 Publication History

Abstract

Machine Learning (ML) techniques are pervasive tools in various emerging commercial applications, but have to be accommodated by powerful computer systems to process very large data. Although general-purpose CPUs and GPUs have provided straightforward solutions, their energy-efficiencies are limited due to their excessive supports for flexibility. Hardware accelerators may achieve better energy-efficiencies, but each accelerator often accommodates only a single ML technique (family). According to the famous No-Free-Lunch theorem in the ML domain, however, an ML technique performs well on a dataset may perform poorly on another dataset, which implies that such accelerator may sometimes lead to poor learning accuracy. Even if regardless of the learning accuracy, such accelerator can still become inapplicable simply because the concrete ML task is altered, or the user chooses another ML technique.

In this study, we present an ML accelerator called PuDianNao, which accommodates seven representative ML techniques, including k-means, k-nearest neighbors, naive bayes, support vector machine, linear regression, classification tree, and deep neural network. Benefited from our thorough analysis on computational primitives and locality properties of different ML techniques, PuDianNao can perform up to 1056 GOP/s (e.g., additions and multiplications) in an area of 3.51 mm^2, and consumes 596 mW only. Compared with the NVIDIA K20M GPU (28nm process), PuDianNao (65nm process) is 1.20x faster, and can reduce the energy by 128.41x.

References

[1]

UC Irvine Machine Learning Repository. http://archive.ics.uci.edu/ml/. {Online; accessed 31-July-2014}.

[2]

Naomi S Altman. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175--185, 1992.

[3]

Leo Breiman, Jerome Friedman, Richard Olshen, Charles Stone, D Steinberg, and P Colla. Cart: Classification and regression trees. Wadsworth: Belmont, CA, 156, 1983.

[4]

Srihari Cadambi, Igor Durdanovic, Venkata Jakkula, Murugan Sankaradass, Eric Cosatto, Srimat Chakradhar, and Hans Peter Graf. A massively parallel fpga-based coprocessor for support vector machines. In Field Programmable Custom Computing Machines, 2009. FCCM'09. 17th IEEE Symposium on, pages 115--122. IEEE, 2009.

Digital Library

[5]

Ernie Chan. Algorithmic trading: winning strategies and their rationale, volume 625. John Wiley & Sons, 2013.

Digital Library

[6]

Min Chen, Shiwen Mao, Yin Zhang, and Victor CM Leung. Big Data-Related Technologies, Challenges and Future Prospects. Springer, 2014.

Digital Library

[7]

Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural support for programming languages and operating systems, pages 269--284. ACM, 2014.

Digital Library

[8]

Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th IEEE/ACM International Symposium on Microarchitecture (MICRO'14), pages 1--14. IEEE, 2014.

Digital Library

[9]

Dan C Ciresan, Ueli Meier, Jonathan Masci, Luca Maria Gambardella, and Jürgen Schmidhuber. Flexible, high performance convolutional neural networks for image classification. In IJCAI Proceedings-International Joint Conference on Artificial Intelligence, volume 22, page 1237, 2011.

Digital Library

[10]

Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160--167. ACM, 2008.

Digital Library

[11]

Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273--297, 1995.

[12]

George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303--314, 1989.

[13]

George E Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):30--42, 2012.

Digital Library

[14]

Allen L Edwards. An introduction to linear regression and correlation. 1976.

[15]

Clément Farabet, Berin Martini, Benoit Corda, Polina Akselrod, Eugenio Culurciello, and Yann LeCun. Neuflow: A runtime reconfigurable dataflow processor for vision. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on, pages 109--116. IEEE, 2011.

[16]

Edward W Forgy. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics, 21:768--769, 1965.

[17]

AC Frery, CC de Araujo, Haglay Alice, Jorge Cerqueira, Juliana A Loureiro, Manoel Eusebio de Lima, Mdas Oliveira, MM Horta, et al. Hyperspectral images clustering on reconfigurable hardware using the k-means algorithm. In Integrated Circuits and Systems Design, 2003. SBCCI 2003. Proceedings. 16th Symposium on, pages 99--104. IEEE, 2003.

Digital Library

[18]

Vincent Garcia, Eric Debreuve, and Michel Barlaud. Fast k nearest neighbor search using gpu. In Computer Vision and Pattern Recognition Workshops, 2008. CVPRW'08. IEEE Computer Society Conference on, pages 1--6. IEEE, 2008.

[19]

Jan NH Heemskerk. Overview of neural hardware. Neurocomputers for Brain-Style Processing. Design, Implementation and Application, 1995.

[20]

Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6):82--97, 2012.

[21]

Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771--1800, 2002.

Digital Library

[22]

Hanaa M Hussain, Khaled Benkrid, Huseyin Seker, and Ahmet T Erdogan. Fpga implementation of k-means algorithm for bioinformatics application: An accelerated approach to clustering microarray data. In Adaptive Hardware and Systems (AHS), 2011 NASA/ESA Conference on, pages 248--255. IEEE, 2011.

[23]

Pat Langley, Wayne Iba, and Kevin Thompson. An analysis of bayesian classifiers. In AAAI, volume 90, pages 223--228. Citeseer, 1992.

Digital Library

[24]

Quoc V Le. Building high-level features using large scale unsupervised learning. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8595--8598. IEEE, 2013.

[25]

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278--2324, 1998.

[26]

Ahmed Al Maashri, Michael Debole, Matthew Cotter, Nandhini Chandramoorthy, Yang Xiao, Vijaykrishnan Narayanan, and Chaitali Chakrabarti. Accelerating neuromorphic vision algorithms for recognition. In Proceedings of the 49th Annual Design Automation Conference, pages 579--584. ACM, 2012.

Digital Library

[27]

Abhinandan Majumdar, Srihari Cadambi, Michela Becchi, Srimat T Chakradhar, and Hans Peter Graf. A massively parallel, energy efficient programmable accelerator for learning and classification. ACM Transactions on Architecture and Code Optimization (TACO), 9(1):6, 2012.

Digital Library

[28]

Abhinandan Majumdar, Srihari Cadambi, and Srimat T Chakradhar. An energy-efficient heterogeneous system for embedded learning and classification. Embedded Systems Letters, IEEE, 3(1):42--45, 2011.

Digital Library

[29]

Elias S Manolakos and Ioannis Stamoulias. Ip-cores design for the knn classifier. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, pages 4133--4136. IEEE, 2010.

[30]

Tsutomu Maruyama. Real-time k-means clustering for color images on reconfigurable hardware. In Pattern Recognition, 2006. ICPR 2006. 18th International Conference on, volume 2, pages 816--819. IEEE, 2006.

Digital Library

[31]

Markos Papadonikolakis and C Bouganis. A heterogeneous fpga architecture for support vector machine training. In Field-Programmable Custom Computing Machines (FCCM), 2010 18th IEEE Annual International Symposium on, pages 211--214. IEEE, 2010.

Digital Library

[32]

John C Platt, Nello Cristianini, and John Shawe-Taylor. Large margin dags for multiclass classification. In nips, volume 12, pages 547--553, 1999.

[33]

J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81--106, 1986.

[34]

J Ross Quinlan. Bagging, boosting, and c4. 5. In AAAI/IAAI, Vol. 1, pages 725--730, 1996.

Digital Library

[35]

DE Rummelhart. Learning representations by back-propagating errors. Nature, 323(9):533--536, 1986.

[36]

Ioannis Stamoulias and Elias S Manolakos. Parallel architectures for the knn classifier--design of soft ip cores and fpga implementations. ACM Transactions on Embedded Computing Systems (TECS), 13(2):22, 2013.

Digital Library

[37]

Olivier Temam. The rebirth of neural networks. In International Symposium on Computer Architecture, 2010.

Digital Library

[38]

George Teodoro, Rafael Sachetto, Olcay Sertel, Metin N Gurcan, W Meira, Umit Catalyurek, and Renato Ferreira. Coordinating the use of gpu and cpu for improving performance of compute intensive applications. In Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference on, pages 1--10. IEEE, 2009.

[39]

David H Wolpert. The lack of a priori distinctions between learning algorithms. Neural computation, 8(7):1341--1390, 1996.

Digital Library

[40]

Yao-Jung Yeh, Hui-Ya Li, Wen-Jyi Hwang, and Chiung-Yao Fang. Fpga implementation of knn classifier based on wavelet transform and partial distance search. In Image Analysis, pages 512--521. Springer, 2007.

Digital Library

[41]

Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, page 116. ACM, 2004.

Digital Library

Cited By

Qiu YLu LYi SJing MZeng XKong YFan Y(2025)Flips: A Flexible Partitioning Strategy Near Memory Processing Architecture for Recommendation SystemIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2025.353953436:4(745-758)Online publication date: Apr-2025
https://doi.org/10.1109/TPDS.2025.3539534
Bi JWen YLi XZhao YGuo YZhou EHu XDu ZLi LChen HChen TGuo Q(2024)Efficient and Fast High-performance Library Generation for Deep Learning AcceleratorsIEEE Transactions on Computers10.1109/TC.2024.3475575(1-14)Online publication date: 2024
https://doi.org/10.1109/TC.2024.3475575
Schmidt PPfau JHotfilter TStammler MHarbaum TBecker J(2024)RVVe: A Minimal RISC-V Vector Processor for Embedded AI Acceleration2024 IEEE 37th International System-on-Chip Conference (SOCC)10.1109/SOCC62300.2024.10737723(1-6)Online publication date: 16-Sep-2024
https://doi.org/10.1109/SOCC62300.2024.10737723
Show More Cited By

Index Terms

PuDianNao: A Polyvalent Machine Learning Accelerator
1. Hardware

Recommendations

PuDianNao: A Polyvalent Machine Learning Accelerator
ASPLOS '15

Machine Learning (ML) techniques are pervasive tools in various emerging commercial applications, but have to be accommodated by powerful computer systems to process very large data. Although general-purpose CPUs and GPUs have provided straightforward ...
PuDianNao: A Polyvalent Machine Learning Accelerator
ASPLOS'15

Machine Learning (ML) techniques are pervasive tools in various emerging commercial applications, but have to be accommodated by powerful computer systems to process very large data. Although general-purpose CPUs and GPUs have provided straightforward ...
Rigel: an architecture and scalable programming interface for a 1000-core accelerator
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled single-program, multiple-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

March 2015

720 pages

ISBN:9781450328357

DOI:10.1145/2694344

General Chairs:
Ozcan Ozturk
Bilkent University, Turkey
,
Kemal Ebcioglu
Global Supercomputing, USA
,
Program Chair:
Sandhya Dwarkadas
University of Rochester, USA

ACM SIGARCH Computer Architecture News Volume 43, Issue 1
ASPLOS'15
March 2015
676 pages
ISSN:0163-5964
DOI:10.1145/2786763
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents
ACM SIGPLAN Notices Volume 50, Issue 4
ASPLOS '15
April 2015
676 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2775054
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '15

Sponsor:

ASPLOS '15: Architectural Support for Programming Languages and Operating Systems

March 14 - 18, 2015

Istanbul, Turkey

Acceptance Rates

ASPLOS '15 Paper Acceptance Rate 48 of 287 submissions, 17%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

300
Total Citations
View Citations
6,604
Total Downloads

Downloads (Last 12 months)242
Downloads (Last 6 weeks)20

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Qiu YLu LYi SJing MZeng XKong YFan Y(2025)Flips: A Flexible Partitioning Strategy Near Memory Processing Architecture for Recommendation SystemIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2025.353953436:4(745-758)Online publication date: Apr-2025
https://doi.org/10.1109/TPDS.2025.3539534
Bi JWen YLi XZhao YGuo YZhou EHu XDu ZLi LChen HChen TGuo Q(2024)Efficient and Fast High-performance Library Generation for Deep Learning AcceleratorsIEEE Transactions on Computers10.1109/TC.2024.3475575(1-14)Online publication date: 2024
https://doi.org/10.1109/TC.2024.3475575
Schmidt PPfau JHotfilter TStammler MHarbaum TBecker J(2024)RVVe: A Minimal RISC-V Vector Processor for Embedded AI Acceleration2024 IEEE 37th International System-on-Chip Conference (SOCC)10.1109/SOCC62300.2024.10737723(1-6)Online publication date: 16-Sep-2024
https://doi.org/10.1109/SOCC62300.2024.10737723
Bolhasani HMarandinejad M(2024)Deep neural networks accelerators with focus on tensor processorsMicroprocessors & Microsystems10.1016/j.micpro.2023.105005105:COnline publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1016/j.micpro.2023.105005
Bi JGuo QLi XZhao YWen YGuo YZhou EHu XDu ZLi LChen HChen TAamodt TJerger NSwift M(2023)Heron: Automatically Constrained High-Performance Library Generation for Deep Learning AcceleratorsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582061(314-328)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582061
Guo CTang JHu WLeng JZhang CYang FLiu YGuo MZhu YSolihin YHeinrich M(2023)OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair QuantizationProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589038(1-15)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589038
Liang TWang LShi SGlossner JZhang X(2023)TCX: A RISC Style Tensor Computing Extension and a Programmable Tensor ProcessorACM Transactions on Embedded Computing Systems10.1145/356831022:3(1-27)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3568310
Yao Y(2023)SE-CNN: Convolution Neural Network Acceleration via Symbolic Value PredictionIEEE Journal on Emerging and Selected Topics in Circuits and Systems10.1109/JETCAS.2023.324476713:1(73-85)Online publication date: Mar-2023
https://doi.org/10.1109/JETCAS.2023.3244767
Saravanan RBavikadi SRai SKumar APudukotai Dinakarrao S(2023)Reconfigurable FET Approximate Computing-based Accelerator for Deep Learning Applications2023 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS46773.2023.10181758(1-5)Online publication date: 21-May-2023
https://doi.org/10.1109/ISCAS46773.2023.10181758
Janfaza VWeston KRazavi MMandal SMahmud FHilty AMuzahid A(2023)MERCURY: Accelerating DNN Training By Exploiting Input Similarity2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071051(638-650)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071051
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten