research-article

A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study

Authors:

Duncan J.M Moss,

Srivatsan Krishnan,

Eriko Nurvitadhi,

Piotr Ratuszniak,

Suchit Subhaschandra,

Philip H.W. LeongAuthors Info & Claims

FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pages 107 - 116

https://doi.org/10.1145/3174243.3174258

Published: 15 February 2018 Publication History

Abstract

General Matrix to Matrix multiplication (GEMM) is the cornerstone for a wide gamut of applications in high performance computing (HPC), scientific computing (SC) and more recently, deep learning. In this work, we present a customizable matrix multiplication framework for the Intel HARPv2 CPU+FPGA platform that includes support for both traditional single precision floating point and reduced precision workloads. Our framework supports arbitrary size GEMMs and consists of two parts: (1) a simple application programming interface (API) for easy configuration and integration into existing software and (2) a highly customizable hardware template. The API provides both compile and runtime options for controlling key aspects of the hardware template including dynamic precision switching; interleaving and block size control; and fused deep learning specific operations. The framework currently supports single precision floating point (FP32), 16, 8, 4 and 2 bit Integer and Fixed Point (INT16, INT8, INT4, INT2) and more exotic data types for deep learning workloads: INT16xTernary, INT8xTernary, BinaryxBinary.

We compare our implementation to the latest NVIDIA Pascal GPU and evaluate the performance benefits provided by optimizations built into the hardware template. Using three neural networks (AlexNet, VGGNet and ResNet) we illustrate that reduced precision representations such as binary achieve the best performance, and that the HARPv2 enables fine-grained partitioning of computations over both the Xeon and FPGA. We observe up to 50x improvement in execution time compared to single precision floating point, and that runtime configuration options can improve the efficiency of certain layers in AlexNet up to 4x, achieving an overall 1.3x improvement over the entire network.

References

[1]

2002. An Updated Set of Basic Linear Algebra Subprograms (BLAS). ACM Trans. Math. Softw. 28, 2 (June 2002), 135--151.

Digital Library

[2]

Firas Abuzaid, Stefan Hadjis, Ce Zhang, and Christopher Ré. 2015. Caffe con Troll: Shallow Ideas to Speed Up Deep Learning. CoRR abs/1504.04343 (2015). http://arxiv.org/abs/1504.04343

[3]

Utku Aydonat, Shane O'Connell, Davor Capalija, Andrew C Ling, and Gordon R Chiu. 2017. An OpenCL (TM) Deep Learning Accelerator on Arria 10. In ISFPGA.

Digital Library

[4]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. CoRR abs/1410.0759 (2014). http://arxiv.org/abs/1410.0759

[5]

Young-kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Peng Wei. 2016. A quantitative analysis on microarchitectures of modern CPU-FPGA platforms. In DAC.

Digital Library

[6]

Taiwan Semiconductor Manufacturing Company. 2013. TSMC 16/12nm Technology. (2013). http://www.tsmc.com/english/dedicatedFoundry/technology/16nm. htm

[7]

NVIDIA Corporation. 2017. NVIDIA TESLA V100 GPU ARCHITECTURE. Technical Report WP-08608-001. NVIDIA Corporation. https://images.nvidia.com/content/ volta-architecture/pdf/Volta-Architecture-Whitepaper-v1.0.pdf

[8]

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. BinaryConnect: Training Deep Neural Networks with Binary Weights During Propagations. In NIPS.

Digital Library

[9]

Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. arXiv:1602.02830 (2016).

[10]

Nicholas J. Fraser, Yaman Umuroglu, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. Scaling Binarized Neural Networks on Reconfigurable Logic. In PARMA-DITAM.

Digital Library

[11]

Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong. 2017. FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates. In FCCM.

[12]

PK Gupta. 2016. Accelerating Datacenter Workloads. In FPL.

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR.

[14]

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, and et. al. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. CoRR abs/1704.04760 (2017). http://arxiv.org/abs/1704.04760

Digital Library

[15]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS.

Digital Library

[16]

Andrew Lavin. 2015. Fast Algorithms for Convolutional Neural Networks. CoRR abs/1509.09308 (2015). http://arxiv.org/abs/1509.09308

[17]

Asit K. Mishra, Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr. 2017. WRPN: Training and Inference using Wide Reduced-Precision Networks. CoRR abs/1709.01134 (2017). http://arxiv.org/abs/1709.01134

[18]

Duncan Moss, Eriko Nurvitadhi, Jaewoong Sim, Asit Mishra, Suchit Subhaschandra, and Debbie Marr. 2017. High Performance Binary Neural Networks on the Xeon+FPGA Platform. In FPL.

[19]

Sharan Narang. 2016. DeepBench. (2016). https://svail.github.io/DeepBench/

[20]

Eriko Nurvitadhi, David Sheffield, Jaewoong Sim, Asit Mishra, Ganesh Venkatesh, and Debbie Marr. 2016. Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC. In FPT.

[21]

Eriko Nurvitadhi, Ganesh Venkatesh, Jaewoong Sim, Debbie Marr, Randy Huang, Jason Ong Gee Hock, Yeong Tat Liew, Krishnan Srivatsan, Duncan Moss, Suchit Subhaschandra, and Guy Boudoukh. 2017. Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?. In ISFPGA.

Digital Library

[22]

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. XNOR-Net: Imagenet Classification Using Binary Convolutional Neural Networks. In ECCV.

[23]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211--252.

Digital Library

[24]

David Sidler, Zsolt István, Muhsen Owaida, and Gustavo Alonso. 2017. Accelerating pattern matching queries in hybrid CPU-FPGA architectures. In Proceedings of the 2017 ACM International Conference on Management of Data.

Digital Library

[25]

Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-scale Image Recognition. arXiv:1409.1556 (2014).

[26]

Yaman Umuroglu, Nicholas J Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. In ISFGPA.

Digital Library

[27]

Xuechao Wei, Yun Liang, Tao Wang, Songwu Lu, and Jason Cong. 2017. Throughput optimization for streaming applications on CPU-FPGA heterogeneous systems. In ASP-DAC. IEEE.

[28]

Gabriel Weisz, Joseph Melber, Yu Wang, Kermin Fleming, Eriko Nurvitadhi, and James C Hoe. 2016. A study of pointer-chasing performance on shared-memory processor-FPGA systems. In ISFPGA.

Digital Library

[29]

Chi Zhang and Viktor Prasanna. 2017. Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System. In FPGA.

Digital Library

[30]

Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei Xing, Jeng-Hau Lin, Mani Srivastava, Rajesh Gupta, and Zhiru Zhang. 2017. Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs. In ISFPGA.

Digital Library

Cited By

Zhuang JLau JYe HYang ZJi SLo JDenolf KNeuendorffer SJones AHu JShi YChen DCong JZhou P(2024)CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP ArchitectureACM Transactions on Reconfigurable Technology and Systems10.1145/368616317:3(1-31)Online publication date: 5-Aug-2024
https://dl.acm.org/doi/10.1145/3686163
Yao YChen XAtmer HKaxiras S(2024)TangramFP: Energy-Efficient, Bit-Parallel, Multiply-Accumulate for Deep Neural Networks2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00009(1-12)Online publication date: 13-Nov-2024
https://doi.org/10.1109/SBAC-PAD63648.2024.00009
Yang GLei JFang ZZhang JZhang JXie WLi Y(2024)SA4: A Comprehensive Analysis and Optimization of Systolic Array Architecture for 4-bit Convolutions2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00036(204-212)Online publication date: 2-Sep-2024
https://doi.org/10.1109/FPL64840.2024.00036
Show More Cited By

Index Terms

A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study

Recommendations

Nuclear Reactor Simulations on OpenCL FPGA Platform
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Field-programmable gate arrays (FPGAs) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The maturing high-level synthesis (HLS) ...
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Base64 Encoding on OpenCL FPGA Platform
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Base64 encoding has many applications on the Web. Previous studies are focused on improving the efficiency of Base64 encoding on central processing units (CPUs). As field-programmable gate arrays (FPGAs) are becoming promising heterogeneous computing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 2018

310 pages

ISBN:9781450356145

DOI:10.1145/3174243

General Chair:
Jason H. Anderson
University of Toronto, Canada
,
Program Chair:
Kia Bazargan
University of Minnesota, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 February 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Australian Research Councils Linkage Projects

Conference

FPGA '18

Sponsor:

SIGDA

FPGA '18: The 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 25 - 27, 2018

CALIFORNIA, Monterey, USA

Acceptance Rates

FPGA '18 Paper Acceptance Rate 10 of 116 submissions, 9%;

Overall Acceptance Rate 125 of 627 submissions, 20%

Upcoming Conference

FPGA '25

Sponsor:
sigda

The 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

February 27 - March 1, 2025

Monterey , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

53
Total Citations
View Citations
1,630
Total Downloads

Downloads (Last 12 months)119
Downloads (Last 6 weeks)8

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhuang JLau JYe HYang ZJi SLo JDenolf KNeuendorffer SJones AHu JShi YChen DCong JZhou P(2024)CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP ArchitectureACM Transactions on Reconfigurable Technology and Systems10.1145/368616317:3(1-31)Online publication date: 5-Aug-2024
https://dl.acm.org/doi/10.1145/3686163
Yao YChen XAtmer HKaxiras S(2024)TangramFP: Energy-Efficient, Bit-Parallel, Multiply-Accumulate for Deep Neural Networks2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00009(1-12)Online publication date: 13-Nov-2024
https://doi.org/10.1109/SBAC-PAD63648.2024.00009
Yang GLei JFang ZZhang JZhang JXie WLi Y(2024)SA4: A Comprehensive Analysis and Optimization of Systolic Array Architecture for 4-bit Convolutions2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00036(204-212)Online publication date: 2-Sep-2024
https://doi.org/10.1109/FPL64840.2024.00036
Jiang LLiu SZhu JShan RLi Y(2024)Dynamic Multi-bit Parallel Computing Method Based on Reconfigurable StructureAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0801-7_20(347-359)Online publication date: 1-Mar-2024
https://doi.org/10.1007/978-981-97-0801-7_20
Brandner JMayer FPhilippsen M(2024)Multilayer Multipurpose Caches for OpenMP Target Regions on FPGAsAdvancing OpenMP for Future Accelerators10.1007/978-3-031-72567-8_6(79-93)Online publication date: 23-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72567-8_6
Xu RMa SGuo YLi D(2023)A Survey of Design and Optimization for Systolic Array-based DNN AcceleratorsACM Computing Surveys10.1145/360480256:1(1-37)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3604802
Zhuang JLau JYe HYang ZDu YLo JDenolf KNeuendorffer SJones AHu JChen DCong JZhou PIenne PZhang Z(2023)CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP ArchitectureProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3543622.3573210(153-164)Online publication date: 12-Feb-2023
https://dl.acm.org/doi/10.1145/3543622.3573210
Jia LLuo ZLu LLiang Y(2023)Automatic Generation of Spatial Accelerator for Tensor AlgebraIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.320994942:6(1898-1911)Online publication date: Jun-2023
https://doi.org/10.1109/TCAD.2022.3209949
José Evilásio Campos LHenrique Terra MMenotti RSantos Inoue R(2023)Systolic Array for Parallel Solution of the Robust Kalman Filter Used for Attitude and Position Estimations in UAVs2023 International Conference on Unmanned Aircraft Systems (ICUAS)10.1109/ICUAS57906.2023.10156196(425-432)Online publication date: 6-Jun-2023
https://doi.org/10.1109/ICUAS57906.2023.10156196
Taka EArora AWu KMarculescu D(2023)MaxEVA: Maximizing the Efficiency of Matrix Multiplication on Versal AI Engine2023 International Conference on Field Programmable Technology (ICFPT)10.1109/ICFPT59805.2023.00016(96-105)Online publication date: 12-Dec-2023
https://doi.org/10.1109/ICFPT59805.2023.00016
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten