skip to main content
10.1145/3174243.3174258acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article

A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study

Published: 15 February 2018 Publication History

Abstract

General Matrix to Matrix multiplication (GEMM) is the cornerstone for a wide gamut of applications in high performance computing (HPC), scientific computing (SC) and more recently, deep learning. In this work, we present a customizable matrix multiplication framework for the Intel HARPv2 CPU+FPGA platform that includes support for both traditional single precision floating point and reduced precision workloads. Our framework supports arbitrary size GEMMs and consists of two parts: (1) a simple application programming interface (API) for easy configuration and integration into existing software and (2) a highly customizable hardware template. The API provides both compile and runtime options for controlling key aspects of the hardware template including dynamic precision switching; interleaving and block size control; and fused deep learning specific operations. The framework currently supports single precision floating point (FP32), 16, 8, 4 and 2 bit Integer and Fixed Point (INT16, INT8, INT4, INT2) and more exotic data types for deep learning workloads: INT16xTernary, INT8xTernary, BinaryxBinary.
We compare our implementation to the latest NVIDIA Pascal GPU and evaluate the performance benefits provided by optimizations built into the hardware template. Using three neural networks (AlexNet, VGGNet and ResNet) we illustrate that reduced precision representations such as binary achieve the best performance, and that the HARPv2 enables fine-grained partitioning of computations over both the Xeon and FPGA. We observe up to 50x improvement in execution time compared to single precision floating point, and that runtime configuration options can improve the efficiency of certain layers in AlexNet up to 4x, achieving an overall 1.3x improvement over the entire network.

References

[1]
2002. An Updated Set of Basic Linear Algebra Subprograms (BLAS). ACM Trans. Math. Softw. 28, 2 (June 2002), 135--151.
[2]
Firas Abuzaid, Stefan Hadjis, Ce Zhang, and Christopher Ré. 2015. Caffe con Troll: Shallow Ideas to Speed Up Deep Learning. CoRR abs/1504.04343 (2015). http://arxiv.org/abs/1504.04343
[3]
Utku Aydonat, Shane O'Connell, Davor Capalija, Andrew C Ling, and Gordon R Chiu. 2017. An OpenCL (TM) Deep Learning Accelerator on Arria 10. In ISFPGA.
[4]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. CoRR abs/1410.0759 (2014). http://arxiv.org/abs/1410.0759
[5]
Young-kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Peng Wei. 2016. A quantitative analysis on microarchitectures of modern CPU-FPGA platforms. In DAC.
[6]
Taiwan Semiconductor Manufacturing Company. 2013. TSMC 16/12nm Technology. (2013). http://www.tsmc.com/english/dedicatedFoundry/technology/16nm. htm
[7]
NVIDIA Corporation. 2017. NVIDIA TESLA V100 GPU ARCHITECTURE. Technical Report WP-08608-001. NVIDIA Corporation. https://images.nvidia.com/content/ volta-architecture/pdf/Volta-Architecture-Whitepaper-v1.0.pdf
[8]
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. BinaryConnect: Training Deep Neural Networks with Binary Weights During Propagations. In NIPS.
[9]
Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. arXiv:1602.02830 (2016).
[10]
Nicholas J. Fraser, Yaman Umuroglu, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. Scaling Binarized Neural Networks on Reconfigurable Logic. In PARMA-DITAM.
[11]
Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong. 2017. FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates. In FCCM.
[12]
PK Gupta. 2016. Accelerating Datacenter Workloads. In FPL.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR.
[14]
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, and et. al. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. CoRR abs/1704.04760 (2017). http://arxiv.org/abs/1704.04760
[15]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS.
[16]
Andrew Lavin. 2015. Fast Algorithms for Convolutional Neural Networks. CoRR abs/1509.09308 (2015). http://arxiv.org/abs/1509.09308
[17]
Asit K. Mishra, Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr. 2017. WRPN: Training and Inference using Wide Reduced-Precision Networks. CoRR abs/1709.01134 (2017). http://arxiv.org/abs/1709.01134
[18]
Duncan Moss, Eriko Nurvitadhi, Jaewoong Sim, Asit Mishra, Suchit Subhaschandra, and Debbie Marr. 2017. High Performance Binary Neural Networks on the Xeon+FPGA Platform. In FPL.
[19]
Sharan Narang. 2016. DeepBench. (2016). https://svail.github.io/DeepBench/
[20]
Eriko Nurvitadhi, David Sheffield, Jaewoong Sim, Asit Mishra, Ganesh Venkatesh, and Debbie Marr. 2016. Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC. In FPT.
[21]
Eriko Nurvitadhi, Ganesh Venkatesh, Jaewoong Sim, Debbie Marr, Randy Huang, Jason Ong Gee Hock, Yeong Tat Liew, Krishnan Srivatsan, Duncan Moss, Suchit Subhaschandra, and Guy Boudoukh. 2017. Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?. In ISFPGA.
[22]
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. XNOR-Net: Imagenet Classification Using Binary Convolutional Neural Networks. In ECCV.
[23]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211--252.
[24]
David Sidler, Zsolt István, Muhsen Owaida, and Gustavo Alonso. 2017. Accelerating pattern matching queries in hybrid CPU-FPGA architectures. In Proceedings of the 2017 ACM International Conference on Management of Data.
[25]
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-scale Image Recognition. arXiv:1409.1556 (2014).
[26]
Yaman Umuroglu, Nicholas J Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. In ISFGPA.
[27]
Xuechao Wei, Yun Liang, Tao Wang, Songwu Lu, and Jason Cong. 2017. Throughput optimization for streaming applications on CPU-FPGA heterogeneous systems. In ASP-DAC. IEEE.
[28]
Gabriel Weisz, Joseph Melber, Yu Wang, Kermin Fleming, Eriko Nurvitadhi, and James C Hoe. 2016. A study of pointer-chasing performance on shared-memory processor-FPGA systems. In ISFPGA.
[29]
Chi Zhang and Viktor Prasanna. 2017. Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System. In FPGA.
[30]
Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei Xing, Jeng-Hau Lin, Mani Srivastava, Rajesh Gupta, and Zhiru Zhang. 2017. Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs. In ISFPGA.

Cited By

View all
  • (2024)CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP ArchitectureACM Transactions on Reconfigurable Technology and Systems10.1145/368616317:3(1-31)Online publication date: 5-Aug-2024
  • (2024)TangramFP: Energy-Efficient, Bit-Parallel, Multiply-Accumulate for Deep Neural Networks2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00009(1-12)Online publication date: 13-Nov-2024
  • (2024)SA4: A Comprehensive Analysis and Optimization of Systolic Array Architecture for 4-bit Convolutions2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00036(204-212)Online publication date: 2-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
February 2018
310 pages
ISBN:9781450356145
DOI:10.1145/3174243
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 February 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning
  2. fpga
  3. heterogeneous architectures
  4. neural networks
  5. reduced precision

Qualifiers

  • Research-article

Funding Sources

  • Australian Research Councils Linkage Projects

Conference

FPGA '18
Sponsor:

Acceptance Rates

FPGA '18 Paper Acceptance Rate 10 of 116 submissions, 9%;
Overall Acceptance Rate 125 of 627 submissions, 20%

Upcoming Conference

FPGA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)119
  • Downloads (Last 6 weeks)8
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP ArchitectureACM Transactions on Reconfigurable Technology and Systems10.1145/368616317:3(1-31)Online publication date: 5-Aug-2024
  • (2024)TangramFP: Energy-Efficient, Bit-Parallel, Multiply-Accumulate for Deep Neural Networks2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00009(1-12)Online publication date: 13-Nov-2024
  • (2024)SA4: A Comprehensive Analysis and Optimization of Systolic Array Architecture for 4-bit Convolutions2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00036(204-212)Online publication date: 2-Sep-2024
  • (2024)Dynamic Multi-bit Parallel Computing Method Based on Reconfigurable StructureAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0801-7_20(347-359)Online publication date: 1-Mar-2024
  • (2024)Multilayer Multipurpose Caches for OpenMP Target Regions on FPGAsAdvancing OpenMP for Future Accelerators10.1007/978-3-031-72567-8_6(79-93)Online publication date: 23-Sep-2024
  • (2023)A Survey of Design and Optimization for Systolic Array-based DNN AcceleratorsACM Computing Surveys10.1145/360480256:1(1-37)Online publication date: 17-Jun-2023
  • (2023)CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP ArchitectureProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3543622.3573210(153-164)Online publication date: 12-Feb-2023
  • (2023)Automatic Generation of Spatial Accelerator for Tensor AlgebraIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.320994942:6(1898-1911)Online publication date: Jun-2023
  • (2023)Systolic Array for Parallel Solution of the Robust Kalman Filter Used for Attitude and Position Estimations in UAVs2023 International Conference on Unmanned Aircraft Systems (ICUAS)10.1109/ICUAS57906.2023.10156196(425-432)Online publication date: 6-Jun-2023
  • (2023)MaxEVA: Maximizing the Efficiency of Matrix Multiplication on Versal AI Engine2023 International Conference on Field Programmable Technology (ICFPT)10.1109/ICFPT59805.2023.00016(96-105)Online publication date: 12-Dec-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media