Abstract
The advent of AI has driven the exploration of high-density low-precision arithmetic on FPGAs. This has resulted in new methods in mapping both arithmetic functions as well as dataflows onto the fabric, as well as some changes to the embedded DSP Blocks. Technologies outside of the FPGA realm have also evolved, such as the addition of tensor structures for GPUs, as well as the introduction of numerous AI ASSPs, all of which have a higher claimed performance and efficiency than current FPGAs. In this article, we will introduce the Stratix 10 NX device, which is a variant of FPGA specifically optimized for the AI application space. In addition to the computational capabilities of the standard programmable soft-logic fabric, a new type of DSP Block provides the dense arrays of low-precision multipliers typically used in AI implementations. The architecture of the block is tuned for the common matrix-matrix or vector-matrix multiplications in AI, with capabilities designed to work efficiently for both small and large matrix sizes. The base precisions are INT8 and INT4, along with shared exponent support to support block FP16 and block FP12 numerics. All additions/accumulations can be done in INT32 or IEEE-754 single precision floating point (FP32), and multiple blocks can be cascaded together to support larger matrices. We will also describe methods by which the smaller precision multipliers can be aggregated to create larger multipliers that are more applicable to standard signal processing requirements.
In the AI market, the FPGA must compete directly with other types of devices, rather than occupy a unique niche. Deterministic system performance is as important as the performance of individual FPGA elements, such as logic, memory, and DSP. We will show that the feed forward datapath structures that are needed to support the typical AI matrix-vector and matrix-matrix multiplication operations can consistently close timing at over 500 MHz on a mid-speed grade device, even if all of the Tensor Blocks on the device are used. We will also show a full-chip NPU processor implementation that out performs GPUs at the same process node for a variety of AI inferencing workloads, even though it has a lower operating frequency of 365 MHz.
In terms of overall compute throughput, Stratix 10 NX is specified at 143 INT8/FP16 TOPs/FLOPs or 286 INT4/FP12 TOPS/FLOPs. Depending on the configuration, power efficiency is in the range of 1–4 TOPs or TFLOPs/W.
- [1] . 2017. Deep Learning with INT8 Optimization on Xilinx Devices. Retrieved from https://www.xilinx.com/support/documentation/white_papers/wp486-deep-learning-int8.pdf.Google Scholar
- [2] . 2018. NVIDIA-Turing-Architecture-Whitepaper. Retrieved from https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf.Google Scholar
- [3] . 2019. Agilex F-Series FPGAs and SoC FPGAs. Retrieved from https://www.intel.com/content/www/us/en/products/details/fpga/agilex/f-series.htmf.Google Scholar
- [4] . 2020. Introducing 2nd Generation IPU Systems for AI at Scale. Retrieved from https://www.graphcore.ai/posts/introducing-second-generation-ipu-systems-for-ai-at-scale.Google Scholar
- [5] . 2020. Versal ACAP Packaging and Pinouts Architecture Manual. Retrieved from https://www.xilinx.com/support/documentation/architecture-manuals/am013-versal-pkg-pinout.pdf.Google Scholar
- [6] . 2020. Versal: The First Adaptive Compute Acceleration Platform (ACAP)s. Retrieved from https://www.xilinx.com/support/documentation/white_papers/wp505-versal-acap.pdf.Google Scholar
- [7] . 2020. Zync DPU v3.2. Retrieved from https://www.xilinx.com/support/documentation/ip_documentation/dpu/v3_0/pg338-dpu.pdf.Google Scholar
- [8] . 2021. 14 nm process. Retrieved from https://https://en.wikipedia.org/wiki/14_nm_process.Google Scholar
- [9] . 2021. 7 nm process. Retrieved from https://https://en.wikipedia.org/wiki/7_nm_process.Google Scholar
- [10] . 2021. Versal ACAP DSP Engine Architecture Manual.Retrieved from https://www.xilinx.com/support/documentation/architecture-manuals/am004-versal-dsp-engine.pdf.Google Scholar
- [11] . 2021. Versal AI Product Selection Guide. Retrieved from https://www.xilinx.com/support/documentation/selection-guides/versal-ai-core-product-selection-guide.pdf.Google Scholar
- [12] 2020. Think fast: A tensor streaming processor (TSP) for accelerating deep learning workloads. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 145–158.Google ScholarDigital Library
- [13] . 2018. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Trans. Reconfig. Technol. Syst. 11, 3, Article
16 (Dec. 2018), 23 pages. Google ScholarDigital Library - [14] . 2020. Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs. In Proceedings of the International Conference on Field-Programmable Technology, (ICFPT’20). IEEE, 10–19. Google ScholarCross Ref
- [15] . 2017. Compiling deep learning models for custom hardware accelerators. Retrieved from http://arxiv.org/abs/1708.00117.Google Scholar
- [16] . 2020. Vyasa: A high-performance vectorizing compiler for tensor convolutions on the xilinx AI engine. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC’20). IEEE, 1–10. Google ScholarCross Ref
- [17] . 2013. Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35, 8 (2013), 1915–1929. Google ScholarDigital Library
- [18] . 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). 1–14.Google ScholarDigital Library
- [19] . 2017. Snowflake: An efficient hardware accelerator for convolutional neural networks. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’17). 1–4. Google ScholarCross Ref
- [20] . 2020. Optimizing deep learning recommender systems training on CPU cluster architectures. Retrieved from https://arXiv:cs.DC/2005.04680.Google Scholar
- [21] . 2018. High density and performance multiplication for FPGA. In Proceedings of the 25th IEEE Symposium on Computer Arithmetic (ARITH’18). 5–12. Google ScholarCross Ref
- [22] . 2019. Fractal synthesis: Invited tutorial. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). Association for Computing Machinery, New York, NY, 202–211. Google ScholarDigital Library
- [23] . 2020. SpiderWeb—High-performance FPGA NoC. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’20). 115–118.Google ScholarCross Ref
- [24] . 2021. Dense FPGA compute using signed byte tuples. In Proceedings of the 31st International Conference on Field Programmable Logic and Applications (FPL’21).Google ScholarCross Ref
- [25] . 2020. High density 8-bit multiplier systolic arrays for FPGA. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). 84–92. Google ScholarCross Ref
- [26] . 2021. Stratix 10 NX architecture and applications. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’21). Association for Computing Machinery, New York, NY, 57–67. Google ScholarDigital Library
- [27] . 2019. High precision, high performance FPGA adders. In Proceedings of the 27th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’19). IEEE, 298–306. Google ScholarCross Ref
- [28] . 2019. Extracting INT8 multipliers from INT18 multipliers. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, Barcelona, Spain.Google ScholarCross Ref
- [29] . 2019. An efficient hardware accelerator for sparse convolutional neural networks on FPGAs. In Proceedings of the IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’19). 17–25. Google ScholarCross Ref
- [30] . 2020. Automatic compilation of diverse CNNs onto high-performance FPGA accelerators. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 39, 2 (2020), 424–437. Google ScholarCross Ref
- [31] . 2018. Activation function architectures for FPGAs. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). 43–437. Google ScholarCross Ref
- [32] . 2019. Scaling the cascades: Interconnect-aware mapping strategies for FPGA implementation of machine learning problems. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19).Google ScholarCross Ref
- [33] . 2017. Maximizing CNN accelerator efficiency through resource partitioning. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). Association for Computing Machinery, New York, NY, 535–547. Google ScholarDigital Library
- [34] . 2020. FTDL: A tailored FPGA-overlay for deep learning with high scalability. In Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC’20). 1–6. Google ScholarCross Ref
- [35] . 2021. unzipFPGA: Enhancing FPGA-based CNN engines with on-the-fly weights generation. In Proceedings of the 29th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’21). IEEE, 165–175. Google ScholarCross Ref
- [36] . 2017. A high-throughput reconfigurable processing array for neural networks. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). 1–4. Google ScholarCross Ref
- [37] . 2019. Compute-efficient neural-network acceleration. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). ACM, New York, NY, 191–200. Google ScholarDigital Library
- [38] . 2020. DNNVM: End-to-end compiler leveraging heterogeneous optimizations on FPGA-based CNN accelerators. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 39, 10 (2020), 2668–2681. Google ScholarCross Ref
- [39] . 2020. Light-OPU: An FPGA-based overlay processor for lightweight convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’20). Association for Computing Machinery, New York, NY, 122–132. Google ScholarDigital Library
Index Terms
- Stratix 10 NX Architecture
Recommendations
Stratix 10 NX Architecture and Applications
FPGA '21: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysThe advent of AI has driven the adoption of high density low precision arithmetic on FPGAs. This has resulted in new methods in mapping both arithmetic functions as well as dataflows onto the fabric, as well as some changes to the embedded DSP Blocks. ...
The Stratix™ 10 Highly Pipelined FPGA Architecture
FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysThis paper describes architectural enhancements in the Altera Stratix? 10 HyperFlex? FPGA architecture, fabricated in the Intel 14nm FinFET process. Stratix 10 includes ubiquitous flip-flops in the routing to enable a high degree of pipelining. In ...
In-Package Domain-Specific ASICs for Intel® Stratix® 10 FPGAs: A Case Study of Accelerating Deep Learning Using TensorTile ASIC(Abstract Only)
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysFPGAs or ASICs? There is a long-running debate on this. FPGAs are extremely flexible while ASICs offer top efficiency but inflexible. We believe that FPGAs and ASICs are better together, to offer both flexible and efficient solutions. We propose single-...
Comments