skip to main content
research-article

Stratix 10 NX Architecture

Published:08 August 2022Publication History
Skip Abstract Section

Abstract

The advent of AI has driven the exploration of high-density low-precision arithmetic on FPGAs. This has resulted in new methods in mapping both arithmetic functions as well as dataflows onto the fabric, as well as some changes to the embedded DSP Blocks. Technologies outside of the FPGA realm have also evolved, such as the addition of tensor structures for GPUs, as well as the introduction of numerous AI ASSPs, all of which have a higher claimed performance and efficiency than current FPGAs. In this article, we will introduce the Stratix 10 NX device, which is a variant of FPGA specifically optimized for the AI application space. In addition to the computational capabilities of the standard programmable soft-logic fabric, a new type of DSP Block provides the dense arrays of low-precision multipliers typically used in AI implementations. The architecture of the block is tuned for the common matrix-matrix or vector-matrix multiplications in AI, with capabilities designed to work efficiently for both small and large matrix sizes. The base precisions are INT8 and INT4, along with shared exponent support to support block FP16 and block FP12 numerics. All additions/accumulations can be done in INT32 or IEEE-754 single precision floating point (FP32), and multiple blocks can be cascaded together to support larger matrices. We will also describe methods by which the smaller precision multipliers can be aggregated to create larger multipliers that are more applicable to standard signal processing requirements.

In the AI market, the FPGA must compete directly with other types of devices, rather than occupy a unique niche. Deterministic system performance is as important as the performance of individual FPGA elements, such as logic, memory, and DSP. We will show that the feed forward datapath structures that are needed to support the typical AI matrix-vector and matrix-matrix multiplication operations can consistently close timing at over 500 MHz on a mid-speed grade device, even if all of the Tensor Blocks on the device are used. We will also show a full-chip NPU processor implementation that out performs GPUs at the same process node for a variety of AI inferencing workloads, even though it has a lower operating frequency of 365 MHz.

In terms of overall compute throughput, Stratix 10 NX is specified at 143 INT8/FP16 TOPs/FLOPs or 286 INT4/FP12 TOPS/FLOPs. Depending on the configuration, power efficiency is in the range of 1–4 TOPs or TFLOPs/W.

REFERENCES

  1. [1] Xilinx. 2017. Deep Learning with INT8 Optimization on Xilinx Devices. Retrieved from https://www.xilinx.com/support/documentation/white_papers/wp486-deep-learning-int8.pdf.Google ScholarGoogle Scholar
  2. [2] Nvidia. 2018. NVIDIA-Turing-Architecture-Whitepaper. Retrieved from https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf.Google ScholarGoogle Scholar
  3. [3] Intel. 2019. Agilex F-Series FPGAs and SoC FPGAs. Retrieved from https://www.intel.com/content/www/us/en/products/details/fpga/agilex/f-series.htmf.Google ScholarGoogle Scholar
  4. [4] Graphcore. 2020. Introducing 2nd Generation IPU Systems for AI at Scale. Retrieved from https://www.graphcore.ai/posts/introducing-second-generation-ipu-systems-for-ai-at-scale.Google ScholarGoogle Scholar
  5. [5] Xilinx. 2020. Versal ACAP Packaging and Pinouts Architecture Manual. Retrieved from https://www.xilinx.com/support/documentation/architecture-manuals/am013-versal-pkg-pinout.pdf.Google ScholarGoogle Scholar
  6. [6] Xilinx. 2020. Versal: The First Adaptive Compute Acceleration Platform (ACAP)s. Retrieved from https://www.xilinx.com/support/documentation/white_papers/wp505-versal-acap.pdf.Google ScholarGoogle Scholar
  7. [7] Xilinx. 2020. Zync DPU v3.2. Retrieved from https://www.xilinx.com/support/documentation/ip_documentation/dpu/v3_0/pg338-dpu.pdf.Google ScholarGoogle Scholar
  8. [8] Wikipedia. 2021. 14 nm process. Retrieved from https://https://en.wikipedia.org/wiki/14_nm_process.Google ScholarGoogle Scholar
  9. [9] Wikipedia. 2021. 7 nm process. Retrieved from https://https://en.wikipedia.org/wiki/7_nm_process.Google ScholarGoogle Scholar
  10. [10] Xilinx. 2021. Versal ACAP DSP Engine Architecture Manual.Retrieved from https://www.xilinx.com/support/documentation/architecture-manuals/am004-versal-dsp-engine.pdf.Google ScholarGoogle Scholar
  11. [11] Xilinx. 2021. Versal AI Product Selection Guide. Retrieved from https://www.xilinx.com/support/documentation/selection-guides/versal-ai-core-product-selection-guide.pdf.Google ScholarGoogle Scholar
  12. [12] Abts D., Ross J., Sparling J., Wong-VanHaren M., Baker M., Hawkins T., Bell A., Thompson J., Kahsai T., Kimmell G., Hwang J., Leslie-Hurd R., Bye M., Creswick E. R., Boyd M., Venigalla M., Laforge E., Purdy J., Kamath P., Maheshwari D., Beidler M., Rosseel G., Ahmad O., Gagarin G., Czekalski R., Rane A., Parmar S., Werner J., Sproch J., Macias A., and Kurtz. B.2020. Think fast: A tensor streaming processor (TSP) for accelerating deep learning workloads. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 145158.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Blott Michaela, Preußer Thomas B., Fraser Nicholas J., Gambardella Giulio, O’brien Kenneth, Umuroglu Yaman, Leeser Miriam, and Vissers Kees. 2018. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Trans. Reconfig. Technol. Syst. 11, 3, Article 16 (Dec. 2018), 23 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Boutros Andrew, Nurvitadhi Eriko, Ma Rui, Gribok Sergey, Zhao Zhipeng, Hoe James C., Betz Vaughn, and Langhammer Martin. 2020. Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs. In Proceedings of the International Conference on Field-Programmable Technology, (ICFPT’20). IEEE, 1019. Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Chang Andre Xian Ming, Zaidy Aliasger, Gokhale Vinayak, and Culurciello Eugenio. 2017. Compiling deep learning models for custom hardware accelerators. Retrieved from http://arxiv.org/abs/1708.00117.Google ScholarGoogle Scholar
  16. [16] Chatarasi Prasanth, Neuendorffer Stephen, Bayliss Samuel, Vissers Kees A., and Sarkar Vivek. 2020. Vyasa: A high-performance vectorizing compiler for tensor convolutions on the xilinx AI engine. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC’20). IEEE, 110. Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Farabet Clément, Couprie Camille, Najman Laurent, and LeCun Yann. 2013. Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35, 8 (2013), 19151929. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Fowers J., Ovtcharov K., Papamichael M., Massengill T., Liu M., Lo D., Alkalay S., Haselman M., Adams L., Ghandi M., Heil S., Patel P., Sapek A., Weisz G., Woods L., Lanka S., Reinhardt S. K., Caulfield A. M., Chung E. S., and Burger D.. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). 114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Gokhale Vinayak, Zaidy Aliasger, Chang Andre Xian Ming, and Culurciello Eugenio. 2017. Snowflake: An efficient hardware accelerator for convolutional neural networks. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’17). 14. Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Kalamkar Dhiraj, Georganas Evangelos, Srinivasan Sudarshan, Chen Jianping, Shiryaev Mikhail, and Heinecke Alexander. 2020. Optimizing deep learning recommender systems training on CPU cluster architectures. Retrieved from https://arXiv:cs.DC/2005.04680.Google ScholarGoogle Scholar
  21. [21] Langhammer Martin and Baeckler Gregg. 2018. High density and performance multiplication for FPGA. In Proceedings of the 25th IEEE Symposium on Computer Arithmetic (ARITH’18). 512. Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Langhammer Martin, Baeckler Gregg, and Gribok Sergey. 2019. Fractal synthesis: Invited tutorial. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). Association for Computing Machinery, New York, NY, 202211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Langhammer M., Baeckler G., and Gribok S.. 2020. SpiderWeb—High-performance FPGA NoC. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’20). 115118.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Langhammer M., Finn S., Gribok S., and Pasca B.. 2021. Dense FPGA compute using signed byte tuples. In Proceedings of the 31st International Conference on Field Programmable Logic and Applications (FPL’21).Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Langhammer Martin, Gribok Sergey, and Baeckler Gregg. 2020. High density 8-bit multiplier systolic arrays for FPGA. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). 8492. Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Langhammer Martin, Nurvitadhi Eriko, Pasca Bogdan, and Gribok Sergey. 2021. Stratix 10 NX architecture and applications. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’21). Association for Computing Machinery, New York, NY, 5767. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Langhammer Martin, Pasca Bogdan, and Baeckler Gregg. 2019. High precision, high performance FPGA adders. In Proceedings of the 27th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’19). IEEE, 298306. Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Langhammer Martin, Pasca Bogdan, Baeckler Gregg, and Gribok Sergey. 2019. Extracting INT8 multipliers from INT18 multipliers. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, Barcelona, Spain.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Lu Liqiang, Xie Jiaming, Huang Ruirui, Zhang Jiansong, Lin Wei, and Liang Yun. 2019. An efficient hardware accelerator for sparse convolutional neural networks on FPGAs. In Proceedings of the IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’19). 1725. Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Ma Yufei, Cao Yu, Vrudhula Sarma, and Seo Jae-Sun. 2020. Automatic compilation of diverse CNNs onto high-performance FPGA accelerators. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 39, 2 (2020), 424437. Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Pasca B. and Langhammer M.. 2018. Activation function architectures for FPGAs. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). 43437. Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Samajdar Ananda, Garg Tushar, Krishna Tushar, and Kapre Nachiket. 2019. Scaling the cascades: Interconnect-aware mapping strategies for FPGA implementation of machine learning problems. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19).Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Shen Yongming, Ferdman Michael, and Milder Peter. 2017. Maximizing CNN accelerator efficiency through resource partitioning. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). Association for Computing Machinery, New York, NY, 535547. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Shi Runbin, Ding Yuhao, Wei Xuechao, Li He, Liu Hang, So Hayden K.-H., and Ding Caiwen. 2020. FTDL: A tailored FPGA-overlay for deep learning with high scalability. In Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC’20). 16. Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Venieris Stylianos I., Fernández-Marqués Javier, and Lane Nicholas D.. 2021. unzipFPGA: Enhancing FPGA-based CNN engines with on-the-fly weights generation. In Proceedings of the 29th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’21). IEEE, 165175. Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Wu E., Zhang X., Berman D., and Cho I.. 2017. A high-throughput reconfigurable processing array for neural networks. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). 14. Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Wu Ephrem, Zhang Xiaoqian, Berman David, Cho Inkeun, and Thendean John. 2019. Compute-efficient neural-network acceleration. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). ACM, New York, NY, 191200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Xing Yu, Liang Shuang, Sui Lingzhi, Jia Xijie, Qiu Jiantao, Liu Xin, Wang Yushun, Shan Yi, and Wang Yu. 2020. DNNVM: End-to-end compiler leveraging heterogeneous optimizations on FPGA-based CNN accelerators. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 39, 10 (2020), 26682681. Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Yu Yunxuan, Zhao Tiandong, Wang Kun, and He Lei. 2020. Light-OPU: An FPGA-based overlay processor for lightweight convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’20). Association for Computing Machinery, New York, NY, 122132. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Stratix 10 NX Architecture

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Reconfigurable Technology and Systems
            ACM Transactions on Reconfigurable Technology and Systems  Volume 15, Issue 4
            December 2022
            476 pages
            ISSN:1936-7406
            EISSN:1936-7414
            DOI:10.1145/3540252
            • Editor:
            • Deming Chen
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 8 August 2022
            • Online AM: 14 March 2022
            • Accepted: 1 February 2022
            • Revised: 1 January 2022
            • Received: 1 September 2021
            Published in trets Volume 15, Issue 4

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text

          HTML Format

          View this article in HTML Format .

          View HTML Format