research-article

Stratix 10 NX Architecture

Authors:
Martin Langhammer

Intel Corporation, Marlow, United Kingdom

Intel Corporation, Marlow, United Kingdom

0000-0001-8206-2077
View Profile

,
Eriko Nurvitadhi

Intel Corporation, Hillsboro, OR, United States

Intel Corporation, Hillsboro, OR, United States
View Profile

,
Sergey Gribok

Intel Corporation, San Jose, CA, United States

Intel Corporation, San Jose, CA, United States
View Profile

,
Bogdan Pasca

Intel Corporation, Meudon, France

Intel Corporation, Meudon, France

0000-0002-5454-4375
View Profile

ACM Transactions on Reconfigurable Technology and Systems Volume 15 Issue 4Article No.: 45pp 1–32https://doi.org/10.1145/3520197

Published:08 August 2022Publication History

ACM Transactions on Reconfigurable Technology and Systems

Abstract

The advent of AI has driven the exploration of high-density low-precision arithmetic on FPGAs. This has resulted in new methods in mapping both arithmetic functions as well as dataflows onto the fabric, as well as some changes to the embedded DSP Blocks. Technologies outside of the FPGA realm have also evolved, such as the addition of tensor structures for GPUs, as well as the introduction of numerous AI ASSPs, all of which have a higher claimed performance and efficiency than current FPGAs. In this article, we will introduce the Stratix 10 NX device, which is a variant of FPGA specifically optimized for the AI application space. In addition to the computational capabilities of the standard programmable soft-logic fabric, a new type of DSP Block provides the dense arrays of low-precision multipliers typically used in AI implementations. The architecture of the block is tuned for the common matrix-matrix or vector-matrix multiplications in AI, with capabilities designed to work efficiently for both small and large matrix sizes. The base precisions are INT8 and INT4, along with shared exponent support to support block FP16 and block FP12 numerics. All additions/accumulations can be done in INT32 or IEEE-754 single precision floating point (FP32), and multiple blocks can be cascaded together to support larger matrices. We will also describe methods by which the smaller precision multipliers can be aggregated to create larger multipliers that are more applicable to standard signal processing requirements.

In the AI market, the FPGA must compete directly with other types of devices, rather than occupy a unique niche. Deterministic system performance is as important as the performance of individual FPGA elements, such as logic, memory, and DSP. We will show that the feed forward datapath structures that are needed to support the typical AI matrix-vector and matrix-matrix multiplication operations can consistently close timing at over 500 MHz on a mid-speed grade device, even if all of the Tensor Blocks on the device are used. We will also show a full-chip NPU processor implementation that out performs GPUs at the same process node for a variety of AI inferencing workloads, even though it has a lower operating frequency of 365 MHz.

In terms of overall compute throughput, Stratix 10 NX is specified at 143 INT8/FP16 TOPs/FLOPs or 286 INT4/FP12 TOPS/FLOPs. Depending on the configuration, power efficiency is in the range of 1–4 TOPs or TFLOPs/W.

REFERENCES

[1] Xilinx. 2017. Deep Learning with INT8 Optimization on Xilinx Devices. Retrieved from https://www.xilinx.com/support/documentation/white_papers/wp486-deep-learning-int8.pdf.Google Scholar
[2] Nvidia. 2018. NVIDIA-Turing-Architecture-Whitepaper. Retrieved from https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf.Google Scholar
[3] Intel. 2019. Agilex F-Series FPGAs and SoC FPGAs. Retrieved from https://www.intel.com/content/www/us/en/products/details/fpga/agilex/f-series.htmf.Google Scholar
[4] Graphcore. 2020. Introducing 2nd Generation IPU Systems for AI at Scale. Retrieved from https://www.graphcore.ai/posts/introducing-second-generation-ipu-systems-for-ai-at-scale.Google Scholar
[5] Xilinx. 2020. Versal ACAP Packaging and Pinouts Architecture Manual. Retrieved from https://www.xilinx.com/support/documentation/architecture-manuals/am013-versal-pkg-pinout.pdf.Google Scholar
[6] Xilinx. 2020. Versal: The First Adaptive Compute Acceleration Platform (ACAP)s. Retrieved from https://www.xilinx.com/support/documentation/white_papers/wp505-versal-acap.pdf.Google Scholar
[7] Xilinx. 2020. Zync DPU v3.2. Retrieved from https://www.xilinx.com/support/documentation/ip_documentation/dpu/v3_0/pg338-dpu.pdf.Google Scholar
[8] Wikipedia. 2021. 14 nm process. Retrieved from https://https://en.wikipedia.org/wiki/14_nm_process.Google Scholar
[9] Wikipedia. 2021. 7 nm process. Retrieved from https://https://en.wikipedia.org/wiki/7_nm_process.Google Scholar
[10] Xilinx. 2021. Versal ACAP DSP Engine Architecture Manual.Retrieved from https://www.xilinx.com/support/documentation/architecture-manuals/am004-versal-dsp-engine.pdf.Google Scholar
[11] Xilinx. 2021. Versal AI Product Selection Guide. Retrieved from https://www.xilinx.com/support/documentation/selection-guides/versal-ai-core-product-selection-guide.pdf.Google Scholar
[12] Abts D., Ross J., Sparling J., Wong-VanHaren M., Baker M., Hawkins T., Bell A., Thompson J., Kahsai T., Kimmell G., Hwang J., Leslie-Hurd R., Bye M., Creswick E. R., Boyd M., Venigalla M., Laforge E., Purdy J., Kamath P., Maheshwari D., Beidler M., Rosseel G., Ahmad O., Gagarin G., Czekalski R., Rane A., Parmar S., Werner J., Sproch J., Macias A., and Kurtz. B.2020. Think fast: A tensor streaming processor (TSP) for accelerating deep learning workloads. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 145–158.Google ScholarDigital Library
[13] Blott Michaela, Preußer Thomas B., Fraser Nicholas J., Gambardella Giulio, O’brien Kenneth, Umuroglu Yaman, Leeser Miriam, and Vissers Kees. 2018. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Trans. Reconfig. Technol. Syst. 11, 3, Article 16 (Dec. 2018), 23 pages. Google ScholarDigital Library
[14] Boutros Andrew, Nurvitadhi Eriko, Ma Rui, Gribok Sergey, Zhao Zhipeng, Hoe James C., Betz Vaughn, and Langhammer Martin. 2020. Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs. In Proceedings of the International Conference on Field-Programmable Technology, (ICFPT’20). IEEE, 10–19. Google ScholarCross Ref
[15] Chang Andre Xian Ming, Zaidy Aliasger, Gokhale Vinayak, and Culurciello Eugenio. 2017. Compiling deep learning models for custom hardware accelerators. Retrieved from http://arxiv.org/abs/1708.00117.Google Scholar
[16] Chatarasi Prasanth, Neuendorffer Stephen, Bayliss Samuel, Vissers Kees A., and Sarkar Vivek. 2020. Vyasa: A high-performance vectorizing compiler for tensor convolutions on the xilinx AI engine. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC’20). IEEE, 1–10. Google ScholarCross Ref
[17] Farabet Clément, Couprie Camille, Najman Laurent, and LeCun Yann. 2013. Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35, 8 (2013), 1915–1929. Google ScholarDigital Library
[18] Fowers J., Ovtcharov K., Papamichael M., Massengill T., Liu M., Lo D., Alkalay S., Haselman M., Adams L., Ghandi M., Heil S., Patel P., Sapek A., Weisz G., Woods L., Lanka S., Reinhardt S. K., Caulfield A. M., Chung E. S., and Burger D.. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). 1–14.Google ScholarDigital Library
[19] Gokhale Vinayak, Zaidy Aliasger, Chang Andre Xian Ming, and Culurciello Eugenio. 2017. Snowflake: An efficient hardware accelerator for convolutional neural networks. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’17). 1–4. Google ScholarCross Ref
[20] Kalamkar Dhiraj, Georganas Evangelos, Srinivasan Sudarshan, Chen Jianping, Shiryaev Mikhail, and Heinecke Alexander. 2020. Optimizing deep learning recommender systems training on CPU cluster architectures. Retrieved from https://arXiv:cs.DC/2005.04680.Google Scholar
[21] Langhammer Martin and Baeckler Gregg. 2018. High density and performance multiplication for FPGA. In Proceedings of the 25th IEEE Symposium on Computer Arithmetic (ARITH’18). 5–12. Google ScholarCross Ref
[22] Langhammer Martin, Baeckler Gregg, and Gribok Sergey. 2019. Fractal synthesis: Invited tutorial. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). Association for Computing Machinery, New York, NY, 202–211. Google ScholarDigital Library
[23] Langhammer M., Baeckler G., and Gribok S.. 2020. SpiderWeb—High-performance FPGA NoC. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’20). 115–118.Google ScholarCross Ref
[24] Langhammer M., Finn S., Gribok S., and Pasca B.. 2021. Dense FPGA compute using signed byte tuples. In Proceedings of the 31st International Conference on Field Programmable Logic and Applications (FPL’21).Google ScholarCross Ref
[25] Langhammer Martin, Gribok Sergey, and Baeckler Gregg. 2020. High density 8-bit multiplier systolic arrays for FPGA. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). 84–92. Google ScholarCross Ref
[26] Langhammer Martin, Nurvitadhi Eriko, Pasca Bogdan, and Gribok Sergey. 2021. Stratix 10 NX architecture and applications. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’21). Association for Computing Machinery, New York, NY, 57–67. Google ScholarDigital Library
[27] Langhammer Martin, Pasca Bogdan, and Baeckler Gregg. 2019. High precision, high performance FPGA adders. In Proceedings of the 27th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’19). IEEE, 298–306. Google ScholarCross Ref
[28] Langhammer Martin, Pasca Bogdan, Baeckler Gregg, and Gribok Sergey. 2019. Extracting INT8 multipliers from INT18 multipliers. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, Barcelona, Spain.Google ScholarCross Ref
[29] Lu Liqiang, Xie Jiaming, Huang Ruirui, Zhang Jiansong, Lin Wei, and Liang Yun. 2019. An efficient hardware accelerator for sparse convolutional neural networks on FPGAs. In Proceedings of the IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’19). 17–25. Google ScholarCross Ref
[30] Ma Yufei, Cao Yu, Vrudhula Sarma, and Seo Jae-Sun. 2020. Automatic compilation of diverse CNNs onto high-performance FPGA accelerators. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 39, 2 (2020), 424–437. Google ScholarCross Ref
[31] Pasca B. and Langhammer M.. 2018. Activation function architectures for FPGAs. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). 43–437. Google ScholarCross Ref
[32] Samajdar Ananda, Garg Tushar, Krishna Tushar, and Kapre Nachiket. 2019. Scaling the cascades: Interconnect-aware mapping strategies for FPGA implementation of machine learning problems. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19).Google ScholarCross Ref
[33] Shen Yongming, Ferdman Michael, and Milder Peter. 2017. Maximizing CNN accelerator efficiency through resource partitioning. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). Association for Computing Machinery, New York, NY, 535–547. Google ScholarDigital Library
[34] Shi Runbin, Ding Yuhao, Wei Xuechao, Li He, Liu Hang, So Hayden K.-H., and Ding Caiwen. 2020. FTDL: A tailored FPGA-overlay for deep learning with high scalability. In Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC’20). 1–6. Google ScholarCross Ref
[35] Venieris Stylianos I., Fernández-Marqués Javier, and Lane Nicholas D.. 2021. unzipFPGA: Enhancing FPGA-based CNN engines with on-the-fly weights generation. In Proceedings of the 29th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’21). IEEE, 165–175. Google ScholarCross Ref
[36] Wu E., Zhang X., Berman D., and Cho I.. 2017. A high-throughput reconfigurable processing array for neural networks. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). 1–4. Google ScholarCross Ref
[37] Wu Ephrem, Zhang Xiaoqian, Berman David, Cho Inkeun, and Thendean John. 2019. Compute-efficient neural-network acceleration. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). ACM, New York, NY, 191–200. Google ScholarDigital Library
[38] Xing Yu, Liang Shuang, Sui Lingzhi, Jia Xijie, Qiu Jiantao, Liu Xin, Wang Yushun, Shan Yi, and Wang Yu. 2020. DNNVM: End-to-end compiler leveraging heterogeneous optimizations on FPGA-based CNN accelerators. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 39, 10 (2020), 2668–2681. Google ScholarCross Ref
[39] Yu Yunxuan, Zhao Tiandong, Wang Kun, and He Lei. 2020. Light-OPU: An FPGA-based overlay processor for lightweight convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’20). Association for Computing Machinery, New York, NY, 122–132. Google ScholarDigital Library

Index Terms

Stratix 10 NX Architecture
1. Computing methodologies
  1. Artificial intelligence
2. Hardware
  1. Integrated circuits
    1. Logic circuits
      1. Arithmetic and datapath circuits
    2. Reconfigurable logic and FPGAs
      1. Hardware accelerators
  2. Very large scale integration design
    1. Application-specific VLSI designs
      1. Application specific integrated circuits

Recommendations

Stratix 10 NX Architecture and Applications
FPGA '21: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

The advent of AI has driven the adoption of high density low precision arithmetic on FPGAs. This has resulted in new methods in mapping both arithmetic functions as well as dataflows onto the fabric, as well as some changes to the embedded DSP Blocks. ...
Read More
The Stratix™ 10 Highly Pipelined FPGA Architecture
FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

This paper describes architectural enhancements in the Altera Stratix? 10 HyperFlex? FPGA architecture, fabricated in the Intel 14nm FinFET process. Stratix 10 includes ubiquitous flip-flops in the routing to enable a high degree of pipelining. In ...
Read More
In-Package Domain-Specific ASICs for Intel® Stratix® 10 FPGAs: A Case Study of Accelerating Deep Learning Using TensorTile ASIC(Abstract Only)
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

FPGAs or ASICs? There is a long-running debate on this. FPGAs are extremely flexible while ASICs offer top efficiency but inflexible. We believe that FPGAs and ASICs are better together, to offer both flexible and efficient solutions. We propose single-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Reconfigurable Technology and Systems Volume 15, Issue 4
December 2022
476 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/3540252
Editor:
Deming Chen
University of Illinois, Urbana-Champaign Urbana, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 August 2022
- Online AM: 14 March 2022
- Accepted: 1 February 2022
- Revised: 1 January 2022
- Received: 1 September 2021
Published in trets Volume 15, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FPGA architecture
AI tensor block
FPGA accelerator
place and route
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 633
  Total Downloads
- Downloads (Last 12 months)282
- Downloads (Last 6 weeks)32
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

Stratix 10 NX Architecture

ACM Transactions on Reconfigurable Technology and Systems

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Stratix 10 NX Architecture and Applications

The Stratix™ 10 Highly Pipelined FPGA Architecture

In-Package Domain-Specific ASICs for Intel® Stratix® 10 FPGAs: A Case Study of Accelerating Deep Learning Using TensorTile ASIC(Abstract Only)

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

HTML Format

Caption

Stratix 10 NX Architecture

ACM Transactions on Reconfigurable Technology and Systems

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Stratix 10 NX Architecture and Applications

The Stratix™ 10 Highly Pipelined FPGA Architecture

In-Package Domain-Specific ASICs for Intel® Stratix® 10 FPGAs: A Case Study of Accelerating Deep Learning Using TensorTile ASIC(Abstract Only)

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

HTML Format

Share this Publication link

Share on Social Media