research-article

TGPA: Tile-Grained Pipeline Architecture for Low Latency CNN Inference

Authors:

Jason CongAuthors Info & Claims

2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

Pages 1 - 8

https://doi.org/10.1145/3240765.3240856

Published: 05 November 2018 Publication History

Abstract

FPGAs are more and more widely used as reconfigurable hardware accelerators for applications leveraging convolutional neural networks (CNNs) in recent years. Previous designs normally adopt a uniform accelerator architecture that processes all layers of a given CNN model one after another. This homogeneous design methodology usually has dynamic resource underutilization issue due to the tensor shape diversity of different layers. As a result, designs equipped with heterogeneous accelerators specific for different layers were proposed to resolve this issue. However, existing heterogeneous designs sacrifice latency for throughput by concurrent execution of multiple input images on different accelerators. In this paper, we propose an architecture named Tile-Grained Pipeline Architecture (TGPA) for low latency CNN inference. TGPA adopts a heterogeneous design which supports pipelining execution of multiple tiles within a single input image on multiple heterogeneous accelerators. The accelerators are partitioned onto different FPGA dies to guarantee high frequency. A partition strategy is designd to maximize on-chip resource utilization. Experiment results show that TGPA designs for different CNN models achieve up to 40% performance improvement than homogeneous designs, and 3X latency reduction over state-of-the-art designs.

References

[1]

M. Alwani, H. Chen, M. Ferdman, and P. Milder. 2016. Fused-layer CNN accelerators. In MICRO.

[2]

U. Aydonat, S. O'Connell, D. Capalija, A. Ling, and G. Chiu. 2017. An OpenCL Deep Learning Accelerator on Arria 10. In FPGA.

[3]

G. Chaitin. 2004. Register Allocation and Spilling via Graph Coloring. SIGPLAN Not. (2004).

[4]

J. Cong, P. Wei, C.H. Yu, and P. Zhou. 2017. Bandwidth Optimization Through On-Chip Memory Restructuring for HLS. In DAC.

[5]

H. Gao, Z. Liu, K.Q. Weinberger, and L. van der Maaten. 2017. Deep Residual Learning for Image Recognition. In CVPR.

[6]

Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, and J. Congo 2017. FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates. In FCCM.

[7]

K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR.

[8]

Intel. 2016. “Not so fast, FFT”: Winograd. https://ai.intel.com/winograd/. (2016).

[9]

N.P. Jouppi, C. Young, N. Patil, and et al. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture.

[10]

A. Krizhevsky, I. Sutskever, and G.E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS.

Digital Library

[11]

H.T. Kung and C.E. Leiserson. 1979. Algorithms for VLSI Processor Arrays.

[12]

H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang. 2016. A High Performance FPGA-based Accelerator for Large-scale Convolutional Neural Networks. In FPL.

[13]

Liqiang Lu and Yun Liang. 2018. SpWA: An Efficient Sparse Winograd Convolutional Neural Networks Accelerator on FPGAs. In DAC.

[14]

L. Lu, Y. Liang, Q. Xiao, and S. Yan. 2017. Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs. In FCCM.

[15]

Y. Ma, Y. Cao, S. Vrudhula, and J. Seo. 2017. Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks. In FPGA.

[16]

Y. Ma, M. Kim, Y. Cao, S. Vrudhula, and J.S. Seo. 2017. End-to-end scalable FPGA accelerator for deep residual networks. In ISCAS.

[17]

K. Ovtcharov, O. Ruwase, J. Kim, J. Fowers, K. Strauss, and E. Chung. 2015. Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter. In Hot Chips.

[18]

PyTorch. 2018. https://pytorch.org. (2018).

[19]

Y. Shen, M. Ferdman, and P. Milder. 2017. Maximizing CNN Accelerator Efficiency Through Resource Partitioning. In ISCA.

[20]

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S.E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2014. Going Deeper with Convolutions. In CVPR.

[21]

Y. Umuroglu, N.J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers. 2017. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. In FPGA.

[22]

S.J. Venieris and C.S. Bouganis. 2016. fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs. In FCCM.

[23]

X. Wei, Cody H. Yu, P. Zhang, Y Chen, Y. Wang, H. Hu, Y. Liang, and J. Congo 2017. Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on FPGAs. In DAC.

[24]

Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y. Tai. 2017. Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs. In DAC.

[25]

Xilinx. 2018. Large FPGA Methodology Guide. https://www.xilinx.com/support/documentation/sw_manuals/xilinx13_4/ug872_largefpga.pdf. (2018).

[26]

C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Congo 2016. Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks. In ICCAD.

[27]

C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Congo 2015. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In FPGA.

[28]

C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong. 2016. Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster. In ISLPED.

[29]

J. Zhang and J. Li. 2017. Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network. In FPGA.

Cited By

Ibrahim MUsman MLee J(2024)ECHO: Energy-Efficient Computation Harnessing Online Arithmetic—An MSDF-Based Accelerator for DNN InferenceElectronics10.3390/electronics1310189313:10(1893)Online publication date: 11-May-2024
https://doi.org/10.3390/electronics13101893
Qararyah FAzhar MMaleki MTrancoso P(2024)Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUsWorkshop Proceedings of the 53rd International Conference on Parallel Processing10.1145/3677333.3678153(58-67)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3677333.3678153
Ye HJun HChen DTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)HIDA: A Hierarchical Dataflow Compiler for High-Level SynthesisProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624850(215-230)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624850
Show More Cited By

Index Terms

TGPA: Tile-Grained Pipeline Architecture for Low Latency CNN Inference

Index terms have been assigned to the content through auto-classification.

Recommendations

An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism
Increasingly, pre-trained convolutional neural networks (CNNs) are being deployed for inference in various computer vision applications, both on the server-side in the data centers and at the edge. CNN inference is a very compute-intensive task. It is a ...
DeepBurning-SEG: Generating DNN Accelerators of Segment-Grained Pipeline Architecture
MICRO '22: Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture

The growing complexity and diversity of deep neural network (DNN) applications have inspired intensive research on specialized DNN accelerators and also the design automation frameworks. Previous specialized NN acceleratos roughly fall into two ...
A Flexible, Fast, Low Bandwidth Block-based Acceleration Architecture for CNN Inference on FPGAs
FPGA '24: Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

In recent years, Convolutional Neural Networks (CNNs) have been widely used in many fields. Many accelerators are designed to run CNN inference on hardware. They are classified into two types: Overlay and Dataflow. In this paper, we propose a new ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

Nov 2018

939 pages

Copyright © 2018.

Publisher

IEEE Press

Publication History

Published: 05 November 2018

Permissions

Request permissions for this article.

Request Permissions

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

52
Total Citations
View Citations
543
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ibrahim MUsman MLee J(2024)ECHO: Energy-Efficient Computation Harnessing Online Arithmetic—An MSDF-Based Accelerator for DNN InferenceElectronics10.3390/electronics1310189313:10(1893)Online publication date: 11-May-2024
https://doi.org/10.3390/electronics13101893
Qararyah FAzhar MMaleki MTrancoso P(2024)Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUsWorkshop Proceedings of the 53rd International Conference on Parallel Processing10.1145/3677333.3678153(58-67)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3677333.3678153
Ye HJun HChen DTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)HIDA: A Hierarchical Dataflow Compiler for High-Level SynthesisProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624850(215-230)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624850
Lou WGong LWang CQian JWang XLi CZhou X(2024)Unleashing Network/Accelerator Co-Exploration Potential on FPGAs: A Deeper Joint SearchIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.339168843:10(3041-3054)Online publication date: Oct-2024
https://doi.org/10.1109/TCAD.2024.3391688
Khan SLi ZJung WFeng YZhao DXin CZhou G(2024)DeepShield: Lightweight Privacy-Preserving Inference for Real-Time IoT Botnet Detection2024 IEEE 37th International System-on-Chip Conference (SOCC)10.1109/SOCC62300.2024.10737827(1-6)Online publication date: 16-Sep-2024
https://doi.org/10.1109/SOCC62300.2024.10737827
Sun HYi QFujita M(2024)FPGA Codec System of Learned Image Compression With Algorithm-Architecture Co-OptimizationIEEE Journal on Emerging and Selected Topics in Circuits and Systems10.1109/JETCAS.2024.338632814:2(334-347)Online publication date: Jun-2024
https://doi.org/10.1109/JETCAS.2024.3386328
Li SZhou XLu HWang K(2024)DNNMapper: An Elastic Framework for Mapping DNNs to Multi-die FPGAs2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10558120(1-5)Online publication date: 19-May-2024
https://doi.org/10.1109/ISCAS58744.2024.10558120
Zhou XLi SLu HWang K(2024)PipeFuser: Building Flexible Pipeline Architecture for DNN Accelerators via Layer Fusion2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC)10.1109/ASP-DAC58780.2024.10473790(921-926)Online publication date: 22-Jan-2024
https://doi.org/10.1109/ASP-DAC58780.2024.10473790
Kao SSubramanian SAgrawal GYazdanbakhsh AKrishna TAamodt TJerger NSwift M(2023)FLAT: An Optimized Dataflow for Mitigating Attention BottlenecksProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575747(295-310)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575747
Basalama SSohrabizadeh AWang JGuo LCong J(2023)FlexCNN: An End-to-end Framework for Composing CNN Accelerators on FPGAACM Transactions on Reconfigurable Technology and Systems10.1145/357092816:2(1-32)Online publication date: 11-Mar-2023
https://dl.acm.org/doi/10.1145/3570928
Show More Cited By

View Options

View options

Figures

Tables

Media

View Table of Conten