skip to main content
10.1145/3240765.3240856guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

TGPA: Tile-Grained Pipeline Architecture for Low Latency CNN Inference

Published: 05 November 2018 Publication History

Abstract

FPGAs are more and more widely used as reconfigurable hardware accelerators for applications leveraging convolutional neural networks (CNNs) in recent years. Previous designs normally adopt a uniform accelerator architecture that processes all layers of a given CNN model one after another. This homogeneous design methodology usually has dynamic resource underutilization issue due to the tensor shape diversity of different layers. As a result, designs equipped with heterogeneous accelerators specific for different layers were proposed to resolve this issue. However, existing heterogeneous designs sacrifice latency for throughput by concurrent execution of multiple input images on different accelerators. In this paper, we propose an architecture named Tile-Grained Pipeline Architecture (TGPA) for low latency CNN inference. TGPA adopts a heterogeneous design which supports pipelining execution of multiple tiles within a single input image on multiple heterogeneous accelerators. The accelerators are partitioned onto different FPGA dies to guarantee high frequency. A partition strategy is designd to maximize on-chip resource utilization. Experiment results show that TGPA designs for different CNN models achieve up to 40% performance improvement than homogeneous designs, and 3X latency reduction over state-of-the-art designs.

References

[1]
M. Alwani, H. Chen, M. Ferdman, and P. Milder. 2016. Fused-layer CNN accelerators. In MICRO.
[2]
U. Aydonat, S. O'Connell, D. Capalija, A. Ling, and G. Chiu. 2017. An OpenCL Deep Learning Accelerator on Arria 10. In FPGA.
[3]
G. Chaitin. 2004. Register Allocation and Spilling via Graph Coloring. SIGPLAN Not. (2004).
[4]
J. Cong, P. Wei, C.H. Yu, and P. Zhou. 2017. Bandwidth Optimization Through On-Chip Memory Restructuring for HLS. In DAC.
[5]
H. Gao, Z. Liu, K.Q. Weinberger, and L. van der Maaten. 2017. Deep Residual Learning for Image Recognition. In CVPR.
[6]
Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, and J. Congo 2017. FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates. In FCCM.
[7]
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR.
[8]
Intel. 2016. “Not so fast, FFT”: Winograd. https://ai.intel.com/winograd/. (2016).
[9]
N.P. Jouppi, C. Young, N. Patil, and et al. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture.
[10]
A. Krizhevsky, I. Sutskever, and G.E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS.
[11]
H.T. Kung and C.E. Leiserson. 1979. Algorithms for VLSI Processor Arrays.
[12]
H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang. 2016. A High Performance FPGA-based Accelerator for Large-scale Convolutional Neural Networks. In FPL.
[13]
Liqiang Lu and Yun Liang. 2018. SpWA: An Efficient Sparse Winograd Convolutional Neural Networks Accelerator on FPGAs. In DAC.
[14]
L. Lu, Y. Liang, Q. Xiao, and S. Yan. 2017. Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs. In FCCM.
[15]
Y. Ma, Y. Cao, S. Vrudhula, and J. Seo. 2017. Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks. In FPGA.
[16]
Y. Ma, M. Kim, Y. Cao, S. Vrudhula, and J.S. Seo. 2017. End-to-end scalable FPGA accelerator for deep residual networks. In ISCAS.
[17]
K. Ovtcharov, O. Ruwase, J. Kim, J. Fowers, K. Strauss, and E. Chung. 2015. Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter. In Hot Chips.
[18]
PyTorch. 2018. https://pytorch.org. (2018).
[19]
Y. Shen, M. Ferdman, and P. Milder. 2017. Maximizing CNN Accelerator Efficiency Through Resource Partitioning. In ISCA.
[20]
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S.E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2014. Going Deeper with Convolutions. In CVPR.
[21]
Y. Umuroglu, N.J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers. 2017. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. In FPGA.
[22]
S.J. Venieris and C.S. Bouganis. 2016. fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs. In FCCM.
[23]
X. Wei, Cody H. Yu, P. Zhang, Y Chen, Y. Wang, H. Hu, Y. Liang, and J. Congo 2017. Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on FPGAs. In DAC.
[24]
Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y. Tai. 2017. Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs. In DAC.
[26]
C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Congo 2016. Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks. In ICCAD.
[27]
C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Congo 2015. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In FPGA.
[28]
C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong. 2016. Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster. In ISLPED.
[29]
J. Zhang and J. Li. 2017. Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network. In FPGA.

Cited By

View all
  • (2024)ECHO: Energy-Efficient Computation Harnessing Online Arithmetic—An MSDF-Based Accelerator for DNN InferenceElectronics10.3390/electronics1310189313:10(1893)Online publication date: 11-May-2024
  • (2024)Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUsWorkshop Proceedings of the 53rd International Conference on Parallel Processing10.1145/3677333.3678153(58-67)Online publication date: 12-Aug-2024
  • (2024)HIDA: A Hierarchical Dataflow Compiler for High-Level SynthesisProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624850(215-230)Online publication date: 27-Apr-2024
  • Show More Cited By

Index Terms

  1. TGPA: Tile-Grained Pipeline Architecture for Low Latency CNN Inference
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Guide Proceedings
        2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)
        Nov 2018
        939 pages

        Publisher

        IEEE Press

        Publication History

        Published: 05 November 2018

        Permissions

        Request permissions for this article.

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 19 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)ECHO: Energy-Efficient Computation Harnessing Online Arithmetic—An MSDF-Based Accelerator for DNN InferenceElectronics10.3390/electronics1310189313:10(1893)Online publication date: 11-May-2024
        • (2024)Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUsWorkshop Proceedings of the 53rd International Conference on Parallel Processing10.1145/3677333.3678153(58-67)Online publication date: 12-Aug-2024
        • (2024)HIDA: A Hierarchical Dataflow Compiler for High-Level SynthesisProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624850(215-230)Online publication date: 27-Apr-2024
        • (2024)Unleashing Network/Accelerator Co-Exploration Potential on FPGAs: A Deeper Joint SearchIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.339168843:10(3041-3054)Online publication date: Oct-2024
        • (2024)DeepShield: Lightweight Privacy-Preserving Inference for Real-Time IoT Botnet Detection2024 IEEE 37th International System-on-Chip Conference (SOCC)10.1109/SOCC62300.2024.10737827(1-6)Online publication date: 16-Sep-2024
        • (2024)FPGA Codec System of Learned Image Compression With Algorithm-Architecture Co-OptimizationIEEE Journal on Emerging and Selected Topics in Circuits and Systems10.1109/JETCAS.2024.338632814:2(334-347)Online publication date: Jun-2024
        • (2024)DNNMapper: An Elastic Framework for Mapping DNNs to Multi-die FPGAs2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10558120(1-5)Online publication date: 19-May-2024
        • (2024)PipeFuser: Building Flexible Pipeline Architecture for DNN Accelerators via Layer Fusion2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC)10.1109/ASP-DAC58780.2024.10473790(921-926)Online publication date: 22-Jan-2024
        • (2023)FLAT: An Optimized Dataflow for Mitigating Attention BottlenecksProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575747(295-310)Online publication date: 27-Jan-2023
        • (2023)FlexCNN: An End-to-end Framework for Composing CNN Accelerators on FPGAACM Transactions on Reconfigurable Technology and Systems10.1145/357092816:2(1-32)Online publication date: 11-Mar-2023
        • Show More Cited By

        View Options

        View options

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media