A Pipelining Loop Optimization Method for Dataflow Architecture

Tan, Xu; Ye, Xiao-Chun; Shen, Xiao-Wei; Xu, Yuan-Chao; Wang, Da; Zhang, Lunkai; Li, Wen-Ming; Fan, Dong-Rui; Tang, Zhi-Min

doi:10.1007/s11390-017-1748-5

A Pipelining Loop Optimization Method for Dataflow Architecture

Regular Paper
Published: 26 January 2018

Volume 33, pages 116–130, (2018)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Xu Tan^1,2,
Xiao-Chun Ye^1,3,
Xiao-Wei Shen^1,2,
Yuan-Chao Xu^1,4,
Da Wang¹,
Lunkai Zhang⁵,
Wen-Ming Li¹,
Dong-Rui Fan^1,2 &
…
Zhi-Min Tang¹

215 Accesses
8 Citations
Explore all metrics

Abstract

With the coming of exascale supercomputing era, power efficiency has become the most important obstacle to build an exascale system. Dataflow architecture has native advantage in achieving high power efficiency for scientific applications. However, the state-of-the-art dataflow architectures fail to exploit high parallelism for loop processing. To address this issue, we propose a pipelining loop optimization method (PLO), which makes iterations in loops flow in the processing element (PE) array of dataflow accelerator. This method consists of two techniques, architecture-assisted hardware iteration and instruction-assisted software iteration. In hardware iteration execution model, an on-chip loop controller is designed to generate loop indexes, reducing the complexity of computing kernel and laying a good foundation for pipelining execution. In software iteration execution model, additional loop instructions are presented to solve the iteration dependency problem. Via these two techniques, the average number of instructions ready to execute per cycle is increased to keep floating-point unit busy. Simulation results show that our proposed method outperforms static and dynamic loop execution model in floating-point efficiency by 2.45x and 1.1x on average, respectively, while the hardware cost of these two techniques is acceptable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

References

Tolentino M, Cameron K W. The optimist, the pessimist, and the global race to exascale in 20 megawatts. Computer, 2012, 45(1): 95-97.
Article Google Scholar
Kogge P. The tops in flops. IEEE Spectrum, 2011, 48(2): 48-54.
Article Google Scholar
Kogge P, Bergman K, Borkar S et al. ExaScale computing study: Technology challenges in achieving exascale systems. Technical Report TR-2008-13, Defense Advanced Research Projects Agency Information Processing Technigues Office, 2008. http://www.citeulike.org/group/11430/article/6638217, Dec. 2017.
Milutinovi V, Salom J, Trifunovic N, Giorgi R. Guide to DataFlow Supercomputing: Basic Concepts, Case Studies, and a Detailed Example. Springer, 2015.
Dennis J B. First version of a data flow procedure language. In Proc. the Programming Symp., April 1974, pp.362-376.
Oriato D, Tilbury S, Marrocu M, Pusceddu G. Acceleration of a meteorological limited area model with dataflow engines. In Proc. Symp. Application Accelerators in High Performance Computing, July 2012, pp.129-132.
Pratas F, Oriato D, Pell O, Mata R A, Sousa L. Accelerating the computation of induced dipoles for molecular mechanics with dataflow engines. In Proc. the 21st IEEE Annual Int. Symp. Field-Programmable Custom Computing Machines, April 2013, pp.177-180.
Fu H H, Gan L, Clapp R G et al. Scaling reverse time migration performance through reconfigurable dataflow engines. IEEE Micro, 2014, 34(1): 30-40.
Article Google Scholar
Ackerman W B, Dennis J B. VAL–A value-oriented algorithmic language: Preliminary reference manual. Technical Report TR-218, Computation Structure Group, Laboratory for Computer Science, MIT, 1979. http://citeseerxist.psu.edu/showciting?cid=928490, Dec. 2017.
Burger D, Keckler S W, McKinley K S, Dahlin M, John L K, Lin C, Moore C R, Burrill J, McDonald R G, Yoder W. Scaling to the end of silicon with edge architectures. Computer, 2004, 37(7): 44-55.
Article Google Scholar
Arvind N, Gostelow K, Plouffe W. An asynchronous programming language and computing machine. Technical Report TR114a, Department of Information and Computer Science, University of California, 1978.
Arvind K, Nikhil R S. Executing a program on the MIT tagged-token dataflow architecture. IEEE Trans. Computers, 1990, 39(3): 300-318.
Article MATH Google Scholar
Swanson S, Schwerin A, Mercaldi M et al. The wavescalar architecture. ACM Trans. Computer Systems, 2007, 25(2): Article No. 4.
Zuckerman S, Suetterlein J, Knauerhase R, Gao G R. Position paper: Using a “codelet” program execution model for exascale machines. In Proc. the 1st Int. Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era, June 2011, pp.64-69.
Suettlerlein J, Zuckerman S, Gao G R. An implementation of the codelet model. In Proc. the 19th Int. Conf. Parallel Processing, August 2013, pp.633-644.
Pell O, Averbukh V. Maximum performance computing with dataflow engines. Computing in Science & Engineering, 2012, 14(4): 98-103.
Article Google Scholar
Voitsechov D, Etsion Y. Single-graph multiple flows: Energy efficient design alternative for GPGPUs. ACM SIGARCH Computer Architecture News, 2014, 42(3): 205-216.
Article Google Scholar
Gurd J R, Kirkham C C, Watson I. The Manchester prototype dataflow computer. Communications of the ACM, 1985, 28(1): 34-52.
Article Google Scholar
Shen X W, Ye X C, Tan X, Wang D, Zhang L K, Li W M, Zhang Z M, Fan D R, Sun N H. An efficient network-onchip router for dataflow architecture. Journal of Computer Science and Technology, 2017, 32(1): 11-25.
Article Google Scholar
Tan X, Shen X W, Ye X C, Wang D, Fan D R, Zhang L K, Li W M, Zhang Z M, Tang Z M. A non-stop double buffering mechanism for dataflow architecture. Journal of Computer Science and Technology, 2018, 33(1): 145-157.
Article Google Scholar
Ye X C, Fan D R, Sun N H et al. SimICT: A fast and flexible framework for performance and power evaluation of large-scale architecture. In Proc. Int. Symp. Low Power Electronics and Design, September 2013, pp.273-278.
Nguyen A, Satish N, Chhugani J et al. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proc. ACM/IEEE Int. Conf. for High Performance Computing Networking Storage and Analysis, Nov. 2010.
Kurzak J, Tomov S, Dongarra J. Autotuning GEMM kernels for the Fermi GPU. IEEE Trans. Parallel and Distributed Systems, 2012, 23(11): 2045-2057.
Article Google Scholar
Zhang Y P, Mueller F. Autogeneration and autotuning of 3D stencil codes on homogeneous and heterogeneous GPU clusters. IEEE Trans. Parallel and Distributed Systems, 2013, 24(3): 417-427.
Article Google Scholar
del Mundo C, Feng W C. Towards a performance-portable FFT library for heterogeneous computing. In Proc. the 11th ACM Conf. Computing Frontiers, May 2014, Article No.11.
Li S, Ahn J H, Strong R D, Brockman J B, Tullsen D M, Jouppi N P. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. the 42nd Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2009, pp.469-480.
Naffziger S. High-performance processors in a power-limited world. In Proc. Symp. VLSI Circuits Digest of Technical Papers, June 2006, pp.93-97.
Solinas M, Badia R M, Bodin F et al. The TERAFLUX project: Exploiting the dataflow paradigm in next generation teradevices. In Proc. Euromicro Conf. Digital System Design, September 2013, pp.272-279.
Carter N P, Agrawal A, Borkar S et al. Runnemede: An architecture for ubiquitous high-performance computing. In Proc. the 19th IEEE Int. Symp. High Performance Computer Architecture, February 2013, pp.198-209.

Download references

Author information

Authors and Affiliations

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Xu Tan, Xiao-Chun Ye, Xiao-Wei Shen, Yuan-Chao Xu, Da Wang, Wen-Ming Li, Dong-Rui Fan & Zhi-Min Tang
School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, 100049, China
Xu Tan, Xiao-Wei Shen & Dong-Rui Fan
State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, 214125, China
Xiao-Chun Ye
College of Information Engineering, Capital Normal University, Beijing, 100048, China
Yuan-Chao Xu
Department of Computer Science, The University of Chicago, Chicago, IL, 60637, U.S.A.
Lunkai Zhang

Authors

Xu Tan
View author publications
You can also search for this author inPubMed Google Scholar
Xiao-Chun Ye
View author publications
You can also search for this author inPubMed Google Scholar
Xiao-Wei Shen
View author publications
You can also search for this author inPubMed Google Scholar
Yuan-Chao Xu
View author publications
You can also search for this author inPubMed Google Scholar
Da Wang
View author publications
You can also search for this author inPubMed Google Scholar
Lunkai Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Wen-Ming Li
View author publications
You can also search for this author inPubMed Google Scholar
Dong-Rui Fan
View author publications
You can also search for this author inPubMed Google Scholar
Zhi-Min Tang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Yuan-Chao Xu.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 213 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tan, X., Ye, XC., Shen, XW. et al. A Pipelining Loop Optimization Method for Dataflow Architecture. J. Comput. Sci. Technol. 33, 116–130 (2018). https://doi.org/10.1007/s11390-017-1748-5

Download citation

Received: 04 September 2016
Revised: 17 April 2017
Published: 26 January 2018
Issue Date: January 2018
DOI: https://doi.org/10.1007/s11390-017-1748-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Pipelining Loop Optimization Method for Dataflow Architecture

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hardware Based Loop Optimization for CGRA Architectures

Improving Utilization of Dataflow Architectures Through Software and Hardware Co-Design

Fine grain algorithm parallelization on a hybrid control-flow and dataflow processor

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

A Pipelining Loop Optimization Method for Dataflow Architecture

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hardware Based Loop Optimization for CGRA Architectures

Improving Utilization of Dataflow Architectures Through Software and Hardware Co-Design

Fine grain algorithm parallelization on a hybrid control-flow and dataflow processor

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now