ABSTRACT
Iterative Stencil Loops (ISLs) are a specific class of algorithms of great importance for their substantial presence in a lot of industrial and scientific computing applications, such as in numerical methods for solving partial differential equation - e.g. reverse time migration and heat distribution simulation - or in cellular automata - used for instance for random number generation and error correction. In this work, we propose a hardware acceleration methodology based on the polyhedral model and implement the related framework to automatically accelerate ISLs on a multi-FPGA system. The experimental evaluation shows that the throughput obtained by our solution scales linearly with the amount of resources used on the FPGAs, the power efficiency increases proportionally to the amount of instantiated computation, and outperforms the power efficiency figure of state of the art ISL implementations running on an Intel Xeon CPU by at most 10×. A key aspect of this approach is also that no knowledge of the underlying architecture is requested to the application designer, as no code refactoring is needed to make the application suitable to be processed by our framework.
- [1]. . Tiling stencil computations to maximize parallelism. In High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, pages 1–11, Nov 2012.Google Scholar
- [2]. . Tiling stencil computations to maximize parallelism. In Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '12, pages 1–11, Washington, DC, USA, 2012. IEEE Computer SocietyGoogle Scholar
- [3]. . A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Not., 43 (6): 101–113, June 2008.Google ScholarDigital Library
- [4]. . Aracompiler: a prototyping flow and evaluation framework for accelerator-rich architectures. In Performance Analysis of Systems and Software (ISPASS), 2015 IEEE International Symposium on, pages 157–158, March 2015.Google Scholar
- [5]. . An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers. In Proceedings of the 51 st Annual Design Automation Conference, DAC '14, pages 77:1–77:6, New York, NY, USA, 2014. ACM.Google Scholar
- [6]. . High-level synthesis of loops using the polyhedral model. In High-level synthesis, pages 215–230. Springer 2008.Google Scholar
- [7]. . High-performance code generation for stencil computations on gpu architectures. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS '12, pages 311–320, New York, NY, USA, 2012. ACM.Google Scholar
- [8]. . Integration of a highly scalable, multi-FPGA-based hardware accelerator in common cluster infrastructures. In 2013 42nd International Conference on Parallel Processing. Institute of Electrical & Electronics Engineers (IEEE), oct 2013.Google Scholar
- [9]. . Automatic tiling of iterative stencil loops. ACM Transactions on Programming Languages and Systems, 26 (6): 975–1028, nov 2004.Google ScholarDigital Library
- [10]. . A high-level synthesis flow for the implementation of iterative stencil loop algorithms on FPGA devices. In Proceedings of the 50th Annual Design Automation Conference on-DAC '13. Association for Computing Machinery (ACM), 2013.Google Scholar
- [11]. , Polyhedral-based data reuse optimization for configurable computing. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA '13, pages 29–38, New York, NY, USA, 2013. ACM.Google Scholar
- [12]. . Scalable streaming-array of simple soft-processors for stencil computations with constant memory-bandwidth. In Proceedings of the 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM '11, pages 234–241, Washington, DC, USA, 2011. IEEE Computer SocietyGoogle Scholar
- [13]. . Is dark silicon useful? harnessing the four horsemen of the coming dark silicon apocalypse. In Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE, pages 1131–1136, June 2012.Google Scholar
- [14]. . A heterogeneous platform with gpu and fpga for power efficient high performance computing. In Integrated Circuits (ISIC), 2014 14th International Symposium on, pages 220–223. IEEE, 2014.Google Scholar
- [15]. . Improving polyhedral code generation for high-level synthesis. In Proceedings of the Ninth IEEE/ACM/IFIP InternationalConference on Hardware/Software Codesign and System Synthesis, CODES+ISSS '13, pages 15:1–15:10, Piscataway, NJ, USA, 2013. IEEE Press.Google Scholar
- [16]. . Improving high level synthesis optimization opportunity through polyhedral transformations. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA '13, pages 9–18, New York, NY, USA, 2013. ACM.Google Scholar
Index Terms
- A polyhedral model-based framework for dataflow implementation on FPGA devices of Iterative Stencil Loops
Recommendations
SODA: Stencil with Optimized Dataflow Architecture
2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)Stencil computation is one of the most important kernels in many application domains such as image processing, solving partial differential equations, and cellular automata. Many of the stencil kernels are complex, usually consist of multiple stages or ...
On How to Accelerate Iterative Stencil Loops: A Scalable Streaming-Based Approach
In high-performance systems, stencil computations play a crucial role as they appear in a variety of different fields of application, ranging from partial differential equation solving, to computer simulation of particles’ interaction, to image ...
An Asynchronous Dataflow FPGA Architecture
We discuss the design of a high-performance field programmable gate array (FPGA) architecture that efficiently prototypes asynchronous (clockless) logic. In this FPGA architecture, low-level application logic is described using asynchronous dataflow ...
Comments