ABSTRACT
To leverage the vast parallelism of loops, embedded loop accelerators often take the form of processor arrays with many, but simple processing elements. Each processing element executes a subset of a loop's iterations in parallel using instruction- and datalevel parallelism by tightly scheduling iterations using software pipelining and packing instructions into compact, individual programs. However, loop bounds are often unknown until runtime, which complicates the static generation of programs because they influence each program's control flow.
Existing solutions, like generating and storing all possible programs or full just-in-time compilation, are prohibitively expensive, especially in embedded systems. As a remedy, we propose a hybrid approach introducing a tree-like program representation, whose generation front-loads all intractable sub-problems to compile time, and from which all concrete program variants can efficiently be stitched together at runtime. The tree consists of so-called polyhedral fragments that represent concrete program parts and are annotated with iteration-dependent conditions.
We show that both this representation is both space- and time-efficient: it requires polynomial space to store---whereas storing all possibly generated programs is non-polynomial---and polynomial time to evaluate---whereas just-in-time compilation requires solving NP-hard problems. In a case study, we show for a representative loop program that using a tree of polyhedral fragments saves 98.88 % of space compared to storing all program variants.
- S. Aditya and V. Kathail, High-level synthesis: From algorithm to digital circuit. Dordrecht: Springer Netherlands, 2008, ch. Algorithmic Synthesis Using PICO, pp. 53--74.Google Scholar
- D. Kissler, F. Hannig, A. Kupriyanov, and J. Teich, "A dynamically reconfigurable weakly programmable processor array architecture template." in Proceedings of the 2nd International Workshop on Reconfigurable Communication Centric System-on-Chips (ReCoSoC), 2006, pp. 31--37.Google Scholar
- F. Hannig, V. Lari, S. Boppu, A. Tanase, and O. Reiche, "Invasive tightly-coupled processor arrays: A domain-specific architecture/compiler co-design approach," ACM Transactions on Embedded Computing Systems (TECS), vol. 13, no. 4s, p. 133, 2014.Google Scholar
- M. Brand, F. Hannig, A. Tanase, and J. Teich, "Orthogonal instruction processing: An alternative to lightweight VLIW processors," in IEEE 11th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), 2017, pp. 5--12.Google Scholar
- P. Feautrier, "Dataflow analysis of array and scalar references," International Journal of Parallel Programming, vol. 20, no. 1, pp. 23--53, 1991.Google ScholarDigital Library
- S. K. Rao, "Regular iterative algorithms and their implementation on processor arrays," Ph.D. dissertation, Stanford University, 1985.Google Scholar
- J. Teich, L. Thiele, and L. Zhang, "Scheduling of partitioned regular algorithms on processor arrays with constrained resources," in 1996 IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), 1996, pp. 131--144.Google Scholar
- F. Hannig, Scheduling techniques for high-throughput loop accelerators. Munich: Verlag Dr. Hut, 2009.Google Scholar
- J. Teich, A compiler for application specific processor arrays. Aachen: Shaker, 1993.Google Scholar
- H. W. Nelis and E. F. Deprettere, "Automatic design and partitioning of systolic/wavefront arrays for VLSI," Circuits, Systems and Signal Processing, vol. 7, no. 2, pp. 235--252, 1988.Google ScholarCross Ref
- B. R. Rau and C. D. Glaeser, "Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing," in Proceedings of the 14th Annual Workshop on Microprogramming. IEEE, 1981, pp. 183--198.Google ScholarDigital Library
- M. Witterauf, A. Tanase, F. Hannig, and J. Teich, "Modulo scheduling of symbolically tiled loops for tightly coupled processor arrays," in 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP), July 2016, pp. 58--66.Google Scholar
- S. Boppu, F. Hannig, and J. Teich, "Compact code generation for tightly-coupled processor arrays," Journal of Signal Processing Systems, vol. 77, no. 1-2, pp. 5--29, 2014.Google ScholarDigital Library
- S. Boppu, F. Hannig, "Loop program mapping and compact code generation for programmable hardware accelerators," in 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors, June 2013, pp. 10--17.Google Scholar
- A. Hartono, M. M. Baskaran, J. Ramanujam, and P. Sadayappan, "DynTile: Parametric tiled loop generation for parallel execution on multicore processors," in IEEE International Symposium on Parallel Distributed Processing (IPDPS), 2010, pp. 1--12.Google Scholar
- D. Kim and S. Rajopadhye, "Efficient tiled loop generation: D-tiling," in Languages and compilers for parallel computing (LCPC). Springer, 2009, pp. 293--307.Google Scholar
- M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan, "When polyhedral transformations meet SIMD code generation," in ACM Sigplan Notices, vol. 48, no. 6. ACM, 2013, pp. 127--138.Google Scholar
- A. Konstantinidis, P. Kelly, J. Ramanujam, and P. Sadayappan, "Parametric GPU code generation for affine loop programs," in Languages and compilers for parallel computing (LCPC), vol. 8664. Springer, 2014, pp. 136--151.Google Scholar
- A. Jimborean, P. Clauss, J.-F. Dollinger, V. Loechner, and J. M. Martinez Caamaño, "Dynamic and speculative polyhedral parallelization using compiler-generated skeletons," International Journal of Parallel Programming, vol. 42, no. 4, pp. 529--545, Aug 2014.Google ScholarDigital Library
- J. M. M. Caamaño, W. Wolff, and P. Clauss, "Code bones: Fast and flexible code generation for dynamic and speculative polyhedral optimization," in Proceedings of the 22nd European Conference on Parallel Processing (Euro-Par). Springer, 2016, pp. 225--237.Google Scholar
Recommendations
Polyhedral Bubble Insertion: A Method to Improve Nested Loop Pipelining for High-Level Synthesis
High-level synthesis (HLS) allows hardware to be directly produced from behavioral description in C/C++, thus accelerating the design process. Loop pipelining is a key transformation of HLS, as it improves the throughput of the design at the price of a ...
A practical automatic polyhedral parallelizer and locality optimizer
PLDI '08We present the design and implementation of an automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this ...
A polyhedral compilation framework for loops with dynamic data-dependent bounds
CC 2018: Proceedings of the 27th International Conference on Compiler ConstructionWe study the parallelizing compilation and loop nest optimization of an important class of programs where counted loops have a dynamic data-dependent upper bound. Such loops are amenable to a wider set of transformations than general while loops with ...
Comments