Abstract
In this article, we present a new technique for optimizing loops that contain kernels mapped on a reconfigurable fabric. We assume the Molen machine organization as our framework. We propose combining loop unrolling with loop shifting, which is used to relocate the function calls contained in the loop body such that in every iteration of the transformed loop, software functions (running on GPP) execute in parallel with multiple instances of the kernel (running on FPGA). The algorithm computes the optimal unroll factor and determines the most appropriate transformation (which can be the combination of unrolling plus shifting or either of the two). This method is based on profiling information about the kernel’s execution times on GPP and FPGA, memory transfers and area utilization. In the experimental part, we apply this method to several kernels from loop nests extracted from real-life applications (DCT and SAD from MPEG2 encoder, Quantizer from JPEG, and Sobel’s Convolution) and perform an analysis of the results, comparing them with the theoretical maximum speedup by Amdahl’s Law and showing when and how our transformations are beneficial.
- Banerjee, S., Bozorgzadeh, E., and Dutt, N. 2006. PARLGRAN: Parallelism granularity selection for scheduling task chains on dynamically reconfigurable architectures. In Proceedings of the Conference on Asia South Pacific Design Automation (ASP-DAC’06). 491--496. Google ScholarDigital Library
- Cardoso, J. M. P. and Diniz, P. C. 2004. Modeling loop unrolling: Approaches and open issues. In Proceedings of the 4th International Workshop on Computer Systems: Architectures, Modeling, and Simulation (SAMOS’04). 224--233.Google Scholar
- Darte, A. and Huard, G. 1999. Loop shifting for loop compaction. In Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing (LCPC’99). 415--431. Google ScholarDigital Library
- Dragomir, O. S., Moscu-Panainte, E., Bertels, K., and Wong, S. 2008a. Optimal unroll factor for reconfigurable architectures. In Proceedings of the 4th International Workshop on Applied Reconfigurable Computing (ARC’08). 4--14. Google ScholarDigital Library
- Dragomir, O. S., Stefanov, T., and Bertels, K. 2008b. Loop unrolling and shifting for reconfigurable architectures. In Proceedings of the 18th International Conference on Field Programmable Logic and Applications (FPL’08).Google Scholar
- Guo, Z., Buyukkurt, B., Najjar, W., and Vissers, K. 2005. Optimized generation of data-path from C codes for FPGAs. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’05). 112--117. Google ScholarDigital Library
- Gupta, S., Dutt, N., Gupta, R., and Nicolau, A. 2004. Loop shifting and compaction for the high-level synthesis of designs with complex control flow. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’04). 114--119. Google ScholarDigital Library
- Kuzmanov, G., Gaydadjiev, G., and Vassiliadis, S. 2004. The Virtex II Pro MOLEN processor. In Proceedings of the 4th International Workshop on Computer Systems: Architectures, Modeling, and Simulation (SAMOS’04). 192--202.Google Scholar
- Liao, J., Wong, W.-F., and Mitra, T. 2003. A model for hardware realization of kernel loops. In Proceedings of the 13th International Conference on Field-Programmable Logic and Applications (FPL’03). 334--344.Google Scholar
- Vassiliadis, S., Gaydadjiev, G. N., Bertels, K., and Panainte, E. M. 2003. The Molen programming paradigm. In Proceedings of the 3rd International Workshop on Systems, Architectures, Modeling, and Simulation (SAMOS’03). 1--7.Google Scholar
- Vassiliadis, S., Wong, S., Gaydadjiev, G., Bertels, K., Kuzmanov, G., and Panainte, E. M. 2004. The Molen polymorphic processor. IEEE Trans. Comput. 53, 11, 1363--1375. Google ScholarDigital Library
- Weinhardt, M. and Luk, W. 2001. Pipeline vectorization. IEEE Trans. Comput. Aid. Des. Integr. Circ. Syst. 234--248. Google ScholarDigital Library
- Xilinx Inc. 2007. Virtex II Pro and Virtex II Pro X platform FPGAs: Complete data sheet. http://www.xilinx.com/bvdocs/publications/ds083.pdf.Google Scholar
- Yankova, Y. D., Kuzmanov, G., Bertels, K., Gaydadjiev, G., Lu, Y., and Vassiliadis, S. 2007. DWARV: DelftWorkbench automated reconfigurable VHDL generator. In Proceedings of the 17th International Conference on Field Programmable Logic and Applications (FPL’07). 697--701.Google Scholar
Index Terms
Optimal Loop Unrolling and Shifting for Reconfigurable Architectures
Recommendations
Joint affine transformation and loop pipelining for mapping nested loop on CGRAs
DATE '15: Proceedings of the 2015 Design, Automation & Test in Europe Conference & ExhibitionCoarse-Grained Reconfigurable Architectures (CGRAs) are the promising architectures with high performance, high power- efficiency and attractions of flexibility. The computation-intensive portions of application, i.e. loops, are often implemented on ...
Dynamic loop pipelining in data-driven architectures
CF '05: Proceedings of the 2nd conference on Computing frontiersData-driven array architectures seem to be important alternatives for coarse-grained reconfigurable computing platforms. Their use has provided performance improvements over microprocessors and shorter programming cycles than FPGA-based platforms. As ...
Improving Nested Loop Pipelining on Coarse-Grained Reconfigurable Architectures
Coarse-grained reconfigurable architecture (CGRA) is a promising architecture with high performance, high power efficiency, and attraction of flexibility. The computation-intensive portions of applications, i.e., loops, are often implemented on CGRAs for ...
Comments