Abstract
Sliding-window applications, an important class of the digital-signal processing domain, are highly amenable to pipeline parallelism on field-programmable gate arrays (FPGAs). Although memory bandwidth often restricts parallelism for many applications, sliding-window applications can leverage custom buffers, referred to as sliding-window generators, that provide massive input bandwidth that far exceeds the capabilities of external memory. Previous work has introduced a variety of sliding-window generators, but those approaches typically generate at most one window per cycle, which significantly restricts parallelism. In this article, we address this limitation with a parallel sliding-window generator that can generate a configurable number of windows every cycle. Although in practice the number of parallel windows is limited by memory bandwidth, we show that even with common bandwidth limitations, the presented generator enables near-linear speedups up to 16x faster than previous FPGA studies that generate a single window per cycle, which were already in some cases faster than graphics-processing units and microprocessors.
- S. Asano, T. Maruyama, and Y. Yamaguchi. 2009. Performance comparison of FPGA, GPU and CPU in image processing. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’09). 126--131. DOI:http://dx.doi.org/10.1109/FPL.2009.5272532Google Scholar
- Z. K. Baker, M. B. Gokhale, and J. L. Tripp. 2007. Matched filter computation on FPGA, cell and GPU. In Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’07). 207--218. DOI:http://dx.doi.org/10.1109/FCCM.2007.52 Google ScholarDigital Library
- S. S. Beauchemin and J. L. Barron. 1995. The computation of optical flow. ACM Computing Surveys 27, 3, 433--466. DOI:http://dx.doi.org/10.1145/212094.212141 Google ScholarDigital Library
- C. S. S. Burrus and T. W. Parks. 1991. DFT/FFT and Convolution Algorithms: Theory and Implementation. John Wiley & Sons, New York, NY. Google ScholarDigital Library
- J. Chase, B. Nelson, J. Bodily, Z. Wei, and D.-J. Lee. 2008. Real-time optical flow calculations on FPGA and GPU architectures: A comparison study. In Proceedings of the 16th International Symposium on Field-Programmable Custom Computing Machines (FCCM’08). 173--182. DOI:http://dx.doi.org/ 10.1109/FCCM.2008.24 Google ScholarDigital Library
- Shane Cook. 2013. CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs. Morgan Kaufmann, San Francisco, CA. Google ScholarDigital Library
- B. Cope, P. Y. K. Cheung, W. Luk, and S. Witt. 2005. Have GPUs made FPGAs redundant in the field of video processing? In Proceedings of the 2005 IEEE International Conference on Field-Programmable Technology. 111--118. DOI:http://dx.doi.org/10.1109/FPT.2005.1568533Google Scholar
- R. E. Crochiere. 1980. A weighted overlap-add method of short-time Fourier analysis/synthesis. IEEE Transactions on Acoustics, Speech, and Signal Processing 28, 1, 99--102. DOI:http://dx.doi.org/10. 1109/TASSP.1980.1163353Google ScholarCross Ref
- Yazhuo Dong, Yong Dou, and Jie Zhou. 2007. Optimized generation of memory structure in compiling window operations onto reconfigurable hardware. In Proceedings of the International Symposium on Applied Reconfigurable Computing. 110--121. Google ScholarDigital Library
- Jeremy Fowers, Greg Brown, Patrick Cooke, and Greg Stitt. 2012. A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’12). ACM, New York, NY, 47--56. DOI:http://dx.doi.org/10.1145/2145694.2145704 Google ScholarDigital Library
- Jeremy Fowers, Greg Brown, John Wernsing, and Greg Stitt. 2013. A performance and energy comparison of convolution on GPUs, FPGAs, and multicore processors. ACM Transactions on Architecture and Code Optimization 9, 4, Article No. 25. DOI:http://dx.doi.org/10.1145/2400682.2400684 Google ScholarDigital Library
- Zhi Guo, Betul Buyukkurt, and Walid Najjar. 2004. Input data reuse in compiling window operations onto reconfigurable hardware. In Proceedings of the 2004 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES’04). ACM, New York, NY, 249--256. DOI:http://dx.doi.org/10.1145/997163.997199 Google ScholarDigital Library
- Mark Harris. 2007. Optimizing Parallel Reduction in CUDA. NVIDIA Developer Technology.Google Scholar
- Nicholas Moore, Miriam Leeser, and Laurie Smith King. 2011. Adaptable two-dimension sliding windows on NVIDIA GPUs with runtime compilation. In Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing (SAAHPC’11). IEEE, Los Alamitos, CA, 103--112. DOI:http://dx.doi.org/10.1109/SAAHPC.2011.11 Google ScholarDigital Library
- K. Pauwels, M. Tomasi, J. Diaz Alonso, E. Ros, and M. M. Van Hulle. 2012. A comparison of FPGA and GPU for real-time phase-based optical flow, stereo, and local image features. IEEE Transactions on Computers 61, 7, 999--1012. DOI:http://dx.doi.org/10.1109/TC.2011.120 Google ScholarDigital Library
- M. Weinhaudt and W. Luk. 2001. Memory access optimisation for reconfigurable systems. IEE Proceedings—Computers and Digital Techniques 148, 3, 105--112. DOI:http://dx.doi.org/10.1049/ip-cdt:20010514Google Scholar
- H. Yu and M. Leeser. 2006. Automatic sliding window operation optimization for FPGA-based computing boards. In Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’06). 76--88. DOI:http://dx.doi.org/10.1109/FCCM.2006.29 Google ScholarDigital Library
Index Terms
- A Parallel Sliding-Window Generator for High-Performance Digital-Signal Processing on FPGAs
Recommendations
A Tradeoff Analysis of FPGAs, GPUs, and Multicores for Sliding-Window Applications
The increasing usage of hardware accelerators such as Field-Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs) has significantly increased application design complexity. Such complexity results from a larger design space created by ...
A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications
FPGA '12: Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate ArraysWith the emergence of accelerator devices such as multicores, graphics-processing units (GPUs), and field-programmable gate arrays (FPGAs), application designers are confronted with the problem of searching a huge design space that has been shown to ...
Exploiting Parallelism on GPUs and FPGAs with OmpSs
ANDARE '17: Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC SystemsThis paper presents the OmpSs approach to deal with heterogeneous programming on GPU and FPGA accelerators. The OmpSs programming model is based on the Mercurium compiler and the Nanos++ runtime. Applications are annotated with compiler directives ...
Comments