Skip to main content

Advertisement

Log in

A Highly Efficient Multicore Floating-Point FFT Architecture Based on Hybrid Linear Algebra/FFT Cores

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

FFT algorithms have memory access patterns that prevent many architectures from achieving high computational utilization, particularly when parallel processing is required to achieve the desired levels of performance. Starting with a highly efficient hybrid linear algebra/FFT core, we co-design the on-chip memory hierarchy, on-chip interconnect, and FFT algorithms for a multicore FFT processor. We show that it is possible to to achieve excellent parallel scaling while maintaining power and area efficiency comparable to that of the single-core solution. The result is an architecture that can effectively use up to 16 hybrid cores for transform sizes that can be contained in on-chip SRAM. When configured with 12MiB of on-chip SRAM, our technology evaluation shows that the proposed 16-core FFT accelerator should sustain 388 GFLOPS of nominal double-precision performance, with power and area efficiencies of 30 GFLOPS/W and 2.66 GFLOPS/mm2, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18

Similar content being viewed by others

Notes

  1. To avoid confusion in this paper, we distinguish between “local twiddle factors” used in the radix-4 operator and “global twiddle factors” used in the four-step method.

References

  1. Akin, B., Milder, P.A., Franchetti, F., Hoe, J.C. (2012). Memory bandwidth efficient two-dimensional fast Fourier transform algorithm and implementation for large problem sizes. In Proceedings of the 2012 IEEE 20th international symposium on field-programmable custom computing machines, FCCM ’12 (pp. 188–191). IEEE.

  2. Bailey, D.H. (1989). FFTs in external or hierarchical memory. In Proceedings of the 1989 ACM/IEEE conference on supercomputing (pp. 234–242). ACM.

  3. Bergland, G. (1969). Fast Fourier transform hardware implementations—an overview. IEEE Transactions on Audio and Electroacoustics, 17 (2), 104–108.

    Article  Google Scholar 

  4. Blake, A., Witten, I., Cree, M. (2013). The fastest Fourier transform in the south. IEEE Transactions on Signal Processing, 61 (19), 4707–4716.

    Article  MathSciNet  Google Scholar 

  5. Cheney, M., Borden, B., of the mathematical Sciences, C.B. (U.S.) (2009). N.S.F.: fundamentals of radar imaging. CBMS-NSF regional conference series in applied mathematics. Philadelphia: SIAM.

  6. Chung, E.S., Milder, P.A., Hoe, J.C., Mai, K. (2010). Single-chip heterogeneous computing: does the future include custom logic, FPGAs, and GPGPUs? In 43rd annual IEEE/ACM international symposium on microarchitecture, MICRO-43 (pp. 225–236). Washington, DC: IEEE Computer Society.

  7. Cooley, J., & Tukey, J. (1965). An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19 (90), 297–301.

    Article  MATH  MathSciNet  Google Scholar 

  8. Frigo, M., & Johnson, S. (2005). The design and implementation of FFTW3. Proceedings of the IEEE, 93 (2), 216–231.

    Article  Google Scholar 

  9. Galal, S., & Horowitz, M. (2010). Energy-efficient floating point unit design. IEEE Transactions on Computers, PP(99).

  10. Greene, J., Pepe, M., Cooper, R. (2005). A parallel 64k complex FFT algorithm for the IBM/Sony/Toshiba Cell broadband engine processor. In Conference on the global signal processing expo.

  11. Hemmert, K.S., & Underwood, K.D. (2005). An analysis of the double-precision floating-point FFT on FPGAs. In Proceedings of the 2005 IEEE 13th international symposium on field-programmable custom computing machines, FCCM ’05 (pp. 171–180).

  12. Ho, C.H. (2010). Customizable and reconfigurable platform for optimising floating-point computations. Ph.D. thesis, University of London, Imperial College of Science, Technology and Medicine, Department of Computing.

  13. Jain, S., Erraguntla, V., Vangal, S., Hoskote, Y., Borkar, N., Mandepudi, T., Karthik, V. (2010). A 90 mW/GFlop 3.4 GHz reconfigurable fused/continuous multiply-accumulator for floating-point and integer operands in 65 nm. In 23rd international conference on VLSI design, 2010. VLSID ’10 (pp. 252–257).

  14. Kak, A, & Slaney, M. (2001). Principles of computerized tomographic imaging. Classics in Applied Mathematics. Philadelphia: SIAM.

  15. Karner, H., Auer, M., Ueberhuber, C.W. (1998). Top speed FFTs for FMA architectures. Tech. Rep. AURORA TR1998-16, Institute for Applied and Numerical Mathematics, Vienna University of Technology.

  16. Kistler, M., Gunnels, J., Brokenshire, D., Benton, B. (2009). Petascale computing with accelerators. In Proceedings of the 14th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP ’09 (pp. 241–250). New York: ACM.

  17. Kuehl, C., Liebstueckel, U., Tejerina, I., Uemminghaus, M., Witte, F., Kolb, M., Suess, M., Weigand, R., Kopp, N. (2012). Fast Fourier Transform Co-processor (FFTC), towards embedded GFLOPs. In Society of photo-optical instrumentation engineers (SPIE) conference series, society of photo-optical instrumentation engineers (SPIE) conference series (vol. 8539).

  18. Li, L., Chen, Y.J., Liu, D.F., Qian, C., Hu, W.W. (2011). An FFT performance model for optimizing general-purpose processor architecture. Journal of Computer Science and Technology, 26 (5), 875–889.

    Article  MATH  Google Scholar 

  19. Milder, P., Franchetti, F., Hoe, J.C., Püschel, M. (2012). Computer generation of hardware for linear digital signal processing transforms. ACM Transactions on Design Automation of Electronic Systems, 17(2), 15:1–15:33

  20. Mou, S., & Yang, X. (2007). Design of a high-speed FPGA-based 32-bit floating-point FFT processor. In Eighth ACIS international conference on software engineering, artificial intelligence, networking, and parallel/distributed computing, 2007. SNPD 2007 (vol. 1, pp. 84–87).

  21. Pedram, A., van de Geijn, R., Gerstlauer, A. (2012). Codesign tradeoffs for high-performance, low-power linear algebra architectures. IEEE Transactions on Computers, Special Issue on Power Efficient Computing, 61 (12), 1724–1736.

    Google Scholar 

  22. Pedram, A., Gerstlauer, A., van de Geijn, R. (2012). On the efficiency of register file versus broadcast interconnect for collective communications in data-parallel hardware accelerators. In Proceedings of the 2012 IEEE 24th international symposium on computer architecture and high performance computing (SBAC-PAD) (pp. 19–26).

  23. Pedram, A., Gerstlauer, A., Geijn, R.A. (2011). A high-performance, low-power linear algebra core. In Proceedings of the 22nd IEEE international conference on application-specific systems, architectures and processors, ASAP ’11 (pp. 35–42). Washington, DC: IEEE Computer Society.

  24. Pedram, A., Gerstlauer, A., van de Geijn, R.A. (2013). Floating point architecture extensions for optimized matrix factorization. In Proceedings of the 2013 IEEE 21st symposium on computer arithmetic, ARITH ’13. IEEE.

  25. Pedram, A., Gilani, S.Z., Kim, N.S., van de Geijn, R., Schulte, M., Gerstlauer, A. (2012). A linear algebra core design for efficient level-3 BLAS. In Proceedings of the 2012 IEEE 23rd international conference on application-specific systems, architectures and processors, ASAP ’12 (pp. 149–152). Washington, DC: IEEE Computer Society.

  26. Pedram, A., McCalpin, J., Gerstlauer, A. (2013). Transforming a linear algebra core to an FFT accelerator. In Proceedings of the 2013 IEEE 24th international conference on application-specific systems, architectures and processors (ASAP) (pp. 175–184).

  27. Pereira, K., Athanas, P., Lin, H., Feng, W. (2011). Spectral method characterization on FPGA and GPU accelerators. In 2011 international conference on reconfigurable computing and FPGAs (ReConFig) (pp. 487–492).

  28. Satpathy, S., Sewell, K., Manville, T., Chen, Y.P., Dreslinski, R., Sylvester, D., Mudge, T., Blaauw, D. (2012). A 4.5Tb/s 3.4Tb/s/W 64x64 switch fabric with self-updating least-recently-granted priority and quality-of-service arbitration in 45 nm CMOS. In 2012 IEEE international solid-state circuits conference digest of technical papers (ISSCC) (pp. 478–480).

  29. Satpathy, S., Sylvester, D., Blaauw, D. (2012). A standard cell compatible bidirectional repeater with thyristor assist. In 2012 symposium on VLSI circuits (VLSIC) (pp. 174–175).

  30. Swartzlander, E.E. Jr., & Saleh, H.H. (2012). FFT implementation with fused floating-point operations. IEEE Transactions on Computers, 61 (2), 284–288.

    Article  MathSciNet  Google Scholar 

  31. Varma, B.S.C., Paul, K., Balakrishnan, M. (2013). Accelerating 3D-FFT using hard embedded blocks in FPGAs. In International conference on VLSI design (pp. 92–97).

  32. Wu, D., Zou, X., Dai, K., Rao, J., Chen, P., Zheng, Z. (2011). Implementation and evaluation of parallel FFT on engineering and scientific computation accelerator (ESCA) architecture. Journal of Zhejiang University-Science C, 12(12), 976–989.

  33. Yuffe, M., Knoll, E., Mehalel, M., Shor, J., Kurts, T. (2011). A fully integrated multi-CPU, GPU and memory controller 32nm processor. In Proceedings of the 2011 IEEE international solid-state circuits conference digest of technical papers (ISSCC). IEEE.

  34. Van Zee, F.G., & van de Geijn, R. (2012). FLAME Working Note #66, R.A.: BLIS: a framework for generating BLAS-like libraries. Technical Report TR-12-30, The University of Texas at Austin, Department of Computer Sciences.

  35. Zhang, Z., Wang, D., Pan, Y., Wang, D., Zhou, X., Sobelman, G. (2011). FFT implementation with multi-operand floating point units. In 2011 IEEE 9th international conference on ASIC (ASICON) (pp. 216–219).

Download references

Acknowledgments

Authors wish to thank John Brunhaver for providing synthesis results for the raw components of the Transposer.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ardavan Pedram.

Additional information

This research was partially sponsored by NSF grants CCF-1218483 (Pedram and Gerstlauer), CCF-1240652 and ACI-1134872 (McCalpin) and also NASA grant NNX08AD58G (Pedram). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation (NSF) or the National Aeronautics and Space Administration (NASA).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pedram, A., McCalpin, J.D. & Gerstlauer, A. A Highly Efficient Multicore Floating-Point FFT Architecture Based on Hybrid Linear Algebra/FFT Cores. J Sign Process Syst 77, 169–190 (2014). https://doi.org/10.1007/s11265-014-0896-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-014-0896-x

Keywords

Navigation