A Highly Efficient Multicore Floating-Point FFT Architecture Based on Hybrid Linear Algebra/FFT Cores

Pedram, Ardavan; McCalpin, John D.; Gerstlauer, Andreas

doi:10.1007/s11265-014-0896-x

A Highly Efficient Multicore Floating-Point FFT Architecture Based on Hybrid Linear Algebra/FFT Cores

Published: 26 June 2014

Volume 77, pages 169–190, (2014)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Ardavan Pedram¹,
John D. McCalpin² &
Andreas Gerstlauer¹

668 Accesses
6 Citations
9 Altmetric
Explore all metrics

Abstract

FFT algorithms have memory access patterns that prevent many architectures from achieving high computational utilization, particularly when parallel processing is required to achieve the desired levels of performance. Starting with a highly efficient hybrid linear algebra/FFT core, we co-design the on-chip memory hierarchy, on-chip interconnect, and FFT algorithms for a multicore FFT processor. We show that it is possible to to achieve excellent parallel scaling while maintaining power and area efficiency comparable to that of the single-core solution. The result is an architecture that can effectively use up to 16 hybrid cores for transform sizes that can be contained in on-chip SRAM. When configured with 12MiB of on-chip SRAM, our technology evaluation shows that the proposed 16-core FFT accelerator should sustain 388 GFLOPS of nominal double-precision performance, with power and area efficiencies of 30 GFLOPS/W and 2.66 GFLOPS/mm², respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture

Article 21 January 2015

Fang Zheng, Hong-Liang Li, … Xiang-Hui Xie

FPGA-Based Multi-precision Architecture for Accelerating Large-Scale Floating-Point Matrix Computing

A Parallel 1-D FFT Implementation Method for Multi-core Vector Processors

Notes

To avoid confusion in this paper, we distinguish between “local twiddle factors” used in the radix-4 operator and “global twiddle factors” used in the four-step method.

References

Akin, B., Milder, P.A., Franchetti, F., Hoe, J.C. (2012). Memory bandwidth efficient two-dimensional fast Fourier transform algorithm and implementation for large problem sizes. In Proceedings of the 2012 IEEE 20th international symposium on field-programmable custom computing machines, FCCM ’12 (pp. 188–191). IEEE.
Bailey, D.H. (1989). FFTs in external or hierarchical memory. In Proceedings of the 1989 ACM/IEEE conference on supercomputing (pp. 234–242). ACM.
Bergland, G. (1969). Fast Fourier transform hardware implementations—an overview. IEEE Transactions on Audio and Electroacoustics, 17 (2), 104–108.
Article Google Scholar
Blake, A., Witten, I., Cree, M. (2013). The fastest Fourier transform in the south. IEEE Transactions on Signal Processing, 61 (19), 4707–4716.
Article MathSciNet Google Scholar
Cheney, M., Borden, B., of the mathematical Sciences, C.B. (U.S.) (2009). N.S.F.: fundamentals of radar imaging. CBMS-NSF regional conference series in applied mathematics. Philadelphia: SIAM.
Chung, E.S., Milder, P.A., Hoe, J.C., Mai, K. (2010). Single-chip heterogeneous computing: does the future include custom logic, FPGAs, and GPGPUs? In 43rd annual IEEE/ACM international symposium on microarchitecture, MICRO-43 (pp. 225–236). Washington, DC: IEEE Computer Society.
Cooley, J., & Tukey, J. (1965). An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19 (90), 297–301.
Article MATH MathSciNet Google Scholar
Frigo, M., & Johnson, S. (2005). The design and implementation of FFTW3. Proceedings of the IEEE, 93 (2), 216–231.
Article Google Scholar
Galal, S., & Horowitz, M. (2010). Energy-efficient floating point unit design. IEEE Transactions on Computers, PP(99).
Greene, J., Pepe, M., Cooper, R. (2005). A parallel 64k complex FFT algorithm for the IBM/Sony/Toshiba Cell broadband engine processor. In Conference on the global signal processing expo.
Hemmert, K.S., & Underwood, K.D. (2005). An analysis of the double-precision floating-point FFT on FPGAs. In Proceedings of the 2005 IEEE 13th international symposium on field-programmable custom computing machines, FCCM ’05 (pp. 171–180).
Ho, C.H. (2010). Customizable and reconfigurable platform for optimising floating-point computations. Ph.D. thesis, University of London, Imperial College of Science, Technology and Medicine, Department of Computing.
Jain, S., Erraguntla, V., Vangal, S., Hoskote, Y., Borkar, N., Mandepudi, T., Karthik, V. (2010). A 90 mW/GFlop 3.4 GHz reconfigurable fused/continuous multiply-accumulator for floating-point and integer operands in 65 nm. In 23rd international conference on VLSI design, 2010. VLSID ’10 (pp. 252–257).
Kak, A, & Slaney, M. (2001). Principles of computerized tomographic imaging. Classics in Applied Mathematics. Philadelphia: SIAM.
Karner, H., Auer, M., Ueberhuber, C.W. (1998). Top speed FFTs for FMA architectures. Tech. Rep. AURORA TR1998-16, Institute for Applied and Numerical Mathematics, Vienna University of Technology.
Kistler, M., Gunnels, J., Brokenshire, D., Benton, B. (2009). Petascale computing with accelerators. In Proceedings of the 14th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP ’09 (pp. 241–250). New York: ACM.
Kuehl, C., Liebstueckel, U., Tejerina, I., Uemminghaus, M., Witte, F., Kolb, M., Suess, M., Weigand, R., Kopp, N. (2012). Fast Fourier Transform Co-processor (FFTC), towards embedded GFLOPs. In Society of photo-optical instrumentation engineers (SPIE) conference series, society of photo-optical instrumentation engineers (SPIE) conference series (vol. 8539).
Li, L., Chen, Y.J., Liu, D.F., Qian, C., Hu, W.W. (2011). An FFT performance model for optimizing general-purpose processor architecture. Journal of Computer Science and Technology, 26 (5), 875–889.
Article MATH Google Scholar
Milder, P., Franchetti, F., Hoe, J.C., Püschel, M. (2012). Computer generation of hardware for linear digital signal processing transforms. ACM Transactions on Design Automation of Electronic Systems, 17(2), 15:1–15:33
Mou, S., & Yang, X. (2007). Design of a high-speed FPGA-based 32-bit floating-point FFT processor. In Eighth ACIS international conference on software engineering, artificial intelligence, networking, and parallel/distributed computing, 2007. SNPD 2007 (vol. 1, pp. 84–87).
Pedram, A., van de Geijn, R., Gerstlauer, A. (2012). Codesign tradeoffs for high-performance, low-power linear algebra architectures. IEEE Transactions on Computers, Special Issue on Power Efficient Computing, 61 (12), 1724–1736.
Google Scholar
Pedram, A., Gerstlauer, A., van de Geijn, R. (2012). On the efficiency of register file versus broadcast interconnect for collective communications in data-parallel hardware accelerators. In Proceedings of the 2012 IEEE 24th international symposium on computer architecture and high performance computing (SBAC-PAD) (pp. 19–26).
Pedram, A., Gerstlauer, A., Geijn, R.A. (2011). A high-performance, low-power linear algebra core. In Proceedings of the 22nd IEEE international conference on application-specific systems, architectures and processors, ASAP ’11 (pp. 35–42). Washington, DC: IEEE Computer Society.
Pedram, A., Gerstlauer, A., van de Geijn, R.A. (2013). Floating point architecture extensions for optimized matrix factorization. In Proceedings of the 2013 IEEE 21st symposium on computer arithmetic, ARITH ’13. IEEE.
Pedram, A., Gilani, S.Z., Kim, N.S., van de Geijn, R., Schulte, M., Gerstlauer, A. (2012). A linear algebra core design for efficient level-3 BLAS. In Proceedings of the 2012 IEEE 23rd international conference on application-specific systems, architectures and processors, ASAP ’12 (pp. 149–152). Washington, DC: IEEE Computer Society.
Pedram, A., McCalpin, J., Gerstlauer, A. (2013). Transforming a linear algebra core to an FFT accelerator. In Proceedings of the 2013 IEEE 24th international conference on application-specific systems, architectures and processors (ASAP) (pp. 175–184).
Pereira, K., Athanas, P., Lin, H., Feng, W. (2011). Spectral method characterization on FPGA and GPU accelerators. In 2011 international conference on reconfigurable computing and FPGAs (ReConFig) (pp. 487–492).
Satpathy, S., Sewell, K., Manville, T., Chen, Y.P., Dreslinski, R., Sylvester, D., Mudge, T., Blaauw, D. (2012). A 4.5Tb/s 3.4Tb/s/W 64x64 switch fabric with self-updating least-recently-granted priority and quality-of-service arbitration in 45 nm CMOS. In 2012 IEEE international solid-state circuits conference digest of technical papers (ISSCC) (pp. 478–480).
Satpathy, S., Sylvester, D., Blaauw, D. (2012). A standard cell compatible bidirectional repeater with thyristor assist. In 2012 symposium on VLSI circuits (VLSIC) (pp. 174–175).
Swartzlander, E.E. Jr., & Saleh, H.H. (2012). FFT implementation with fused floating-point operations. IEEE Transactions on Computers, 61 (2), 284–288.
Article MathSciNet Google Scholar
Varma, B.S.C., Paul, K., Balakrishnan, M. (2013). Accelerating 3D-FFT using hard embedded blocks in FPGAs. In International conference on VLSI design (pp. 92–97).
Wu, D., Zou, X., Dai, K., Rao, J., Chen, P., Zheng, Z. (2011). Implementation and evaluation of parallel FFT on engineering and scientific computation accelerator (ESCA) architecture. Journal of Zhejiang University-Science C, 12(12), 976–989.
Yuffe, M., Knoll, E., Mehalel, M., Shor, J., Kurts, T. (2011). A fully integrated multi-CPU, GPU and memory controller 32nm processor. In Proceedings of the 2011 IEEE international solid-state circuits conference digest of technical papers (ISSCC). IEEE.
Van Zee, F.G., & van de Geijn, R. (2012). FLAME Working Note #66, R.A.: BLIS: a framework for generating BLAS-like libraries. Technical Report TR-12-30, The University of Texas at Austin, Department of Computer Sciences.
Zhang, Z., Wang, D., Pan, Y., Wang, D., Zhou, X., Sobelman, G. (2011). FFT implementation with multi-operand floating point units. In 2011 IEEE 9th international conference on ASIC (ASICON) (pp. 216–219).

Download references

Acknowledgments

Authors wish to thank John Brunhaver for providing synthesis results for the raw components of the Transposer.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA
Ardavan Pedram & Andreas Gerstlauer
Texas Advanced Computing Center, The University of Texas at Austin, Austin, TX, USA
John D. McCalpin

Authors

Ardavan Pedram
View author publications
You can also search for this author in PubMed Google Scholar
John D. McCalpin
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Gerstlauer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ardavan Pedram.

Additional information

This research was partially sponsored by NSF grants CCF-1218483 (Pedram and Gerstlauer), CCF-1240652 and ACI-1134872 (McCalpin) and also NASA grant NNX08AD58G (Pedram). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation (NSF) or the National Aeronautics and Space Administration (NASA).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pedram, A., McCalpin, J.D. & Gerstlauer, A. A Highly Efficient Multicore Floating-Point FFT Architecture Based on Hybrid Linear Algebra/FFT Cores. J Sign Process Syst 77, 169–190 (2014). https://doi.org/10.1007/s11265-014-0896-x

Download citation

Received: 16 September 2013
Revised: 14 March 2014
Accepted: 10 April 2014
Published: 26 June 2014
Issue Date: October 2014
DOI: https://doi.org/10.1007/s11265-014-0896-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A Highly Efficient Multicore Floating-Point FFT Architecture Based on Hybrid Linear Algebra/FFT Cores

Abstract

Access this article

Similar content being viewed by others

Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture

FPGA-Based Multi-precision Architecture for Accelerating Large-Scale Floating-Point Matrix Computing

A Parallel 1-D FFT Implementation Method for Multi-core Vector Processors

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Highly Efficient Multicore Floating-Point FFT Architecture Based on Hybrid Linear Algebra/FFT Cores

Abstract

Access this article

Similar content being viewed by others

Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture

FPGA-Based Multi-precision Architecture for Accelerating Large-Scale Floating-Point Matrix Computing

A Parallel 1-D FFT Implementation Method for Multi-core Vector Processors

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation