Skip to main content

Advertisement

Log in

FFTs with Near-Optimal Memory Access Through Block Data Layouts: Algorithm, Architecture and Design Automation

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Fast Fourier transform algorithms on large data sets achieve poor performance on various platforms because of the inefficient strided memory access patterns. These inefficient access patterns need to be reshaped to achieve high performance implementations. In this paper we formally restructure 1D, 2D and 3D FFTs targeting a generic machine model with a two-level memory hierarchy requiring block data transfers, and derive memory access pattern efficient algorithms using custom block data layouts. These algorithms need to be carefully mapped to the targeted platform’s architecture, particularly the memory subsystem, to fully utilize performance and energy efficiency potentials. Using the Kronecker product formalism, we integrate our optimizations into Spiral framework and evaluate a family of DRAM-optimized FFT algorithms and their hardware implementation design space via automated techniques. In our evaluations, we demonstrate DRAM-optimized accelerator designs over a large tradeoff space given various problem (single/double precision 1D, 2D and 3D FFTs) and hardware platform (off-chip DRAM, 3D-stacked DRAM, ASIC, FPGA, etc.) parameters. We show that Spiral generated pareto optimal designs can achieve close to theoretical peak performance of the targeted platform offering 6x and 6.5x system performance and power efficiency improvements respectively over conventional row-column FFT algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15

Similar content being viewed by others

Notes

  1. Calculated as 5n log2(n)/t for DFT n where t is the total runtime.

References

  1. CACTI 6.5. HP labs. http://www.hpl.hp.com/research/cacti/.

  2. DDR3-1600 dram datasheet. MT41J256M4, Micron. http://www.micron.com/parts/dram/ddr3-sdram.

  3. DDR3 sdram system-power calculator. Micron. http://www.micron.com/products/support/power-calc.

  4. DesignWare library. Synopsys. http://www.synopsys.com/dw.

  5. McPAT 1.0. HP labs. http://www.hpl.hp.com/research/mcpat/.

  6. CUDA toolkit 5.0 performance report (2013). Nvidia. https://developer.nvidia.com/cuda-math-library.

  7. Akin, B., Franchetti, F., & Hoe, J.C. (2014). FFTs with near-optimal memory access through block data layouts. In Proceedings of IEEE international conference on acoustics speech and signal processing (ICASSP).

  8. Akin, B., Franchetti, F., & Hoe, J.C. (2014). Understanding the design space of dram-optimized hardware FFT accelerators. In IEEE 25th international conference on application-specific systems, architectures and processors, ASAP 2014, June 18-20, (pp. 248–255). Zurich.

  9. Akin, B., Franchetti, F., & Hoe, J.C. (2015). Data reorganization in memory using 3d-stacked dram. In Proceedings of the 42nd international symposium on computer architecture (ISCA).

  10. Akin, B., Hoe, J.C., & Franchetti, F. (2014). Hamlet: hardware accelerated memory layout transform within 3d-stacked DRAM. In IEEE high performance extreme computing conference, HPEC 2014, September 9-11 (pp. 1–6). Waltham.

  11. Akin, B., Milder, P.A., Franchetti, F., & Hoe, J.C. (2012). Memory bandwidth efficient two-dimensional fast Fourier transform algorithm and implementation for large problem sizes. In Proceedings of the IEEE symposium on FCCM (pp. 188–191).

  12. Chen, K., Li, S., Muralimanohar, N., Ahn, J.H., Brockman, J., & Jouppi, N. (2012). CACTI-3DD: architecture-level modeling for 3D die-stacked DRAM main memory. In Design, automation test in Europe (DATE) (pp. 33–38).

  13. Chung, E.S., Milder, P.A., Hoe, J.C., & Mai, K. (2010). Single-chip heterogeneous computing: does the future include custom logic, FPGAs, and GPGPUs?. In Proceedings of the 43th IEEE/ACM international symposium on microarchitecture (MICRO).

  14. Cooley, J.W., & Tukey, J.W. (1965). An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19(90), 297–301.

    Article  MathSciNet  MATH  Google Scholar 

  15. Eleftheriou, M., Fitch, B., Rayshubskiy, A., Ward, T.J.C., & Germain, R. (2005). Scalable framework for 3D FFTs on the Blue Gene/L supercomputer: implementation and early performance measurements. IBM Journal of Research and Development, 49(2.3), 457–464.

    Article  Google Scholar 

  16. Franchetti, F., & Püschel, M. (2011). Encyclopedia of Parallel Computing, chap, Fast fourier transform: Springer.

  17. Frigo, M., & Johnson, S.G. (2005). The design and implementation of FFTW3. Proceedings of the IEEE, Special issue on Program Generation Optimization, and Platform Adaptation, 93(2), 216–231.

    Google Scholar 

  18. Govindaraju, N.K., Lloyd, B., Dotsenko, Y., Smith, B., & Manferdelli, J. (2008). High performance discrete Fourier transforms on graphics processors. In Proceedings of the ACM/IEEE conference on supercomputing (SC) (pp. 2:1–2:12).

  19. Loh, G.H. (2008). 3D-stacked memory architectures for multi-core processors. In Proceedings of the 35th annual international symposium on computer architecture, (ISCA) (pp. 453–464).

  20. Milder, P.A., Franchetti, F., Hoe, J.C., & Püschel, M. (2012). Computer generation of hardware for linear digital signal processing transforms. ACM Transactions on Design Automation of Electronic Systems, 17(2).

  21. Milder, P.A., Hoe, J.C., & Püschel, M. (2009). Automatic generation of streaming datapaths for arbitrary fixed permutations. In Design, automation and test in Europe (DATE) (pp. 1118–1123).

  22. Pawlowski, J.T. (2011). Hybrid memory cube (HMC). In Hotchips.

  23. Pedram, A., McCalpin, J., & Gerstlauer, A. (2013). Transforming a linear algebra core to an FFT accelerator. In Proceedings of IEEE international conference on application-specific systems, architectures and processors (ASAP) (pp. 175–184).

  24. Püschel, M., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R.W., & Rizzolo, N. (2005). SPIRAL: code generation for DSP transforms. Proceedings of IEEE, Special Issue on Program Generation Optimization, and Adaptation, 93(2), 232–275.

    Google Scholar 

  25. Rosenfeld, P., Cooper-Balis, E., & Jacob, B. (2011). Dramsim2: a cycle accurate memory system simulator. IEEE Computer Architecture Letters, 10(1), 16–19.

    Article  Google Scholar 

  26. Loan Van, C. (1992). Computational frameworks for the fast Fourier transform: SIAM.

  27. Weis, C., Loi, I., Benini, L., & Wehn, N. (2013). Exploration and optimization of 3-D integrated dram subsystems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 32(4), 597–610.

    Article  Google Scholar 

  28. Yu, C.L., Irick, K., Chakrabarti, C., & Narayanan, V. (2010). Multidimensional DFT IP generator for FPGA platforms. IEEE Transactions on Circuits and Systems, 58(4), 755–764.

    Article  MathSciNet  Google Scholar 

  29. Zhu, Q., Akin, B., Sumbul, H., Sadi, F., Hoe, J., Pileggi, L., & Franchetti, F. (2013). A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In 2013 IEEE international 3D systems integration conference (3DIC) (pp. 1–7).

  30. Zhu, Q., Vaidyanathan, K., Shacham, O., Horowitz, M., Pileggi, L., & Franchetti, F. (2012). Design automation framework for application-specific logic-in-memory blocks. In Proceedings of IEEE international conference on application-specific systems, architectures and processors (ASAP) (pp. 125–132).

Download references

Acknowledgments

The work was sponsored by Defense Advanced Research Projects Agency (DARPA) under agreement no. HR0011-13-2-0007. The content, views and conclusions presented in this document do not necessarily reflect the position or the policy of DARPA or the U.S. Government. No official endorsement should be inferred.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Berkin Akin.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Akin, B., Franchetti, F. & Hoe, J.C. FFTs with Near-Optimal Memory Access Through Block Data Layouts: Algorithm, Architecture and Design Automation. J Sign Process Syst 85, 67–82 (2016). https://doi.org/10.1007/s11265-015-1018-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-015-1018-0

Keywords

Navigation