Skip to main content
Log in

Memory Efficient Two-Pass 3D FFT Algorithm for Intel® Xeon PhiTM Coprocessor

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Equipped with 512-bit wide SIMD instructions and large numbers of computing cores, the emerging x86-based Intel® Many Integrated Core (MIC) Architecture provides not only high floating-point performance, but also substantial off-chip memory bandwidth. The 3D FFT (three-dimensional fast Fourier transform) is a widely-studied algorithm; however, the conventional algorithm needs to traverse the data array three times. In each pass, it computes multiple 1D FFTs along one of three dimensions, giving rise to plenty of non-unit strided memory accesses. In this paper, we propose a two-pass 3D FFT algorithm, which mainly aims to reduce the amount of explicit data transfer between the memory and the on-chip cache. The main idea is to split one dimension into two sub-dimensions, and then combine the transform along each sub-dimension with one of the rest dimensions respectively. The difference in amount of TLB misses resulting from decomposition along different dimensions is analyzed in detail. Multi-level parallelism is leveraged on the many-core system for a high degree of parallelism and better data reuse of local cache. On top of this, a number of optimization techniques, such as memory padding, loop transformation and vectorization, are employed in our implementation to further enhance the performance. We evaluate the algorithm on the Intel® Xeon PhiTM coprocessor 7110P, and achieve a maximum performance of 136 Gflops with 240 threads in offload mode, which beats the vendor-specific Intel® MKL library by a factor of up to 2.22X.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Tessendorf J. Simulating ocean water. In SIGGRAPH 2001 Course Notes, http://people.clemson.edu/~jtessen/reports.html, Oct. 2014.

  2. Ohno Y, Nishibori E, Narumi T, Koishi T, Tahirov T H, Ago H, Miyano M, Himeno R, Ebisuzaki T, Sakata M, Taiji M. A 281 Tflops calculation for X-ray protein structure analysis with special-purpose computers MDGRAPE-3. In Proc. SC, Nov. 2007, Article No. 56

  3. Omlor L, Giese M A. Anechoic blind source separation using wigner marginals. The Journal of Machine Learning Research, 2011, 12: 1111–1148.

    MATH  MathSciNet  Google Scholar 

  4. Cooley J W, Tukey J W. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 1965, 19: 297–301.

    Article  MATH  MathSciNet  Google Scholar 

  5. Good I J. The interaction algorithm and practical Fourier analysis. Journal of the Royal Statistical Society. Series B (Methodological), 1958, 20(2): 361–372.

  6. Thomas L H. Using a computer to solve problems in physics. Applications of Digital Computers, 1963: 44–45.

  7. Yavne R. An economical method for calculating the discrete Fourier transform. In Proc. AFIPS Fall Joint Comput. Conf., Dec. 1968, pp.115–125.

  8. Rader C M. Discrete Fourier transforms when the number of data samples is prime. Proceedings of the IEEE, 1968, 56(6): 1107–1108.

    Article  Google Scholar 

  9. Frigo M, Johnson S G. The design and implementation of FFTW3. Proceedings of the IEEE, 2005, 93(2): 216–231.

    Article  Google Scholar 

  10. Ali A, Johnsson L, Subhlok J. Scheduling FFT computation on SMP and multicore systems. In Proc. the 21st ICS, Jun. 2007, pp.293–301.

  11. Li Y, Zhang Y Q, Liu Y Q, Long G P, Jia H P. MPFFT: An auto-tuning FFT library for OpenCL GPUs. Journal of Computer Science and Technology, 2013, 28(1): 90–105.

    Article  Google Scholar 

  12. Nukada A, Matsuoka S. Auto-tuning 3-D FFT library for CUDA GPUs. In Proc. SC, Nov. 2009, Article No. 30.

  13. Ramos S, Hoefler T. Modeling communication in cache-coherent SMP systems | A case-study with Xeon Phi. In Proc. the 22nd HPDC, Jun. 2013, pp. 97–108.

  14. Van Loan C. Computational Frameworks for the Fast Fourier Transform. Philadelphia USA: SIAM, 1992.

    Book  MATH  Google Scholar 

  15. Takahashi D. A blocking algorithm for FFT on cache-based processors. In Proc. the 9th HPCN, Jun. 2001, pp.551–554.

  16. Takahashi D. Implementation and evaluation of parallel FFT using SIMD instructions on multi-core processors. In Proc. IWIA, Jan. 2007, pp.53–59.

  17. Frigo M, Leiserson C E, Prokop H, Ramachandran S. Cache-oblivious algorithms. In Proc. the 40th FOCS, Oct. 1999, pp.285–297.

  18. Gu L, Li X, Siegel J. An empirically tuned 2D and 3D FFT library on CUDA GPU. In Proc. the 24th ICS, Jun. 2010, pp.305–314.

  19. Nukada A, Ogata Y, Endo T, Matsuoka S. Bandwidth intensive 3-D FFT kernel for GPUs using CUDA. In Proc. SC, Nov. 2008, Article No. 5.

  20. Dotsenko Y, Baghsorkhi S S, Lloyd B, Govindaraju N K. Auto-tuning of fast Fourier transform on graphics processors. In Proc. the 16th PPoPP, Feb. 2011, pp.257–266.

  21. Pjschel M, Moura J M F, Johnson J R, Padua D, Veloso M, Singer B W, Xiong J, Franchetti F, Gacic A, Voronenko Y, Chen K, Johnson R W, Rizzolo N. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, 2005, 93(2): 232–275.

    Article  Google Scholar 

  22. Caballero D, Duran A, Martorell X. An OpenMP* barrier using SIMD instructions for Intel® Xeon PhiTM coprocessor. In Proc. the 9th IWOMP, Sept. 2013, pp.99–113.

  23. Krishnaiyer R, Kultursay E, Chawla P, Preis S, Zvezdin A, Saito H. Compiler-based data prefetching and streaming non-temporal store generation for the Intel® Xeon PhiTM coprocessor. In Proc. the 27th IPDPSW, May 2013, pp.1575–1586.

  24. Franchetti F, Puschel M, Voronenko Y, Chellappa S, Moura J M. Discrete Fourier transform on multicore. IEEE Signal Processing Magazine, 2009, 26(6): 90–102.

    Article  Google Scholar 

  25. Chen L, Hu Z, Lin J, Gao G R. Optimizing the fast Fourier transform on a multi-core architecture. In Proc. the 21st IPDPS, Mar. 2007.

  26. Chen L, Gao G R. Performance analysis of Cooley-Tukey FFT algorithms for a many-core architecture. In Proc. SpringSim, Apr. 2010, Article No. 81.

  27. Almaless G, Wajsburt F. Does shared-memory, highly multi-threaded, single-application scale on many-cores? In Proc. the 4th HotPar, Jun. 2012.

  28. Heinecke A, Vaidyanathan K, Smelyanskiy M, Kobotov A, Dubtsov A, Henry G, Shet A G, Chrysos G, Dubey P. Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon PhiTM coprocessor. In Proc. the 27th IPDPS, May 2013, pp.126–137.

  29. Liu X, Smelyanskiy M, Chow E, Dubey P. E±cient sparse matrix–vector multiplication on x86-based many-core processors. In Proc. the 27th ICS, Jun. 2013, pp.273–282.

  30. Park J, Bikshandi G, Vaidyanathan K, Tang P T P, Dubey P, Kim D. Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon PhiTM coprocessors. In Proc. SC, Nov. 2013, Article No. 34.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi-Qun Liu.

Additional information

This work was supported by the National Natural Science Foundation of China under Grant Nos. 61133005, 61272136, 61221062, 61402441, 61432018, the National High Technology Research and Development 863 Program of China under Grant No. 2012AA010903, and the Chinese Academy of Sciences Special Grant for Postgraduate Research, Innovation and Practice under Grant No. 11000GBF01.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 108 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, YQ., Li, Y., Zhang, YQ. et al. Memory Efficient Two-Pass 3D FFT Algorithm for Intel® Xeon PhiTM Coprocessor. J. Comput. Sci. Technol. 29, 989–1002 (2014). https://doi.org/10.1007/s11390-014-1484-z

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-014-1484-z

Keywords

Navigation