Memory Efficient Two-Pass 3D FFT Algorithm for Intel® Xeon PhiTM Coprocessor

Liu, Yi-Qun; Li, Yan; Zhang, Yun-Quan; Zhang, Xian-Yi

doi:10.1007/s11390-014-1484-z

Memory Efficient Two-Pass 3D FFT Algorithm for Intel® Xeon Phi^TM Coprocessor

Regular Paper
Published: 17 November 2014

Volume 29, pages 989–1002, (2014)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Yi-Qun Liu^1,2,
Yan Li³,
Yun-Quan Zhang⁴ &
…
Xian-Yi Zhang^1,2

193 Accesses
5 Citations
Explore all metrics

Abstract

Equipped with 512-bit wide SIMD instructions and large numbers of computing cores, the emerging x86-based Intel® Many Integrated Core (MIC) Architecture provides not only high floating-point performance, but also substantial off-chip memory bandwidth. The 3D FFT (three-dimensional fast Fourier transform) is a widely-studied algorithm; however, the conventional algorithm needs to traverse the data array three times. In each pass, it computes multiple 1D FFTs along one of three dimensions, giving rise to plenty of non-unit strided memory accesses. In this paper, we propose a two-pass 3D FFT algorithm, which mainly aims to reduce the amount of explicit data transfer between the memory and the on-chip cache. The main idea is to split one dimension into two sub-dimensions, and then combine the transform along each sub-dimension with one of the rest dimensions respectively. The difference in amount of TLB misses resulting from decomposition along different dimensions is analyzed in detail. Multi-level parallelism is leveraged on the many-core system for a high degree of parallelism and better data reuse of local cache. On top of this, a number of optimization techniques, such as memory padding, loop transformation and vectorization, are employed in our implementation to further enhance the performance. We evaluate the algorithm on the Intel® Xeon Phi^TM coprocessor 7110P, and achieve a maximum performance of 136 Gflops with 240 threads in offload mode, which beats the vendor-specific Intel® MKL library by a factor of up to 2.22X.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Survey on chiplets: interface, interconnect and integration methodology

Article 31 March 2022

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Article 27 April 2021

Can GPU performance increase faster than the code error rate?

Article Open access 18 April 2024

References

Tessendorf J. Simulating ocean water. In SIGGRAPH 2001 Course Notes, http://people.clemson.edu/~jtessen/reports.html, Oct. 2014.
Ohno Y, Nishibori E, Narumi T, Koishi T, Tahirov T H, Ago H, Miyano M, Himeno R, Ebisuzaki T, Sakata M, Taiji M. A 281 Tflops calculation for X-ray protein structure analysis with special-purpose computers MDGRAPE-3. In Proc. SC, Nov. 2007, Article No. 56
Omlor L, Giese M A. Anechoic blind source separation using wigner marginals. The Journal of Machine Learning Research, 2011, 12: 1111–1148.
MATH MathSciNet Google Scholar
Cooley J W, Tukey J W. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 1965, 19: 297–301.
Article MATH MathSciNet Google Scholar
Good I J. The interaction algorithm and practical Fourier analysis. Journal of the Royal Statistical Society. Series B (Methodological), 1958, 20(2): 361–372.
Thomas L H. Using a computer to solve problems in physics. Applications of Digital Computers, 1963: 44–45.
Yavne R. An economical method for calculating the discrete Fourier transform. In Proc. AFIPS Fall Joint Comput. Conf., Dec. 1968, pp.115–125.
Rader C M. Discrete Fourier transforms when the number of data samples is prime. Proceedings of the IEEE, 1968, 56(6): 1107–1108.
Article Google Scholar
Frigo M, Johnson S G. The design and implementation of FFTW3. Proceedings of the IEEE, 2005, 93(2): 216–231.
Article Google Scholar
Ali A, Johnsson L, Subhlok J. Scheduling FFT computation on SMP and multicore systems. In Proc. the 21st ICS, Jun. 2007, pp.293–301.
Li Y, Zhang Y Q, Liu Y Q, Long G P, Jia H P. MPFFT: An auto-tuning FFT library for OpenCL GPUs. Journal of Computer Science and Technology, 2013, 28(1): 90–105.
Article Google Scholar
Nukada A, Matsuoka S. Auto-tuning 3-D FFT library for CUDA GPUs. In Proc. SC, Nov. 2009, Article No. 30.
Ramos S, Hoefler T. Modeling communication in cache-coherent SMP systems | A case-study with Xeon Phi. In Proc. the 22nd HPDC, Jun. 2013, pp. 97–108.
Van Loan C. Computational Frameworks for the Fast Fourier Transform. Philadelphia USA: SIAM, 1992.
Book MATH Google Scholar
Takahashi D. A blocking algorithm for FFT on cache-based processors. In Proc. the 9th HPCN, Jun. 2001, pp.551–554.
Takahashi D. Implementation and evaluation of parallel FFT using SIMD instructions on multi-core processors. In Proc. IWIA, Jan. 2007, pp.53–59.
Frigo M, Leiserson C E, Prokop H, Ramachandran S. Cache-oblivious algorithms. In Proc. the 40th FOCS, Oct. 1999, pp.285–297.
Gu L, Li X, Siegel J. An empirically tuned 2D and 3D FFT library on CUDA GPU. In Proc. the 24th ICS, Jun. 2010, pp.305–314.
Nukada A, Ogata Y, Endo T, Matsuoka S. Bandwidth intensive 3-D FFT kernel for GPUs using CUDA. In Proc. SC, Nov. 2008, Article No. 5.
Dotsenko Y, Baghsorkhi S S, Lloyd B, Govindaraju N K. Auto-tuning of fast Fourier transform on graphics processors. In Proc. the 16th PPoPP, Feb. 2011, pp.257–266.
Pjschel M, Moura J M F, Johnson J R, Padua D, Veloso M, Singer B W, Xiong J, Franchetti F, Gacic A, Voronenko Y, Chen K, Johnson R W, Rizzolo N. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, 2005, 93(2): 232–275.
Article Google Scholar
Caballero D, Duran A, Martorell X. An OpenMP* barrier using SIMD instructions for Intel® Xeon PhiTM coprocessor. In Proc. the 9th IWOMP, Sept. 2013, pp.99–113.
Krishnaiyer R, Kultursay E, Chawla P, Preis S, Zvezdin A, Saito H. Compiler-based data prefetching and streaming non-temporal store generation for the Intel® Xeon PhiTM coprocessor. In Proc. the 27th IPDPSW, May 2013, pp.1575–1586.
Franchetti F, Puschel M, Voronenko Y, Chellappa S, Moura J M. Discrete Fourier transform on multicore. IEEE Signal Processing Magazine, 2009, 26(6): 90–102.
Article Google Scholar
Chen L, Hu Z, Lin J, Gao G R. Optimizing the fast Fourier transform on a multi-core architecture. In Proc. the 21st IPDPS, Mar. 2007.
Chen L, Gao G R. Performance analysis of Cooley-Tukey FFT algorithms for a many-core architecture. In Proc. SpringSim, Apr. 2010, Article No. 81.
Almaless G, Wajsburt F. Does shared-memory, highly multi-threaded, single-application scale on many-cores? In Proc. the 4th HotPar, Jun. 2012.
Heinecke A, Vaidyanathan K, Smelyanskiy M, Kobotov A, Dubtsov A, Henry G, Shet A G, Chrysos G, Dubey P. Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon PhiTM coprocessor. In Proc. the 27th IPDPS, May 2013, pp.126–137.
Liu X, Smelyanskiy M, Chow E, Dubey P. E±cient sparse matrix–vector multiplication on x86-based many-core processors. In Proc. the 27th ICS, Jun. 2013, pp.273–282.
Park J, Bikshandi G, Vaidyanathan K, Tang P T P, Dubey P, Kim D. Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon PhiTM coprocessors. In Proc. SC, Nov. 2013, Article No. 34.

Download references

Author information

Authors and Affiliations

Institute of Software, Chinese Academy of Sciences, Beijing, 100190, China
Yi-Qun Liu & Xian-Yi Zhang
University of Chinese Academy of Sciences, Beijing, 100049, China
Yi-Qun Liu & Xian-Yi Zhang
Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Yan Li
State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Yun-Quan Zhang

Authors

Yi-Qun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yan Li
View author publications
You can also search for this author in PubMed Google Scholar
Yun-Quan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xian-Yi Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi-Qun Liu.

Additional information

This work was supported by the National Natural Science Foundation of China under Grant Nos. 61133005, 61272136, 61221062, 61402441, 61432018, the National High Technology Research and Development 863 Program of China under Grant No. 2012AA010903, and the Chinese Academy of Sciences Special Grant for Postgraduate Research, Innovation and Practice under Grant No. 11000GBF01.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 108 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, YQ., Li, Y., Zhang, YQ. et al. Memory Efficient Two-Pass 3D FFT Algorithm for Intel® Xeon Phi^TM Coprocessor. J. Comput. Sci. Technol. 29, 989–1002 (2014). https://doi.org/10.1007/s11390-014-1484-z

Download citation

Received: 27 December 2013
Revised: 25 July 2014
Published: 17 November 2014
Issue Date: November 2014
DOI: https://doi.org/10.1007/s11390-014-1484-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Memory Efficient Two-Pass 3D FFT Algorithm for Intel® Xeon Phi^TM Coprocessor

Abstract

Access this article

Similar content being viewed by others

Survey on chiplets: interface, interconnect and integration methodology

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Can GPU performance increase faster than the code error rate?

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Memory Efficient Two-Pass 3D FFT Algorithm for Intel® Xeon PhiTM Coprocessor

Abstract

Access this article

Similar content being viewed by others

Survey on chiplets: interface, interconnect and integration methodology

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Can GPU performance increase faster than the code error rate?

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

Memory Efficient Two-Pass 3D FFT Algorithm for Intel® Xeon Phi^TM Coprocessor