skip to main content
10.1145/1810085.1810128acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Large-scale FFT on GPU clusters

Published: 02 June 2010 Publication History

Abstract

A GPU cluster is a cluster equipped with GPU devices. Excellent acceleration is achievable for computation-intensive tasks (e. g. matrix multiplication and LINPACK) and bandwidth-intensive tasks with data locality (e. g. finite-difference simulation). Bandwidth-intensive tasks such as large-scale FFTs without data locality are harder to accelerate, as the bottleneck often lies with the PCI between main memory and GPU device memory or the communication network between workstation nodes. That means optimizing the performance of FFT for a single GPU device will not improve the overall performance. This paper uses large-scale FFT as an example to show how to achieve substantial speedups for these more challenging tasks on a GPU cluster. Three GPU-related factors lead to better performance: firstly the use of GPU devices improves the sustained memory bandwidth for processing large-size data; secondly GPU device memory allows larger subtasks to be processed in whole and hence reduces repeated data transfers between memory and processors; and finally some costly main-memory operations such as matrix transposition can be significantly sped up by GPUs if necessary data adjustment is performed during data transfers. This technique of manipulating array dimensions during data transfer is the main technical contribution of this paper. These factors (as well as the improved communication library in our implementation) attribute to 24.3x speedup with respect to FFTW and 7x speedup with respect to Intel MKL for 4096 3D single-precision FFT on a 16-node cluster with 32 GPUs. Around 5x speedup with respect to both standard libraries are achieved for double precision.

References

[1]
CUDA CUFFT Library, Version 2.3. NVIDIA Corp., 2009.
[2]
CUDA Programming Guide, Version 2.3. NVIDIA Corp., 2009.
[3]
NVIDIA Tesla GPUs to communicate faster over Mellanox infiniband networks. http://www.nvidia.com/object/io1258539409179.html, 2009.
[4]
Agarwal, Gustavson, and Zubair. A high performance parallel algorithm for 1-D FFT. In ICS'94, pages 34--40. IEEE Computer Society, 1994.
[5]
Y. Chen and J. Sanders. Logic of global synchrony. ACM Transactions on Programming Languages and Systems, 26(2):221--262, 2004.
[6]
I. Corp. http://software.intel.com/en-us/intel-mkl/.
[7]
X. Cui, Y. Chen, and H. Mei. Improving performance of matrix multiplication and FFT on GPU. In International Conference on Parallel and Distributed Systems, pages 42--48. IEEE Computer Society, 2009.
[8]
M. Fatica. Accelerating linpack with CUDA on heterogenous clusters. GPGPU'09, June 2009.
[9]
N. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete fourier transforms on graphics processors. SC'08, November 2008.
[10]
H. Jagode. Fourier transforms for the Bluegene/L communication network. Master's thesis, The University of Edinburgh, 2005.
[11]
P. Micikevicius. 3D finite difference computation on GPUs using CUDA. GPGPU2, March 2009.
[12]
A. Nukada and S. Matsuoka. Auto-tuning 3-D FFT library for cuda GPUs. In SC'09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1--10. ACM, 2009.
[13]
A. Nukada, Y. Ogata, T. Endo, and S. Matsuoka. Bandwidth intensive 3-D FFT kernel for GPUs using cuda. In SC'08, pages 1--11, 2008.
[14]
S. Ryoo, C. I. R. amd S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the 13th ACM SIGPLAN, pages 73--82. ACM Press, 2008.
[15]
V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. SC'08, November 2008.
[16]
V. Volkov and B. Kazian. FFT prototype. http://www.cs.berkeley.edu/volkov/.
[17]
V. Volkov and B. Kazian. Fitting FFT onto the G80 architecture. http://www.cs.berkeley.edu/, May 2008.

Cited By

View all
  • (2024)Memory Efficiency Oriented Fine-Grain Representation and Optimization of FFTProceedings of the International Symposium on Memory Systems10.1145/3695794.3695818(245-256)Online publication date: 30-Sep-2024
  • (2024)PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00027(245-260)Online publication date: 29-Jun-2024
  • (2022)MISA-MD: A New Design of Molecular Dynamics Software for GPU Architecture *2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00041(51-58)Online publication date: Dec-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing
June 2010
365 pages
ISBN:9781450300186
DOI:10.1145/1810085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. FFT
  2. GPU clusters
  3. array dimensions

Qualifiers

  • Research-article

Funding Sources

Conference

ICS'10
Sponsor:
ICS'10: International Conference on Supercomputing
June 2 - 4, 2010
Ibaraki, Tsukuba, Japan

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)36
  • Downloads (Last 6 weeks)3
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Memory Efficiency Oriented Fine-Grain Representation and Optimization of FFTProceedings of the International Symposium on Memory Systems10.1145/3695794.3695818(245-256)Online publication date: 30-Sep-2024
  • (2024)PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00027(245-260)Online publication date: 29-Jun-2024
  • (2022)MISA-MD: A New Design of Molecular Dynamics Software for GPU Architecture *2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00041(51-58)Online publication date: Dec-2022
  • (2021)Domain Decomposition for Large-Scale Viscoacoustic Wave Simulation Using Localized Pseudo-Spectral MethodIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2020.300661459:3(2666-2679)Online publication date: Mar-2021
  • (2021)Accelerating Fourier and Number Theoretic Transforms using Tensor Cores and Warp ShufflesProceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT52795.2021.00032(345-355)Online publication date: 26-Sep-2021
  • (2021)A Multi-GPU Design for Large Size Cryo-EM 3D Reconstruction2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00094(847-858)Online publication date: May-2021
  • (2021)An Efficient Shuffle-Light FFT Library2021 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC51483.2021.9679431(1-10)Online publication date: 29-Oct-2021
  • (2021)HI-FFT: Heterogeneous Parallel In-Place Algorithm for Large-Scale 2D-FFTIEEE Access10.1109/ACCESS.2021.31084049(120261-120273)Online publication date: 2021
  • (2021)Large-Scale Discrete Fourier Transform on TPUsIEEE Access10.1109/ACCESS.2021.30923129(93422-93432)Online publication date: 2021
  • (2021)Visualization and High-Performance Computing for City-Scale Nonlinear Time-History AnalysesEarthquake Disaster Simulation of Civil Infrastructures10.1007/978-981-15-9532-5_9(641-711)Online publication date: 2-Feb-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media