research-article

Large-scale FFT on GPU clusters

Authors:

Hong MeiAuthors Info & Claims

ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing

Pages 315 - 324

https://doi.org/10.1145/1810085.1810128

Published: 02 June 2010 Publication History

Abstract

A GPU cluster is a cluster equipped with GPU devices. Excellent acceleration is achievable for computation-intensive tasks (e. g. matrix multiplication and LINPACK) and bandwidth-intensive tasks with data locality (e. g. finite-difference simulation). Bandwidth-intensive tasks such as large-scale FFTs without data locality are harder to accelerate, as the bottleneck often lies with the PCI between main memory and GPU device memory or the communication network between workstation nodes. That means optimizing the performance of FFT for a single GPU device will not improve the overall performance. This paper uses large-scale FFT as an example to show how to achieve substantial speedups for these more challenging tasks on a GPU cluster. Three GPU-related factors lead to better performance: firstly the use of GPU devices improves the sustained memory bandwidth for processing large-size data; secondly GPU device memory allows larger subtasks to be processed in whole and hence reduces repeated data transfers between memory and processors; and finally some costly main-memory operations such as matrix transposition can be significantly sped up by GPUs if necessary data adjustment is performed during data transfers. This technique of manipulating array dimensions during data transfer is the main technical contribution of this paper. These factors (as well as the improved communication library in our implementation) attribute to 24.3x speedup with respect to FFTW and 7x speedup with respect to Intel MKL for 4096 3D single-precision FFT on a 16-node cluster with 32 GPUs. Around 5x speedup with respect to both standard libraries are achieved for double precision.

References

[1]

CUDA CUFFT Library, Version 2.3. NVIDIA Corp., 2009.

[2]

CUDA Programming Guide, Version 2.3. NVIDIA Corp., 2009.

[3]

NVIDIA Tesla GPUs to communicate faster over Mellanox infiniband networks. http://www.nvidia.com/object/io1258539409179.html, 2009.

[4]

Agarwal, Gustavson, and Zubair. A high performance parallel algorithm for 1-D FFT. In ICS'94, pages 34--40. IEEE Computer Society, 1994.

Digital Library

[5]

Y. Chen and J. Sanders. Logic of global synchrony. ACM Transactions on Programming Languages and Systems, 26(2):221--262, 2004.

Digital Library

[6]

I. Corp. http://software.intel.com/en-us/intel-mkl/.

[7]

X. Cui, Y. Chen, and H. Mei. Improving performance of matrix multiplication and FFT on GPU. In International Conference on Parallel and Distributed Systems, pages 42--48. IEEE Computer Society, 2009.

Digital Library

[8]

M. Fatica. Accelerating linpack with CUDA on heterogenous clusters. GPGPU'09, June 2009.

Digital Library

[9]

N. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete fourier transforms on graphics processors. SC'08, November 2008.

Digital Library

[10]

H. Jagode. Fourier transforms for the Bluegene/L communication network. Master's thesis, The University of Edinburgh, 2005.

[11]

P. Micikevicius. 3D finite difference computation on GPUs using CUDA. GPGPU2, March 2009.

Digital Library

[12]

A. Nukada and S. Matsuoka. Auto-tuning 3-D FFT library for cuda GPUs. In SC'09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1--10. ACM, 2009.

Digital Library

[13]

A. Nukada, Y. Ogata, T. Endo, and S. Matsuoka. Bandwidth intensive 3-D FFT kernel for GPUs using cuda. In SC'08, pages 1--11, 2008.

Digital Library

[14]

S. Ryoo, C. I. R. amd S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the 13th ACM SIGPLAN, pages 73--82. ACM Press, 2008.

Digital Library

[15]

V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. SC'08, November 2008.

Digital Library

[16]

V. Volkov and B. Kazian. FFT prototype. http://www.cs.berkeley.edu/volkov/.

[17]

V. Volkov and B. Kazian. Fitting FFT onto the G80 architecture. http://www.cs.berkeley.edu/, May 2008.

Cited By

Servodio SLi X(2024)Memory Efficiency Oriented Fine-Grain Representation and Optimization of FFTProceedings of the International Symposium on Memory Systems10.1145/3695794.3695818(245-256)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3695794.3695818
Noh SHong JLim CPark SKim JKim HKim YLee J(2024)PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00027(245-260)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00027
Chu GChen TChen DBai HLi GRen SLi Y(2022)MISA-MD: A New Design of Molecular Dynamics Software for GPU Architecture *2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00041(51-58)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00041
Show More Cited By

Index Terms

Large-scale FFT on GPU clusters
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

This paper demonstrates the first tera-scale performance of Intel® Xeon Phi™ coprocessors on 1D FFT computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed ...
Improving Performance of Matrix Multiplication and FFT on GPU
ICPADS '09: Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems

In this paper we discuss about our experiences in improving the performance of two key algorithms: the single-precision matrix-matrix multiplication subprogram (SGEMM of BLAS) and single-precision FFT using CUDA. The former is computation-intensive, ...
Solving Poissons equation using FFT in a GPU cluster

Poissons equation is present in many scientific computations and its efficient solution is achieved by means of several methods. One of the most efficient methods is the Fast Fourier Transform (FFT), which is very widely used in lots of computational ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing

June 2010

365 pages

ISBN:9781450300186

DOI:10.1145/1810085

General Chair:
Taisuke Boku
University of Tsukuba
,
Program Chairs:
Hiroshi Nakashima
Kyoto University
,
Avi Mendelson
Microsoft

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Ministry of Science and Technology of the People's Republic of China
China HGJ Significant Project

Conference

ICS'10

Sponsor:

SIGARCH

ICS'10: International Conference on Supercomputing

June 2 - 4, 2010

Ibaraki, Tsukuba, Japan

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

67
Total Citations
View Citations
892
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)3

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Servodio SLi X(2024)Memory Efficiency Oriented Fine-Grain Representation and Optimization of FFTProceedings of the International Symposium on Memory Systems10.1145/3695794.3695818(245-256)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3695794.3695818
Noh SHong JLim CPark SKim JKim HKim YLee J(2024)PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00027(245-260)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00027
Chu GChen TChen DBai HLi GRen SLi Y(2022)MISA-MD: A New Design of Molecular Dynamics Software for GPU Architecture *2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00041(51-58)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00041
Zhao XZhou HChen HWang Y(2021)Domain Decomposition for Large-Scale Viscoacoustic Wave Simulation Using Localized Pseudo-Spectral MethodIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2020.300661459:3(2666-2679)Online publication date: Mar-2021
https://doi.org/10.1109/TGRS.2020.3006614
Durrani SChughtai MHidayetoglu MTahir RDakkak ARauchwerger LZaffar FHwu W(2021)Accelerating Fourier and Number Theoretic Transforms using Tensor Cores and Warp ShufflesProceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT52795.2021.00032(345-355)Online publication date: 26-Sep-2021
https://dl.acm.org/doi/10.1109/PACT52795.2021.00032
Wang ZWan XLiu ZFan QZhang FTan G(2021)A Multi-GPU Design for Large Size Cryo-EM 3D Reconstruction2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00094(847-858)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00094
Servodio SLi X(2021)An Efficient Shuffle-Light FFT Library2021 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC51483.2021.9679431(1-10)Online publication date: 29-Oct-2021
https://doi.org/10.1109/IPCCC51483.2021.9679431
Kang HLee JKim D(2021)HI-FFT: Heterogeneous Parallel In-Place Algorithm for Large-Scale 2D-FFTIEEE Access10.1109/ACCESS.2021.31084049(120261-120273)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3108404
Lu TChen YHechtman BWang TAnderson J(2021)Large-Scale Discrete Fourier Transform on TPUsIEEE Access10.1109/ACCESS.2021.30923129(93422-93432)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3092312
Lu XGuan HLu XGuan H(2021)Visualization and High-Performance Computing for City-Scale Nonlinear Time-History AnalysesEarthquake Disaster Simulation of Civil Infrastructures10.1007/978-981-15-9532-5_9(641-711)Online publication date: 2-Feb-2021
https://doi.org/10.1007/978-981-15-9532-5_9
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten