An Implementation of Parallel 3-D FFT Using Short Vector SIMD Instructions on Clusters of PCs

Takahashi, Daisuke; Boku, Taisuke; Sato, Mitsuhisa

doi:10.1007/11558958_139

Daisuke Takahashi¹⁹,
Taisuke Boku¹⁹ &
Mitsuhisa Sato¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3732))

Included in the following conference series:

International Workshop on Applied Parallel Computing

1473 Accesses

Abstract

In this paper, we propose an implementation of a parallel three-dimensional fast Fourier transform (FFT) using short vector SIMD instructions on clusters of PCs. We vectorized FFT kernels using Intel’s Streaming SIMD Extensions 2 (SSE2) instructions. We show that a combination of the vectorization and block three-dimensional FFT algorithm improves performance effectively. Performance results of three-dimensional FFTs on a dual Xeon 2.8 GHz PC SMP cluster are reported. We successfully achieved performance of over 5 GFLOPS on a 16-node dual Xeon 2.8 GHz PC SMP cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

An Implementation of Parallel 1-D Real FFT on Intel Xeon Phi Processors

Implementation of Parallel 3-D Real FFT with 2-D Decomposition on Intel Xeon Phi Clusters

Accelerating FFT Using NEC SX-Aurora Vector Engine

References

Cooley, J.W., Tukey, J.W.: An algorithmfor themachine calculation of complex Fourier series. Math. Comput. 19, 297–301 (1965)
Article MATH MathSciNet Google Scholar
Brass, A., Pawley, G.S.: Two and three dimensional FFTs on highly parallel computers. Parallel Computing 3, 167–184 (1986)
Article MATH MathSciNet Google Scholar
Agarwal, R.C., Gustavson, F.G., Zubair, M.: An efficient parallel algorithm for the 3-D FFT NAS parallel benchmark. In: Proceedings of the Scalable High-Performance Computing Conference, pp. 129–133 (1994)
Google Scholar
Hegland, M.: Real and complex fast Fourier transforms on the Fujitsu VPP 500. Parallel Computing 22, 539–553 (1996)
Article MATH Google Scholar
Calvin, C.: Implementation of parallel FFT algorithms on distributed memory machines with a minimum overhead of communication. Parallel Computing 22, 1255–1279 (1996)
Article MATH MathSciNet Google Scholar
Takahashi, D.: Efficient implementation of parallel three-dimensional FFT on clusters of PCs. Computer Physics Communications 152, 144–150 (2003)
Article Google Scholar
Nadehara, K., Miyazaki, T., Kuroda, I.: Radix-4 FFT implementation using SIMDmultimedia instructions. In: Proc. 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1999), vol. 4, pp. 2131–2134 (1999)
Google Scholar
Franchetti, F., Karner, H., Kral, S., Ueberhuber, C.W.: Architecture independent short vector FFTs. In: Proc. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2001), vol. 2, pp. 1109–1112 (2001)
Google Scholar
Rodriguez, V.P.: A radix-2 FFT algorithm for modern single instruction multiple data (SIMD) architectures. In: Proc. 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2002), vol. 3, pp. 3220–3223 (2002)
Google Scholar
Kral, S., Franchetti, F., Lorenz, J., Ueberhuber, C.W.: SIMD vectorization of straight line FFT code. In: Kosch, H., Böszörményi, L., Hellwagner, H. (eds.) Euro-Par 2003. LNCS, vol. 2790, pp. 251–260. Springer, Heidelberg (2003)
Chapter Google Scholar
Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proc. IEEE 93, 216–231 (2005)
Article Google Scholar
Franchetti, F., Kral, S., Lorenz, J., Ueberhuber, C.W.: Efficient utilization of SIMD extensions. Proc. IEEE 93, 409–425 (2005)
Article Google Scholar
Bailey, D.H.: FFTs in external or hierarchical memory. The Journal of Supercomputing 4, 23–35 (1990)
Article Google Scholar
Van Loan, C.: Computational Frameworks for the Fast Fourier Transform. SIAM Press, Philadelphia (1992)
MATH Google Scholar
Frigo, M., Johnson, S.G.: FFTW: An adaptive software architecture for the FFT. In: Proc. 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1998), pp. 1381–1384 (1998)
Google Scholar
Intel Corporation: IA-32 Intel Architecture Software Developer’s Manual Volume 1: Basic Architecture (2004)
Google Scholar
Intel Corporation: Intel C++ Compiler for Linux Systems User’s Guide (2004)
Google Scholar
Swarztrauber, P.N.: FFT algorithms for vector computers. Parallel Computing 1, 45–63 (1984)
Article MATH Google Scholar
Sumimoto, S., Tezuka, H., Hori, A., Harada, H., Takahashi, T., Ishikawa, Y.: High performance communication using a commodity network for cluster systems. In: Proc. Ninth International Symposium on High Performance Distributed Computing (HPDC-9), pp. 139–146 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Systems and Information Engineering, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8573, Japan
Daisuke Takahashi, Taisuke Boku & Mitsuhisa Sato

Authors

Daisuke Takahashi
View author publications
You can also search for this author in PubMed Google Scholar
Taisuke Boku
View author publications
You can also search for this author in PubMed Google Scholar
Mitsuhisa Sato
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, University of Tennessee, 37996-3450, Knoxville, TN, USA
Jack Dongarra
Department of Informatics and Mathematical Modelling, Technical University of Denmark, DK-2800, Lyngby, Denmark
Kaj Madsen
Informatics & Mathematical Modeling, Technical University of Denmark, DK-2800, Lyngby, Denmark
Jerzy Waśniewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Takahashi, D., Boku, T., Sato, M. (2006). An Implementation of Parallel 3-D FFT Using Short Vector SIMD Instructions on Clusters of PCs. In: Dongarra, J., Madsen, K., Waśniewski, J. (eds) Applied Parallel Computing. State of the Art in Scientific Computing. PARA 2004. Lecture Notes in Computer Science, vol 3732. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11558958_139

Download citation

DOI: https://doi.org/10.1007/11558958_139
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29067-4
Online ISBN: 978-3-540-33498-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Implementation of Parallel 3-D FFT Using Short Vector SIMD Instructions on Clusters of PCs

Abstract

Access this chapter

Preview

Similar content being viewed by others

An Implementation of Parallel 1-D Real FFT on Intel Xeon Phi Processors

Implementation of Parallel 3-D Real FFT with 2-D Decomposition on Intel Xeon Phi Clusters

Accelerating FFT Using NEC SX-Aurora Vector Engine

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

An Implementation of Parallel 3-D FFT Using Short Vector SIMD Instructions on Clusters of PCs

Abstract

Access this chapter

Preview

Similar content being viewed by others

An Implementation of Parallel 1-D Real FFT on Intel Xeon Phi Processors

Implementation of Parallel 3-D Real FFT with 2-D Decomposition on Intel Xeon Phi Clusters

Accelerating FFT Using NEC SX-Aurora Vector Engine

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation