Abstract
In this paper, we propose an implementation of parallel fast Fourier transforms (FFTs) with automatic performance tuning on distributed-memory parallel computers. A blocking algorithm for parallel FFTs utilizes cache memory effectively. Since the optimal block size may depend on the problem size, we propose a method to determine the optimal block size that minimizes the number of cache misses. In addition, parallel FFTs require intensive all-to-all communication, which affects the performance of FFTs. An automatic tuning of all-to-all communication is also implemented. The performance results demonstrate that the proposed implementation of parallel FFTs with automatic performance tuning is efficient for improving the performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Cooley JW, Tukey JW (1965) An algorithm for the machine calculation of complex Fourier series. Math Comput 19:297–301
Swarztrauber PN (1987) Multiprocessor FFTs. Parallel Comput 5:197–210
Agarwal RC, Gustavson FG, Zubair M (1994) A high performance parallel algorithm for 1-D FFT. In: Proceedings of the Supercomputing 1994, Washington, DC. pp 34–40
Hegland M (1994) A self-sorting in-place fast Fourier transform algorithm suitable for vector and parallel processing. Numer Math 68:507–547
Edelman A, McCorquodale P, Toledo S (1999) The future fast Fourier transform? SIAM J Sci Comput 20:1094–1114
Frigo M, Johnson SG (2005) The design and implementation of FFTW3. Proc IEEE 93: 216–231
Püschel M, Moura JMF, Johnson J, Padua D, Veloso M, Singer BW, Xiong J, Franchetti F, Gacic A, Voronenko Y, Chen K, Johnson RW, Rizzolo N (2005) SPIRAL: Code generation for DSP transforms. Proc IEEE 93:232–275
Mirković D, Johnsson SL (2001) Automatic performance tuning in the UHFFT library. In: Proceedings of the 2001 International Conference on Computational Science (ICCS 2001). Lecture Notes in Computer Science, Vol 2073, Springer, pp 71–80
Bonelli A, Franchetti F, Lorenz J, Püschel M, Ueberhuber CW (2006) Automatic performance optimization of the discrete Fourier transform on distributed memory computers. In: Proceedings of 4th International Symposium on Parallel and Distributed Processing and Applications (ISPA 2006). Lecture Notes in Computer Science, Vol 4330, Springer, pp 818–832
Takahashi D, Boku T, Sato M (2002) A blocking algorithm for parallel 1-D FFT on clusters of PCs. In: Proceedings of the 8th International Euro-Par Conference (Euro-Par 2002). Lecture Notes in Computer Science, Vol 2400, Springer, pp 691–700
FFTE: A Fast Fourier Transform Package. http://www.ffte.jp/.
Swarztrauber PN (1984) FFT algorithms for vector computers. Parallel Comput 1:45–63
Van Loan C (1992) Computational frameworks for the fast Fourier transform. SIAM, Philadelphia, PA
Faraj A, Yuan X (2005) Automatic generation and tuning of mpi collective communication routines. In: Proceedings of the 19th ACM International Conference on Supercomputing (ICS’05). pp 393–402
Kumar R, Mamidala A, Panda DK (2008) Scaling alltoall collective on multi-core systems. In: Proceedings of the 2008 IEEE International Parallel and Distributed Processing Symposium (IPDPS 2008)
MVAPICH: MPI over InfiniBand and iWARP. http://mvapich.cse.ohio-state.edu/.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer New York
About this chapter
Cite this chapter
Takahashi, D. (2011). Automatic Tuning for Parallel FFTs. In: Naono, K., Teranishi, K., Cavazos, J., Suda, R. (eds) Software Automatic Tuning. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-6935-4_4
Download citation
DOI: https://doi.org/10.1007/978-1-4419-6935-4_4
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4419-6934-7
Online ISBN: 978-1-4419-6935-4
eBook Packages: EngineeringEngineering (R0)