Abstract
This paper introduces a formal framework for automatically generating performance optimized implementations of the discrete Fourier transform (DFT) for distributed memory computers. The framework is implemented as part of the program generation and optimization system Spiral. DFT algorithms are represented as mathematical formulas in Spiral’s internal language SPL. Using a tagging mechanism and formula rewriting, we extend Spiral to automatically generate parallelized formulas. Using the same mechanism, we enable the generation of rescaling DFT algorithms, which redistribute the data in intermediate steps to fewer processors to reduce communication overhead. It is a novel feature of these methods that the redistribution steps are merged with the communication steps of the algorithm to avoid additional communication overhead. Among the possible alternative algorithms, Spiral’s search mechanism now determines the fastest for a given platform, effectively generating adapted code without human intervention. Experiments with DFT MPI programs generated by Spiral show performance gains of up to 30% due to rescaling. Further, our generated programs compare favorably with Fftw-MPI 2.1.5.
This work was supported by the Special Research Program SFB F011 “AURORA” and the Erwin Schrödinger Fellowship of the Austrian Science Fund FWF, and in part by DARPA through the Department of Interior grant NBCH1050009 and by NSF through awards 0234293 and 0325687.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Adelmann, A., Bonelli, A., Petersen, W.P., Ueberhuber, C.W.: Communication efficiency of parallel 3D FFTs. In: VECPAR 2004, vol. III, pp. 901–907 (2004)
Baumgartner, G., Auer, A., Bernholdt, D.E., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R.J., Hirata, S., Krishnamoorthy, S., Krishnan, S., Lam, C., Lu, Q., Nooijen, M., Pitzer, R.M., Ramanujam, J., Sadayappan, P., Sibiryakov, A.: Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. In: [17], pp. 276–292 (2005)
Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLapack Users’ Guide. SIAM, Philadelphia, PA (1997)
Dershowitz, N., Plaisted, D.A.: Rewriting. In: Robinson, A., Voronkov, A. (eds.) Handbook of Automated Reasoning, ch. 9, vol. 1, pp. 535–610. Elsevier, Amsterdam (2001)
Eleftheriou, M., Fitch, B., Rayshubskiy, A., Ward, T.C., Germain, R.: Scalable framework for 3D FFTs on the Blue Gene/L supercomputer: Implementation and early performance measurements. IBM Journal of Research and Development 49(2/3), 457–464 (2005)
Faraj, A., Yuan, X.: Automatic generation and tuning of MPI collective communication routines. In: Proc. International Conference on Supercomputing (ICS), pp. 393–402 (2005)
Franchetti, F., Püschel, M.: A SIMD vectorizing compiler for digital signal processing algorithms. In: Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 20–26 (2002)
Franchetti, F., Voronenko, Y., Püschel, M.: Loop merging for signal transforms. In: Proc. Programming Language Design and Implementation (PLDI), pp. 315–326 (2005)
Franchetti, F., Voronenko, Y., Püschel, M.: FFT program generation for shared memory: SMP and multicore. In: Proc. Supercomputing, SC (2006)
Franchetti, F., Voronenko, Y., Püschel, M.: A rewriting system for the vectorization of signal transforms. In: Daydé, M., Palma, J.M.L.M., Coutinho, Á.L.G.A., Pacitti, E., Lopes, J.C. (eds.) VECPAR 2006. LNCS, vol. 4395, pp. 363–377. Springer, Heidelberg (2007) (On CD-ROM)
Frigo, M.: A fast Fourier transform compiler. In: Proc. Programming Language Design and Implementation (PLDI), pp. 169–180 (1999)
Frigo, M., Johnson, S.G.: Fftw: An adaptive software architecture for the FFT. In: Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 3, pp. 1381–1384. IEEE, Los Alamitos (1998)
Frigo, M., Johnson, S.G.: The design and implementation of Fftw3. In: [17], pp. 216–231 (2005)
Goumas, G., Drosinos, N., Athanasaki, M., Koziris, N.: Automatic parallel code generation for tiled nested loops. In: Proc. Symposium on Applied Computing (SAC), pp. 1412–1419. ACM Press, New York (2004)
Gygi, F., Draeger, E., de Supinski, B.R., Yates, R.K., Franchetti, F., Kral, S., Lorenz, J., Ueberhuber, C.W., Gunnels, J., Sexton, J.: Large-scale first-principles molecular dynamics simulations on the Blue Gene/L platform using the Qbox code. In: Proc. Supercomputing (SC), p. 24 (2005)
Johnson, J., Chen, K.: A self-adapting distributed memory package for fast signal transforms. In: Proc. International Parallel and Distributed Processing Symposium (IPDPS), p. 44a (2004)
Moura, J.M.F., Püschel, M., Padua, D., Dongarra, J. (eds.): Special Issue on Program Generation, Optimization, and Platform Adaptation, Proceedings of the IEEE 93(2) (2005)
Pjesivac-Grbovic, J., Angskun, T., Bosilca, G., Fagg, G.E., Gabriel, E., Dongarra, J.: Performance analysis of MPI collective operations. Cluster Computing Journal, Special Issue on Performance Modeling and Evaluation of Parallel and Distributed Systems (accepted for publication, 2006)
Püschel, M., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Singer, B.W., Xiong, J., Franchetti, F., Gačić, A., Voronenko, Y., Chen, K., Johnson, R.W., Rizzolo, N.: Spiral: Code generation for DSP transforms. In: [17], pp. 232–275 (2005)
Spiral web site, http://www.spiral.net
Van Loan, C.: Computational Frameworks for the Fast Fourier Transform. Frontiers in Applied Mathematics, vol. 10. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (1992)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bonelli, A., Franchetti, F., Lorenz, J., Püschel, M., Ueberhuber, C.W. (2006). Automatic Performance Optimization of the Discrete Fourier Transform on Distributed Memory Computers. In: Guo, M., Yang, L.T., Di Martino, B., Zima, H.P., Dongarra, J., Tang, F. (eds) Parallel and Distributed Processing and Applications. ISPA 2006. Lecture Notes in Computer Science, vol 4330. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11946441_74
Download citation
DOI: https://doi.org/10.1007/11946441_74
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68067-3
Online ISBN: 978-3-540-68070-3
eBook Packages: Computer ScienceComputer Science (R0)