Automatic Performance Optimization of the Discrete Fourier Transform on Distributed Memory Computers

Bonelli, Andreas; Franchetti, Franz; Lorenz, Juergen; Püschel, Markus; Ueberhuber, Christoph W.

doi:10.1007/11946441_74

Andreas Bonelli²²,
Franz Franchetti²³,
Juergen Lorenz²²,
Markus Püschel²³ &
…
Christoph W. Ueberhuber²²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4330))

Included in the following conference series:

International Symposium on Parallel and Distributed Processing and Applications

643 Accesses

Abstract

This paper introduces a formal framework for automatically generating performance optimized implementations of the discrete Fourier transform (DFT) for distributed memory computers. The framework is implemented as part of the program generation and optimization system Spiral. DFT algorithms are represented as mathematical formulas in Spiral’s internal language SPL. Using a tagging mechanism and formula rewriting, we extend Spiral to automatically generate parallelized formulas. Using the same mechanism, we enable the generation of rescaling DFT algorithms, which redistribute the data in intermediate steps to fewer processors to reduce communication overhead. It is a novel feature of these methods that the redistribution steps are merged with the communication steps of the algorithm to avoid additional communication overhead. Among the possible alternative algorithms, Spiral’s search mechanism now determines the fastest for a given platform, effectively generating adapted code without human intervention. Experiments with DFT MPI programs generated by Spiral show performance gains of up to 30% due to rescaling. Further, our generated programs compare favorably with Fftw-MPI 2.1.5.

This work was supported by the Special Research Program SFB F011 “AURORA” and the Erwin Schrödinger Fellowship of the Austrian Science Fund FWF, and in part by DARPA through the Department of Interior grant NBCH1050009 and by NSF through awards 0234293 and 0325687.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Synthesizing MPI Implementations from Functional Data-Parallel Programs

Article 28 March 2015

Combining Data and Computation Distribution Directives for Hybrid Parallel Programming : A Transformation System

Article 10 May 2016

DASH: Distributed Data Structures and Parallel Algorithms in a Global Address Space

References

Adelmann, A., Bonelli, A., Petersen, W.P., Ueberhuber, C.W.: Communication efficiency of parallel 3D FFTs. In: VECPAR 2004, vol. III, pp. 901–907 (2004)
Google Scholar
Baumgartner, G., Auer, A., Bernholdt, D.E., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R.J., Hirata, S., Krishnamoorthy, S., Krishnan, S., Lam, C., Lu, Q., Nooijen, M., Pitzer, R.M., Ramanujam, J., Sadayappan, P., Sibiryakov, A.: Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. In: [17], pp. 276–292 (2005)
Google Scholar
Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLapack Users’ Guide. SIAM, Philadelphia, PA (1997)
Google Scholar
Dershowitz, N., Plaisted, D.A.: Rewriting. In: Robinson, A., Voronkov, A. (eds.) Handbook of Automated Reasoning, ch. 9, vol. 1, pp. 535–610. Elsevier, Amsterdam (2001)
Chapter Google Scholar
Eleftheriou, M., Fitch, B., Rayshubskiy, A., Ward, T.C., Germain, R.: Scalable framework for 3D FFTs on the Blue Gene/L supercomputer: Implementation and early performance measurements. IBM Journal of Research and Development 49(2/3), 457–464 (2005)
Article Google Scholar
Faraj, A., Yuan, X.: Automatic generation and tuning of MPI collective communication routines. In: Proc. International Conference on Supercomputing (ICS), pp. 393–402 (2005)
Google Scholar
Franchetti, F., Püschel, M.: A SIMD vectorizing compiler for digital signal processing algorithms. In: Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 20–26 (2002)
Google Scholar
Franchetti, F., Voronenko, Y., Püschel, M.: Loop merging for signal transforms. In: Proc. Programming Language Design and Implementation (PLDI), pp. 315–326 (2005)
Google Scholar
Franchetti, F., Voronenko, Y., Püschel, M.: FFT program generation for shared memory: SMP and multicore. In: Proc. Supercomputing, SC (2006)
Google Scholar
Franchetti, F., Voronenko, Y., Püschel, M.: A rewriting system for the vectorization of signal transforms. In: Daydé, M., Palma, J.M.L.M., Coutinho, Á.L.G.A., Pacitti, E., Lopes, J.C. (eds.) VECPAR 2006. LNCS, vol. 4395, pp. 363–377. Springer, Heidelberg (2007) (On CD-ROM)
Chapter Google Scholar
Frigo, M.: A fast Fourier transform compiler. In: Proc. Programming Language Design and Implementation (PLDI), pp. 169–180 (1999)
Google Scholar
Frigo, M., Johnson, S.G.: Fftw: An adaptive software architecture for the FFT. In: Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 3, pp. 1381–1384. IEEE, Los Alamitos (1998)
Google Scholar
Frigo, M., Johnson, S.G.: The design and implementation of Fftw3. In: [17], pp. 216–231 (2005)
Google Scholar
Goumas, G., Drosinos, N., Athanasaki, M., Koziris, N.: Automatic parallel code generation for tiled nested loops. In: Proc. Symposium on Applied Computing (SAC), pp. 1412–1419. ACM Press, New York (2004)
Google Scholar
Gygi, F., Draeger, E., de Supinski, B.R., Yates, R.K., Franchetti, F., Kral, S., Lorenz, J., Ueberhuber, C.W., Gunnels, J., Sexton, J.: Large-scale first-principles molecular dynamics simulations on the Blue Gene/L platform using the Qbox code. In: Proc. Supercomputing (SC), p. 24 (2005)
Google Scholar
Johnson, J., Chen, K.: A self-adapting distributed memory package for fast signal transforms. In: Proc. International Parallel and Distributed Processing Symposium (IPDPS), p. 44a (2004)
Google Scholar
Moura, J.M.F., Püschel, M., Padua, D., Dongarra, J. (eds.): Special Issue on Program Generation, Optimization, and Platform Adaptation, Proceedings of the IEEE 93(2) (2005)
Google Scholar
Pjesivac-Grbovic, J., Angskun, T., Bosilca, G., Fagg, G.E., Gabriel, E., Dongarra, J.: Performance analysis of MPI collective operations. Cluster Computing Journal, Special Issue on Performance Modeling and Evaluation of Parallel and Distributed Systems (accepted for publication, 2006)
Google Scholar
Püschel, M., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Singer, B.W., Xiong, J., Franchetti, F., Gačić, A., Voronenko, Y., Chen, K., Johnson, R.W., Rizzolo, N.: Spiral: Code generation for DSP transforms. In: [17], pp. 232–275 (2005)
Google Scholar
Spiral web site, http://www.spiral.net
Van Loan, C.: Computational Frameworks for the Fast Fourier Transform. Frontiers in Applied Mathematics, vol. 10. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (1992)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Analysis and Scientific Computing, Vienna University of Technology, Wiedner Hauptstrasse 8-10, A-1040, Wien, Austria
Andreas Bonelli, Juergen Lorenz & Christoph W. Ueberhuber
Department of Electrical and Computer Engineering, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, 15213, USA
Franz Franchetti & Markus Püschel

Authors

Andreas Bonelli
View author publications
You can also search for this author in PubMed Google Scholar
Franz Franchetti
View author publications
You can also search for this author in PubMed Google Scholar
Juergen Lorenz
View author publications
You can also search for this author in PubMed Google Scholar
Markus Püschel
View author publications
You can also search for this author in PubMed Google Scholar
Christoph W. Ueberhuber
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Shanghai Jiao Tong University, 200030, Shanghai, China
Minyi Guo
Department of Computer Science, St. Francis Xavier University, Antigonish, Canada
Laurence T. Yang
Dipartimento di Ingegneria dell’ Informazione - Second, University of Naples - Italy, Real Casa dell’Annunziata, via Roma, 29 81031, Aversa (CE), Italy
Beniamino Di Martino
Institute of Scientific Computing, University of Vienna, Nordbergstr. 15/C/3, A-1090, Vienna, Austria/JPL, Caltech, USA
Hans P. Zima
Computer Science Department, University of Tennessee, TN 37996-3450, Knoxville, USA
Jack Dongarra
Grid Computing Center, Shanghai Jiao Tong University, 800 Dongchuan Road, 200240, Shanghai, China
Feilong Tang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bonelli, A., Franchetti, F., Lorenz, J., Püschel, M., Ueberhuber, C.W. (2006). Automatic Performance Optimization of the Discrete Fourier Transform on Distributed Memory Computers. In: Guo, M., Yang, L.T., Di Martino, B., Zima, H.P., Dongarra, J., Tang, F. (eds) Parallel and Distributed Processing and Applications. ISPA 2006. Lecture Notes in Computer Science, vol 4330. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11946441_74

Download citation

DOI: https://doi.org/10.1007/11946441_74
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68067-3
Online ISBN: 978-3-540-68070-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Automatic Performance Optimization of the Discrete Fourier Transform on Distributed Memory Computers

Abstract

Access this chapter

Preview

Similar content being viewed by others

Synthesizing MPI Implementations from Functional Data-Parallel Programs

Combining Data and Computation Distribution Directives for Hybrid Parallel Programming : A Transformation System

DASH: Distributed Data Structures and Parallel Algorithms in a Global Address Space

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Automatic Performance Optimization of the Discrete Fourier Transform on Distributed Memory Computers

Abstract

Access this chapter

Preview

Similar content being viewed by others

Synthesizing MPI Implementations from Functional Data-Parallel Programs

Combining Data and Computation Distribution Directives for Hybrid Parallel Programming : A Transformation System

DASH: Distributed Data Structures and Parallel Algorithms in a Global Address Space

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation