skip to main content
10.1145/1188455.1188575acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
Article

FFT program generation for shared memory: SMP and multicore

Published: 11 November 2006 Publication History

Abstract

The chip maker's response to the approaching end of CPU frequency scaling are multicore systems, which offer the same programming paradigm as traditional shared memory platforms but have different performance characteristics. This situation considerably increases the burden on library developers and strengthens the case for automatic performance tuning frameworks like Spiral, a program generator and optimizer for linear transforms such as the discrete Fourier transform (DFT). We present a shared memory extension of Spiral. The extension within Spiral consists of a rewriting system that manipulates the structure of transform algorithms to achieve load balancing and avoids false sharing, and of a backend to generate multithreaded code. Application to the DFT produces a novel class of algorithms suitable for multicore systems as validated by experimental results: we demonstrate a parallelization speed-up already for sizes that fit into L1 cache and compare favorably to other DFT libraries across all small and midsize DFTs and considered platforms.

References

[1]
Al Na'mneh, R. A., Pan, W. D., And Adhami, R. 2005. Communication efficient adaptive matrix transpose algorithm for FFT on symmetric multiprocessors. In Proc. Southeastern Symposium on System Theory (SSST), 312--315.
[2]
Al Na'mneh, R. A., Pan, W. D., and Adhami, R. 2005. Parallel implementation of 1-D fast Fourier transform without inter-processor communications. In Proc. South-eastern Symposium on System Theory (SSST), 307--311.
[3]
Bailey, D. H. 1990. FFTs in external or hierarchical memory. J. Supercomputing 4, 23--35.
[4]
Banerjee, U., Eigenmann, R., Nicolau, A., and Padua, D. A. 1993. Automatic program parallelization. Proceedings of the IEEE 81, 2, 211--243.
[5]
Bientinesi, P., Gunnels, J. A., Myers, M. E., Quintana-Orti, E., and Van de Geijn, R. 2005. The science of deriving dense linear algebra algorithms. TOMS 31, 1 (March), 1--26.
[6]
Chandra, R., Menon, R., Dagum, L., Kohr, D., Maydan, D., and McDonald, J. 2000. Parallel Programming in OpenMP. Elsevier.
[7]
Demmel, J., Dongarra, J., Eijkhout, V., Fuentes, E., Petitet, A., Vuduc, R., Whaley, C., and Yelick, K. 2005. Self adapting linear algebra algorithms and software. Proceedings of the IEEE 93, 2, 293--312. Special issue on "Program Generation, Optimization, and Adaptation".
[8]
Dershowitz, N., and Plaisted, D. A. 2001. Rewriting. In Handbook of Automated Reasoning, A. Robinson and A. Voronkov, Eds., vol. 1. Elsevier, ch. 9, 535--610.
[9]
Franchetti, F., and Püschel, M. 2002. A SIMD vectorizing compiler for digital signal processing algorithms. In Proc. IEEE Int'l Parallel and Distributed Processing Symposium (IPDPS), 20--26.
[10]
Franchetti, F., and Püschel, M. 2003. Short vector code generation for the discrete Fourier transform. In Proc. IEEE Int'l Parallel and Distributed Processing Symposium (IPDPS), 58--67.
[11]
Franchetti, F., Voronenko, Y., and Püschel, M. 2005. Loop merging for signal transforms. In Proc. Programming Language Design and Implementation (PLDI), 315--326.
[12]
Franchetti, F., Voronenko, Y., and Püschel, M. 2006. A rewriting system for the vectorization of signal transforms. In Proc. High Performance Computing for Computational Science (VECPAR).
[13]
Frigo, M., and Johnson, S. G. 2005. The design and implementation of FFTW3. Proceedings of the IEEE 93, 2, 216--231. Special issue on "Program Generation, Optimization, and Adaptation".
[14]
Gatlin, K. S., and Carter, L. 1999. Architecture-cognizant divide and conquer algorithms. In Proc. Super-computing (CDROM).
[15]
Gunnels, J. A., Gustavson, F. G., Henry, G. M., and Van de Geijn, R. A. 2001. FLAME: Formal linear algebra methods environment. TOMS 27, 4 (December), 422--455.
[16]
Hiranandani, S., Kennedy, K., and Tseng, C.-W. 1992. Compiling Fortran D for MIMD distributed-memory machines. Commun. ACM 35, 8, 66--80.
[17]
Im. E.-J., Yelick, K., and Vuduc, R. 2004. Sparsity: Optimization framework for sparse matrix kernels. Int'l J. High Performance Computing Applications 18, 1.
[18]
Johnson, J. R., Johnson, R. W., Rodriguez, D., and Tolimieri, R. 1990. A methodology for designing, modifying, and implementing FFT algorithms on various architectures. Circuits Systems Signal Processing 9, 449--500.
[19]
Mckellar, A. C., and E. G. Coffman, J. 1969. Organizing matrices and matrix operations for paged memory systems. Communications ACM 12, 3, 153--165.
[20]
Norton, A., and Silberger, A. J. 1987. Parallelization and performance analysis of the Cooley-Tukey FFT algorithm for shared-memory architectures. IEEE Trans. Comput. 36, 5, 581--591.
[21]
Püschel, M., Moura, J. M. F., Johnson, J., Padua, D., Veloso, M., Singer, B. W., Xiong, J., Franchetti, F., Gačić, A., Voronenko, Y., Chen, K., Johnson, R. W., and Rizzolo, N. 2005. SPIRAL: Code generation for DSP transforms. Proc. of the IEEE 93, 2, 232--275. Special issue on Program Generation, Optimization, and Adaptation.
[22]
Schwarztrauber, P. N. 1987. Multiprocessor FFTs. Parallel Computing 5, 197--210.
[23]
Singer, B., and Veloso, M. 2001. Stochastic search for signal processing algorithm optimization. In Proc. Supercomputing.
[24]
Takahashi, D., Sato, M., and Boku, T. 2003. An OpenMP implementation of parallel FFT and its performance on IA-64 processors. Lecture Notes in Computer Science 2716, 99--108.
[25]
Takahashi, D. 2002. A blocking algorithm for parallel 1-D FFT on shared-memory parallel computers. Lecture Notes in Computer Science 2367, 380--389.
[26]
Van Loan, C. 1992. Computational Framework of the Fast Fourier Transform. SIAM.
[27]
Whaley, R. C., Petitet, A., and Dongarra, J. J. 2001. Automated empirical optimization of software and the ATLAS project. Parallel Computing 27, 1-2, 3--35.
[28]
Wolf, M. E., and Lam, M. S. 1991. A data locality optimizing algorithm. In Proc. Programming Language Design and Implementation (PLDI), 30--44.
[29]
Wolfe, M. 1996. High performance compilers for parallel computing. Addison-Wesley, Redwood City, CA.
[30]
Xiong, J., Johnson, J., Johnson, R., and Padua, D. 2001. SPL: A language and compiler for DSP algorithms. In Proc. Programming Language Design and Implementation (PLDI), 298--308.
[31]
Zima, H., and Chapman, B. 1990. Supercompilers for parallel and vector computers. ACM Press, New York.

Cited By

View all
  • (2022)Verification of Vectorization of Signal TransformsLanguages and Compilers for Parallel Computing10.1007/978-3-030-95953-1_15(215-231)Online publication date: 16-Feb-2022
  • (2019)Big Prime Field FFT on Multi-core ProcessorsProceedings of the 2019 International Symposium on Symbolic and Algebraic Computation10.1145/3326229.3326273(106-113)Online publication date: 8-Jul-2019
  • (2017)Towards Automatic Learning of Heuristics for Mechanical Transformations of Procedural CodeElectronic Proceedings in Theoretical Computer Science10.4204/EPTCS.237.4237(52-67)Online publication date: 11-Jan-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing
November 2006
746 pages
ISBN:0769527000
DOI:10.1145/1188455
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. automatic parallelization
  2. chip multiprocessor
  3. fast fourier transform
  4. multicore
  5. shared memory

Qualifiers

  • Article

Conference

SC '06
Sponsor:

Acceptance Rates

SC '06 Paper Acceptance Rate 54 of 239 submissions, 23%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)4
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Verification of Vectorization of Signal TransformsLanguages and Compilers for Parallel Computing10.1007/978-3-030-95953-1_15(215-231)Online publication date: 16-Feb-2022
  • (2019)Big Prime Field FFT on Multi-core ProcessorsProceedings of the 2019 International Symposium on Symbolic and Algebraic Computation10.1145/3326229.3326273(106-113)Online publication date: 8-Jul-2019
  • (2017)Towards Automatic Learning of Heuristics for Mechanical Transformations of Procedural CodeElectronic Proceedings in Theoretical Computer Science10.4204/EPTCS.237.4237(52-67)Online publication date: 11-Jan-2017
  • (2017)Towards a Semantics-Aware Code Transformation Toolchain for Heterogeneous SystemsElectronic Proceedings in Theoretical Computer Science10.4204/EPTCS.237.3237(34-51)Online publication date: 11-Jan-2017
  • (2017)A Haskell compiler for signal transformsACM SIGPLAN Notices10.1145/3170492.313605652:12(219-232)Online publication date: 23-Oct-2017
  • (2017)A Haskell compiler for signal transformsProceedings of the 16th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences10.1145/3136040.3136056(219-232)Online publication date: 23-Oct-2017
  • (2017)Data Flow Algorithms for Processors with Vector ExtensionsJournal of Signal Processing Systems10.1007/s11265-015-1045-x87:1(21-31)Online publication date: 1-Apr-2017
  • (2015)High performance implementation of the inverse TFTProceedings of the 2015 International Workshop on Parallel Symbolic Computation10.1145/2790282.2790292(87-94)Online publication date: 10-Jul-2015
  • (2014)High performance implementation of the TFTProceedings of the 39th International Symposium on Symbolic and Algebraic Computation10.1145/2608628.2608661(328-334)Online publication date: 23-Jul-2014
  • (2014)Loop scheduling with memory access reduction subject to register constraints for DSP applicationsSoftware—Practice & Experience10.1002/spe.218644:8(999-1026)Online publication date: 1-Aug-2014
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media