Article

FFT program generation for shared memory: SMP and multicore

Authors:

Franz Franchetti,

Yevgen Voronenko,

Markus PüschelAuthors Info & Claims

SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing

Pages 115 - es

https://doi.org/10.1145/1188455.1188575

Published: 11 November 2006 Publication History

Abstract

The chip maker's response to the approaching end of CPU frequency scaling are multicore systems, which offer the same programming paradigm as traditional shared memory platforms but have different performance characteristics. This situation considerably increases the burden on library developers and strengthens the case for automatic performance tuning frameworks like Spiral, a program generator and optimizer for linear transforms such as the discrete Fourier transform (DFT). We present a shared memory extension of Spiral. The extension within Spiral consists of a rewriting system that manipulates the structure of transform algorithms to achieve load balancing and avoids false sharing, and of a backend to generate multithreaded code. Application to the DFT produces a novel class of algorithms suitable for multicore systems as validated by experimental results: we demonstrate a parallelization speed-up already for sizes that fit into L1 cache and compare favorably to other DFT libraries across all small and midsize DFTs and considered platforms.

References

[1]

Al Na'mneh, R. A., Pan, W. D., And Adhami, R. 2005. Communication efficient adaptive matrix transpose algorithm for FFT on symmetric multiprocessors. In Proc. Southeastern Symposium on System Theory (SSST), 312--315.

[2]

Al Na'mneh, R. A., Pan, W. D., and Adhami, R. 2005. Parallel implementation of 1-D fast Fourier transform without inter-processor communications. In Proc. South-eastern Symposium on System Theory (SSST), 307--311.

[3]

Bailey, D. H. 1990. FFTs in external or hierarchical memory. J. Supercomputing 4, 23--35.

Digital Library

[4]

Banerjee, U., Eigenmann, R., Nicolau, A., and Padua, D. A. 1993. Automatic program parallelization. Proceedings of the IEEE 81, 2, 211--243.

[5]

Bientinesi, P., Gunnels, J. A., Myers, M. E., Quintana-Orti, E., and Van de Geijn, R. 2005. The science of deriving dense linear algebra algorithms. TOMS 31, 1 (March), 1--26.

Digital Library

[6]

Chandra, R., Menon, R., Dagum, L., Kohr, D., Maydan, D., and McDonald, J. 2000. Parallel Programming in OpenMP. Elsevier.

Digital Library

[7]

Demmel, J., Dongarra, J., Eijkhout, V., Fuentes, E., Petitet, A., Vuduc, R., Whaley, C., and Yelick, K. 2005. Self adapting linear algebra algorithms and software. Proceedings of the IEEE 93, 2, 293--312. Special issue on "Program Generation, Optimization, and Adaptation".

[8]

Dershowitz, N., and Plaisted, D. A. 2001. Rewriting. In Handbook of Automated Reasoning, A. Robinson and A. Voronkov, Eds., vol. 1. Elsevier, ch. 9, 535--610.

[9]

Franchetti, F., and Püschel, M. 2002. A SIMD vectorizing compiler for digital signal processing algorithms. In Proc. IEEE Int'l Parallel and Distributed Processing Symposium (IPDPS), 20--26.

Digital Library

[10]

Franchetti, F., and Püschel, M. 2003. Short vector code generation for the discrete Fourier transform. In Proc. IEEE Int'l Parallel and Distributed Processing Symposium (IPDPS), 58--67.

Digital Library

[11]

Franchetti, F., Voronenko, Y., and Püschel, M. 2005. Loop merging for signal transforms. In Proc. Programming Language Design and Implementation (PLDI), 315--326.

Digital Library

[12]

Franchetti, F., Voronenko, Y., and Püschel, M. 2006. A rewriting system for the vectorization of signal transforms. In Proc. High Performance Computing for Computational Science (VECPAR).

Digital Library

[13]

Frigo, M., and Johnson, S. G. 2005. The design and implementation of FFTW3. Proceedings of the IEEE 93, 2, 216--231. Special issue on "Program Generation, Optimization, and Adaptation".

[14]

Gatlin, K. S., and Carter, L. 1999. Architecture-cognizant divide and conquer algorithms. In Proc. Super-computing (CDROM).

Digital Library

[15]

Gunnels, J. A., Gustavson, F. G., Henry, G. M., and Van de Geijn, R. A. 2001. FLAME: Formal linear algebra methods environment. TOMS 27, 4 (December), 422--455.

Digital Library

[16]

Hiranandani, S., Kennedy, K., and Tseng, C.-W. 1992. Compiling Fortran D for MIMD distributed-memory machines. Commun. ACM 35, 8, 66--80.

Digital Library

[17]

Im. E.-J., Yelick, K., and Vuduc, R. 2004. Sparsity: Optimization framework for sparse matrix kernels. Int'l J. High Performance Computing Applications 18, 1.

Digital Library

[18]

Johnson, J. R., Johnson, R. W., Rodriguez, D., and Tolimieri, R. 1990. A methodology for designing, modifying, and implementing FFT algorithms on various architectures. Circuits Systems Signal Processing 9, 449--500.

Digital Library

[19]

Mckellar, A. C., and E. G. Coffman, J. 1969. Organizing matrices and matrix operations for paged memory systems. Communications ACM 12, 3, 153--165.

Digital Library

[20]

Norton, A., and Silberger, A. J. 1987. Parallelization and performance analysis of the Cooley-Tukey FFT algorithm for shared-memory architectures. IEEE Trans. Comput. 36, 5, 581--591.

Digital Library

[21]

Püschel, M., Moura, J. M. F., Johnson, J., Padua, D., Veloso, M., Singer, B. W., Xiong, J., Franchetti, F., Gačić, A., Voronenko, Y., Chen, K., Johnson, R. W., and Rizzolo, N. 2005. SPIRAL: Code generation for DSP transforms. Proc. of the IEEE 93, 2, 232--275. Special issue on Program Generation, Optimization, and Adaptation.

[22]

Schwarztrauber, P. N. 1987. Multiprocessor FFTs. Parallel Computing 5, 197--210.

[23]

Singer, B., and Veloso, M. 2001. Stochastic search for signal processing algorithm optimization. In Proc. Supercomputing.

Digital Library

[24]

Takahashi, D., Sato, M., and Boku, T. 2003. An OpenMP implementation of parallel FFT and its performance on IA-64 processors. Lecture Notes in Computer Science 2716, 99--108.

Digital Library

[25]

Takahashi, D. 2002. A blocking algorithm for parallel 1-D FFT on shared-memory parallel computers. Lecture Notes in Computer Science 2367, 380--389.

Digital Library

[26]

Van Loan, C. 1992. Computational Framework of the Fast Fourier Transform. SIAM.

Digital Library

[27]

Whaley, R. C., Petitet, A., and Dongarra, J. J. 2001. Automated empirical optimization of software and the ATLAS project. Parallel Computing 27, 1-2, 3--35.

[28]

Wolf, M. E., and Lam, M. S. 1991. A data locality optimizing algorithm. In Proc. Programming Language Design and Implementation (PLDI), 30--44.

Digital Library

[29]

Wolfe, M. 1996. High performance compilers for parallel computing. Addison-Wesley, Redwood City, CA.

Digital Library

[30]

Xiong, J., Johnson, J., Johnson, R., and Padua, D. 2001. SPL: A language and compiler for DSP algorithms. In Proc. Programming Language Design and Implementation (PLDI), 298--308.

Digital Library

[31]

Zima, H., and Chapman, B. 1990. Supercompilers for parallel and vector computers. ACM Press, New York.

Cited By

Brinich PJohnson J(2022)Verification of Vectorization of Signal TransformsLanguages and Compilers for Parallel Computing10.1007/978-3-030-95953-1_15(215-231)Online publication date: 16-Feb-2022
https://doi.org/10.1007/978-3-030-95953-1_15
Covanov SMohajerani DMoreno Maza MWang LDavenport JWang DKauers MBradford R(2019)Big Prime Field FFT on Multi-core ProcessorsProceedings of the 2019 International Symposium on Symbolic and Algebraic Computation10.1145/3326229.3326273(106-113)Online publication date: 8-Jul-2019
https://dl.acm.org/doi/10.1145/3326229.3326273
Vigueras GCarro MTamarit SMariño J(2017)Towards Automatic Learning of Heuristics for Mechanical Transformations of Procedural CodeElectronic Proceedings in Theoretical Computer Science10.4204/EPTCS.237.4237(52-67)Online publication date: 11-Jan-2017
https://doi.org/10.4204/EPTCS.237.4
Show More Cited By

Index Terms

FFT program generation for shared memory: SMP and multicore

Recommendations

A Highly Efficient Multicore Floating-Point FFT Architecture Based on Hybrid Linear Algebra/FFT Cores

FFT algorithms have memory access patterns that prevent many architectures from achieving high computational utilization, particularly when parallel processing is required to achieve the desired levels of performance. Starting with a highly efficient ...
Memory Locality Exploitation Strategies for FFT on the CUDA Architecture
High Performance Computing for Computational Science - VECPAR 2008

Modern graphics processing units (GPU) are becoming more and more suitable for general purpose computing due to its growing computational power. These commodity processors follow, in general, a parallel SIMD execution model whose efficiency is subject ...
An Implementation of Parallel 1-D FFT Using AVX Instructions on Multi-core Processors
IWIA '12: Proceedings of the 2012 International Workshop on Innovative Architecture for Future Generation Processors and Systems

In this paper, we propose an implementation of a parallel one-dimensional fast Fourier transform (FFT) using Intel Advanced Vector Extensions (AVX) instructions on multi-core processors. The combination of vectorization and a block six-step FFT ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing

November 2006

746 pages

ISBN:0769527000

DOI:10.1145/1188455

Conference Chair:
Barbara Horner-Miller
Arctic Region Supercomputing Center

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SC '06

Sponsor:

SIGARCH
IEEE-CS

SC '06: International Conference for High Performance Computing, Networking, Storage and Analysis

November 11 - 17, 2006

Florida, Tampa

Acceptance Rates

SC '06 Paper Acceptance Rate 54 of 239 submissions, 23%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
695
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)4

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Brinich PJohnson J(2022)Verification of Vectorization of Signal TransformsLanguages and Compilers for Parallel Computing10.1007/978-3-030-95953-1_15(215-231)Online publication date: 16-Feb-2022
https://doi.org/10.1007/978-3-030-95953-1_15
Covanov SMohajerani DMoreno Maza MWang LDavenport JWang DKauers MBradford R(2019)Big Prime Field FFT on Multi-core ProcessorsProceedings of the 2019 International Symposium on Symbolic and Algebraic Computation10.1145/3326229.3326273(106-113)Online publication date: 8-Jul-2019
https://dl.acm.org/doi/10.1145/3326229.3326273
Vigueras GCarro MTamarit SMariño J(2017)Towards Automatic Learning of Heuristics for Mechanical Transformations of Procedural CodeElectronic Proceedings in Theoretical Computer Science10.4204/EPTCS.237.4237(52-67)Online publication date: 11-Jan-2017
https://doi.org/10.4204/EPTCS.237.4
Tamarit SMariño JVigueras GCarro M(2017)Towards a Semantics-Aware Code Transformation Toolchain for Heterogeneous SystemsElectronic Proceedings in Theoretical Computer Science10.4204/EPTCS.237.3237(34-51)Online publication date: 11-Jan-2017
https://doi.org/10.4204/EPTCS.237.3
Mainland GJohnson J(2017)A Haskell compiler for signal transformsACM SIGPLAN Notices10.1145/3170492.313605652:12(219-232)Online publication date: 23-Oct-2017
https://dl.acm.org/doi/10.1145/3170492.3136056
Mainland GJohnson JFlatt MErdweg S(2017)A Haskell compiler for signal transformsProceedings of the 16th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences10.1145/3136040.3136056(219-232)Online publication date: 23-Oct-2017
https://dl.acm.org/doi/10.1145/3136040.3136056
Barford LBhattacharyya SLiu Y(2017)Data Flow Algorithms for Processors with Vector ExtensionsJournal of Signal Processing Systems10.1007/s11265-015-1045-x87:1(21-31)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.1007/s11265-015-1045-x
Meng LJohnson JPernet C(2015)High performance implementation of the inverse TFTProceedings of the 2015 International Workshop on Parallel Symbolic Computation10.1145/2790282.2790292(87-94)Online publication date: 10-Jul-2015
https://dl.acm.org/doi/10.1145/2790282.2790292
Meng LJohnson JNagasaka KWinkler FSzanto A(2014)High performance implementation of the TFTProceedings of the 39th International Symposium on Symbolic and Algebraic Computation10.1145/2608628.2608661(328-334)Online publication date: 23-Jul-2014
https://dl.acm.org/doi/10.1145/2608628.2608661
Wang YJia ZChen RWang MLiu DShao Z(2014)Loop scheduling with memory access reduction subject to register constraints for DSP applicationsSoftware—Practice & Experience10.1002/spe.218644:8(999-1026)Online publication date: 1-Aug-2014
https://dl.acm.org/doi/10.1002/spe.2186
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents