Four Easy Ways to a Faster FFT

Mullin, Lenore R.; Small, Sharon G.

doi:10.1023/A:1020590506372

Four Easy Ways to a Faster FFT

Published: September 2002

Volume 1, pages 193–214, (2002)
Cite this article

Journal of Mathematical Modelling and Algorithms

Lenore R. Mullin¹ &
Sharon G. Small²

113 Accesses
4 Citations
Explore all metrics

Abstract

The Fast Fourier Transform (FFT) was named one of the Top Ten algorithms of the 20th century , and continues to be a focus of current research. A problem with currently used FFT packages is that they require large, finely tuned, machine specific libraries, produced by highly skilled software developers. Therefore, these packages fail to perform well across a variety of architectures. Furthermore, many need to run repeated experiments in order to ‘re-program’ their code to its optimal performance based on a given machine's underlying hardware. Finally, it is difficult to know which radix to use given a particular vector size and machine configuration. We propose the use of monolithic array analysis as a way to remove the constraints imposed on performance by a machine's underlying hardware, by pre-optimizing array access patterns. In doing this we arrive at a single optimized program. We have achieved up to a 99.6% increase in performance, and the ability to run vectors up to 8 388 608 elements larger, on our experimental platforms. Preliminary experiments indicate different radices perform better relative to a machine's underlying architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Openmp simple, portable, scalable smp programming, 2000.
Agarwal, R. C., Gustavson, F. G. and Zubair, M.: A high performance parallel algorithm for 1-D FFT, In: Proc., Supercomputing '94, IEEE Computer Society Press, Washington, DC, 1994, pp. 34–40.
Google Scholar
Bilmes, J., Asanovic, K., Chin, C.-W. and Demmel, J.: Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C coding methodology, In: Proc. 1997 International Conference on Supercomputing, Vienna, Austria, July 1997, pp. 340–347.
Center, C. M. H.: Top ten algorithms of the 20th century, Computing Science and Engineering Magazine, 1999.
Chamberlain, B. L., Choi, S.-E., Lewis, C., Snyder, L., Weathersby W. D. and Lin, C.: The case for high-level parallel programming in ZPL, IEEE Comput. Sci. Engrg. 5(3) (1998), 76–86.
Google Scholar
Chamberlain, B. L., Choi, S.-E., Lewis, E. C., Lin, C., Snyder, L. and Weathersby, W. D.: Factor-join: A unique approach to compiling array languages for parallel machines, In: D. Padua, A. Nicolau, D. Gelernter, U. Banerjee and D. Sehr (eds), Proc. Ninth International Workshop on Languages and Compilers for Parallel Computing, Lecture Notes in Comput. Sci. 1239, Springer-Verlag, New York, 1996, pp. 481–500.
Google Scholar
Chamberlain, B. L., Choi, S.-E. and Snyder, L.: A compiler abstraction for machine independent parallel communication generation, In: Z. Li, P. C. Yew, S. Chatterjee, C. H. Huang, P. Sadayappan and D. Sehr (eds), Languages and Compilers for Parallel Computing, Lecture Notes in Comput. Sci. 1366, Springer-Verlag, New York, 1998, pp. 261–276.
Google Scholar
Cormen, T.: Everything you always wanted to know about out-of-core ffts but were aftaid to ask, COMPASS Colloquia Series, U Albany, SUNY, 2000.
Google Scholar
Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K. E., Santos, E., Subramonian, R. and von Eicken, T.: LogP: Toward a realistic model of parallel computation, In: Proc. Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, May 1993, pp. 1–12.
Dai, D. L., Gupta, S. K. S., Kaushik, S. D. and Lu, J. H.: EXTENT: A portable programming environment for designing and implementing high-performance block-recursive algorithms, In: Proc., Supercomputing '94, IEEE Computer Society Press, Washington, DC, 1994, pp. 49–58.
Google Scholar
Dooling, D. and Mullin, L.: Indexing and distributing a general partitioned sparse array, Proc. Workshop on Solving Irregular Problems on Distributed Memory Machines, 1995.
Elliott, D. F. and Rao, K. R.: Fast Transforms: Algorithms, Analyses, Applications, Academic Press, New York, 1982.
Google Scholar
Frigo, M. and Johnson, S.: Fftw online documentation, Nov. 1999.
Granata, J., Conner, M. and Tolimieri, R.: Recursive fast algorithms and the role of the tensor product, IEEE Trans. Signal Process. 40(12) (1992), 2921–2930.
Google Scholar
Gupta, A. and Kumar, V.: The scalability of FFT on parallel computers, IEEE Trans. Parallel and Distributed Systems 4(8) (1993), 922–932.
Google Scholar
Gupta, S., Huang, C.-H., Sadayappan, P. and Johnson, R.: On the synthesis of parallel programs from tensor product formulas for block recursive algorithms, In: U. Banerjee, D. Gelernter, A. Nicolau and D. Padua (eds), Proc. 5th International Workshop on Languages and Compilers for Parallel Computing (New Haven, Connecticut), Lecture Notes in Comput. Sci. 757, Springer-Verlag, New York, 1992, pp. 264–280.
Google Scholar
Gupta, S. K. S., Huang, C.-H., Sadayappan, P. and Johnson, R. W.: Implementing fast Fourier transforms on distributed-memory multiprocessors using data redistributions, Parallel Processing Lett. 4(4) (1994), 477–488.
Google Scholar
Gupta, S. K. S., Huang, C.-H., Sadayappan, P. and Johnson, R.W.: A framework for generating distributed-memory parallel programs for block recursive algorithms, J. Parallel Distributed Comput. 34(2) (1996), 137–153.
Google Scholar
Hennessy, J. and Patterson, D.: Computer Architecture a Quantitative Approach, Morgan Kaufmann, California, 1996.
Google Scholar
High Performance Fortran Forum. High Performance Fortran language specification, Scientific Programming 2(1-2) (1993), 1–170.
Humphrey, W., Karmesin, S., Bassetti, F. and Reynders, J.: Optimization of data-parallel field expressions in the POOMA framework, In: Y. Ishikawa, R. R. Oldehoeft, J. Reyn ders and M. Tholburn (eds), Proc. First International Conference on Scientific Computing in Object-Oriented Parallel Environments (ISCOPE '97) (Marina del Rey, CA), Lecture Notes in Comput. Sci. 1343, Springer-Verlag, New York, 1997, pp. 185–194.
Google Scholar
Hunt, H., Mullin, L. and Rosenkrantz, D.: A feasibility study on the high level design of both sequential and parallel algorithms applied ot the fft, Paper in progress, Department of CS SUNY, Albany, 2001.
Google Scholar
Karmesin, S., Crotinger, J., Cummings, J., Haney, S., Humphrey, W., Reynders, J., Smith, S. and Williams, T.: Array design and expression evaluation in POOMA II, In: D. Caromel, R. R. Oldehoeft and M. Tholburn (eds), Proc. Second International Symposium on Scientific Computing in Object-Oriented Parallel Environments (ISCOPE '98) (Santa Fe, NM), Lecture Notes in Comput. Sci. 1505, Springer-Verlag, New York, 1998, pp.
Google Scholar
Li, J. and Skjellum, A.: A poly-algorithm for parallel dense matrix multiplication on twodimensional process grid topologies, Mississippi State Univ., 1995.
Lin, C. and Snyder, L.: ZPL: An array sublanguage, In: U. Banerjee, D. Gelernter, A. Nicolau and D. Padua (eds), Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing (Portland, OR), Lecture Notes in Comput. Sci. 768, Springer-Verlag, New York, 1993, pp. 96–114.
Google Scholar
Lumsdaine, A.: The matrix template library: A generic programming approach to high performance numerical linear algebra, In: Proceedings of International Symposium on Computing in Object-Oriented Parallel Environments, 1998.
Lumsdaine, A. and McCandless, B.: Parallel extensions to the matrix template library, In: Proc. 8th SIAM Conference on Parallel Processing for Scientific Computing, SIAM Press, Philadelphia, 1997.
Google Scholar
Miles, D.: Compute intensity and the FFT, In: Proc., Supercomputing '93 (Portland, OR), IEEE Computer Society Press, 1993, pp. 676–684.
Mullin, L.: The Psi compiler project, In: Workshop on Compilers for Parallel Computers, TU Delft, Holland, 1993.
Google Scholar
Mullin, L.: On the monolithic analysis of a general radix cooley-tukey fft: Design, development, and performance, Invited talk, Lincoln Labs, MIT, 2000.
Mullin, L., Dooling, D., Sandberg, E. and Thibault, S.: Formal methods for portable, scalable, scheduling, routing, and communication protocol, Technical Report CSC 94-04, Dept. of CS, Univ. Missouri-Rolla, 1994.
Mullin, L., Kluge, W. and Scholtz, S.: On programming scientific applications in SAC - a functional language extended by a subsystem for high level array operations, In: Proc. 8th International Workshop on Implementation of Functional Languages, Bonn/Germany, 1996.
Mullin, L. and Small, S.: Three easy steps to a faster fft (no, we don't need a plan), Proc. 2001 International Symposium on Performance Evaluation of Computer and Telecommunication Systems, SPECTS 2001.
Mullin, L. and Small, S.: Three easy steps to a faster fft (the story continues...), Proc. International Conference on Imaging Science, Systems, and Technology, CISST 2001.
Mullin, L. M. R.: A mathematics of arrays, PhD thesis, Syracuse Univ., Dec. 1988.
Mullin, L. R., Dooling, D., Sandberg, E. and Thibault, S.: Formal methods for scheduling, routing and communication protocol, In: Proc. Second International Symposium on High Performance Distributed Computing (HPDC-2), IEEE Computer Society, 1993.
Mullin, L. R., Eggleston, D., Woodrum, L. J. and Rennie W.: The PGI-PSI project: Preprocessing optimizations for existing and new F90 intrinsics in HPF using compositional symmetric indexing of the Psi calculus, In: M. Gerndt (ed.), Proc. 6th Workshop on Compilers for Parallel Computers (Aachen, Germany), Forschungszentrum Jülich GmbH, 1996, pp. 345–355.
Rosenkrantz, D., Mullin, L. and H. B. H. III: On materializations of array-valued temporaries, In: Proc. 13th International Workshop on Languages and Compilers for Parallel Computing 2000 (LCPC'00) (Yorktown Heights, NY), Springer-Verlag, New York, to be published.
Skjellum, A., Doss, N. and Bangalore, P.: Driving issues in scalable libraries: Poly-algorithms, data distribution independence, redistribution, local storage schemes, In: Proc. Seventh SIAM Conference on Parallel Processing for Scientific Computing, SIAM Press, Philadelphia, 1996.
Google Scholar
Thibault, S. and Mullin, L.: A pipeline implementation of LU-decomposition on a hypercube, Technical Report, Univ. Missouri-Rolla, 1994, TR 95-03.
Tolimieri, R., An, M. and Lu, C.: Algorithms for Discrete Fourier Tranform and Convolution, Springer-Verlag, New York, 1989.
Google Scholar
Tolimieri, R., An, M. and Lu, C.: Mathematics of Multidimensional Fourier Transform Algorithms, Springer-Verlag, New York, 1993.
Google Scholar
Van Loan, C.: Computational Frameworks for the Fast Fourier Transform, Frontiers in Applied Mathematics, SIAM, Philadelphia, 1992.
Google Scholar
Veldhuizen, T.: Using C++ template metaprograms, C++ Report 7(4) (1995), 36–43. Reprinted in C++ Gems (ed. Stanley Lippman).
Veldhuizen, T. L.: Expression templates, C++ Report 7(5) (1995), 26–31. Reprinted in C++ Gems (ed. Stanley Lippman).
Veldhuizen, T. L.: Arrays in Blitz++, In: D. Caromel, R. R. Oldehoeft and M. Tholburn (eds), Proc. Second International Symposium on Scientific Computing in Object-Oriented Parallel Environments (ISCOPE '98) (Santa Fe, NM), Lecture Notes in Comput. Sci. 1505, Springer-Verlag, New York, 1998.
Google Scholar
Whaley, R. C. and Dongarra, J. J.: Automatically tuned linear algebra software, Technical Report UT-CS-97-366, Department of Computer Science, Univ. Tennessee, Dec. 1997.

Download references

Author information

Authors and Affiliations

Computer Science Department, University at Albany, SUNY, Albany, NY, 12222, U.S.A.
Lenore R. Mullin
NY 12222, U.S.A.
Sharon G. Small

Authors

Lenore R. Mullin
View author publications
You can also search for this author in PubMed Google Scholar
Sharon G. Small
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mullin, L.R., Small, S.G. Four Easy Ways to a Faster FFT. Journal of Mathematical Modelling and Algorithms 1, 193–214 (2002). https://doi.org/10.1023/A:1020590506372

Download citation

Issue Date: September 2002
DOI: https://doi.org/10.1023/A:1020590506372

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Four Easy Ways to a Faster FFT

Abstract

Access this article

Similar content being viewed by others

gearshifft – The FFT Benchmark Suite for Heterogeneous Platforms

High-Performance Computing Basics

Methods for High-Throughput Computation of Elementary Functions

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Four Easy Ways to a Faster FFT

Abstract

Access this article

Similar content being viewed by others

gearshifft – The FFT Benchmark Suite for Heterogeneous Platforms

High-Performance Computing Basics

Methods for High-Throughput Computation of Elementary Functions

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation