Abstract
The Fast Fourier Transform (FFT) was named one of the Top Ten algorithms of the 20th century , and continues to be a focus of current research. A problem with currently used FFT packages is that they require large, finely tuned, machine specific libraries, produced by highly skilled software developers. Therefore, these packages fail to perform well across a variety of architectures. Furthermore, many need to run repeated experiments in order to ‘re-program’ their code to its optimal performance based on a given machine's underlying hardware. Finally, it is difficult to know which radix to use given a particular vector size and machine configuration. We propose the use of monolithic array analysis as a way to remove the constraints imposed on performance by a machine's underlying hardware, by pre-optimizing array access patterns. In doing this we arrive at a single optimized program. We have achieved up to a 99.6% increase in performance, and the ability to run vectors up to 8 388 608 elements larger, on our experimental platforms. Preliminary experiments indicate different radices perform better relative to a machine's underlying architecture.
Similar content being viewed by others
References
Openmp simple, portable, scalable smp programming, 2000.
Agarwal, R. C., Gustavson, F. G. and Zubair, M.: A high performance parallel algorithm for 1-D FFT, In: Proc., Supercomputing '94, IEEE Computer Society Press, Washington, DC, 1994, pp. 34–40.
Bilmes, J., Asanovic, K., Chin, C.-W. and Demmel, J.: Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C coding methodology, In: Proc. 1997 International Conference on Supercomputing, Vienna, Austria, July 1997, pp. 340–347.
Center, C. M. H.: Top ten algorithms of the 20th century, Computing Science and Engineering Magazine, 1999.
Chamberlain, B. L., Choi, S.-E., Lewis, C., Snyder, L., Weathersby W. D. and Lin, C.: The case for high-level parallel programming in ZPL, IEEE Comput. Sci. Engrg. 5(3) (1998), 76–86.
Chamberlain, B. L., Choi, S.-E., Lewis, E. C., Lin, C., Snyder, L. and Weathersby, W. D.: Factor-join: A unique approach to compiling array languages for parallel machines, In: D. Padua, A. Nicolau, D. Gelernter, U. Banerjee and D. Sehr (eds), Proc. Ninth International Workshop on Languages and Compilers for Parallel Computing, Lecture Notes in Comput. Sci. 1239, Springer-Verlag, New York, 1996, pp. 481–500.
Chamberlain, B. L., Choi, S.-E. and Snyder, L.: A compiler abstraction for machine independent parallel communication generation, In: Z. Li, P. C. Yew, S. Chatterjee, C. H. Huang, P. Sadayappan and D. Sehr (eds), Languages and Compilers for Parallel Computing, Lecture Notes in Comput. Sci. 1366, Springer-Verlag, New York, 1998, pp. 261–276.
Cormen, T.: Everything you always wanted to know about out-of-core ffts but were aftaid to ask, COMPASS Colloquia Series, U Albany, SUNY, 2000.
Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K. E., Santos, E., Subramonian, R. and von Eicken, T.: LogP: Toward a realistic model of parallel computation, In: Proc. Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, May 1993, pp. 1–12.
Dai, D. L., Gupta, S. K. S., Kaushik, S. D. and Lu, J. H.: EXTENT: A portable programming environment for designing and implementing high-performance block-recursive algorithms, In: Proc., Supercomputing '94, IEEE Computer Society Press, Washington, DC, 1994, pp. 49–58.
Dooling, D. and Mullin, L.: Indexing and distributing a general partitioned sparse array, Proc. Workshop on Solving Irregular Problems on Distributed Memory Machines, 1995.
Elliott, D. F. and Rao, K. R.: Fast Transforms: Algorithms, Analyses, Applications, Academic Press, New York, 1982.
Frigo, M. and Johnson, S.: Fftw online documentation, Nov. 1999.
Granata, J., Conner, M. and Tolimieri, R.: Recursive fast algorithms and the role of the tensor product, IEEE Trans. Signal Process. 40(12) (1992), 2921–2930.
Gupta, A. and Kumar, V.: The scalability of FFT on parallel computers, IEEE Trans. Parallel and Distributed Systems 4(8) (1993), 922–932.
Gupta, S., Huang, C.-H., Sadayappan, P. and Johnson, R.: On the synthesis of parallel programs from tensor product formulas for block recursive algorithms, In: U. Banerjee, D. Gelernter, A. Nicolau and D. Padua (eds), Proc. 5th International Workshop on Languages and Compilers for Parallel Computing (New Haven, Connecticut), Lecture Notes in Comput. Sci. 757, Springer-Verlag, New York, 1992, pp. 264–280.
Gupta, S. K. S., Huang, C.-H., Sadayappan, P. and Johnson, R. W.: Implementing fast Fourier transforms on distributed-memory multiprocessors using data redistributions, Parallel Processing Lett. 4(4) (1994), 477–488.
Gupta, S. K. S., Huang, C.-H., Sadayappan, P. and Johnson, R.W.: A framework for generating distributed-memory parallel programs for block recursive algorithms, J. Parallel Distributed Comput. 34(2) (1996), 137–153.
Hennessy, J. and Patterson, D.: Computer Architecture a Quantitative Approach, Morgan Kaufmann, California, 1996.
High Performance Fortran Forum. High Performance Fortran language specification, Scientific Programming 2(1-2) (1993), 1–170.
Humphrey, W., Karmesin, S., Bassetti, F. and Reynders, J.: Optimization of data-parallel field expressions in the POOMA framework, In: Y. Ishikawa, R. R. Oldehoeft, J. Reyn ders and M. Tholburn (eds), Proc. First International Conference on Scientific Computing in Object-Oriented Parallel Environments (ISCOPE '97) (Marina del Rey, CA), Lecture Notes in Comput. Sci. 1343, Springer-Verlag, New York, 1997, pp. 185–194.
Hunt, H., Mullin, L. and Rosenkrantz, D.: A feasibility study on the high level design of both sequential and parallel algorithms applied ot the fft, Paper in progress, Department of CS SUNY, Albany, 2001.
Karmesin, S., Crotinger, J., Cummings, J., Haney, S., Humphrey, W., Reynders, J., Smith, S. and Williams, T.: Array design and expression evaluation in POOMA II, In: D. Caromel, R. R. Oldehoeft and M. Tholburn (eds), Proc. Second International Symposium on Scientific Computing in Object-Oriented Parallel Environments (ISCOPE '98) (Santa Fe, NM), Lecture Notes in Comput. Sci. 1505, Springer-Verlag, New York, 1998, pp.
Li, J. and Skjellum, A.: A poly-algorithm for parallel dense matrix multiplication on twodimensional process grid topologies, Mississippi State Univ., 1995.
Lin, C. and Snyder, L.: ZPL: An array sublanguage, In: U. Banerjee, D. Gelernter, A. Nicolau and D. Padua (eds), Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing (Portland, OR), Lecture Notes in Comput. Sci. 768, Springer-Verlag, New York, 1993, pp. 96–114.
Lumsdaine, A.: The matrix template library: A generic programming approach to high performance numerical linear algebra, In: Proceedings of International Symposium on Computing in Object-Oriented Parallel Environments, 1998.
Lumsdaine, A. and McCandless, B.: Parallel extensions to the matrix template library, In: Proc. 8th SIAM Conference on Parallel Processing for Scientific Computing, SIAM Press, Philadelphia, 1997.
Miles, D.: Compute intensity and the FFT, In: Proc., Supercomputing '93 (Portland, OR), IEEE Computer Society Press, 1993, pp. 676–684.
Mullin, L.: The Psi compiler project, In: Workshop on Compilers for Parallel Computers, TU Delft, Holland, 1993.
Mullin, L.: On the monolithic analysis of a general radix cooley-tukey fft: Design, development, and performance, Invited talk, Lincoln Labs, MIT, 2000.
Mullin, L., Dooling, D., Sandberg, E. and Thibault, S.: Formal methods for portable, scalable, scheduling, routing, and communication protocol, Technical Report CSC 94-04, Dept. of CS, Univ. Missouri-Rolla, 1994.
Mullin, L., Kluge, W. and Scholtz, S.: On programming scientific applications in SAC - a functional language extended by a subsystem for high level array operations, In: Proc. 8th International Workshop on Implementation of Functional Languages, Bonn/Germany, 1996.
Mullin, L. and Small, S.: Three easy steps to a faster fft (no, we don't need a plan), Proc. 2001 International Symposium on Performance Evaluation of Computer and Telecommunication Systems, SPECTS 2001.
Mullin, L. and Small, S.: Three easy steps to a faster fft (the story continues...), Proc. International Conference on Imaging Science, Systems, and Technology, CISST 2001.
Mullin, L. M. R.: A mathematics of arrays, PhD thesis, Syracuse Univ., Dec. 1988.
Mullin, L. R., Dooling, D., Sandberg, E. and Thibault, S.: Formal methods for scheduling, routing and communication protocol, In: Proc. Second International Symposium on High Performance Distributed Computing (HPDC-2), IEEE Computer Society, 1993.
Mullin, L. R., Eggleston, D., Woodrum, L. J. and Rennie W.: The PGI-PSI project: Preprocessing optimizations for existing and new F90 intrinsics in HPF using compositional symmetric indexing of the Psi calculus, In: M. Gerndt (ed.), Proc. 6th Workshop on Compilers for Parallel Computers (Aachen, Germany), Forschungszentrum Jülich GmbH, 1996, pp. 345–355.
Rosenkrantz, D., Mullin, L. and H. B. H. III: On materializations of array-valued temporaries, In: Proc. 13th International Workshop on Languages and Compilers for Parallel Computing 2000 (LCPC'00) (Yorktown Heights, NY), Springer-Verlag, New York, to be published.
Skjellum, A., Doss, N. and Bangalore, P.: Driving issues in scalable libraries: Poly-algorithms, data distribution independence, redistribution, local storage schemes, In: Proc. Seventh SIAM Conference on Parallel Processing for Scientific Computing, SIAM Press, Philadelphia, 1996.
Thibault, S. and Mullin, L.: A pipeline implementation of LU-decomposition on a hypercube, Technical Report, Univ. Missouri-Rolla, 1994, TR 95-03.
Tolimieri, R., An, M. and Lu, C.: Algorithms for Discrete Fourier Tranform and Convolution, Springer-Verlag, New York, 1989.
Tolimieri, R., An, M. and Lu, C.: Mathematics of Multidimensional Fourier Transform Algorithms, Springer-Verlag, New York, 1993.
Van Loan, C.: Computational Frameworks for the Fast Fourier Transform, Frontiers in Applied Mathematics, SIAM, Philadelphia, 1992.
Veldhuizen, T.: Using C++ template metaprograms, C++ Report 7(4) (1995), 36–43. Reprinted in C++ Gems (ed. Stanley Lippman).
Veldhuizen, T. L.: Expression templates, C++ Report 7(5) (1995), 26–31. Reprinted in C++ Gems (ed. Stanley Lippman).
Veldhuizen, T. L.: Arrays in Blitz++, In: D. Caromel, R. R. Oldehoeft and M. Tholburn (eds), Proc. Second International Symposium on Scientific Computing in Object-Oriented Parallel Environments (ISCOPE '98) (Santa Fe, NM), Lecture Notes in Comput. Sci. 1505, Springer-Verlag, New York, 1998.
Whaley, R. C. and Dongarra, J. J.: Automatically tuned linear algebra software, Technical Report UT-CS-97-366, Department of Computer Science, Univ. Tennessee, Dec. 1997.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Mullin, L.R., Small, S.G. Four Easy Ways to a Faster FFT. Journal of Mathematical Modelling and Algorithms 1, 193–214 (2002). https://doi.org/10.1023/A:1020590506372
Issue Date:
DOI: https://doi.org/10.1023/A:1020590506372