Abstract
The complexity of modern computing platforms has made it extremely difficult to write numerical code that achieves the best possible performance. Straightforward implementations based on algorithms that minimize the operations count often fall short in performance by at least one order of magnitude. This tutorial introduces the reader to a set of general techniques to improve the performance of numerical code, focusing on optimizations for the computer’s memory hierarchy. Further, program generators are discussed as a way to reduce the implementation and optimization effort. Two running examples are used to demonstrate these techniques: matrix-matrix multiplication and the discrete Fourier transform.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Moore, G.E.: Cramming more components onto integrated circuits. Readings in computer architecture, 56–59 (2000)
Meadows, L., Nakamoto, S., Schuster, V.: A vectorizing, software pipelining compiler for LIW and superscalar architecture. In: Proceedings of Risc (1992)
Group, S.S.C.: SUIF: A parallelizing & optimizing research compiler. Technical Report CSL-TR-94-620, Computer Systems Laboratory, Stanford University (May 1994)
Franke, B., O’Boyle, M.F.P.: A complete compiler approach to auto-parallelizing C programs for multi-DSP systems. IEEE Trans. Parallel Distrib. Syst. 16(3), 234–245 (2005)
Van Loan, C.: Computational Framework of the Fast Fourier Transform. SIAM, Philadelphia (1992)
Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical Recipes in C: The Art of Scientific Computing, 2nd edn. Cambridge University Press, Cambridge (1992)
Püschel, M., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Singer, B.W., Xiong, J., Franchetti, F., Gačić, A., Voronenko, Y., Chen, K., Johnson, R.W., Rizzolo, N.: SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE Special issue on Program Generation, Optimization, and Adaptation 93(2), 232–275 (2005)
Website: Spiral (1998), http://www.spiral.net
Frigo, M., Johnson, S.G.: FFTW: An adaptive software architecture for the FFT. In: Proc. IEEE Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP), vol. 3, pp. 1381–1384 (1998)
Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proceedings of the IEEE Special issue on Program Generation, Optimization, and Adaptation 93(2), 216–231 (2005)
Website: FFTW, http://www.fftw.org
Goto, K., van de Geijn, R.: On reducing TLB misses in matrix multiplication, FLAME working note 9. Technical Report TR-2002-55, The University of Texas at Austin, Department of Computer Sciences (November 2002)
Whaley, R.C., Dongarra, J.: Automatically Tuned Linear Algebra Software (ATLAS). In: Proc. Supercomputing (1998)
Moura, J.M.F., Püschel, M., Padua, D., Dongarra, J.: Scanning the issue: Special issue on program generation, optimization, and platform adaptation. Proceedings of the IEEE, special issue on Program Generation, Optimization, and Adaptation 93(2), 211–215 (2005)
Bida, E., Toledo, S.: An automatically-tuned sorting library. Software: Practice and Experience 37(11), 1161–1192 (2007)
Li, X., Garzaran, M.J., Padua, D.: A dynamically tuned sorting library. In: Proc. Int’l Symposium on Code Generation and Optimization (CGO), pp. 111–124 (2004)
Im, E.-J., Yelick, K., Vuduc, R.: Sparsity: Optimization framework for sparse matrix kernels. Int’l J. High Performance Computing Applications 18(1), 135–158 (2004)
Demmel, J., Dongarra, J., Eijkhout, V., Fuentes, E., Petitet, A., Vuduc, R., Whaley, C., Yelick, K.: Self adapting linear algebra algorithms and software. Proceedings of the IEEE Special issue on Program Generation, Optimization, and Adaptation 93(2), 293–312 (2005)
Website: BeBOP, http://bebop.cs.berkeley.edu/
Vuduc, R., Demmel, J.W., Yelick, K.A.: OSKI: A library of automatically tuned sparse matrix kernels. In: Proc. SciDAC. Journal of Physics: Conference Series, vol. 16, pp. 521–530 (2005)
Whaley, R., Petitet, A., Dongarra, J.: Automated empirical optimization of software and the ATLAS project. Parallel Computing 27(1-2), 3–35 (2001)
Bilmes, J., Asanović, K., whye Chin, C., Demmel, J.: Optimizing matrix multiply using PHiPAC: a Portable, High-Performance, ANSI C coding methodology. In: Proc. Int’l Conference on Supercomputing (ICS), pp. 340–347 (1997)
Frigo, M.: A fast Fourier transform compiler. In: Proc. Programming Language Design and Implementation (PLDI), pp. 169–180 (1999)
Franchetti, F., Voronenko, Y., Püschel, M.: Formal loop merging for signal transforms. In: Proc. Programming Language Design and Implementation (PLDI), pp. 315–326 (2005)
Franchetti, F., Voronenko, Y., Püschel, M.: FFT program generation for shared memory: SMP and multicore. In: Proc. Supercomputing (2006)
Franchetti, F., Voronenko, Y., Püschel, M.: A rewriting system for the vectorization of signal transforms. In: Daydé, M., Palma, J.M.L.M., Coutinho, Á.L.G.A., Pacitti, E., Lopes, J.C. (eds.) VECPAR 2006. LNCS, vol. 4395. Springer, Heidelberg (2006)
Bientinesi, P., Gunnels, J.A., Myers, M.E., Quintana-Orti, E., van de Geijn, R.: The science of deriving dense linear algebra algorithms. ACM Trans. on Mathematical Software 31(1), 1–26 (2005)
Gunnels, J.A., Gustavson, F.G., Henry, G.M., van de Geijn, R.A.: FLAME: Formal linear algebra methods environment. ACM Trans. on Mathematical Software 27(4), 422–455 (2001)
Quintana-Orti, G., Quintana-Orti, E.S., van de Geijn, R., Van Zee, F.G., Chan, E.: Programming algorithms-by-blocks for matrix computations on multithreaded architectures (submitted for publication)
Baumgartner, G., Auer, A., Bernholdt, D.E., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R.J., Hirata, S., Krishanmoorthy, S., Krishnan, S., Lam, C.C., Lu, Q., Nooijen, M., Pitzer, R.M., Ramanujam, J., Sadayappan, P., Sibiryakov, A.: Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proceedings of the IEEE 93(2), 276–292 (2005); Special issue on Program Generation, Optimization, and Adaptation
Czarnecki, K., Eisenecker, U.: Generative Programming: Methods, Tools, and Applications. Addison-Wesley, Reading (2000)
Lämmel, R., Saraiva, J., Visser, J. (eds.): GTTSE 2005. LNCS, vol. 4143. Springer, Heidelberg (2006)
Püschel, M.: How to write fast code.Course 18-645, Electrical and Computer Engineering, Carnegie Mellon University (2008), http://www.ece.cmu.edu/~pueschel/teaching/18-645-CMU-spring08/course.html
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C. (eds.): Introduction to algorithms. MIT Press, Cambridge (2001)
Demmel, J.W.: Applied numerical linear algebra. SIAM, Philadelphia (1997)
Tolimieri, R., An, M., Lu, C.: Algorithms for discrete Fourier transforms and convolution, 2nd edn. Springer, Heidelberg (1997)
Hennessy, J.L., Patterson, D.: Computer Architecture: A Quantitative Approach. Morgan Kaufmann, San Francisco (2002)
Bryant, R.E., O’Hallaron, D.R.: Computer Systems: A Programmer’s Perspective. Prentice-Hall, Englewood Cliffs (2003)
Strassen, V.: Gaussian elimination is not optimal. Numerische Mathematik 14(3), 354–356 (1969)
Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. Journal of Symbolic Computation 9, 251–280 (1990)
Blackford, L.S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., Whaley, R.C.: An updated set of Basic Linear Algebra Subprograms (BLAS). ACM Trans. on Mathematical Software 28(2), 135–151 (2002)
Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Croz, J.D., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide, 3rd edn. SIAM, Philadelphia (1999)
Website: ATLAS, http://math-atlas.sourceforge.net/
Website: Goto BLAS, http://www.tacc.utexas.edu/general/staff/goto/
Website: LAPACK, http://www.netlib.org/lapack/
Website: ScaLAPACK, http://www.netlib.org/scalapack/
Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia (1997)
Website: PLAPACK, http://www.cs.utexas.edu/users/plapack/
Chtchelkanova, A., Gunnels, J., Morrow, G., Overfelt, J., van de Geijn, R.: Parallel implementation of BLAS: General techniques for level 3 BLAS. Concurrency: Practice and Experience 9(9), 837–857 (1997)
Website: FLAME, http://www.cs.utexas.edu/users/flame/
Johnson, S.G., Frigo, M.: A modified split-radix FFT with fewer arithmetic operations. IEEE Trans. Signal Processing 55(1), 111–119 (2007)
Nussbaumer, H.J.: Fast Fourier Transformation and Convolution Algorithms, 2nd edn. Springer, Heidelberg (1982)
Johnson, J.R., Johnson, R.W., Rodriguez, D., Tolimieri, R.: A methodology for designing, modifying, and implementing FFT algorithms on various architectures. Circuits Systems Signal Processing 9(4), 449–500 (1990)
Franchetti, F., Püschel, M.: Short vector code generation for the discrete Fourier transform. In: Proc. IEEE Int’l Parallel and Distributed Processing Symposium (IPDPS), pp. 58–67 (2003)
Bonelli, A., Franchetti, F., Lorenz, J., Püschel, M., Ueberhuber, C.W.: Automatic performance optimization of the discrete Fourier transform on distributed memory computers. In: Guo, M., Yang, L.T., Di Martino, B., Zima, H.P., Dongarra, J., Tang, F. (eds.) ISPA 2006. LNCS, vol. 4330. Springer, Heidelberg (2006)
Website: FFTPACK, http://www.netlib.org/fftpack/
GNU: GSL http://www.gnu.org/software/gsl/
Mirković, D., Johnsson, S.L.: Automatic performance tuning in the UHFFT library. In: Alexandrov, V.N., Dongarra, J., Juliano, B.A., Renner, R.S., Tan, C.J.K. (eds.) ICCS-ComputSci 2001. LNCS, vol. 2073, pp. 71–80. Springer, Heidelberg (2001)
Website: UHFFT, http://www2.cs.uh.edu/~mirkovic/fft/parfft.htm
Website: FFTE, http://www.ffte.jp
Website: ACML, http://developer.amd.com/acml.jsp
Website: Intel MKL, http://www.intel.com/cd/software/products/asmo-na/eng/307757.htm
Website: Intel IPP, http://www.intel.com/cd/software/products/asmo-na/eng/perflib/ipp/302910.htm
Website, I.B.M.: ESSL and PESSL, http://www-03.ibm.com/systems/p/software/essl.html
Website: NAG, http://www.nag.com/
Website: IMSL, http://www.vni.com/products/imsl/
Hill, M.D., Smith, A.J.: Evaluating associativity in CPU caches. IEEE Trans. Comput. 38(12), 1612–1630 (1989)
Intel Corporation: Intel 64 and IA-32 Architectures Optimization Reference Manual (2007), http://www.intel.com/products/processor/manuals/index.htm
Advanced Micro Devices (AMD) Inc.: Software Optimization Guide for AMD Athlon 64 and AMD Optero Processors (2005), http://developer.amd.com/devguides.jsp
GNU: GCC:optimization options, http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
Intel: Quick-reference guide to optimization with intel compilers version 10.x, http://cache-www.intel.com/cd/00/00/22/23/222300_222300.pdf
Intel: Intel VTune
Microsoft: Microsoft Visual Studio
GNU: Gnu gprof manual, http://www.gnu.org/software/binutils/manual/gprof-2.9.1/html_mono/gprof.html
Yotov, K., Li, X., Ren, G., Garzaran, M.J., Padua, D., Pingali, K., Stodghill, P.: Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE Special issue on Program Generation, Optimization, and Adaptation 93(2), 358–386 (2005)
Wolfe, M.: Iteration space tiling for memory hierarchies. In: SIAM Conference on Parallel Processing for Scientific Computing (1987)
Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex Fourier series. Math. of Computation 19, 297–301 (1965)
Püschel, M., Singer, B., Xiong, J., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Johnson, R.W.: SPIRAL: A generator for platform-adapted libraries of signal processing algorithms. Int’l Journal of High Performance Computing Applications 18(1), 21–45 (2004)
D’Alberto, P., Milder, P.A., Sandryhaila, A., Franchetti, F., Hoe, J.C., Moura, J.M.F., Püschel, M., Johnson, J.: Generating FPGA accelerated DFT libraries. In: Proc. Symposium on Field-Programmable Custom Computing Machines (FCCM) (2007)
Milder, P.A., Franchetti, F., Hoe, J.C., Püschel, M.: Formal datapath representation and manipulation for implementing DSP transforms. In: Proc. Design Automation Conference (DAC) (2008)
Xiong, J., Johnson, J., Johnson, R., Padua, D.: SPL: A language and compiler for DSP algorithms. In: Proc. Programming Language Design and Implementation (PLDI), pp. 298–308 (2001)
Dershowitz, N., Plaisted, D.A.: Rewriting. In: Handbook of Automated Reasoning, vol. 1, pp. 535–610. Elsevier, Amsterdam (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Chellappa, S., Franchetti, F., Püschel, M. (2008). How to Write Fast Numerical Code: A Small Introduction. In: Lämmel, R., Visser, J., Saraiva, J. (eds) Generative and Transformational Techniques in Software Engineering II. GTTSE 2007. Lecture Notes in Computer Science, vol 5235. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88643-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-540-88643-3_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88642-6
Online ISBN: 978-3-540-88643-3
eBook Packages: Computer ScienceComputer Science (R0)