Skip to main content

Abstract

The complexity of modern computing platforms has made it extremely difficult to write numerical code that achieves the best possible performance. Straightforward implementations based on algorithms that minimize the operations count often fall short in performance by at least one order of magnitude. This tutorial introduces the reader to a set of general techniques to improve the performance of numerical code, focusing on optimizations for the computer’s memory hierarchy. Further, program generators are discussed as a way to reduce the implementation and optimization effort. Two running examples are used to demonstrate these techniques: matrix-matrix multiplication and the discrete Fourier transform.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Moore, G.E.: Cramming more components onto integrated circuits. Readings in computer architecture, 56–59 (2000)

    Google Scholar 

  2. Meadows, L., Nakamoto, S., Schuster, V.: A vectorizing, software pipelining compiler for LIW and superscalar architecture. In: Proceedings of Risc (1992)

    Google Scholar 

  3. Group, S.S.C.: SUIF: A parallelizing & optimizing research compiler. Technical Report CSL-TR-94-620, Computer Systems Laboratory, Stanford University (May 1994)

    Google Scholar 

  4. Franke, B., O’Boyle, M.F.P.: A complete compiler approach to auto-parallelizing C programs for multi-DSP systems. IEEE Trans. Parallel Distrib. Syst. 16(3), 234–245 (2005)

    Article  Google Scholar 

  5. Van Loan, C.: Computational Framework of the Fast Fourier Transform. SIAM, Philadelphia (1992)

    Book  MATH  Google Scholar 

  6. Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical Recipes in C: The Art of Scientific Computing, 2nd edn. Cambridge University Press, Cambridge (1992)

    MATH  Google Scholar 

  7. Püschel, M., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Singer, B.W., Xiong, J., Franchetti, F., Gačić, A., Voronenko, Y., Chen, K., Johnson, R.W., Rizzolo, N.: SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE Special issue on Program Generation, Optimization, and Adaptation 93(2), 232–275 (2005)

    Google Scholar 

  8. Website: Spiral (1998), http://www.spiral.net

  9. Frigo, M., Johnson, S.G.: FFTW: An adaptive software architecture for the FFT. In: Proc. IEEE Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP), vol. 3, pp. 1381–1384 (1998)

    Google Scholar 

  10. Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proceedings of the IEEE Special issue on Program Generation, Optimization, and Adaptation 93(2), 216–231 (2005)

    Google Scholar 

  11. Website: FFTW, http://www.fftw.org

  12. Goto, K., van de Geijn, R.: On reducing TLB misses in matrix multiplication, FLAME working note 9. Technical Report TR-2002-55, The University of Texas at Austin, Department of Computer Sciences (November 2002)

    Google Scholar 

  13. Whaley, R.C., Dongarra, J.: Automatically Tuned Linear Algebra Software (ATLAS). In: Proc. Supercomputing (1998)

    Google Scholar 

  14. Moura, J.M.F., Püschel, M., Padua, D., Dongarra, J.: Scanning the issue: Special issue on program generation, optimization, and platform adaptation. Proceedings of the IEEE, special issue on Program Generation, Optimization, and Adaptation 93(2), 211–215 (2005)

    Google Scholar 

  15. Bida, E., Toledo, S.: An automatically-tuned sorting library. Software: Practice and Experience 37(11), 1161–1192 (2007)

    Google Scholar 

  16. Li, X., Garzaran, M.J., Padua, D.: A dynamically tuned sorting library. In: Proc. Int’l Symposium on Code Generation and Optimization (CGO), pp. 111–124 (2004)

    Google Scholar 

  17. Im, E.-J., Yelick, K., Vuduc, R.: Sparsity: Optimization framework for sparse matrix kernels. Int’l J. High Performance Computing Applications 18(1), 135–158 (2004)

    Article  Google Scholar 

  18. Demmel, J., Dongarra, J., Eijkhout, V., Fuentes, E., Petitet, A., Vuduc, R., Whaley, C., Yelick, K.: Self adapting linear algebra algorithms and software. Proceedings of the IEEE Special issue on Program Generation, Optimization, and Adaptation 93(2), 293–312 (2005)

    Google Scholar 

  19. Website: BeBOP, http://bebop.cs.berkeley.edu/

  20. Vuduc, R., Demmel, J.W., Yelick, K.A.: OSKI: A library of automatically tuned sparse matrix kernels. In: Proc. SciDAC. Journal of Physics: Conference Series, vol. 16, pp. 521–530 (2005)

    Google Scholar 

  21. Whaley, R., Petitet, A., Dongarra, J.: Automated empirical optimization of software and the ATLAS project. Parallel Computing 27(1-2), 3–35 (2001)

    Article  MATH  Google Scholar 

  22. Bilmes, J., Asanović, K., whye Chin, C., Demmel, J.: Optimizing matrix multiply using PHiPAC: a Portable, High-Performance, ANSI C coding methodology. In: Proc. Int’l Conference on Supercomputing (ICS), pp. 340–347 (1997)

    Google Scholar 

  23. Frigo, M.: A fast Fourier transform compiler. In: Proc. Programming Language Design and Implementation (PLDI), pp. 169–180 (1999)

    Google Scholar 

  24. Franchetti, F., Voronenko, Y., Püschel, M.: Formal loop merging for signal transforms. In: Proc. Programming Language Design and Implementation (PLDI), pp. 315–326 (2005)

    Google Scholar 

  25. Franchetti, F., Voronenko, Y., Püschel, M.: FFT program generation for shared memory: SMP and multicore. In: Proc. Supercomputing (2006)

    Google Scholar 

  26. Franchetti, F., Voronenko, Y., Püschel, M.: A rewriting system for the vectorization of signal transforms. In: Daydé, M., Palma, J.M.L.M., Coutinho, Á.L.G.A., Pacitti, E., Lopes, J.C. (eds.) VECPAR 2006. LNCS, vol. 4395. Springer, Heidelberg (2006)

    Google Scholar 

  27. Bientinesi, P., Gunnels, J.A., Myers, M.E., Quintana-Orti, E., van de Geijn, R.: The science of deriving dense linear algebra algorithms. ACM Trans. on Mathematical Software 31(1), 1–26 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  28. Gunnels, J.A., Gustavson, F.G., Henry, G.M., van de Geijn, R.A.: FLAME: Formal linear algebra methods environment. ACM Trans. on Mathematical Software 27(4), 422–455 (2001)

    Article  MATH  Google Scholar 

  29. Quintana-Orti, G., Quintana-Orti, E.S., van de Geijn, R., Van Zee, F.G., Chan, E.: Programming algorithms-by-blocks for matrix computations on multithreaded architectures (submitted for publication)

    Google Scholar 

  30. Baumgartner, G., Auer, A., Bernholdt, D.E., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R.J., Hirata, S., Krishanmoorthy, S., Krishnan, S., Lam, C.C., Lu, Q., Nooijen, M., Pitzer, R.M., Ramanujam, J., Sadayappan, P., Sibiryakov, A.: Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proceedings of the IEEE 93(2), 276–292 (2005); Special issue on Program Generation, Optimization, and Adaptation

    Article  Google Scholar 

  31. Czarnecki, K., Eisenecker, U.: Generative Programming: Methods, Tools, and Applications. Addison-Wesley, Reading (2000)

    Google Scholar 

  32. Lämmel, R., Saraiva, J., Visser, J. (eds.): GTTSE 2005. LNCS, vol. 4143. Springer, Heidelberg (2006)

    Google Scholar 

  33. Püschel, M.: How to write fast code.Course 18-645, Electrical and Computer Engineering, Carnegie Mellon University (2008), http://www.ece.cmu.edu/~pueschel/teaching/18-645-CMU-spring08/course.html

  34. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C. (eds.): Introduction to algorithms. MIT Press, Cambridge (2001)

    MATH  Google Scholar 

  35. Demmel, J.W.: Applied numerical linear algebra. SIAM, Philadelphia (1997)

    Book  MATH  Google Scholar 

  36. Tolimieri, R., An, M., Lu, C.: Algorithms for discrete Fourier transforms and convolution, 2nd edn. Springer, Heidelberg (1997)

    Book  MATH  Google Scholar 

  37. Hennessy, J.L., Patterson, D.: Computer Architecture: A Quantitative Approach. Morgan Kaufmann, San Francisco (2002)

    MATH  Google Scholar 

  38. Bryant, R.E., O’Hallaron, D.R.: Computer Systems: A Programmer’s Perspective. Prentice-Hall, Englewood Cliffs (2003)

    Google Scholar 

  39. Strassen, V.: Gaussian elimination is not optimal. Numerische Mathematik 14(3), 354–356 (1969)

    Article  MathSciNet  MATH  Google Scholar 

  40. Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. Journal of Symbolic Computation 9, 251–280 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  41. Blackford, L.S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., Whaley, R.C.: An updated set of Basic Linear Algebra Subprograms (BLAS). ACM Trans. on Mathematical Software 28(2), 135–151 (2002)

    Article  MathSciNet  Google Scholar 

  42. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Croz, J.D., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide, 3rd edn. SIAM, Philadelphia (1999)

    Book  MATH  Google Scholar 

  43. Website: ATLAS, http://math-atlas.sourceforge.net/

  44. Website: Goto BLAS, http://www.tacc.utexas.edu/general/staff/goto/

  45. Website: LAPACK, http://www.netlib.org/lapack/

  46. Website: ScaLAPACK, http://www.netlib.org/scalapack/

  47. Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia (1997)

    Book  MATH  Google Scholar 

  48. Website: PLAPACK, http://www.cs.utexas.edu/users/plapack/

  49. Chtchelkanova, A., Gunnels, J., Morrow, G., Overfelt, J., van de Geijn, R.: Parallel implementation of BLAS: General techniques for level 3 BLAS. Concurrency: Practice and Experience 9(9), 837–857 (1997)

    Article  Google Scholar 

  50. Website: FLAME, http://www.cs.utexas.edu/users/flame/

  51. Johnson, S.G., Frigo, M.: A modified split-radix FFT with fewer arithmetic operations. IEEE Trans. Signal Processing 55(1), 111–119 (2007)

    Article  MathSciNet  Google Scholar 

  52. Nussbaumer, H.J.: Fast Fourier Transformation and Convolution Algorithms, 2nd edn. Springer, Heidelberg (1982)

    Book  Google Scholar 

  53. Johnson, J.R., Johnson, R.W., Rodriguez, D., Tolimieri, R.: A methodology for designing, modifying, and implementing FFT algorithms on various architectures. Circuits Systems Signal Processing 9(4), 449–500 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  54. Franchetti, F., Püschel, M.: Short vector code generation for the discrete Fourier transform. In: Proc. IEEE Int’l Parallel and Distributed Processing Symposium (IPDPS), pp. 58–67 (2003)

    Google Scholar 

  55. Bonelli, A., Franchetti, F., Lorenz, J., Püschel, M., Ueberhuber, C.W.: Automatic performance optimization of the discrete Fourier transform on distributed memory computers. In: Guo, M., Yang, L.T., Di Martino, B., Zima, H.P., Dongarra, J., Tang, F. (eds.) ISPA 2006. LNCS, vol. 4330. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  56. Website: FFTPACK, http://www.netlib.org/fftpack/

  57. GNU: GSL http://www.gnu.org/software/gsl/

  58. Mirković, D., Johnsson, S.L.: Automatic performance tuning in the UHFFT library. In: Alexandrov, V.N., Dongarra, J., Juliano, B.A., Renner, R.S., Tan, C.J.K. (eds.) ICCS-ComputSci 2001. LNCS, vol. 2073, pp. 71–80. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  59. Website: UHFFT, http://www2.cs.uh.edu/~mirkovic/fft/parfft.htm

  60. Website: FFTE, http://www.ffte.jp

  61. Website: ACML, http://developer.amd.com/acml.jsp

  62. Website: Intel MKL, http://www.intel.com/cd/software/products/asmo-na/eng/307757.htm

  63. Website: Intel IPP, http://www.intel.com/cd/software/products/asmo-na/eng/perflib/ipp/302910.htm

  64. Website, I.B.M.: ESSL and PESSL, http://www-03.ibm.com/systems/p/software/essl.html

  65. Website: NAG, http://www.nag.com/

  66. Website: IMSL, http://www.vni.com/products/imsl/

  67. Hill, M.D., Smith, A.J.: Evaluating associativity in CPU caches. IEEE Trans. Comput. 38(12), 1612–1630 (1989)

    Article  Google Scholar 

  68. Intel Corporation: Intel 64 and IA-32 Architectures Optimization Reference Manual (2007), http://www.intel.com/products/processor/manuals/index.htm

  69. Advanced Micro Devices (AMD) Inc.: Software Optimization Guide for AMD Athlon 64 and AMD Optero Processors (2005), http://developer.amd.com/devguides.jsp

  70. GNU: GCC:optimization options, http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

  71. Intel: Quick-reference guide to optimization with intel compilers version 10.x, http://cache-www.intel.com/cd/00/00/22/23/222300_222300.pdf

  72. Intel: Intel VTune

    Google Scholar 

  73. Microsoft: Microsoft Visual Studio

    Google Scholar 

  74. GNU: Gnu gprof manual, http://www.gnu.org/software/binutils/manual/gprof-2.9.1/html_mono/gprof.html

  75. Yotov, K., Li, X., Ren, G., Garzaran, M.J., Padua, D., Pingali, K., Stodghill, P.: Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE Special issue on Program Generation, Optimization, and Adaptation 93(2), 358–386 (2005)

    Google Scholar 

  76. Wolfe, M.: Iteration space tiling for memory hierarchies. In: SIAM Conference on Parallel Processing for Scientific Computing (1987)

    Google Scholar 

  77. Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex Fourier series. Math. of Computation 19, 297–301 (1965)

    Article  MathSciNet  MATH  Google Scholar 

  78. Püschel, M., Singer, B., Xiong, J., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Johnson, R.W.: SPIRAL: A generator for platform-adapted libraries of signal processing algorithms. Int’l Journal of High Performance Computing Applications 18(1), 21–45 (2004)

    Article  Google Scholar 

  79. D’Alberto, P., Milder, P.A., Sandryhaila, A., Franchetti, F., Hoe, J.C., Moura, J.M.F., Püschel, M., Johnson, J.: Generating FPGA accelerated DFT libraries. In: Proc. Symposium on Field-Programmable Custom Computing Machines (FCCM) (2007)

    Google Scholar 

  80. Milder, P.A., Franchetti, F., Hoe, J.C., Püschel, M.: Formal datapath representation and manipulation for implementing DSP transforms. In: Proc. Design Automation Conference (DAC) (2008)

    Google Scholar 

  81. Xiong, J., Johnson, J., Johnson, R., Padua, D.: SPL: A language and compiler for DSP algorithms. In: Proc. Programming Language Design and Implementation (PLDI), pp. 298–308 (2001)

    Google Scholar 

  82. Dershowitz, N., Plaisted, D.A.: Rewriting. In: Handbook of Automated Reasoning, vol. 1, pp. 535–610. Elsevier, Amsterdam (2001)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Chellappa, S., Franchetti, F., Püschel, M. (2008). How to Write Fast Numerical Code: A Small Introduction. In: Lämmel, R., Visser, J., Saraiva, J. (eds) Generative and Transformational Techniques in Software Engineering II. GTTSE 2007. Lecture Notes in Computer Science, vol 5235. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88643-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-88643-3_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-88642-6

  • Online ISBN: 978-3-540-88643-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics