How to Write Fast Numerical Code: A Small Introduction

Chellappa, Srinivas; Franchetti, Franz; Püschel, Markus

doi:10.1007/978-3-540-88643-3_5

Srinivas Chellappa⁴,
Franz Franchetti⁴ &
Markus Püschel⁴

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 5235))

Included in the following conference series:

International Summer School on Generative and Transformational Techniques in Software Engineering

2135 Accesses

Abstract

The complexity of modern computing platforms has made it extremely difficult to write numerical code that achieves the best possible performance. Straightforward implementations based on algorithms that minimize the operations count often fall short in performance by at least one order of magnitude. This tutorial introduces the reader to a set of general techniques to improve the performance of numerical code, focusing on optimizations for the computer’s memory hierarchy. Further, program generators are discussed as a way to reduce the implementation and optimization effort. Two running examples are used to demonstrate these techniques: matrix-matrix multiplication and the discrete Fourier transform.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

The Mathematical Origins of Modern Computing

ExaStencils: Advanced Multigrid Solver Generation

Smolyak’s Algorithm: A Powerful Black Box for the Acceleration of Scientific Computations

References

Moore, G.E.: Cramming more components onto integrated circuits. Readings in computer architecture, 56–59 (2000)
Google Scholar
Meadows, L., Nakamoto, S., Schuster, V.: A vectorizing, software pipelining compiler for LIW and superscalar architecture. In: Proceedings of Risc (1992)
Google Scholar
Group, S.S.C.: SUIF: A parallelizing & optimizing research compiler. Technical Report CSL-TR-94-620, Computer Systems Laboratory, Stanford University (May 1994)
Google Scholar
Franke, B., O’Boyle, M.F.P.: A complete compiler approach to auto-parallelizing C programs for multi-DSP systems. IEEE Trans. Parallel Distrib. Syst. 16(3), 234–245 (2005)
Article Google Scholar
Van Loan, C.: Computational Framework of the Fast Fourier Transform. SIAM, Philadelphia (1992)
Book MATH Google Scholar
Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical Recipes in C: The Art of Scientific Computing, 2nd edn. Cambridge University Press, Cambridge (1992)
MATH Google Scholar
Püschel, M., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Singer, B.W., Xiong, J., Franchetti, F., Gačić, A., Voronenko, Y., Chen, K., Johnson, R.W., Rizzolo, N.: SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE Special issue on Program Generation, Optimization, and Adaptation 93(2), 232–275 (2005)
Google Scholar
Website: Spiral (1998), http://www.spiral.net
Frigo, M., Johnson, S.G.: FFTW: An adaptive software architecture for the FFT. In: Proc. IEEE Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP), vol. 3, pp. 1381–1384 (1998)
Google Scholar
Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proceedings of the IEEE Special issue on Program Generation, Optimization, and Adaptation 93(2), 216–231 (2005)
Google Scholar
Website: FFTW, http://www.fftw.org
Goto, K., van de Geijn, R.: On reducing TLB misses in matrix multiplication, FLAME working note 9. Technical Report TR-2002-55, The University of Texas at Austin, Department of Computer Sciences (November 2002)
Google Scholar
Whaley, R.C., Dongarra, J.: Automatically Tuned Linear Algebra Software (ATLAS). In: Proc. Supercomputing (1998)
Google Scholar
Moura, J.M.F., Püschel, M., Padua, D., Dongarra, J.: Scanning the issue: Special issue on program generation, optimization, and platform adaptation. Proceedings of the IEEE, special issue on Program Generation, Optimization, and Adaptation 93(2), 211–215 (2005)
Google Scholar
Bida, E., Toledo, S.: An automatically-tuned sorting library. Software: Practice and Experience 37(11), 1161–1192 (2007)
Google Scholar
Li, X., Garzaran, M.J., Padua, D.: A dynamically tuned sorting library. In: Proc. Int’l Symposium on Code Generation and Optimization (CGO), pp. 111–124 (2004)
Google Scholar
Im, E.-J., Yelick, K., Vuduc, R.: Sparsity: Optimization framework for sparse matrix kernels. Int’l J. High Performance Computing Applications 18(1), 135–158 (2004)
Article Google Scholar
Demmel, J., Dongarra, J., Eijkhout, V., Fuentes, E., Petitet, A., Vuduc, R., Whaley, C., Yelick, K.: Self adapting linear algebra algorithms and software. Proceedings of the IEEE Special issue on Program Generation, Optimization, and Adaptation 93(2), 293–312 (2005)
Google Scholar
Website: BeBOP, http://bebop.cs.berkeley.edu/
Vuduc, R., Demmel, J.W., Yelick, K.A.: OSKI: A library of automatically tuned sparse matrix kernels. In: Proc. SciDAC. Journal of Physics: Conference Series, vol. 16, pp. 521–530 (2005)
Google Scholar
Whaley, R., Petitet, A., Dongarra, J.: Automated empirical optimization of software and the ATLAS project. Parallel Computing 27(1-2), 3–35 (2001)
Article MATH Google Scholar
Bilmes, J., Asanović, K., whye Chin, C., Demmel, J.: Optimizing matrix multiply using PHiPAC: a Portable, High-Performance, ANSI C coding methodology. In: Proc. Int’l Conference on Supercomputing (ICS), pp. 340–347 (1997)
Google Scholar
Frigo, M.: A fast Fourier transform compiler. In: Proc. Programming Language Design and Implementation (PLDI), pp. 169–180 (1999)
Google Scholar
Franchetti, F., Voronenko, Y., Püschel, M.: Formal loop merging for signal transforms. In: Proc. Programming Language Design and Implementation (PLDI), pp. 315–326 (2005)
Google Scholar
Franchetti, F., Voronenko, Y., Püschel, M.: FFT program generation for shared memory: SMP and multicore. In: Proc. Supercomputing (2006)
Google Scholar
Franchetti, F., Voronenko, Y., Püschel, M.: A rewriting system for the vectorization of signal transforms. In: Daydé, M., Palma, J.M.L.M., Coutinho, Á.L.G.A., Pacitti, E., Lopes, J.C. (eds.) VECPAR 2006. LNCS, vol. 4395. Springer, Heidelberg (2006)
Google Scholar
Bientinesi, P., Gunnels, J.A., Myers, M.E., Quintana-Orti, E., van de Geijn, R.: The science of deriving dense linear algebra algorithms. ACM Trans. on Mathematical Software 31(1), 1–26 (2005)
Article MathSciNet MATH Google Scholar
Gunnels, J.A., Gustavson, F.G., Henry, G.M., van de Geijn, R.A.: FLAME: Formal linear algebra methods environment. ACM Trans. on Mathematical Software 27(4), 422–455 (2001)
Article MATH Google Scholar
Quintana-Orti, G., Quintana-Orti, E.S., van de Geijn, R., Van Zee, F.G., Chan, E.: Programming algorithms-by-blocks for matrix computations on multithreaded architectures (submitted for publication)
Google Scholar
Baumgartner, G., Auer, A., Bernholdt, D.E., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R.J., Hirata, S., Krishanmoorthy, S., Krishnan, S., Lam, C.C., Lu, Q., Nooijen, M., Pitzer, R.M., Ramanujam, J., Sadayappan, P., Sibiryakov, A.: Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proceedings of the IEEE 93(2), 276–292 (2005); Special issue on Program Generation, Optimization, and Adaptation
Article Google Scholar
Czarnecki, K., Eisenecker, U.: Generative Programming: Methods, Tools, and Applications. Addison-Wesley, Reading (2000)
Google Scholar
Lämmel, R., Saraiva, J., Visser, J. (eds.): GTTSE 2005. LNCS, vol. 4143. Springer, Heidelberg (2006)
Google Scholar
Püschel, M.: How to write fast code.Course 18-645, Electrical and Computer Engineering, Carnegie Mellon University (2008), http://www.ece.cmu.edu/~pueschel/teaching/18-645-CMU-spring08/course.html
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C. (eds.): Introduction to algorithms. MIT Press, Cambridge (2001)
MATH Google Scholar
Demmel, J.W.: Applied numerical linear algebra. SIAM, Philadelphia (1997)
Book MATH Google Scholar
Tolimieri, R., An, M., Lu, C.: Algorithms for discrete Fourier transforms and convolution, 2nd edn. Springer, Heidelberg (1997)
Book MATH Google Scholar
Hennessy, J.L., Patterson, D.: Computer Architecture: A Quantitative Approach. Morgan Kaufmann, San Francisco (2002)
MATH Google Scholar
Bryant, R.E., O’Hallaron, D.R.: Computer Systems: A Programmer’s Perspective. Prentice-Hall, Englewood Cliffs (2003)
Google Scholar
Strassen, V.: Gaussian elimination is not optimal. Numerische Mathematik 14(3), 354–356 (1969)
Article MathSciNet MATH Google Scholar
Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. Journal of Symbolic Computation 9, 251–280 (1990)
Article MathSciNet MATH Google Scholar
Blackford, L.S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., Whaley, R.C.: An updated set of Basic Linear Algebra Subprograms (BLAS). ACM Trans. on Mathematical Software 28(2), 135–151 (2002)
Article MathSciNet Google Scholar
Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Croz, J.D., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide, 3rd edn. SIAM, Philadelphia (1999)
Book MATH Google Scholar
Website: ATLAS, http://math-atlas.sourceforge.net/
Website: Goto BLAS, http://www.tacc.utexas.edu/general/staff/goto/
Website: LAPACK, http://www.netlib.org/lapack/
Website: ScaLAPACK, http://www.netlib.org/scalapack/
Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia (1997)
Book MATH Google Scholar
Website: PLAPACK, http://www.cs.utexas.edu/users/plapack/
Chtchelkanova, A., Gunnels, J., Morrow, G., Overfelt, J., van de Geijn, R.: Parallel implementation of BLAS: General techniques for level 3 BLAS. Concurrency: Practice and Experience 9(9), 837–857 (1997)
Article Google Scholar
Website: FLAME, http://www.cs.utexas.edu/users/flame/
Johnson, S.G., Frigo, M.: A modified split-radix FFT with fewer arithmetic operations. IEEE Trans. Signal Processing 55(1), 111–119 (2007)
Article MathSciNet Google Scholar
Nussbaumer, H.J.: Fast Fourier Transformation and Convolution Algorithms, 2nd edn. Springer, Heidelberg (1982)
Book Google Scholar
Johnson, J.R., Johnson, R.W., Rodriguez, D., Tolimieri, R.: A methodology for designing, modifying, and implementing FFT algorithms on various architectures. Circuits Systems Signal Processing 9(4), 449–500 (1990)
Article MathSciNet MATH Google Scholar
Franchetti, F., Püschel, M.: Short vector code generation for the discrete Fourier transform. In: Proc. IEEE Int’l Parallel and Distributed Processing Symposium (IPDPS), pp. 58–67 (2003)
Google Scholar
Bonelli, A., Franchetti, F., Lorenz, J., Püschel, M., Ueberhuber, C.W.: Automatic performance optimization of the discrete Fourier transform on distributed memory computers. In: Guo, M., Yang, L.T., Di Martino, B., Zima, H.P., Dongarra, J., Tang, F. (eds.) ISPA 2006. LNCS, vol. 4330. Springer, Heidelberg (2006)
Chapter Google Scholar
Website: FFTPACK, http://www.netlib.org/fftpack/
GNU: GSL http://www.gnu.org/software/gsl/
Mirković, D., Johnsson, S.L.: Automatic performance tuning in the UHFFT library. In: Alexandrov, V.N., Dongarra, J., Juliano, B.A., Renner, R.S., Tan, C.J.K. (eds.) ICCS-ComputSci 2001. LNCS, vol. 2073, pp. 71–80. Springer, Heidelberg (2001)
Chapter Google Scholar
Website: UHFFT, http://www2.cs.uh.edu/~mirkovic/fft/parfft.htm
Website: FFTE, http://www.ffte.jp
Website: ACML, http://developer.amd.com/acml.jsp
Website: Intel MKL, http://www.intel.com/cd/software/products/asmo-na/eng/307757.htm
Website: Intel IPP, http://www.intel.com/cd/software/products/asmo-na/eng/perflib/ipp/302910.htm
Website, I.B.M.: ESSL and PESSL, http://www-03.ibm.com/systems/p/software/essl.html
Website: NAG, http://www.nag.com/
Website: IMSL, http://www.vni.com/products/imsl/
Hill, M.D., Smith, A.J.: Evaluating associativity in CPU caches. IEEE Trans. Comput. 38(12), 1612–1630 (1989)
Article Google Scholar
Intel Corporation: Intel 64 and IA-32 Architectures Optimization Reference Manual (2007), http://www.intel.com/products/processor/manuals/index.htm
Advanced Micro Devices (AMD) Inc.: Software Optimization Guide for AMD Athlon 64 and AMD Optero Processors (2005), http://developer.amd.com/devguides.jsp
GNU: GCC:optimization options, http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
Intel: Quick-reference guide to optimization with intel compilers version 10.x, http://cache-www.intel.com/cd/00/00/22/23/222300_222300.pdf
Intel: Intel VTune
Google Scholar
Microsoft: Microsoft Visual Studio
Google Scholar
GNU: Gnu gprof manual, http://www.gnu.org/software/binutils/manual/gprof-2.9.1/html_mono/gprof.html
Yotov, K., Li, X., Ren, G., Garzaran, M.J., Padua, D., Pingali, K., Stodghill, P.: Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE Special issue on Program Generation, Optimization, and Adaptation 93(2), 358–386 (2005)
Google Scholar
Wolfe, M.: Iteration space tiling for memory hierarchies. In: SIAM Conference on Parallel Processing for Scientific Computing (1987)
Google Scholar
Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex Fourier series. Math. of Computation 19, 297–301 (1965)
Article MathSciNet MATH Google Scholar
Püschel, M., Singer, B., Xiong, J., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Johnson, R.W.: SPIRAL: A generator for platform-adapted libraries of signal processing algorithms. Int’l Journal of High Performance Computing Applications 18(1), 21–45 (2004)
Article Google Scholar
D’Alberto, P., Milder, P.A., Sandryhaila, A., Franchetti, F., Hoe, J.C., Moura, J.M.F., Püschel, M., Johnson, J.: Generating FPGA accelerated DFT libraries. In: Proc. Symposium on Field-Programmable Custom Computing Machines (FCCM) (2007)
Google Scholar
Milder, P.A., Franchetti, F., Hoe, J.C., Püschel, M.: Formal datapath representation and manipulation for implementing DSP transforms. In: Proc. Design Automation Conference (DAC) (2008)
Google Scholar
Xiong, J., Johnson, J., Johnson, R., Padua, D.: SPL: A language and compiler for DSP algorithms. In: Proc. Programming Language Design and Implementation (PLDI), pp. 298–308 (2001)
Google Scholar
Dershowitz, N., Plaisted, D.A.: Rewriting. In: Handbook of Automated Reasoning, vol. 1, pp. 535–610. Elsevier, Amsterdam (2001)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Electrical and Computer Engineering, Carnegie Mellon University, USA
Srinivas Chellappa, Franz Franchetti & Markus Püschel

Authors

Srinivas Chellappa
View author publications
You can also search for this author in PubMed Google Scholar
Franz Franchetti
View author publications
You can also search for this author in PubMed Google Scholar
Markus Püschel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Fachbereich 4, Institut für Informatik, Universität Koblenz-Landau, B127, Universitätsstraße 1, 56070, Koblenz, Germany
Ralf Lämmel
Software Improvement Group, A.J. Ernststraat 595-H,, 1082 LD, Amsterdam, The Netherlands
Joost Visser
Universidade do Minho, Braga,, Campus de Gualtar, Portugal
João Saraiva

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chellappa, S., Franchetti, F., Püschel, M. (2008). How to Write Fast Numerical Code: A Small Introduction. In: Lämmel, R., Visser, J., Saraiva, J. (eds) Generative and Transformational Techniques in Software Engineering II. GTTSE 2007. Lecture Notes in Computer Science, vol 5235. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88643-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-540-88643-3_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88642-6
Online ISBN: 978-3-540-88643-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

How to Write Fast Numerical Code: A Small Introduction

Abstract

Access this chapter

Preview

Similar content being viewed by others

The Mathematical Origins of Modern Computing

ExaStencils: Advanced Multigrid Solver Generation

Smolyak’s Algorithm: A Powerful Black Box for the Acceleration of Scientific Computations

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

How to Write Fast Numerical Code: A Small Introduction

Abstract

Access this chapter

Preview

Similar content being viewed by others

The Mathematical Origins of Modern Computing

ExaStencils: Advanced Multigrid Solver Generation

Smolyak’s Algorithm: A Powerful Black Box for the Acceleration of Scientific Computations

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation