ABSTRACT
The evolution of computing platforms is at a historic inflection point. CPU frequency has stalled (in 2004 at about 3GHz), which means future performance gains will only be achievable due to increasing parallelism in the form of multiple cores and vector instructions sets. The impact on the developers of high performance libraries implementing important mathematical functionality such as matrix-multiplication, linear transforms, and many others is profound. Traditionally, an algorithm developer ensures correctness and minimizes the operations count. A software engineer then performs the actual implementation (in a compilable language like C) and performance optimization. However, on modern platforms, two implementations with the exact same operations count may differ by 10, 100, or even 1000x in runtime: instead, the structure of an algorithm becomes a major factor and determines how well it can be parallelized, vectorized, and matched to the memory hierarchy. Ideally, a compiler would perform all these tasks, but the current state of knowledge suggest that this may be inherently impossible for many types of code. The reason may be two-fold. First, many transformations, in particular for parallelism, require domain-knowledge that the compiler simply does possess. Second, often there are simply too many choices of transformations that the compiler cannot or does not know how to explore.
As a consequence, the development of high performance libraries for mathematical functions becomes extraordinarily difficult, since the developer needs to have a good understanding of available algorithms, the target microarchitecture, and implementation techniques such as threading and vector instruction set such as SSE on Intel. To make things worse, optimal code is often platform specific, that is, code that runs very fast on one platform can be suboptimal on another. This means that if highest performance is desired, library developers are constantly forced to reimplement and reoptimize the same functionality. A commercial example following this model are Intel's IPP and MKL libraries, which provide a very broad set of mathematical functions needed in scientific computing, signal and image processing, communication, and security applications.
An attractive solution would be to automate the library development, which means let the computer write the code and rewrite it for every new platforms. There are several challenges involved with this proposal. First, for a given desired function (such as multiplying matrices or computing a discrete Fourier transform), the existing algorithm knowledge has to be encoded into a form or language that is suitable for computer representation. Second, structural algorithm transformations for parallelism or locality that are typically performed by the programmer also have to be encoded into this form. Third, available choices have to be explored systematically and efficiently. As we will show for a specific domain, techniques from symbolic computation provide the answers.
In this talk we present Spiral [6, 1], a domain-specific program generation system for important mathematical functionality such as linear transforms, filters, Viterbi decoders, and basic linear algebra routines. Spiral completely replaces the human programmer. For a desired function, Spiral generates alternative algorithms, optimizes them, compiles them into programs, and "intelligently"' searches for the best match to the computing platform. The main idea behind Spiral is a mathematical, symbolic, declarative, domain-specific language to represent algorithms and the use of rewriting systems to generate and structurally optimize algorithms at a high level of abstraction. Optimization includes parallelization, vectorization, and locality improvement for the memory hierarchy [3, 4, 5, 7, 2]. Experimental results show that the code generated by Spiral competes with, and sometimes outperforms, the best available human-written code.
- Spiral web site, 2006. www.spiral.net.Google Scholar
- F. Franchetti, F. de Mesmay, D. McFarlin, and M. Püschel. Operator language: A program generation framework for fast kernels. In IFIP Working Conference on Domain Specific Languages (DSL WC), 2009. Google ScholarDigital Library
- F. Franchetti, Y. Voronenko, and M. Püschel. Loop merging for signal transforms. In Proc. PLDI, pages 315--326, 2005. Google ScholarDigital Library
- F. Franchetti, Y. Voronenko, and M. Püschel. FFT program generation for shared memory: SMP and multicore. In Supercomputing, 2006. Google ScholarDigital Library
- F. Franchetti, Y. Voronenko, and M. Püschel. A rewriting system for the vectorization of signal transforms. In High Performance Computing for Computational Science (VECPAR), volume 4395 of Lecture Notes in Computer Science, pages 363--377. Springer, 2006. Google ScholarDigital Library
- M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gačić, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, 93(2):232--275, 2005. special issue on "Program Generation, Optimization, and Adaptation".Google Scholar
- Y. Voronenko, F. de Mesmay, and M. Püschel. Computer generation of general size linear transform libraries. In International Symposium on Code Generation and Optimization (CGO), pages 102--113, 2009. Google ScholarDigital Library
Index Terms
- Automatic synthesis of high performance mathematical programs
Recommendations
Compiling math to fast code
PEPM '12: Proceedings of the ACM SIGPLAN 2012 workshop on Partial evaluation and program manipulationExtracting optimal performance from modern computing platforms has become increasingly difficult over the last few years. The effect is particularly noticeable in computations that are of mathematical nature such as those needed in multimedia processing,...
Automatic performance programming
Onward! 2011: Proceedings of the 10th SIGPLAN symposium on New ideas, new paradigms, and reflections on programming and softwareIt has become extraordinarily difficult to write software that performs close to optimally on complex modern microarchitectures. Particularly plagued are domains that are data intensive and require complex mathematical computations such as information ...
Can we teach computers to write fast libraries?
GPCE '07: Proceedings of the 6th international conference on Generative programming and component engineeringAs the computing world "goes multicore", high performance library development finally becomes a nightmare. Optimal programs, and their underlying algorithms, have to be adapted to take full advantage of the platform's parallelism, memory hierarchy, and ...
Comments