Scalable Manycore Computing with CUDA

doi:10.1201/b11417-14

ABSTRACT

The applications that seemmost likely to benefit frommajor advances in computational power and drive future processor development appear increasingly throughput oriented, with products optimized more for data or task parallelism depending on their market focus (e.g., HPC vs. transactional vs. multimedia). Examples include the simulation of large physical systems, data mining, and ray tracing. Throughputoriented workload design emphasizes many small cores because they eliminate most of the hardware needed to speed up the performance of an individual thread. These simple cores are then multithreaded, so that when any one thread stalls, other threads can run and every core can continue to be used to maximize the application’s overall throughput. Multithreading in turn relaxes requirements for high performance on any individual thread. Small, simple cores therefore provide greater throughput per unit

of chip area and greater throughput within a given power or cooling constraint. The high throughput provided by “manycore” organizations has been recognized by most major processor vendors. To understand the implications of rapidly increasing parallelism on both hardware

and software design, we believe it is most productive to look at the design of modern GPUs (Graphics Processing Units). A decade ago, GPUs were fixed-function hardware devices designed specifically to accelerate graphics APIs such as OpenGL and Direct3D. In contrast to the fixed-function devices of the past, today’s GPUs are fully programmable microprocessors with general-purpose architectures. Having evolved in response to the needs of computer graphics-an application domain with tremendous inherent parallelism but increasing need for general-purpose programmabilitythe GPU is already a general-purpose manycore processor with greater peak performance than any other commodity processor. GPUs simply include some additional hardware that typical, general-purpose CPUs do not, mainly units such as rasterizers that accelerate the rendering of 3D polygons and texture units that accelerate filtering and blending of images. Most of these units are not needed when using the GPU as a general-purpose manycore processor, although some can be useful, such as texture caches and GPU instruction-set support for some transcendental functions. Because GPUs are general-purpose manycore processors, they are typically programmed in a fashion similar to traditional parallel programming models, with a single-program, multiple data (SPMD) model for launching a large number of concurrent threads, a unified memory, and standard synchronization mechanisms. High-end GPUs cost just hundreds of dollars and provide teraflop performance

while creating, executing, and retiring literally billions of parallel threads per second, exhibiting a scale of parallelism that is orders of magnitude higher than other platforms and truly embodies the manycore paradigm. GPUs are now used in a wide range of computational science and engineering applications, and are supported by several major libraries and commercial software products.