ABSTRACT
In Throughput Computing, the data can be processed independently with a substantial amount of threads running similar programs, referred to as kernels, or shaders for graphics specific workload. A Throughput Computing device, such as GPU, requires task latency tolerance to hold the context of the outstanding threads, and data latency tolerance to hold spaces for memory requests issued from the threads. The threads are grouped into thread groups. The register file and the associated number of outstanding thread groups should be sized according to the ratio of the computing resources to load/store units. Such a ratio should reflect the balance between ALU and load/store instructions of target workload.
Index Terms
- Latency tolerance for throughput computing
Recommendations
Improving Latency Tolerance of Multithreading through Decoupling
The increasing hardware complexity of dynamically scheduled superscalar processors may compromise the scalability of this organization to make an efficient use of future increases in transistor budget. SMT processors, designed over a superscalar core, ...
Memory Latency-Tolerance Approaches for Itanium Processors: Out-of-Order Execution vs.Speculative Precomputation
HPCA '02: Proceedings of the 8th International Symposium on High-Performance Computer ArchitectureThe performance of in-order execution Itanium(tm) processors can suffer signi ?cantly due to cache misses. Two memory latency tolerance approaches can be applied for the Itanium processors. One uses an out-of-order (OOO) execution core; the other ...
High-Performance Throughput Computing
Throughput computing, achieved through multithreading and multicore technology, can lead to performance improvements that are 10 to 30 those of conventional processors and systems. However, such systems should also offer good single-thread performance. ...
Comments