Reducing Power of Memory Hierarchy in General Purpose Graphics Processing Units
General Purpose Graphics Processing Units (GPGPUs) are finding applications in high performance computing domains owing to their massively parallel architecture. However, execution of such applications requires huge amounts of data. Therefore, memory sub-systems of GPGPUs need to be
able to serve massive amounts of data to processing cores without long access delays. For this reason, the architecture of GPGPUs has evolved to include low-latency memory units such as caches and shared memory. The popularity of GPGPUs in high performance applications has pushed manufacturers
to continue to increase the number of cores for every generation. Larger number of cores further increases the amount of data that needs to be serviced by the underlying memory units. To cope with this demand of huge data, size of caches has been growing for newer generations of GPGPUs. However,
increased cache sizes exacerbate the problem of power dissipation that is already a major design constraint in processors. Our work proposes two optimization techniques to reduce power consumption in L1 caches (data, texture, constant, and instruction), shared memory and L2 cache. The two
optimization techniques target static and dynamic power respectively. Analysis of cache access pattern of several GPGPU applications reveals that consecutive accesses to the same cache block are separated in time by hundreds of clock cycles. This long inter-access cycle presents the unique
opportunity of reducing static power by putting cache cells in drowsy mode. The advantage of reducing leakage power using drowsy mode comes at a cost of an increased access time, since the voltage of a drowsy cache cell must be raised before it can be accessed. Our novel technique of coarse
grained drowsy mode helps to mitigate the impact on performance. In coarse grained drowsy mode, we partition each cache into regions of contiguous cache blocks. Upon cache access, we wake up the whole cache region that is being accessed. This method exploits temporal and spatial locality of
cache accesses The delay is incurred only for the first access to a cache region and subsequent accesses in the same cache region do not incur any delay. This helps to reduce the impact on performance due to wake-up delay. Our second optimization technique takes advantage of branch divergence
in GPGPUs. GPGPUs have a Single Instruction Multiple Thread (SIMT) execution model. The SIMT execution model can cause divergence of threads when a control instruction is encountered. GPGPUs execute branch instructions in two phases. Threads in the taken path are active for the first phase,
while the rest of the threads are idle. Threads in the not-taken path are executed in the second phase and the rest of the threads remain idle. Contemporary GPGPUs access all portions of cache blocks even when some of the threads are idle due to branch divergence. Our optimization technique
proposes to access portion of a cache block that corresponds to active threads. Disabling access to unnecessary sections of cache blocks helps in the reduction of dynamic power. Our results show a significant reduction in static and dynamic power of caches using the two optimization techniques
together.
Keywords: CACHE; CUDA; DYNAMIC POWER; GPGPU; LEAKAGE POWER; MEMORY HIERARCHY
Document Type: Research Article
Publication date: 01 June 2017
- The electronic systems that can operate with very low power are of great technological interest. The growing research activity in the field of low power electronics requires a forum for rapid dissemination of important results: Journal of Low Power Electronics (JOLPE) is that international forum which offers scientists and engineers timely, peer-reviewed research in this field.
- Editorial Board
- Information for Authors
- Subscribe to this Title
- Terms & Conditions
- Ingenta Connect is not responsible for the content or availability of external websites
- Access Key
- Free content
- Partial Free content
- New content
- Open access content
- Partial Open access content
- Subscribed content
- Partial Subscribed content
- Free trial content