Reducing Power of Memory Hierarchy in General Purpose Graphics Processing Units

Buy Article:

$107.14 + tax (Refund Policy)

Authors: Saghir, Ahsan; Atoofian, Ehsan; Manzak, Ali

Source: Journal of Low Power Electronics, Volume 13, Number 2, June 2017, pp. 149-165(17)

Publisher: American Scientific Publishers

DOI: https://doi.org/10.1166/jolpe.2017.1480

General Purpose Graphics Processing Units (GPGPUs) are finding applications in high performance computing domains owing to their massively parallel architecture. However, execution of such applications requires huge amounts of data. Therefore, memory sub-systems of GPGPUs need to be able to serve massive amounts of data to processing cores without long access delays. For this reason, the architecture of GPGPUs has evolved to include low-latency memory units such as caches and shared memory. The popularity of GPGPUs in high performance applications has pushed manufacturers to continue to increase the number of cores for every generation. Larger number of cores further increases the amount of data that needs to be serviced by the underlying memory units. To cope with this demand of huge data, size of caches has been growing for newer generations of GPGPUs. However, increased cache sizes exacerbate the problem of power dissipation that is already a major design constraint in processors. Our work proposes two optimization techniques to reduce power consumption in L1 caches (data, texture, constant, and instruction), shared memory and L2 cache. The two optimization techniques target static and dynamic power respectively. Analysis of cache access pattern of several GPGPU applications reveals that consecutive accesses to the same cache block are separated in time by hundreds of clock cycles. This long inter-access cycle presents the unique opportunity of reducing static power by putting cache cells in drowsy mode. The advantage of reducing leakage power using drowsy mode comes at a cost of an increased access time, since the voltage of a drowsy cache cell must be raised before it can be accessed. Our novel technique of coarse grained drowsy mode helps to mitigate the impact on performance. In coarse grained drowsy mode, we partition each cache into regions of contiguous cache blocks. Upon cache access, we wake up the whole cache region that is being accessed. This method exploits temporal and spatial locality of cache accesses The delay is incurred only for the first access to a cache region and subsequent accesses in the same cache region do not incur any delay. This helps to reduce the impact on performance due to wake-up delay. Our second optimization technique takes advantage of branch divergence in GPGPUs. GPGPUs have a Single Instruction Multiple Thread (SIMT) execution model. The SIMT execution model can cause divergence of threads when a control instruction is encountered. GPGPUs execute branch instructions in two phases. Threads in the taken path are active for the first phase, while the rest of the threads are idle. Threads in the not-taken path are executed in the second phase and the rest of the threads remain idle. Contemporary GPGPUs access all portions of cache blocks even when some of the threads are idle due to branch divergence. Our optimization technique proposes to access portion of a cache block that corresponds to active threads. Disabling access to unnecessary sections of cache blocks helps in the reduction of dynamic power. Our results show a significant reduction in static and dynamic power of caches using the two optimization techniques together.

Keywords: CACHE; CUDA; DYNAMIC POWER; GPGPU; LEAKAGE POWER; MEMORY HIERARCHY

Document Type: Research Article

Publication date: 01 June 2017

More about this publication?

The electronic systems that can operate with very low power are of great technological interest. The growing research activity in the field of low power electronics requires a forum for rapid dissemination of important results: Journal of Low Power Electronics (JOLPE) is that international forum which offers scientists and engineers timely, peer-reviewed research in this field.
Editorial Board
Information for Authors
Subscribe to this Title
Terms & Conditions
Ingenta Connect is not responsible for the content or availability of external websites

Access Key
Free content
Partial Free content
New content
Open access content
Partial Open access content
Subscribed content
Partial Subscribed content
Free trial content

Reducing Power of Memory Hierarchy in General Purpose Graphics Processing Units

Buy Article:

Sign-in

Tools

Share Content