Elsevier

Microelectronics Journal

Volume 83, January 2019, Pages 137-146
Microelectronics Journal

BRLoop: Constructing balanced retimed loop to architect STT-RAM-based hybrid cache for VLIW processors

https://doi.org/10.1016/j.mejo.2018.11.011Get rights and content

Abstract

The new emerging non-volatile memory technology of Spin Torque Transfer RAM (STT-RAM) has been proposed as a replacement for SRAM based cache. Recently its commercial step has been greatly boosted by big companies such as Samsung. Although STT-RAM has quite a few advantages such as nonvolatility, high density and extremely low leakage power consumption, it suffers high dynamic energy and long latency on write operations. Addressing this problem, researchers proposed a STT-RAM/SRAM hybrid structure to alleviate the side effect of write operations. In hybrid caches, a migration based technique is often adopted to explore the advantages of both parts of a hybrid cache by dynamically moving write-intensive and read-intensive data between STT-RAM and SRAM.

Meanwhile, migrations also introduce extra reads and writes during data movements. For stencil loops with read and write data dependencies, it is observed that migration overhead is significant and migrations closely correlate to the interleaved read and write memory access pattern in a memory block. Loop retiming technique has proposed to reduce the migration overhead by changing the interleaved memory access pattern. It is known that loop retiming has been extensively studied to maximize instruction-level parallelism (ILP) of multiple function units by rearranging the dependence delays in a uniform loop. Both retiming techniques are conducted by changing the instruction dependence delays in a loop. However, this previous ILP-aware loop retiming is unaware of its impact on the hybrid cache's migration while the recent migration-aware loop retiming has not fully considered the parallelism of arithmetic and logical units (ALUs) in VLIW processors.

It is sure that the impacts of retiming on both the migration overhead of hybrid cache and ILP of VLIW should be considered when architecting STT-RAM-based hybrid cache for VLIW processors. Addressing this issue, this paper models the impacts of loop retiming on both ILP of ALUs and migration overhead in STT-RAM/SRAM hybrid cache. An overall balanced loop retiming solution, considering both of the ALU part and the memory part, is devised to achieve high performance for VLIW processors. The experimental results across a set of benchmarks show that the proposed optimal and heuristic balanced retiming approaches can effectively improve the overall system performance over the cases with no retiming, pure migration-aware retiming and pure ILP-aware retiming, respectively.

Introduction

As technology scaling continues, architecting the merging nonvolatile memories (NVMs) such as STT-RAM (Spin Torque Transfer RAM) [1], PCM (Phase Change Memory) and RRAM (Resistive RAM), as caches is a promising technique to achieve low power consumption and high storage density for embedded systems [[2], [3], [4], [5]]. In recent years, STT-RAM has been advancing rapidly in both academic and industrial areas. In order to solve the readability problem caused by STT-RAM's Process variations and thermal fluctuations, Wang Kang et al. proposed a novel reconfigurable design strategy from device, circuit, and architecture codesign perspective through quantitative analysis at the physical level and circuit level [6,7]. Mengxing Wang et al. analyzed the P-MTJ structure and provided a critical path to the research and development of new generation STT-RAM [8]. Zhaohao Wang et al. proposed a NAND-SPIN memory with flash-like write operation for high-density non-volatile memory application [9].

However, STT-RAM still suffers from relatively high dynamic energy and long latency on write operations [[10], [11], [12], [13]]. In previous work, a hybrid cache architecture consisting of majority of STT-RAM and minority of SRAM has been proposed to combat the write problems of STT-RAM [3,[14], [15], [16], [17], [18], [19]]. Meanwhile, migration technique, dynamically moving write-intensive and read-intensive data between STT-RAM and SRAM, has often been employed in the hybrid architecture to further achieve high performance and low power.

Migrations also introduce extra reads and writes during data movements. It has been observed that the migration overhead on dynamic energy as well as execution latency is significant in loops. In particular, this overhead is highly related to read and write transition events [20] which is manifested as interleaved read and write access pattern in loops. Motivated by this observation, a loop retiming technique has been proposed to change the read and write access pattern of cache lines to effectively reduce the migration overhead in hybrid caches [21].

Very Long Instruction Word (VLIW) processors are featured with multiple-issue architectures such as Intel Itanium [22], Trimedia CPU64 [23], and TMS320C6745 [24]. These processors are able to deliver high-performance at a low energy consumption cost. They use specific compiling techniques to schedule the instructions for parallel execution instead of runtime dependency-checking mechanisms. VLIW processors are suitable options for safety-critical systems in quite a few fields, such as automotive, space, and avionics. VLIW processors have developed through decades. Recently, Anderson L. Sartor et al. proposed a new VLIW-based processor design capable of adapting the execution of the application at run-time in a totally transparent fashion, considering performance, fault tolerance, and energy consumption altogether, in which the weight (priority) of each one can be defined a priori [25]. Debjyoti Bhattacharjee et al. proposed ReVAMP, a general-purpose programmable platform that allows for VLIW-like instruction-set and systematically handles the parallelization of computation on ReRAM crossbar array structures [26]. Notably, a VLIW multi-core was recently used by NASA (National Aeronautics and Space Administration) in its Mars rover for image processing [27].

Applications such as signal processing, image processing and fluid mechanics require very high computing performance. Loops with iterative or recursive computations often occupy the significant portion of execution time in those computation-intensive applications. In order to improve the execution efficiency of loops, previous studies have proposed to employ loop retiming technique to effectively enable parallel and/or pipelined processing for the processors equipped with multiple arithmetic and logical units (ALUs) such as VLIW processors [[28], [29], [30], [31], [32]]. Loop retiming reconstructs the loop body by changing the existing dependence delays to obtain friendly dependencies for high instruction level parallelism (ILP).

It is known that ALUs and memory are both key parts in an embedded system. Loop retiming can impact both of the ALU part and the memory part, however, none of current studies take into account both of the impacts in one architecture. The traditional ILP-aware loop retiming techniques focusing on ALU part are unaware of the new architectural characteristics of NVM-based hybrid cache while the migration-aware loop retiming approaches focusing on memory part have not fully considered ILP of ALUs. Addressing such a problem, this work proposes an idea of balanced loop retiming to achieve a comprehensive optimization considering ILP of ALUs and migration overhead of hybrid cache, so as to effectively architect the STT-RAM-based hybrid cache for embedded processors with multiple ALUs such as VLIW processors.

In previous studies, read and write transition number is adopted to measure the migration overhead, while parallelled instruction number is adopted to measure the ILP level. They are not with the same metrics. Therefore, in order to work out an overall balanced loop retiming solution a complete system, the challenge of this work is that we need to take into account the impacts of retiming on both of the ILP of ALUs and the migration overhead of the STT-RAM-based hybrid cache. That is, the measurements of the two parts should be transformed to the same metric.

In this paper, we first derive the impact models of loop retiming on the migration overhead and the ILP of ALUs by analyzing the data flow graph (DFG) of a loop kernel. Directed by the impact model, a balanced loop retiming technique is then proposed to achieve the maximal overall performance profit. Finally, an optimal algorithm and a heuristic algorithm are proposed to obtain the retiming vector for each computation node of a loop. We conducted experiments to evaluate the proposed loop retiming schemes for different loop kernels. Results across a set of benchmarks show that the proposed optimal and heuristic balanced retiming approaches improve the system performance up to 16.0%(13.1%) and 46.0%(43.1%) compared to the cases of no retiming and pure ILP-aware retiming respectively.

In summary, we make the following contributions.

  • This paper is the first work to raise the issue of conducting effective loop retiming to optimize the overall performance when architecting STT-RAM-based hybrid caches for VLIW processors.

  • The impacts of loop retiming on both of the ILP of ALUs and the migration overhead of STT-RAM-based hybrid cache are quantitatively analyzed. And the impacts are modeled in one integrated framework with the same metric.

  • Guided by the impact model, the proposed loop retiming solution takes into account both sides of a VLIW processor: ALU part and memory part. An optimal and a heuristic algorithm are proposed to implement the overall balanced loop retiming to improve system performance.

  • The experimental results report a good improvement on the overall system performance. Furthermore, the results reveal the unbalanced effects of loop retiming on ILP and migration overhead.

The rest of this paper is organized as follows. The loop retiming preliminaries as well as its applications for VLIW processors and STT-RAM-based hybrid caches are introduced in Section 2. A motivational example is illustrated to present the effectiveness of the overall balanced loop retiming in Section 3. The impact models of loop retiming on ILP of ALUs and on migration overhead in hybrid cache are studied in detail in Section 4. A heuristic algorithm is proposed to conduct the balanced loop retiming in Section 5. Section 6 presents the evaluation results. We finally conclude the paper in Section 7.

Section snippets

Background information

In this section, we first briefly introduce the loop retiming preliminaries. Then we present its applications for enhancing instruction parallelism and for mitigating migration overhead in STT-RAM-based hybrid caches.

Motivation

For loops with data dependencies, the previous research has shown that loop retiming can impact both the ILP of multiple ALUs and the migration overhead of STT-RAM-based hybrid cache. However, the ILP-aware loop retiming techniques are proposed in 1990s-2000s and unaware of the new characteristics of the emerging NVM technologies. The recent migration-aware loop retiming work [21] only focuses on the STT-RAM-based hybrid memory part, but pays little attention to the ALU part. Therefore, it is

Impact models

In this section, we first quantitatively model the impacts of loop retiming on ILP and on migration overhead respectively. These models are normalized to execution cycles. Then we present the overall impact model under a loop retiming action.

Balanced loop retiming algorithm

Directed by the overall impact model of retiming, this paper proposes an optimal and a heuristic approach to solve the retiming vector for each computation node to effectively obtain as large as possible overall profit.

Experiments

In this section, we first introduce the experimental setup. Then the evaluation results of migration overhead of hybrid cache, ILP of ALUs and the execution time under the balanced loop retiming are presented.

Conclusion

This paper presents a balanced loop retiming technique to effectively improve the overall performance when architecting the emerging STT-RAM-based hybrid cache for VLIW processors. The impacts of loop retiming on both the ILP of ALUs and migration overhead of the hybrid cache are quantitatively modeled. An optimal and a heuristic retiming algorithms are proposed to derive retiming vector for each node in a loop. The experimental validation demonstrates that the proposed balanced loop retiming

Acknowledgment

This work is supported by Beijing Advanced Innovation Center for Imaging Technology, Beijing Innovation Center for Future Chip, National Natural Science Foundation of China [Project No. 61502321, 61872251] and the Project of Beijing Municipal Education Commission [Project No. KM201710028016].

References (35)

  • Jingtong Hu et al.

    Write activity reduction on non-volatile main memories for embedded chip multiprocessors

    ACM Trans. Embed. Comput. Syst.

    (2013)
  • STT-MRAM:...
  • Chun Jason Xue et al.

    Emerging non-volatile memories: opportunities and challenges

  • Yiran Chen et al.

    Design margin exploration of Spin-Transfer Torque RAM (STT-RAM) in scaled technologies

    IEEE Trans. Very Large Scale Integr. Syst.

    (2010)
  • Lei Jiang et al.

    Constructing large and fast multi-level cell STT-MRAM based cache for embedded processors

  • Kyle Kuan et al.

    LARS: logically adaptable retention time STT-RAM cache for embedded systems

  • Kang Wang et al.

    Reconfigurable codesign of STT-MRAM under process variations in deeply scaled technology

    IEEE Trans. Electron. Dev.

    (2015)
  • Kang Wang et al.

    Spintronics: emerging ultra-low-power circuits and systems beyond mos technology

    ACM J. Emerg. Technol. Comput. Syst.

    (2015)
  • Mengxing Wang et al.

    Current-induced magnetization switching in atom-thick tungsten engineered perpendicular magnetic tunnel junctions with large tunnel magnetoresistance

    Nat. Commun.

    (2018)
  • Zhaohao Wang et al.

    High-density NAND-like spin transfer torque memory with spin orbit torque erase operation

    IEEE Electron. Device Lett.

    (2018)
  • Ping Zhou et al.

    Energy reduction for STT-RAM using early write termination

  • Zhenyu Sun et al.

    Multi retention level STT-RAM cache designs with a dynamic refresh scheme

  • Jie Xu et al.

    Encoding Separately: an energy-efficient write scheme for MLC STT-RAM

  • Xiaoxia Wu et al.

    Design exploration of hybrid caches with disparate memory technologies

    ACM Trans. Archit. Code Optim.

    (2010)
  • Jianhua Li et al.

    STT-RAM based energy-efficiency hybrid cache for CMPs

  • Qingan Li et al.

    Compiler-assisted preferred caching for embedded systems with STT-RAM based hybrid cache

  • Jadidi Amin et al.

    High-endurance and performance-efficient design of hybrid cache architectures through adaptive line replacement

  • Cited by (3)

    View full text