BRLoop: Constructing balanced retimed loop to architect STT-RAM-based hybrid cache for VLIW processors
Introduction
As technology scaling continues, architecting the merging nonvolatile memories (NVMs) such as STT-RAM (Spin Torque Transfer RAM) [1], PCM (Phase Change Memory) and RRAM (Resistive RAM), as caches is a promising technique to achieve low power consumption and high storage density for embedded systems [[2], [3], [4], [5]]. In recent years, STT-RAM has been advancing rapidly in both academic and industrial areas. In order to solve the readability problem caused by STT-RAM's Process variations and thermal fluctuations, Wang Kang et al. proposed a novel reconfigurable design strategy from device, circuit, and architecture codesign perspective through quantitative analysis at the physical level and circuit level [6,7]. Mengxing Wang et al. analyzed the P-MTJ structure and provided a critical path to the research and development of new generation STT-RAM [8]. Zhaohao Wang et al. proposed a NAND-SPIN memory with flash-like write operation for high-density non-volatile memory application [9].
However, STT-RAM still suffers from relatively high dynamic energy and long latency on write operations [[10], [11], [12], [13]]. In previous work, a hybrid cache architecture consisting of majority of STT-RAM and minority of SRAM has been proposed to combat the write problems of STT-RAM [3,[14], [15], [16], [17], [18], [19]]. Meanwhile, migration technique, dynamically moving write-intensive and read-intensive data between STT-RAM and SRAM, has often been employed in the hybrid architecture to further achieve high performance and low power.
Migrations also introduce extra reads and writes during data movements. It has been observed that the migration overhead on dynamic energy as well as execution latency is significant in loops. In particular, this overhead is highly related to read and write transition events [20] which is manifested as interleaved read and write access pattern in loops. Motivated by this observation, a loop retiming technique has been proposed to change the read and write access pattern of cache lines to effectively reduce the migration overhead in hybrid caches [21].
Very Long Instruction Word (VLIW) processors are featured with multiple-issue architectures such as Intel Itanium [22], Trimedia CPU64 [23], and TMS320C6745 [24]. These processors are able to deliver high-performance at a low energy consumption cost. They use specific compiling techniques to schedule the instructions for parallel execution instead of runtime dependency-checking mechanisms. VLIW processors are suitable options for safety-critical systems in quite a few fields, such as automotive, space, and avionics. VLIW processors have developed through decades. Recently, Anderson L. Sartor et al. proposed a new VLIW-based processor design capable of adapting the execution of the application at run-time in a totally transparent fashion, considering performance, fault tolerance, and energy consumption altogether, in which the weight (priority) of each one can be defined a priori [25]. Debjyoti Bhattacharjee et al. proposed ReVAMP, a general-purpose programmable platform that allows for VLIW-like instruction-set and systematically handles the parallelization of computation on ReRAM crossbar array structures [26]. Notably, a VLIW multi-core was recently used by NASA (National Aeronautics and Space Administration) in its Mars rover for image processing [27].
Applications such as signal processing, image processing and fluid mechanics require very high computing performance. Loops with iterative or recursive computations often occupy the significant portion of execution time in those computation-intensive applications. In order to improve the execution efficiency of loops, previous studies have proposed to employ loop retiming technique to effectively enable parallel and/or pipelined processing for the processors equipped with multiple arithmetic and logical units (ALUs) such as VLIW processors [[28], [29], [30], [31], [32]]. Loop retiming reconstructs the loop body by changing the existing dependence delays to obtain friendly dependencies for high instruction level parallelism (ILP).
It is known that ALUs and memory are both key parts in an embedded system. Loop retiming can impact both of the ALU part and the memory part, however, none of current studies take into account both of the impacts in one architecture. The traditional ILP-aware loop retiming techniques focusing on ALU part are unaware of the new architectural characteristics of NVM-based hybrid cache while the migration-aware loop retiming approaches focusing on memory part have not fully considered ILP of ALUs. Addressing such a problem, this work proposes an idea of balanced loop retiming to achieve a comprehensive optimization considering ILP of ALUs and migration overhead of hybrid cache, so as to effectively architect the STT-RAM-based hybrid cache for embedded processors with multiple ALUs such as VLIW processors.
In previous studies, read and write transition number is adopted to measure the migration overhead, while parallelled instruction number is adopted to measure the ILP level. They are not with the same metrics. Therefore, in order to work out an overall balanced loop retiming solution a complete system, the challenge of this work is that we need to take into account the impacts of retiming on both of the ILP of ALUs and the migration overhead of the STT-RAM-based hybrid cache. That is, the measurements of the two parts should be transformed to the same metric.
In this paper, we first derive the impact models of loop retiming on the migration overhead and the ILP of ALUs by analyzing the data flow graph (DFG) of a loop kernel. Directed by the impact model, a balanced loop retiming technique is then proposed to achieve the maximal overall performance profit. Finally, an optimal algorithm and a heuristic algorithm are proposed to obtain the retiming vector for each computation node of a loop. We conducted experiments to evaluate the proposed loop retiming schemes for different loop kernels. Results across a set of benchmarks show that the proposed optimal and heuristic balanced retiming approaches improve the system performance up to 16.0%(13.1%) and 46.0%(43.1%) compared to the cases of no retiming and pure ILP-aware retiming respectively.
In summary, we make the following contributions.
- •
This paper is the first work to raise the issue of conducting effective loop retiming to optimize the overall performance when architecting STT-RAM-based hybrid caches for VLIW processors.
- •
The impacts of loop retiming on both of the ILP of ALUs and the migration overhead of STT-RAM-based hybrid cache are quantitatively analyzed. And the impacts are modeled in one integrated framework with the same metric.
- •
Guided by the impact model, the proposed loop retiming solution takes into account both sides of a VLIW processor: ALU part and memory part. An optimal and a heuristic algorithm are proposed to implement the overall balanced loop retiming to improve system performance.
- •
The experimental results report a good improvement on the overall system performance. Furthermore, the results reveal the unbalanced effects of loop retiming on ILP and migration overhead.
The rest of this paper is organized as follows. The loop retiming preliminaries as well as its applications for VLIW processors and STT-RAM-based hybrid caches are introduced in Section 2. A motivational example is illustrated to present the effectiveness of the overall balanced loop retiming in Section 3. The impact models of loop retiming on ILP of ALUs and on migration overhead in hybrid cache are studied in detail in Section 4. A heuristic algorithm is proposed to conduct the balanced loop retiming in Section 5. Section 6 presents the evaluation results. We finally conclude the paper in Section 7.
Section snippets
Background information
In this section, we first briefly introduce the loop retiming preliminaries. Then we present its applications for enhancing instruction parallelism and for mitigating migration overhead in STT-RAM-based hybrid caches.
Motivation
For loops with data dependencies, the previous research has shown that loop retiming can impact both the ILP of multiple ALUs and the migration overhead of STT-RAM-based hybrid cache. However, the ILP-aware loop retiming techniques are proposed in 1990s-2000s and unaware of the new characteristics of the emerging NVM technologies. The recent migration-aware loop retiming work [21] only focuses on the STT-RAM-based hybrid memory part, but pays little attention to the ALU part. Therefore, it is
Impact models
In this section, we first quantitatively model the impacts of loop retiming on ILP and on migration overhead respectively. These models are normalized to execution cycles. Then we present the overall impact model under a loop retiming action.
Balanced loop retiming algorithm
Directed by the overall impact model of retiming, this paper proposes an optimal and a heuristic approach to solve the retiming vector for each computation node to effectively obtain as large as possible overall profit.
Experiments
In this section, we first introduce the experimental setup. Then the evaluation results of migration overhead of hybrid cache, ILP of ALUs and the execution time under the balanced loop retiming are presented.
Conclusion
This paper presents a balanced loop retiming technique to effectively improve the overall performance when architecting the emerging STT-RAM-based hybrid cache for VLIW processors. The impacts of loop retiming on both the ILP of ALUs and migration overhead of the hybrid cache are quantitatively modeled. An optimal and a heuristic retiming algorithms are proposed to derive retiming vector for each node in a loop. The experimental validation demonstrates that the proposed balanced loop retiming
Acknowledgment
This work is supported by Beijing Advanced Innovation Center for Imaging Technology, Beijing Innovation Center for Future Chip, National Natural Science Foundation of China [Project No. 61502321, 61872251] and the Project of Beijing Municipal Education Commission [Project No. KM201710028016].
References (35)
- et al.
Write activity reduction on non-volatile main memories for embedded chip multiprocessors
ACM Trans. Embed. Comput. Syst.
(2013) - STT-MRAM:...
- et al.
Emerging non-volatile memories: opportunities and challenges
- et al.
Design margin exploration of Spin-Transfer Torque RAM (STT-RAM) in scaled technologies
IEEE Trans. Very Large Scale Integr. Syst.
(2010) - et al.
Constructing large and fast multi-level cell STT-MRAM based cache for embedded processors
- et al.
LARS: logically adaptable retention time STT-RAM cache for embedded systems
- et al.
Reconfigurable codesign of STT-MRAM under process variations in deeply scaled technology
IEEE Trans. Electron. Dev.
(2015) - et al.
Spintronics: emerging ultra-low-power circuits and systems beyond mos technology
ACM J. Emerg. Technol. Comput. Syst.
(2015) - et al.
Current-induced magnetization switching in atom-thick tungsten engineered perpendicular magnetic tunnel junctions with large tunnel magnetoresistance
Nat. Commun.
(2018) - et al.
High-density NAND-like spin transfer torque memory with spin orbit torque erase operation
IEEE Electron. Device Lett.
(2018)
Energy reduction for STT-RAM using early write termination
Multi retention level STT-RAM cache designs with a dynamic refresh scheme
Encoding Separately: an energy-efficient write scheme for MLC STT-RAM
Design exploration of hybrid caches with disparate memory technologies
ACM Trans. Archit. Code Optim.
STT-RAM based energy-efficiency hybrid cache for CMPs
Compiler-assisted preferred caching for embedded systems with STT-RAM based hybrid cache
High-endurance and performance-efficient design of hybrid cache architectures through adaptive line replacement
Cited by (3)
Deep Learning Optimization for Many-Core Virtual Platforms
2021, Communications in Computer and Information ScienceEnergy and performance analysis of sttram caches for mobile applications
2019, Proceedings - 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019