# Efficient Placement and Migration Policies for an STT-RAM based Hybrid L1 Cache for Intermittently Powered Systems

SatyaJaswanth Badri<sup>1\*</sup>, Mukesh Saini<sup>1</sup> and Neeraj Goel<sup>1</sup>

<sup>1</sup>Computer Science and Engineering, IIT Ropar, Rupnagar, 140001, Punjab, India.

\*Corresponding author(s). E-mail(s): 2018csz0002@iitrpr.ac.in; Contributing authors: mukesh@iitrpr.ac.in; neeraj@iitrpr.ac.in;

#### Abstract

The number of battery-powered devices is rapidly increasing due to the widespread use of IoT-enabled nodes in various fields. Energy harvesters, which help to power embedded devices, are a feasible alternative to replacing battery-powered devices. In a capacitor, the energy harvester stores enough energy to power up the embedded device and compute the task. This type of computation is referred to as intermittent computing. Energy harvesters are unable to supply continuous power to embedded devices. All registers and cache in conventional processors are volatile. We require a Non-Volatile Memory (NVM)-based Non-Volatile Processor (NVP) that can store registers and cache contents during a power failure.

NVM-based caches reduce system performance and consume more energy than SRAM-based caches. This paper proposes Efficient Placement and Migration policies for hybrid cache architecture that uses SRAM and STT-RAM at the first level cache. The proposed architecture includes cache block placement and migration policies to reduce the number of writes to STT-RAM. During a power failure, the backup strategy identifies and migrates the critical blocks from SRAM to STT-RAM. When compared to the baseline architecture, the proposed architecture reduces STT-RAM writes from 63.35% to 35.93%, resulting in a 32.85% performance gain and a 23.42% reduction in energy consumption. Our backup strategy reduces backup time by 34.46% when compared to the baseline. **Keywords:** Hybrid cache architecture, Memory, Single-level cache, SRAM, STT-RAM.

## 1 Introduction

The Internet of things (IoT) has created several fascinating applications that consist of intelligent sensors and systems. IoT may consist of billions of sensors and systems by the end of 2050 [1]. This prediction is exciting and promising, but deciding how to power these IoT devices is the main challenge. The majority of IoT devices are battery-powered. In some areas, such as deep mines, space, and industrial environments, replacing batteries after installation is difficult and expensive.

Furthermore, the battery has a specific problem with its lifetime [2]. As a result, an alternative and the promising solution is to replace the battery with energy harvesters. Energy-harvesting devices extract energy from their surroundings, such as light, vibration, radio, and many others [3]. The accumulated energy is used for powering up these IoT devices.

The unpredictable nature of energy harvesters causes voltage fluctuation or a power failure. A voltage stabilizer or a capacitor is a standard solution to the voltage fluctuation issue [4]. In a conventional processor, however, power failures result in data loss. The data is lost because registers, caches, and main memories are designed using volatile memories, such as SRAM and DRAM [5], [6]. Data lost includes the application's program state and progress. The contents of registers, cache, and main memory are all part of the program state. As a result, when the power failure occurs, some parts of the application have to re-execute, causing the execution progress to be slow and consuming extra energy. This type of computing is known as intermittent computing [7–9].

The solution is to store the application's program state at a precise restart point before a power failure. The question is, where should the program state be stored? Using Non-volatile memories (NVM), we can save the application's program state during a power failure. There are several new NVMs proposed recently, such as spin-transfer torque RAM (STT-RAM) [10–13], phase-change memory (PCM) [14], resistive random-access memory (Re-RAM) [15], and ferroelectric RAM (FRAM) [16].

Researchers have explored NVM-based non-volatile processors (NVPs) [17], which help in executing the application during the irregular power supply. NVM cache enhances the application's execution progress even during frequent power failures. NVMs have longer read and write latency than SRAM-based caches. Replacing SRAM with NVM is not a good idea; as an alternative, we can integrate both SRAM and NVM to make a hybrid architecture [18] [19] at the cache level. Xie et al. [18] propose a hybrid cache architecture for intermittently powered IoT devices.

This paper builds on previous work [18] to improve the following aspects of a hybrid cache: (a) performance, (b) energy utilization, and (c) reducing writes to NVM.

Most of the literature mentions STT-RAM as an emerging candidate among all NVM technologies for LLC [20], [21], [22]. STT-RAM promises higher density and less leakage power than existing SRAM. The main memory in conventional processors is implemented using DRAM, but the main memory in emerging micro-controllers, such as the MSP430FR5969 [16], has non-volatile based main memory. For main memory, PCM is the appropriate memory technology because PCM has similar endurance features with STT-RAM and is also cheaper than other NVM technologies. As a result, throughout this paper, we have used STT-RAM as a non-volatile cache and PCM as a non-volatile main memory.

At the L1 cache, we propose Efficient Placement and Migration policies for hybrid cache architecture that includes both SRAM and STT-RAM. We assume that a capacitor's energy can backup the processor state during a power failure [23]. Because SRAM cache and SRAM-based register values are volatile, they must be stored in NVM. The proposed architecture identifies the blocks that need to be written to the main memory and STT-RAM at the L1 cache to reduce backup time.

Further, the exception mechanism of the pipelined processor is used to arrive at a precise wake-up point. When power comes back, our proposed architecture works like a regular architecture, i.e., every memory access is first looked up in the L1 cache. Due to the presence of STT-RAM in the L1 cache, it stores frequently accessed blocks, and the proposed architecture has benefits because of hybrid NVM cache architecture during frequent power failures.

In addition, we proposed a prediction table to help in block placement and migration. Compared to the baseline architecture, the proposed architecture improves performance by 32.85% and reduces energy consumption by 23.42%. Compared to the baseline, our proposed backup strategy reduces backup time by 27.91%. The proposed architecture has a storage overhead of only 2.34%.

This paper is organized as follows: Section 2 discusses the related works. Section 3 explains the motivation behind the proposed system architecture and gives an overview of the problem formulation. Section 4 explains about proposed hybrid cache architecture. The experimental setup and results are described in section 5. We concluded this work in section 6.

### 2 Related works

This section reviews the related work in hybrid cache architectures (HCAs) and NVM for last-level caches (LLCs), HCA for L1 cache, and architectures for intermittent powered IoT devices.

#### 2.1 NVM for LLC

STT-RAM offers better features than the existing NVM technologies [11–13]. In order to use STT-RAM at LLC, we have two possible and distinct preferences. First, by replacing the whole SRAM cache with STT-RAM at LLC. Second, using HCA (SRAM+STT-RAM) at LLC, where this design takes advantage of both SRAM and STT-RAM.

Wu et al. [24] modeled a 3-level cache architecture by replacing SRAM with STT-RAM at the L3 and using SRAM at L1 and L2. This architecture achieved instructions per cycle (IPC) improvement of around 4% and compared with the traditional 3-level SRAM cache design Wu et al. achieved a 63% reduction in power consumption.

Usually, for hybrid caches, block placement and movement between caches are the main challenges. Classifying the cache blocks based on write frequencies [17] and write access behavior [25] [26] [27] [28] that helps to decide where to place the respective cache block. Many architectures with the HCA use prediction table-based techniques to predict and place the cache block in an appropriate cache region and migrate from one cache to another [27]. Challenges at L1 HCA are different from LLC. At LLC, input traffic is due to misses of L1/L2, while read/write requests at L1 are due to load/store instructions.

#### 2.2 HCA for L1 caches

The write access latency of STT-RAM is higher than the SRAM, which creates the primary limiting factor of using STT-RAM at the L1 cache. For this concern, there are two possible alternatives. First, relaxing the non-volatility of STT-RAM to reduce the overall STT-RAM's write access latency [21], [22]. Relaxing the STT-RAM's non-volatility is achievable by reducing the MTJ planar area, and MTJ switching current [29]. Second, reducing the STT-RAM's write latency and energy consumption by limiting the number of writes to STT-RAM. Usually, the number of reads and write operations in the L1 cache is more than in the LLC.

Xie et al. [18] introduce an HCA that consists of STT-RAM in the L1 cache. During power failures, they backup the program state from the SRAM cache to the STT-RAM cache. The authors use an access pattern-based predictor that predicts block behavior. Based on the prediction, Xie et al. place the cache block in the respective cache region. During an eviction or on a wrong prediction, Xie et al. propose a migration policy that migrates a cache block from one cache region to another. Whenever power comes back, Xie et al. restore the cache contents from the STT-RAM cache to the SRAM cache.

In these hybrid caches, wrongly placing a cache block in any region causes migration overhead. Migration overhead increases the number of writes to NVM and consumes more write energy. Introducing NVM in the L1 cache shows an impact on performance and energy consumption. Therefore, we proposed an efficient HCA to address the above issues. We use SRAM and STT-RAM at the L1 cache to reduce these migration overheads during both stable and unstable power scenarios. We proposed placement and migration policies, which also have a prediction table to predict the correct placement to reduce these additional overheads.

#### 2.3 Architectures for Intermittent power devices

NVM-based NVPs [30], [31], [32] are proposed by storing the contents of the registers, volatile on-chip data to the non-volatile registers, and non-volatile memories, respectively. Whenever power comes back, the system uses data from the NVM region to continue and complete the application execution.

Checkpoint-based approaches for HCA are the other alternative for supporting intermittently powered IoT devices. In these checkpoint-based approaches, volatile data is checkpointed to NVM at regular intervals to store the application program state [33]. Mementos [34] was one of the initial checkpointing techniques. It used periodic voltage checks to decide when to back up the program state. Hibernus [35] extended the work of [34] by introducing NVM. These checkpointing schemes don't consider the timely execution of the applications. TICS [36] overcomes this problem by introducing timely execution, branching, and efficient automatic checkpoints.

Checkpoints are placed using either software procedures or hardware components. Checkpointing approaches like [34], [37], and [38] were proposed to backup and restore a consistent program state. The compiler or software procedures were primarily responsible for placing software-based checkpoints. Whenever a checkpoint is identified, the system initiates a backup procedure that stores the program state to NVM. In [38], checkpoints are placed based on the expiration of a timer. Hardware-based checkpoints were mainly associated with external devices. In [37], hardware-based checkpoints were placed using a voltage detector that triggers a backup mechanism for an NVP.

The main problem with the above architectures and techniques is the extra computation caused by multiple checkpoints. Another issue with checkpointing during a power failure is data inconsistency, which leads to a corrupted output. Another disadvantage of the checkpointing approach is that whenever power comes back, we must restore the contents of non-volatile main memory to the cache. Whenever power comes back, we must implement a restoration procedure that restores the saved checkpoint from the NVM. Restoring the program state introduces one more extra overhead. We proposed an HCA with a backup policy that triggers during a power failure instead of multiple backups of the program state at the desired checkpoints. Instead of restoring the program state after every power failure, our proposed HCA implements an automatic restoration process by accessing the data from the NVM.

## **3** Motivation and Problem Formulation

This section discusses observations that motivate us to propose new architecture and techniques. We performed a set of experiments on a system configuration that consists of equal SRAM and STT-RAM at the L1 cache. In section 5, we have given more details of the experimental setup, and table 1 has architectural parameters.



(b) Dynamic Energy Consumption

Fig. 1: Comparisons between Pure SRAM and Pure STT-RAM Architectures in terms of Execution time and Dynamic Energy Consumption

#### 3.1 Motivation

Introducing STT-RAM as a cache can deteriorate the system's performance due to its long access time and consumes more dynamic energy. We modeled two cache architectures in gem5 [39], pure SRAM cache (only SRAM at L-1) and pure STT-RAM cache (only STT-RAM at L-1), to compare their performances and energy consumption. In figure 1, the performance and energy consumption of the cache architectures are normalized based on the pure SRAM cache architecture. Figure 1 (a) shows that STT-RAM cache architecture takes 45.93% more execution time than pure SRAM cache architecture. Our first observation is to use STT-RAM efficiently so that it should not deteriorate the overall system performance and energy consumption. Thus, we need to use a hybrid cache instead of a pure STT-RAM cache, where that hybrid cache benefits from both SRAM and STT-RAM.

In the case of HCAs, movement between two cache regions was explored in literature, i.e., migration-based policies for hybrid caches [40], [41], [42], [43], [44]. Migrating a cache block from one cache region to another cache region yields extra overheads, i.e., migration overheads. These overheads increase the number of reads and write operations and require additional cycles and energy, making the system inefficient by consuming more energy and deteriorating the overall system's performance. Thus, our second observation is to reduce these additional migration overheads.

The observations and challenges mentioned above motivated us to propose an HCA. We proposed an HCA that uses both SRAM and STT-RAM efficiently, with the proposed architecture that benefits from SRAM during regular operation and STT-RAM during power failure.

In the existing architectures, Xie et al. [18] also introduced a similar hybrid cache architecture that consists of STT-RAM at the L1 cache. The main observations that we reported from the Xie et al. work and the main challenges associated with the existing HCA [18] are listed below.

- 1. For the prediction table, Xie et al. used a pattern sampler, which doesn't gather the complete details of the application.
- 2. Where the placement and migration policies cannot provide accurate predictions if the prediction information is incomplete. As previously stated, inaccurate predictions increase the number of reads and write operations, which consume more execution time and energy.
- 3. Xie et al. used a checkpointing scheme, which uses more energy because we need to write/read to/from the NVM for every checkpoint.
- 4. Xie et al. used a standard LRU replacement policy to identify the cache block for eviction. What if the evicted block turns out to be a write-intensive block the next time? The used replacement policy may result in unnecessary writes to NVM and consumes more energy.
- 5. Xie et al. backup all volatile contents during a power failure, which is not always necessary, and push more writes to NVM during frequent power failures.

All the above challenges and observations motivated us to propose an efficient HCA that considers these issues. In section 4, we discussed the proposed HCA and placement policies in detail.

#### 3.2 Problem definition

We propose a hybrid cache model as shown in figure 2. By introducing NVM at the L1 cache, we observed additional overheads, which were discussed in section 3.1. So we reduce these overheads by introducing placement and migration policies. We formulated our three main objectives below.



Fig. 2: Proposed System Model

- Minimize backup energy.
- Maximize backup efficiency.
- Maximize energy efficiency.

For the hybrid architecture model, our design performs a backup during a power failure, and when the power comes back, it performs a memory restore operation. As a result, we define the energy required to execute the application in equation 1.

$$E_{overall} = E_{exec} + E_{backup} + E_{restore} \tag{1}$$

Where  $E_{overall}$  is the energy required to execute the overall application,  $E_{exec}$  is the energy required to execute the program. We backup the system state by copying all register contents and SRAM cache blocks to NVM. The energy consumed by the backup procedure is  $E_{backup}$ , where it depends on the number of bytes to be backed up to NVM as shown in equation 2.

$$E_{backup} = N_{w\_L1} * e_{w\_sttram} + N_{w\_main} * e_{w\_pcm}$$

$$\tag{2}$$

Where  $N_{w\_L1}$  is the number of writes to STT\_RAM,  $N_{w\_main}$  is the number of writes to main memory,  $e_{w\_sttram}$  is the energy per write for the STT-RAM and  $e_{w\_pcm}$  is the energy per write for PCM RAM. We achieve our first objective by reducing the number of writes,  $N_{writes} = N_{w_{-L1}} + N_{w_{-main}}$ , during both stable power and intermittent power supply.

The energy required to restore the volatile contents from NVM is  $E_{restore}$ , where it depends on the number of bytes to be restored from NVM and can be defined as follows:

$$E_{restore} = N_{r\_L1} * e_{r\_sttram} + N_{r\_main} * e_{r\_pcm}$$
(3)

Where  $N_{r\_L1}$  is the number of reads to STT\_RAM,  $N_{r\_main}$  is the number of reads to main memory,  $e_{r\_sttram}$  is the energy per read for the STT-RAM and  $e_{r\_pcm}$  is the energy per read for PCM RAM. Whenever power comes back, the size of the restoring contents is the same as the content that was backed up during a power failure. Thus, the equations 2 and 3 were interrelated in terms of sizes, and as we are doing automatic restoration, so we don't have any restore overhead in our proposed HCA.

Our second objective is maximizing backup efficiency  $(\eta)$ , defined and shown in equation 4.

$$\eta = \frac{N_{w\_L1}}{N_{w\_L1} + N_{w\_main}} \tag{4}$$

If we achieve less  $N_{writes}$ , our  $\eta$  increase. Thus, we achieve our second objective by reducing  $N_{writes}$ .

Lastly, we define energy efficiency as the ratio of energy consumed during normal execution without any power failures to the energy consumed during power failures. Let  $\theta$  be the energy efficiency as defined in the equation 5.

$$\theta = \frac{E_{normal}}{E_{overall}} \tag{5}$$

Where  $E_{normal}$  is the energy required for normal execution without any power interruptions.

### 4 Proposed Architecture

This section explains the proposed architecture that uses the proposed placement, migration, and backup policies.

#### 4.1 Hybrid Cache Architecture

The proposed architecture is shown in figure 3.

Every cache set in the proposed architecture contains a mix of SRAM and STT-RAM cache blocks. Along with the valid bit (V), dirty bit (D), tag, and data in each cache block, we added three more entries: i) Read-Intensive Counter (RIC), ii) Write-Intensive Counter (WIC), and iii) Confidence bits (CONF). These three entries are beneficial for cache placement and migration policies. We classified blocks into two types: read-intensive blocks and writeintensive blocks. Read-intensive (RI) blocks have more read accesses than a



Fig. 3: Overview of the Proposed Architecture

predefined threshold at a given point in time, while write-intensive (WI) blocks have more write accesses than a predefined threshold at a certain point in time.

We keep two counters for each block, RIC and the WIC. Furthermore, we added a 2-bit CONF field that tracks important blocks; important block information is helpful during power failures. A prediction table has also been included. Each prediction table entry has a previous region (PR) bit. During the replacement/eviction process, this PR bit is updated. The PR bit stores the block's most recent cache region.

#### 4.2 Placement and Migration Policies

We describe the proposed block placement and migration policies in this section. Because STT-RAM has higher read/write latency and consumes more energy than SRAM, the placement policy aims to reduce the number of writes to STT-RAM. STT-RAM write latency is ten times more than its read latency [10]. Therefore, we would like to place write-intensive blocks in the SRAM cache and read-intensive blocks in the STT-RAM cache. We use the PR bit to

check the prediction table and place the block in the appropriate cache region based on whether it is read-intensive or write-intensive.

Algorithm 1 demonstrates placement policy in case of a cache miss. Line 1 uses a tag to check the prediction table on a read/write miss. We access the PR bit associated with the tag entry. We keep the PR bit to note the previous block placement information for that tag entry. If PR=0, line 3-5 in algorithm 1 checks whether the corresponding STT-RAM cache set is full or not. If it is full, we replace the block with the lowest RIC value; otherwise, we place it in the STT-RAM cache. Suppose PR !=0, line 10-12 in algorithm 1 checks whether the SRAM cache set is full or not. We replace the block with the lowest WIC value; else, we place the block in the SRAM cache.

| Alg | Algorithm 1 Placement Algorithm in case of Cache miss       |  |  |  |  |  |
|-----|-------------------------------------------------------------|--|--|--|--|--|
| 1:  | 1: Check Prediction Table.                                  |  |  |  |  |  |
| 2:  | if $PR == 0$ then                                           |  |  |  |  |  |
| 3:  | if STT-RAM set is full then                                 |  |  |  |  |  |
| 4:  | Replace block with lowest <i>b</i> . <i>RIC</i>             |  |  |  |  |  |
| 5:  | Update the replaced block's PR bit in the Prediction Table. |  |  |  |  |  |
| 6:  | else                                                        |  |  |  |  |  |
| 7:  | Place in the STT-RAM cache.                                 |  |  |  |  |  |
| 8:  | Re-Intialize b.RIC, b.WIC to zero.                          |  |  |  |  |  |
| 9:  | end if                                                      |  |  |  |  |  |
| 10: | else                                                        |  |  |  |  |  |
| 11: | if SRAM set is full then                                    |  |  |  |  |  |
| 12: | Replace block with lowest b.WIC.                            |  |  |  |  |  |
| 13: | Update the replaced block's PR bit in the Prediction Table. |  |  |  |  |  |
| 14: | else                                                        |  |  |  |  |  |
| 15: | Place in the SRAM cache.                                    |  |  |  |  |  |
| 16: | Re-Initialize b.RIC, b.WIC to zero.                         |  |  |  |  |  |
| 17: | end if                                                      |  |  |  |  |  |
| 18: | end if                                                      |  |  |  |  |  |

Algorithm 2 describes placement and migration policies whenever there is a read hit. Line 1 checks the block's RIC value with the empirically determined threshold. We fixed the threshold limit empirically. If the block's RIC is equal to the threshold, we call that block an RI block. The proposed placement policy suggests that all RI blocks should place in STT-RAM. If the block is present in the SRAM cache, we migrate from SRAM to STT-RAM and re-initialize RIC, WIC, and CONF to zero. If the block is not in the SRAM cache, we place the block in the STT-RAM cache and increment CONF by 1. If the threshold does not equal the block's RIC value, we increment RIC by 1. The block chosen for replacement has to update its PR bit in the prediction table. If RIC reaches the threshold and CONF reaches 11 state, then we don't increment RIC.

Algorithm 3 describes placement and migration policies whenever there is a write hit. Line 1 checks the block's WIC value with the threshold. If the block's WIC equals the threshold, we call that block a WI block. The proposed placement policy suggests that all WI blocks should place in SRAM. If the

```
Algorithm 2 Placement and Migration Algorithm in case of Read hit
```

```
1: if b.RIC == threshold then
      if Block is in SRAM then
2:
          if STT-RAM set is full then
3:
             Replace block with lowest b.RIC.
 4.
             Update the replaced block's PR bit in the Prediction Table.
 5:
             Migrate to STT-RAM.
 6:
          else
 7.
             Migrate to STT-RAM.
8:
          end if
9.
          Re-Initialize b.RIC, b.WIC, b.CONF to zero.
10:
       end if
11:
      b.CONF = b.CONF + 1
12.
      Re-Initialize b.RIC to zero.
13 \cdot
14: else
      b.RIC = b.RIC + 1
15:
16: end if
```

block is already present in the STT-RAM cache, we migrate from STT-RAM to SRAM and re-initialize RIC, WIC, and CONF to zero. This case reduces the number of writes to the STT-RAM cache. If the threshold does not equal the block's WIC value, we increment WIC by 1. The block chosen for replacement has to update its PR bit in the prediction table. If WIC reaches the threshold and CONF reaches 11 state, then we don't increment WIC.

Algorithm 3 Placement and Migration Algorithm in case of Write hit

```
1: if b.WIC == threshold then
      if Block is in STT-RAM then
2:
          if SRAM set is full then
3:
             Replace block with lowest b.WIC.
 4:
              Update the replaced block's PR bit in the Prediction Table.
 5:
             Migrate to SRAM.
 6.
          else
 7:
             Migrate to SRAM.
8:
          end if
9:
          Re-Initialize b.RIC, b.WIC, b.CONF to zero.
10:
      end if
11:
      b.CONF = b.CONF + 1
12 \cdot
      Re-Initialize b.WIC to zero.
13:
14: else
      b.WIC = b.WIC + 1;
15:
16: end if
```

#### 4.3 Prediction Table Design

The importance of the prediction table in the proposed architecture is to store the previous region for the respective tag entry. The prediction table has L entries, where L denotes the number of entries in the prediction table. This table acts as a direct-mapped buffer, indexed using (Address/block\_size) % L. Each entry in the prediction table has a PR (Previous Region) field. The prediction table does not store the tag bits in order to save area; its size is L bits. Initially, all bits in the prediction table are set to 1.

We update the PR field whenever there is a replacement in the cache due to the SRAM/STT-RAM set being full. If PR is 1, the block is a WIC because its WIC was greater than RIC during the replacement. Place the WIC block into the SRAM cache region. If PR is zero, the block is a RIC because its RIC is greater than WIC during the replacement. Place the RIC block into the STT-RAM cache region.

#### 4.4 Support for Intermittent Power Supply

Our proposed architecture supports intermittent computing and performs well during frequent power failures. We define important blocks as those with high CONF values. We use RIC/WIC values to update the CONF field. When power is restored in a traditional architecture, we begin execution by accessing blocks from the main memory and copying them to the cache. We save important blocks in STT-RAM that help to start execution without restoring blocks from the main memory to SRAM.

We propose a state model to assist in determining the most important blocks. Using the CONF field, we can determine which blocks should be present in STT-RAM during a power failure. Initially, CONF is in 00 state and supports four states, i.e., 00, 01, 10, and 11 states, as shown in figure 4. To represent the proposed state model, we need a 2-bit CONF field. The algorithmic process of updating the CONF field has already been described in algorithm 2 and 3.



Fig. 4: State Diagram for Updating CONF

In summary, if RIC/WIC exceeds the threshold, CONF is increased by one and advances to the next state. When CONF is in the 11 state and crosses the

threshold, it remains in that state. If any migration happens from SRAM/STT-RAM to STT-RAM/SRAM cache, then CONF resets to 00 state along with the RIC and WIC values.

During a power failure, the proposed backup policy triggers to save important blocks from SRAM to STT-RAM. According to the proposed backup policy, the blocks with the CONF field 11 are the most important blocks. Therefore, we prioritize blocks with the CONF field in the order of 11 > 10> 01 > 00. If any SRAM block has a CONF field of 11, we replace that block with the least priority block in STT-RAM. If there is no block with 11 state in the SRAM, we decrement our priority order by 1.

Now our priority becomes 10. If there are blocks with 10 state in the SRAM cache line, we replace the blocks with the least priority block in STT-RAM. If there is no block with 10 state in the SRAM line, we decrement our priority order by one. Similarly, we check blocks with 01 and 00 states. Priority with 00 is the case where we copy the SRAM contents to STT-RAM and STT-RAM contents to PCM. Whenever power comes back, STT-RAM contents are accessed automatically without copying to SRAM. Our migration policy automatically migrates from STT-RAM to SRAM if needed and vice-versa.

### 4.5 Detailed Example

Figure 5 illustrates the detailed working of the proposed architecture. In figure 5, Initially, the SRAM cache has (a,c) blocks, and the STT-RAM cache has (b,d) blocks. We defined all counters and CONF as a tuple [RIC, WC, CONF] and initialized it to [0, 0, 00]. A prediction table has a PR field. We take a sequence of access requests; read requests are labeled as  $rd_i$  (i.e., read block i), and write requests are labeled as  $wr_i$  (i.e., write block i). We labeled different timing points as A, B, C, ..., K. In this section, we discuss how the proposed architecture works after every timing point.

In Fig A of figure 5, we update the RIC of 'a' to 2 because of two consecutive reads. In Fig B, the WIC of 'b' has become 2. In Fig C, [RIC, WIC] of 'a' updates to [3, 1]. In Fig D, the WIC of 'b' becomes 7, which equals the threshold and becomes a write-intensive block. Our placement policy suggests that write-intensive blocks should place in SRAM. SRAM set is full; to replace the block, we find the block having the lowest WIC. Block 'c' has a low WIC value; we replace 'c' with 'b' and reset all 'b' counters to [0, 0, 00]. In Fig E, the RIC of 'a' becomes 7, which equals the threshold and becomes a read-intensive block. Our placement policy suggests that read-intensive blocks should place in STT-RAM. STT-RAM set had one empty slot; we migrated 'a' from SRAM to STT-RAM. Reset all 'a' counters to [0, 0, 00] and update the WIC of 'b' to 2.

In Fig F, the RIC of 'a' updates to 4. A new block request 'c' occurred between F and G timing points. Block request 'c' is not present in both caches. Check the prediction table for index 2, associated with  $tag_c$ , to find the c's PR field. We found an entry in the prediction table of index 2 with PR = 1. If the PR value is 1, the block is placed in the SRAM cache during the last eviction.



Fig. 5: Working Example of the Proposed Architecture

We place 'c' in the SRAM cache. In Fig G, the WIC of 'c' updates to 7, which equals the threshold, becomes a write-intensive block, and updates the RIC of 'a' to 4. Our placement policy suggests that write-intensive blocks should be placed in the SRAM; 'c' is already in SRAM. We update the CONF of 'c' to 01 and reset the counter values. After H, WIC of 'c' updates to 3.

A new block request 'e' occurred between H and I timing points. Block request 'e' is not present in both caches. Check the prediction table for index 3, associated with  $tag_e$ , to find the e's PR field. We found an entry in the prediction table of index 3 with PR = 0. If the PR value is 0, the block is placed in the STT-RAM cache during the last eviction. We place 'e' in the STT-RAM cache. STT-RAM set is full; we find the lowest RIC to replace the block. Block 'd' has a low RIC value; we replace 'e' with 'd'. Update all 'e' counters to  $\{1, 0, 00\}$ .

Power failure (PF) occurred; our backup policy saves important blocks using the CONF field. Where the CONF of 'c' has 01 and 'a' has 00, our priority order suggests that 01 has the highest priority than 00. We place 'a' to the main memory and backup 'c' to STT-RAM. We prefer write-intensive blocks compared to read-intensive during a power failure. So 'b' replaces 'e'. In Fig J, 'c' and 'b' are saved to STT-RAM. Whenever power comes back (PB), we don't require any restoration process. Fig K shows the RIC of 'c' and 'b' updates to 1.

## 4.6 Storage Overhead

We analyze the storage overhead because we added extra bits, a prediction table, and backup logic. For the same system configuration shown in table 1, we evaluate the area overhead for the proposed architecture. We showed the area overhead as an example. There are two aspects of the proposed architecture that cause storage overhead.

- The proposed architecture has two 3-bit counters and two confidence bits per block. The data cache has 256 blocks, each with 8 bits, so the data cache requires 256\*8=2048 bits.
- The proposed prediction table has 4K byte entries with 1-bit per entry, resulting in a total storage overhead of 1024\*4 = 4096 bits.

The overall storage overhead of the proposed architecture will be 2048 + 4096 = 6144 bits=0.75KB. The total percentage of area overhead is about 0.75 KB/32 KB=2.34%.

## 5 Experimental Setup and Results

### 5.1 Experimental Setup

We evaluate the proposed architecture using the gem5 [39] simulator and 18 benchmarks from the MiBench suite [45]. Overall micro-architectural parameters used for implementation are shown in table 1.

Table 2 shows the dynamic energy and latency for a single read and write operation to SRAM and STT-RAM, taken using Nvsim [46].

### 5.2 Baseline Architecture

We modeled a baseline architecture to compare with the proposed architecture.

We first compared the performance and dynamic energy consumption of pure SRAM, pure STT-RAM, and hybrid (SRAM and STT-RAM) cache architectures to determine the baseline architecture. Based on the analysis, we

| Component               | Description                                           |  |  |
|-------------------------|-------------------------------------------------------|--|--|
| CPU core 1-core, 480MHZ |                                                       |  |  |
|                         | Block size - 64-byte, 4-way associative (2-way SRAM,  |  |  |
| L1 Cache                | 2-way STT-RAM);                                       |  |  |
|                         | Private cache (16KB hybrid D-cache, and 16KB I-cache) |  |  |
| Size Devenuetors        | VB-1bit, WIC and RIC-3bits, CONF-2bits,               |  |  |
| Size Parameters         | L- 4K bytes, threshold-7, and PR-1bit                 |  |  |
| Main memory             | 128MB PCRAM                                           |  |  |
|                         | Clock Period: 2ns,                                    |  |  |
|                         | SRAM Read: 1 Cycle,                                   |  |  |
|                         | SRAM Write: 2 Cycles,                                 |  |  |
| Others                  | STT-RAM Read: 2 Cycles,                               |  |  |
|                         | STT-RAM Write: 10 Cycles,                             |  |  |
|                         | PCM Read: 35 Cycles, and                              |  |  |
|                         | PCM Write: 100 Cycles                                 |  |  |

Table 1: System Configuration

| Table  | 2: Nvsim | parameters | of | SRAM, | MRAM | Caches, | and | $\mathbf{PCM}$ | memory |
|--------|----------|------------|----|-------|------|---------|-----|----------------|--------|
| (350K, | 22nm)    |            |    |       |      |         |     |                |        |

| Baramatar     | 16KB                 | 16KB       | 128MB              |
|---------------|----------------------|------------|--------------------|
| Farameter     | SRAM                 | MRAM       | PCRAM              |
| Read Latency  | 0.792 ns             | 1.994 ns   | 204.584 ns         |
| Read Energy   | 0.006 nJ             | 0.081 nJ   | 1.553 nJ           |
| Write Latoney | $0.772 \mathrm{~ns}$ | 10.520 pc  | RESET - 134.954 ns |
| write Latency |                      | 10.520 118 | SET - 264.954 ns   |
| Write Freerow | 0.002 n I            | 0.217 n I  | RESET - 6.946 nJ   |
| write Energy  | 0.002 115            | 0.217 115  | SET - 6.927 nJ     |
| Leakage Power | 18.972  mW           | 3.014  mW  | -                  |

choose the relevant baseline architecture to compare the proposed and existing architectures throughout this work.

- **Pure SRAM cache**: We don't require any placement or migration policies in pure SRAM cache because we have only SRAM at L1.
- **Pure STT-RAM cache**: We don't require any placement or migration policies in pure STT-RAM cache because we have only STT-RAM at L1.
- Hybrid cache: At L1, the hybrid cache architecture includes both SRAM and STT-RAM. We use a random placement policy in this HCA. In the random placement policy, the block is randomly placed in either SRAM or STT-RAM. We use the migration policy that moves blocks from one cache to another based on counters. We empirically determined the threshold as 7 and the size of the counters as 3 bits. Assume the WIC exceeds the threshold and is present in the STT-RAM cache region. In that case, we migrate that block into the SRAM cache region. In that case, we migrate that block into the STT-RAM cache region. In that case, we migrate that block into the STT-RAM cache region.

We set the L1 size to 32KB in all three architectures. Above all three cache architectures, we did not use any prediction mechanisms. In this figure



(b) Dynamic Energy Consumption

**Fig. 6**: Comparisons between Pure SRAM, Pure STT-RAM, and Hybrid Cache Architectures.

6, the performance and energy consumption of the cache architectures are normalized with the pure SRAM cache architecture. As shown in figure 6, hybrid-based architecture performs in between pure SRAM and pure STT-RAM cache architectures, i.e., hybrid-based architecture is better than pure STT-RAM cache.

**Baseline Architecture:** We selected hybrid-based architecture as our baseline architecture throughout this paper with the above-mentioned modeling details.

We experimented with the baseline architecture to determine the threshold value. When the respective counter crosses its threshold, we move the block from one cache to another to check energy values. The size of the counters is determined by the threshold value. For example, if the threshold is 3, the counter size is log4. (counts from 0 to 3). We experimented with threshold values of 1, 3, 7, and 15. We observed that threshold value 7 consumes less energy than the other threshold values on average. We set the threshold to 7 because we noticed that migrations between cache regions increase when the threshold is exceeded. We also observed that NVM gets more writes if the threshold is higher than 7, increasing HCA's energy consumption. Figure 7 shows that threshold value 7 consumed less energy than the other threshold values. The threshold value in our proposed architecture is set to 7 throughout this work for the system configuration shown in table 7. We also performed experiments to analyze the selected threshold behavior on our proposed architecture in section 5.3.1.



Fig. 7: Dynamic Energy consumption for Various Threshold Values

#### 5.3 Results

This section evaluates the proposed architecture under stable power and during intermittent power supply. We also evaluate the proposed architecture efficiency w.r.t traditional checkpointing approach under stable power and frequent power failures. Lastly, we evaluate the proposed architecture for  $\eta$ ,  $\theta$ w.r.t baseline, and existing architectures.

#### 5.3.1 Under Stable Power

We compare our proposed architecture with the baseline and the architecture proposed by Xie et al. [18]. We implemented Xie et al. [18] work to analyze both stable power and intermittent power systems. For a fair comparison, all these architectures use the same system configuration shown in table 1 and energy/delay values of STT-RAM and PCM from table 2.



Fig. 8: Write operations to STT-RAM

One of the main objectives of the proposed architecture is to reduce the number of writes to the STT-RAM cache. To achieve this, we place the writeintensive blocks in the SRAM cache. We have shown the ratio of the write operations to STT-RAM with total write accesses in figure 8. A lower number of writes to STT-RAM shows the effectiveness of the proposed architecture. The percentage of writes to the STT-RAM cache is normalized with the base-line architecture shown in figure 8. Overall, the proposed architecture helps in reducing STT-RAM write operations from 63.35% to 35.93% compared to the baseline architecture.



**Fig. 9**: Comparisons between Proposed, Baseline, and Existing Architectures for Execution Time under Stable Power.

Reducing the STT-RAM writes also guarantees better endurance and a lifetime of IoT nodes. The performance and energy consumption values are normalized with the baseline architecture in the figures, 9 and 10. Figures 9 and figure 10 show better execution time and dynamic energy consumption than the baseline and existing architectures. We achieve better values because of accurate prediction when we compare the proposed architecture with Xie et al. architecture. Xie et al. [18] work use a pattern sampler for prediction, but we maintained PR bit for every block in our proposed architecture. PR bit helps us with efficient block placement. If our prediction accuracy increases, the number of migrations decreases. If the number of migrations decreases the number of writes to STT-RAM.



Fig. 10: Comparisons between Proposed, Baseline, and Existing Architectures for Energy Consumption under Stable Power.

Further, the proposed prediction table helps to decrease the number of migrations and accesses. Therefore, our architecture results in 32.85% better execution time and saves 23.42% of dynamic energy consumption than baseline architecture.

During stable power, we performed experiments to compare the traditional checkpointing approach with the proposed architecture. We used a traditional checkpointing method, creating a safe point every 4 million instructions. We save the program state for every 4 million instructions to the main memory. Our proposed architecture outperforms traditional checkpointing. In traditional checkpointing, backup occurs for each safe point, but in the proposed architecture, backup occurs only during a power failure. We normalized the performance and energy consumption values with the traditional checkpointing approach. Proposed HCA reduces performance overhead and energy consumption by 21.03% and 22.95%, as shown in figures 11 and 12.



Fig. 11: Comparison in terms of Performance Overhead during Stable Power



Fig. 12: Comparison in terms of Energy Consumption during Stable Power

Analysis for the threshold value: We performed experiments to analyze the selected threshold value with our proposed architecture in detail. For these experiments, we used five MiBench benchmarks with more instructions and writes, as shown in table 3, to better understand the relationship between different thresholds and dynamic energy consumption.

We observed that the number of migrations decreases when the threshold is at 7 for three out of five benchmarks. Figure 13 shows these five benchmarks' migration energy (the total energy required to migrate the blocks from SRAM/STT-RAM to SRAM/SRAM). The difference between threshold seven and other benchmarks was minimal for the remaining two benchmarks. The number of incorrect placements increases when we increment the counter size by 1 bit for many benchmarks, which increases the number of migrations between cache regions. As a result, for the considered system configuration,

|            | # of         | # of Mana | # of More |           | Dynamic |
|------------|--------------|-----------|-----------|-----------|---------|
| Benchmarks |              | # of Mem  | # of Mem  | Threshold | Energy  |
|            | Instructions | neaus     | writes    |           | (mJ)    |
|            | 469467835    | 73791088  | 60602059  | 1         | 89.53   |
| acont      |              |           |           | 3         | 67.19   |
| qsort      |              |           |           | 7         | 68.98   |
|            |              |           |           | 15        | 76.11   |
|            |              | 13422737  |           | 1         | 103.15  |
| she        | 49870857     |           | 5004930   | 3         | 92.33   |
| sna        |              |           |           | 7         | 71.60   |
|            |              |           |           | 15        | 89.79   |
|            | 111653876    | 24586223  | 10127992  | 1         | 116.09  |
|            |              |           |           | 3         | 109.67  |
| susan      |              |           |           | 7         | 91.45   |
|            |              |           |           | 15        | 89.21   |
|            | 301988532    | 79999803  | 1045712   | 1         | 102.19  |
| dillecture |              |           |           | 3         | 98.77   |
| dijkstra   |              |           |           | 7         | 84.10   |
|            |              |           |           | 15        | 96.07   |
|            | 277951743    | 24217872  | 23164606  | 1         | 73.81   |
| basismath  |              |           |           | 3         | 44.01   |
| Dasicinati |              |           |           | 7         | 23.68   |
|            |              |           |           | 15        | 40.77   |

 Table 3: Application Memory Patterns

threshold 7 is beneficial for many benchmarks. The threshold value depends on the system configuration.



Fig. 13: Migration Energy consumption for Various Threshold Values

Analysis for Different Cache Settings: We also performed experiments by changing the system configurations like cache sizes and associativity that are different from the system configuration shown in table 1. We used 6 different cache settings for these experiments, as shown in table 4.

| Configuration | Cache Setting | Cache Size | Associativity               |  |  |
|---------------|---------------|------------|-----------------------------|--|--|
| 1             | 16K (0:8)     | 16KD       | 8-way                       |  |  |
| 1             |               | TOKD       | (0-way SRAM, 8-way STT-RAM) |  |  |
| 2             | 16K (2.6)     | 16KB       | 8-way                       |  |  |
| 2             | 1013(2.0)     | TOKD       | (2-way SRAM, 6-way STT-RAM) |  |  |
| 2             | 16K (4.4)     | 16KB       | 8-way                       |  |  |
| 5             | 1013 (4.4)    | TOKD       | (4-way SRAM, 4-way STT-RAM) |  |  |
| 4             | 16K (6.2)     | 16KB       | 8-way                       |  |  |
| 4             | 101X(0.2)     | TOKD       | (6-way SRAM, 2-way STT-RAM) |  |  |
| 5             | 16K (8:0)     | 16KB       | 8-way                       |  |  |
| 0             |               |            | (8-way SRAM, 0-way STT-RAM) |  |  |
| 6             | 32K (0:8)     | 32KB       | 8-way                       |  |  |
| 0             |               |            | (0-way SRAM, 8-way STT-RAM) |  |  |
| 7             | 39K (2.6)     | 32KB       | 8-way                       |  |  |
| 1             | 52IT (2.0)    |            | (2-way SRAM, 6-way STT-RAM) |  |  |
| 8             | 32K (4:4)     | 32KB       | 8-way                       |  |  |
| 0             |               |            | (4-way SRAM, 4-way STT-RAM) |  |  |
| 9             | 32K (6:2)     | 32KB       | 8-way                       |  |  |
| 3             |               |            | (6-way SRAM, 2-way STT-RAM) |  |  |
| 10            | 32K (8:0)     | 32KB       | 8-way                       |  |  |
| 10            |               |            | (8-way SRAM, 0-way STT-RAM) |  |  |

 Table 4: Different Cache Configurations used to Analyze Proposed Policies

Here, we used two different cache sizes; one is 16 KB, and the other is 32 KB. We compared the energy consumption for these 10 configurations under stable power. We used all proposed policies and techniques in these 10 sets of configurations. In figure 14, the energy consumption of the configurations {2-5} is normalized based on configuration-1 (pure STT-RAM-based cache architecture for 16KB cache size).

We observed that for the 16KB cache size, configuration-1 consumes more energy than the other 4 configurations during stable power because STT-RAM consumes more energy than SRAM. During stable power supply, we observed that the configuration with more SRAM ways consumes less energy than others without relating to cache sizes. We observed 16K(8:0) setting consumes less energy than all other 4 configurations because it is like a pure SRAM-based architecture) and after this, the 16K(6:2) setting consumes less energy than all the other 3 configurations. After 16K(6:2), the 16K(4:4) setting consumes less energy than the other two configurations, i.e., 16K(2:6) and 16K(0:8). Compared to pure STT-RAM cache architecture, the 16K(6:2) setting consumes 38.10% less energy, the 16K(4:4) setting consumes 17.97% less energy, as shown in figure 14. Compared to pure STT-RAM cache architecture, the 16K(2:6) setting consumes 5.82% less energy, as shown in figure 14.

Similarly, we observed 32K(8:0) setting consumes less energy than all other 4 configurations because it is like a pure SRAM-based architecture) and after this, the 32K(6:2) setting consumes less energy than all other 4 configurations for the 32KB cache size. The order is the same with the 32KB cache size because a large SRAM size gives more benefits during stable power. Compared to pure STT-RAM cache architecture, the 32K(6:2) setting consumes 42.91%



Fig. 14: Dynamic Energy Consumption for Different Cache Configurations under Stable Power, where Cache Size is 16KB

less energy, and the 32K(4:4) setting consumes 26.01% less energy. Compared to pure STT-RAM cache architecture, the 32K(2:6) setting consumes 13.37% less energy.

The above analysis concludes that pure SRAM-based architecture i,e. 16K/32K (8:0) performs better, and then 16K/32K (6:2) performs better than other configurations. However, this does not imply that the 16K/32K (0:8) and the 16K/32K (6:2) architectures are preferable because SRAM has a relatively high leakage energy than STT-RAM, whereas STT-RAM has 3x times of density than SRAM. As a result, when selecting hybrid architectures, the size of NVMs (both at cache and main memory), associativity, and energy consumption must all be considered.

#### 5.3.2 Under Frequent Power failures

We assume frequent power failures happen for every 2 and 4 million instructions. We perform all experiments for one billion instructions in the gem5 simulator. We modeled three power failure scenarios, as shown in table 5. In case 1, power failures occur for every 2 million instructions. In case 2, power failures occur for every 4 million instructions. In case 3, power failures occur randomly in between 2 to 4 million instructions.

| Configuration                 | Power Failure (PF) Scenario         |
|-------------------------------|-------------------------------------|
| Case-1 (Proposed 2M)          | PF for every 2-Million Instructions |
| Case-2 (Proposed 4M)          | PF for every 4-Million Instructions |
| Case 2 (Proposed Pandom)      | Random PF between every             |
| Case-3 (1 toposed Italidolli) | 2 to 4-Million Instructions         |

 Table 5: Different Power Failure Scenarios

Considering energy harvesting sources, such as piezoelectric and vibrationbased sources, they extract much less energy from the surroundings. In these cases, the capacitor cannot store enough energy, resulting in frequent power failures. As a result, our proposed architecture supports these worst-case scenarios. However, existing work by Xie et al. made similar assumptions, assuming that each power failure occurs for every 500 ms.

In the figures 15, 16, and 17, we refer to proposed 2M with case-1, proposed 4M with case-2, and proposed random with case-3. We calculated the average backup time  $(B_t)$ , i.e., the time required to backup all the SRAM contents to NVM. We also evaluate a random intermittent power system, where power failure occurs very often and randomly, to check  $B_t$  and the efficiency of the proposed architecture. The performance and energy consumption values are normalized based on the baseline architecture.



Fig. 15: Backup Time

We compare the average  $B_t$  w.r.t to the baseline, as shown in figure 15. We also compared SRAM+PCM-based architecture to show how much performance improved during intermittent power supply. In SRAM+PCM architecture, SRAM is the L1 cache, and PCM is the main memory. We introduced a power failure randomly and a safe point for every 4 million instructions. When a power failure occurs, we back up all SRAM contents to PCM. Whenever power comes back, we start the application's execution from the nearest safe point. When we compared SRAM+PCM architecture with the proposed architecture, the proposed architecture gives better because the proposed architecture saves data at the L1 cache itself (by using STT-RAM). proposed architecture saves the re-execution time of the application and reduces the number of writes to PCM during a power failure. The performance and energy consumption values are normalized with the baseline architecture. We compare the execution time and energy consumption with the baseline architecture during these frequent power failures, as shown in figures 16 and 17.



Fig. 16: Comparisons between Proposed, Baseline, and Existing Architectures for Execution Time under Frequent Power Failures.



**Fig. 17**: Comparisons between Proposed, Baseline, and Existing Architectures for Dynamic Energy Consumption under Frequent Power Failures.

We also compared the proposed architecture with the existing work, i.e., Xie et al. They checkpoint only selective dirty blocks from SRAM to STT-RAM during power failures. This type of checkpointing increases writes to PCM, which increases dynamic energy consumption for their architecture. Thus, the proposed architecture achieves better execution time and energy values than the existing architecture. Whenever power comes back, the proposed architecture uses blocks from STT-RAM directly. In Xie et al. work, STT-RAM consists of fewer blocks than the proposed architecture, which increases execution time in existing work.





Fig. 18: Comparison in terms of Performance Overhead and Energy Consumption during Power Failure

We compare the traditional checkpointing approach with the proposed architecture during power failures. As earlier said, we implemented a traditional checkpointing approach by creating a safe point for every 4 million instructions. We save the program state for every 4 million instructions. We retrieve the program state from the main memory at every safe point to continue with the remaining execution of the application. For instance, if a random power failure occurs at 9<sup>th</sup> million instruction. We re-execute the application from  $4^{th}$  million instruction because the nearest safe point is at  $4^{th}$  million instruction. The performance and energy consumption values are normalized based on the traditional checkpointing approach. We compared the proposed architecture with the traditional checkpointing approach, which reduces performance overhead and energy consumption by 36.10% and 31.03%, as shown in figure 18.

Analysis for Different Cache Settings: We also performed experiments by changing the system configurations like cache sizes and associativity that are different from the system configuration shown in table 1. We used 10 different cache settings for these experiments, as shown in table 4.

Here, we used two different cache sizes; one is 16 KB, and the other is 32 KB. We compared the energy consumption under an unstable power supply for the 10 configurations. We used all proposed policies and techniques in these 10 sets of configurations. In figure 19, the energy consumption of the configurations {6-9} is normalized based on configuration-10 (pure SRAM-based cache architecture for 32KB cache size).



Fig. 19: Dynamic Energy Consumption for Different Cache Configurations under Unstable Power, where Cache Size is 32KB

We observed that for the 16KB cache size, configuration-5 consumes more energy than the other 4 configurations during an unstable power because of backing up SRAM contents to SRAM. During an unstable power supply, we observed that the configuration with more STT-RAM ways consumes less energy than others without relating to cache sizes. We observed 16K(0:8)setting consumes less energy than all other 4 configurations because it is like a pure STT-RAM-based architecture) and after this, the 16K(2:6) setting consumes less energy than all the other 3 configurations. After 16K(2:6), the 16K(4:4) setting consumes less energy than the other two configurations, i.e., 16K(6:2) and 16K(8:0). Compared to pure SRAM cache architecture, the 16K(2:6) setting consumes 16.70% less energy, the 16K(4:4) setting consumes 12.19% less energy. Compared to pure STT-RAM cache architecture, the 16K(6:2) setting consumes 7.11% less energy.

Similarly, We observed 32K(0:8) setting consumes less energy than all other 4 configurations because it is like a pure STT-RAM-based architecture) and after this, the 32K(2:6) setting consumes less energy than all other 3 configurations for the 32KB cache size. The order is the same as the 16KB cache size because a large STT-RAM size gives more benefits during unstable power, where it backup more data and reduces both backup and restore overhead. Compared to pure SRAM cache architecture, the 32K(2:6) setting consumes 21.10% less energy, and the 32K(4:4) setting consumes 15.49% less energy, as shown in figure 19. Compared to pure STT-RAM cache architecture, the 32K(6:2) setting consumes 9.14% less energy, as shown in figure 19.

The above analysis concludes that pure STT-RAM-based architecture i,e. 16K/32K (0:8) performs better, and then 16K/32K (2:6) performs better than other configurations. However, this does not imply that the 16K/32K (8:0) and the 16K/32K (2:6) architectures are preferable because STT-RAM has relatively high read/write latency and consumes more dynamic energy than SRAM. Thus, we used and suggested equal partitions of SRAM and STT-RAM throughout this work, which give benefits under both stable and unstable power supplies.



**Fig. 20**: Comparison of Backup Efficiency  $(\eta)$  during Power failures

As shown in figures 20 and 21, we performed experiments to analyze the backup efficiency  $(\eta)$  and energy efficiency  $(\theta)$  for both proposed and existing architectures. The  $\eta$  and  $\theta$  values are normalized with the baseline architecture. Our proposed architecture improves  $\eta$  by 32.52% and  $\theta$  by 43.41% because of the proposed backup strategy. The other reason for the improvement in both  $\eta$  and  $\theta$  is a reduction in both  $E_{backup}$  and  $B_t$ .



**Fig. 21**: Comparison of Energy Efficiency  $(\theta)$  during Power failures

Lastly, as we discussed SRAM+PCM architecture, there is a safe point for every 4 million instructions. Whenever power failure occurs, we save the state in PCM. This type of backup policy increases writes to PCM. Whenever power comes back, the restore procedure increases the number of accesses from PCM to the SRAM cache. In a hybrid cache, STT-RAM saves some blocks so that PCM observes fewer writes, and the restore takes lesser accesses from PCM. We evaluated the 32KB SRAM cache and hybrid cache (16KB SRAM+16KB STT-RAM) to check static power. We have seen the proposed architecture has a 17.02% improvement in static power compared to 32KB SRAM+PCM architecture.

## 6 Conclusions

The proposed architecture is a promising HCA for IoT embedded systems. The proposed architecture is beneficial for IoT applications, where power failures are frequently unpredictable. Because of its high write latency and energy consumption, NVM introduces overhead in hybrid caches. We proposed an efficient prediction-based placement policy and an intelligent migration policy that efficiently uses SRAM and STT-RAM. We reduce the number of writes to STT-RAM by effectively using the proposed prediction table. In comparison to the baseline architecture, the proposed architecture reduces STT-RAM writes from 63.35% to 35.93%. As a result, our energy consumption and execution time are reduced.

We compared the proposed architecture to state-of-the-art and baseline architectures. proposed improves energy and backup efficiency. We proposed a backup strategy to ensure the efficient backup of the program state. During a power failure, the proposed backup strategy helps to recognize important blocks and migrate them to the STT-RAM cache. When compared to baseline and existing architectures, proposed requires less backup time. When power comes back, we use STT-RAM contents without any restoration procedure.

## References

- Golpîra, H., Khan, S.A.R., Safaeipour, S.: A review of logistics internet-ofthings: Current trends and scope for future research. Journal of Industrial Information Integration, 100194 (2021)
- [2] Hu, X., Xu, L., Lin, X., Pecht, M.: Battery lifetime prognostics. Joule 4(2), 310–346 (2020)
- [3] Ma, D., Lan, G., Hassan, M., Hu, W., Das, S.K.: Sensing, computing, and communications for energy harvesting iots: A survey. IEEE Communications Surveys & Tutorials 22(2), 1222–1250 (2019)
- [4] Mamen, A., Supatti, U.: A survey of hybrid energy storage systems applied for intermittent renewable energy systems. In: 2017 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), pp. 729– 732 (2017). IEEE
- [5] Liu, Y., Li, H., Li, X., Xue, J.C., Xie, Y., Yang, H.: Self-powered wearable sensor node: Challenges and opportunities. In: International Conference on Compilers, Architecture and Synthesis for Embedded Systems, pp. 189–189 (2015). IEEE
- [6] Martinez, B., Monton, M., Vilajosana, I., Prades, J.D.: The power of models: Modeling power consumption for iot devices. IEEE Sensors Journal 15(10), 5777–5789 (2015)
- [7] Lucia, B., Balaji, V., Colin, A., Maeng, K., Ruppel, E.: Intermittent computing: Challenges and opportunities. 2nd Summit on Advances in Programming Languages (SNAPL 2017) (2017)
- [8] Surbatovich, M., Lucia, B., Jia, L.: Towards a formal foundation of intermittent computing. Proceedings of the ACM on Programming Languages 4(OOPSLA), 1–31 (2020)
- [9] Hester, J., Sorber, J.: The future of sensing is batteryless, intermittent, and awesome. In: Proceedings of the 15th ACM Conference on Embedded Network Sensor Systems, pp. 1–6 (2017)
- [10] Jog, A., Mishra, A.K., Xu, C., Xie, Y., Narayanan, V., Iyer, R., Das, C.R.: Cache revive: Architecting volatile stt-ram caches for enhanced performance in cmps. In: Design Automation Conference, pp. 243–252 (2012). IEEE

- [11] Manohar, S.S., Kapoor, H.K.: Capmig: Coherence aware block placement and migration in multi-retention stt-ram caches. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2022)
- [12] Sarkar, A., Singh, N., Venkitaraman, V., Singh, V.: Dam: Deadblock aware migration techniques for stt-ram-based hybrid caches. IEEE Computer Architecture Letters 20(1), 62–4 (2021)
- [13] Agarwal, S., Chakraborty, S.: Abaca: Access based allocation on set wise multi-retention in stt-ram last level cache. In: 2021 IEEE 32nd International Conference on Application-specific Systems, Architectures and Processors (ASAP), pp. 171–174 (2021). IEEE
- [14] Pan, C., Xie, M., Hu, J., Chen, Y., Yang, C.: 3m-pcm: Exploiting multiple write modes mlc phase change main memory in embedded systems. In: Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis, pp. 1–10 (2014)
- [15] Lee, A., Lo, C.-P., et al.: A reram-based nonvolatile flip-flop with self-write-termination scheme for frequent-off fast-wake-up nonvolatile processors. IEEE Journal of Solid-State Circuits 52(8), 2194–2207 (2017)
- [16] Instruments, T.: MSP430FR5969 launchpad development kit (2018)
- [17] Wang, Z., Jiménez, D.A., Xu, C., Sun, G., Xie, Y.: Adaptive placement and migration policy for an stt-ram-based hybrid cache. In: IEEE 20th International Symposium on High Performance Computer Architecture, pp. 13–24. IEEE, ??? (2014). IEEE
- [18] Xie, M., Pan, C., Zhang, Y., Hu, J., Liu, Y., Xue, C.J.: A novel stt-rambased hybrid cache for intermittently powered processors in iot devices. IEEE Micro 39(1), 24–32 (2018)
- [19] Ma, K., Zheng, Y., Li, S., Swaminathan, K., Li, X., Liu, Y., Sampson, J., Xie, Y., Narayanan, V.: Architecture exploration for ambient energy harvesting nonvolatile processors. In: IEEE 21st International Symposium on High Performance Computer Architecture, pp. 526–537. IEEE, ??? (2015). IEEE
- [20] Mao, M., Li, H., Jones, A.K., Chen, Y.: Coordinating prefetching and stt-ram based last-level cache management for multicore systems. In: Proceedings of the 23rd ACM International Conference on Great Lakes Symposium on VLSI, pp. 55–60. ACM New York, NY, USA, ??? (2013)
- [21] Sun, Z., Bi, X., Li, H., Wong, W.-F., Ong, Z.-L., Zhu, X., Wu, W.: Multi retention level stt-ram cache designs with a dynamic refresh scheme. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on

Microarchitecture, pp. 329–338. ACM New York, NY, USA, ??? (2011)

- [22] Smullen, C.W., Mohan, V., Nigam, A., Gurumurthi, S., Stan, M.R.: Relaxing non-volatility for fast and energy-efficient stt-ram caches. In: IEEE 17th International Symposium on High Performance Computer Architecture, pp. 50–61. IEEE, ??? (2011). IEEE
- [23] Li, H., Liu, Y., Zhao, Q., Gu, Y., Sheng, X., Sun, G., Zhang, C., Chang, M.-F., Luo, R., Yang, H.: An energy efficient backup scheme with low inrush current for nonvolatile sram in energy harvesting sensor nodes. In: Design, Automation & Test in Europe Conference & Exhibition, pp. 7–12. IEEE, ??? (2015). IEEE
- [24] Wu, X., Li, J., Zhang, L., Speight, E., Rajamony, R., Xie, Y.: Design exploration of hybrid caches with disparate memory technologies. ACM Transactions on Architecture and Code Optimization 7(3), 1–34 (2010)
- [25] Kim, N., Ahn, J., Choi, K., Sanchez, D., Yoo, D., Ryu, S.: Benzene: An energy-efficient distributed hybrid cache architecture for manycore systems. ACM Transactions on Architecture and Code Optimization 15(1), 1–23 (2018)
- [26] Zhao, J., Xu, C., Zhang, T., Xie, Y.: Bach: A bandwidth-aware hybrid cache hierarchy design with nonvolatile memories. Journal of Computer Science and Technology **31**(1), 20–35 (2016)
- [27] Ahn, J., Yoo, S., Choi, K.: Prediction hybrid cache: An energy-efficient sttram cache architecture. IEEE Transactions on Computers 65(3), 940–951 (2015)
- [28] Gao, L., Wang, R., Xu, Y., Yang, H., Luan, Z., Qian, D., Zhang, H., Cai, J.: Sram-and stt-ram-based hybrid, shared last-level cache for on-chip cpu–gpu heterogeneous architectures. The Journal of Supercomputing 74(7), 3388–3414 (2018)
- [29] Yao, J., Ma, J., Chen, T., Hu, T.: An energy-efficient scheme for stt-ram l1 cache. In: IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, pp. 1345–1350. IEEE, ??? (2013). IEEE
- [30] Xie, M., Zhao, M., Pan, C., Hu, J., Liu, Y., Xue, C.J.: Fixing the broken time machine: Consistency-aware checkpointing for energy harvesting powered non-volatile processor. In: Proceedings of the 52nd Annual Design Automation Conference, pp. 1–6. ACM New York, NY, USA, ??? (2015)

- [31] Liu, Y., Suy, F., Wangy, Z., Yang, H.: Design exploration of inrush current aware controller for nonvolatile processor. In: 2015 IEEE Non-Volatile Memory System and Applications Symposium, pp. 1–6. IEEE, ??? (2015). IEEE
- [32] Zhou, Y., Zhao, M., Ju, L., Xue, C.J., Li, X., Jia, Z.: Energy-aware morphable cache management for self-powered non-volatile processors. In: IEEE 23rd International Conference on Embedded and Real-Time Computing Systems and Applications, pp. 1–7. IEEE, ??? (2017). IEEE
- [33] Xie, M., Zhao, M., Pan, C., Li, H., Liu, Y., Zhang, Y., Xue, C.J., Hu, J.: Checkpoint aware hybrid cache architecture for nv processor in energy harvesting powered systems. In: International Conference on Hardware/-Software Codesign and System Synthesis, pp. 1–10. IEEE, ??? (2016). IEEE
- [34] Ransford, B., Sorber, J., Fu, K.: Mementos: System support for longrunning computation on rfid-scale devices. In: Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 159–170 (2011)
- [35] Balsamo, D., Weddell, A.S., Das, A., Arreola, A.R., Brunelli, D., Al-Hashimi, B.M., Merrett, G.V., Benini, L.: Hibernus++: a self-calibrating and adaptive system for transiently-powered embedded devices. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35(12), 1968–1980 (2016)
- [36] Kortbeek, V., Yildirim, K.S., Bakar, A., Sorber, J., Hester, J., Pawełczak, P.: Time-sensitive intermittent computing meets legacy software. In: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 85–99 (2020)
- [37] Su, F., Liu, Y., Wang, Y., Yang, H.: A ferroelectric nonvolatile processor with 46 \μ s system-level wake-up time and 14\μ s sleep time for energy harvesting applications. IEEE Transactions on Circuits and Systems I: Regular Papers 64(3), 596–607 (2016)
- [38] Choi, J., Joe, H., Kim, Y., Jung, C.: Achieving stagnation-free intermittent computation with boundary-free adaptive execution. In: IEEE Real-Time and Embedded Technology and Applications Symposium, pp. 331–344 (2019). IEEE
- [39] Binkert, N., Beckmann, B., Black, G., Reinhardt, S.K., Saidi, A., Basu, A., Hestness, J., Hower, D.R., Krishna, T., Sardashti, S., et al.: The gem5 simulator. ACM SIGARCH computer architecture news 39(2), 1–7 (2011)

- [40] Sun, G., Dong, X., Xie, Y., Li, J., Chen, Y.: A novel architecture of the 3d stacked mram l2 cache for cmps. In: IEEE 15th International Symposium on High Performance Computer Architecture, pp. 239–249 (2009). IEEE
- [41] Wu, X., Li, J., Zhang, L., Speight, E., Rajamony, R., Xie, Y.: Hybrid cache architecture with disparate memory technologies. ACM SIGARCH computer architecture news 37(3), 34–45 (2009)
- [42] Li, J., Xue, C.J., Xu, Y.: Stt-ram based energy-efficiency hybrid cache for cmps. In: IEEE/IFIP 19th International Conference on VLSI and Systemon-Chip, pp. 31–36 (2011). IEEE
- [43] Jadidi, A., Arjomand, M., Sarbazi-Azad, H.: High-endurance and performance-efficient design of hybrid cache architectures through adaptive line replacement. In: IEEE/ACM International Symposium on Low Power Electronics and Design, pp. 79–84 (2011). IEEE
- [44] Choi, J.-H., Park, G.-H.: Nvm way allocation scheme to reduce nvm writes for hybrid cache architecture in chip-multiprocessors. IEEE Transactions on Parallel and Distributed Systems 28(10), 2896–2910 (2017)
- [45] Guthaus, M.R., Ringenberg, J.S., Ernst, D., Austin, T.M., Mudge, T., Brown, R.B.: Mibench: A free, commercially representative embedded benchmark suite. In: Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization, pp. 3–14 (2001). IEEE
- [46] Dong, X., Xu, C., Xie, Y., Jouppi, N.P.: Nvsim: A circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems **31**(7), 994–1007 (2012)