Keywords

RRAM memory systems have been demonstrated to have low switching power [23], and higher density [29] compared to regular 6T-SRAM cells. Fabrication of large arrays [7, 8, 19] along with integration in CMOS process flow affirm its potential at industry scale [14, 26]. RRAMs also show an additional power advantage with their non-volatility, where the crossbar array can be isolated from the power source when not in use, thus saving power [24]. However, the application of RRAMs as cache substitutes as an alternative to SRAM has been restricted because of the high latencies [17]. RRAM shows long write times (10–100 ns) limiting their performance in programs requiring frequent and fast write operations [11, 16]. Additionally, fabricated large arrays have shown high read latencies [7, 8, 19].

Fig. 1.
figure 1

(a) RRAM IV curve generated using model [9] (b) Circuit configuration in reading requires charging of line capacitance through RRAM (c) Size dependence of read latency of SRAM and RRAM with increasing array size. RRAM can be seen to have significantly higher latency compared to SRAM. SRAM latencies are extracted using CACTI while, for RRAM, capacitance charge time is calculated as a function of the array size (d) write energy vs. read latency trade-off for different RRAM resistance values. The simulation curve assumes write latency of 10 ns

The prior exploration for using RRAM for memory application can be seen to leverage its key density advantages [2]. Specifically, RRAM is attractive for L2-L4 cache from an area and energy perspective for a small penalty in performance. Thus, RRAMs have mostly been used as last-level caches [6, 20, 30] and main memory [18] due to their greater tolerance for higher latencies. However, L1 cache read latency requirements are more stringent. Reducing latency becomes an important requirement for the application of RRAMs in lower level caches. L1 instruction cache has a fast read requirement but has fewer write operations - which is an excellent application to benchmark fast reading in RRAM to the conventional SRAM.

Typical RRAM characteristics (Fig. 1(a)) show that read and write currents are of similar magnitude. The reduction in write current to reduce write energy produces a reduction in read current. The lower read current needs a longer timescale to charge up the line capacitance (\(C_{line}\)) to enable the sense amplifier to read (Fig. 1(b)). Naturally, the larger arrays have a larger read latency as \(C_{line}\) increases with array size, which is consistent with literature [8, 19]. The read latency with the scaling of array size is shown in Fig. 1(c) for both SRAM and RRAM, where RRAM can be seen to have significantly higher read latency. Thus, write energy vs. read latency trade-off is observed in RRAMs based on literature [10, 22, 23, 25, 28]. Usage of low resistance device for quick reading might seem attractive with resistance values reported for RRAMs in the literature varying 3 orders of magnitude (10k\(\varOmega \)-10M\(\varOmega \)) [10, 22, 23, 25, 28]. However, these high current devices increase the energy consumption in the array proportional to the current values, which makes competition with SRAM difficult (Fig. 1(d). Thus, attractive RRAM with low write energy and low read latency is forbidden by the write energy vs. read latency trade-off.

In this paper, we present a bitcell design to mitigate the write energy vs. read latency trade-off and evaluate its impact on instruction cache level performance. First, we present a modified 2T1R bitcell by adding 1 NMOS transistor to the conventional select transistor (1T) and RRAM (1R) based 1T1R bitcell to produce a high read-current. Second, we show that our proposal ‘breaks’ the write energy vs. read latency trade-off at the cost of the increased bitcell area. Third, we analyze the effect at the architectural level for instruction cache replacement by comparing our proposal with conventional 1T1R bitcell and SRAM in terms of improved energy-delay-product (EDP) for both high performance and embedded processor configuration.

1 Proposal

1.1 Fast Read Solution

Fig. 2.
figure 2

Bitcells for (a) SRAM (b) conventional 1T1R and (c) proposed 2T1R scheme. (d–f) Array level schematics for these schemes. (g) Schematic for architecture level evaluation where smaller RRAM bitcell area accommodates more memory

Figure 2(a–c) shows bitcells for SRAM and RRAM schemes. An NMOS (1T) is used to select an RRAM (1R) for reading/writing to define the conventional 1T1R bitcell [15]. The read-write scheme is taken from [29]. We propose that the voltage (\(V_{read-select}\)) between 1T and 1R devices on the 1T1R is applied on the gate of an NMOS transistor to form the 2T1R bitcell as shown in Fig. 2(c). The drain and source of read transistor are connected to another bit and word line. This modification makes the read transistor supply the read current that charges the line capacitance to the sense amplifier instead of the RRAM current. Thus, the read transistor can be independently designed for a fast read, while the resistance of the RRAM can be increased for low energy to disable the write energy vs. read latency trade-off.

Array level schematics of these bitcells are shown in Fig. 2(d–f). The conventional 1T1R scheme has one bitcell at the intersection of a cross-bars with two wordlines and a bitline. One complimentary pass transistor switch performs as a row selector (RS) to apply different voltages for reading and writing. A select line turns the select transistors on when a row is to be read or written. The bitline connects the bitcell to the sense amplifier through a column selector (CS) while reading. The proposed scheme requires one pair of additional wordline and bitline connecting the source and drain of the read transistor as shown in Fig. 2(f).

1.2 Targeted Application

Fig. 3.
figure 3

Cache memory schematic with RRAM accommodating more memory in the same area compared to SRAM. This area advantage along with fast read capability is to be examined for potential of RRAM as L1 instruction cache substitute

Fast reading achieved by the proposed scheme is tested for L1 instruction cache replacement as it requires infrequent writing and frequent reading instances. This produces an aggressive read latency requirement but is tolerant to slow writing. We exploit the area advantage offered by RRAM to accommodate higher memory size in the same area. Thus, we intend to compensate for the high write latency of RRAM using a higher memory size (Fig. 3) - so that the larger cache enables less data miss in L1 to cause a data fetch from L2 and thereby reduce the frequency of write operations. We study the performance of the proposed scheme (2T1R) for both high-performance and embedded architectures.

2 New Read Scheme

2.1 Circuit Schematic

Fig. 4.
figure 4

Circuit schematic in reading for the (a) conventional 1T1R and (b) proposed 2T1R schemes. Additional read transistor in 2T1R scheme provides high current path for read current demonstrating quick reading

Figure 4 shows circuit schematics for both RRAM schemes in the reading configuration. For the conventional 1T1R scheme, \(V_{read}\) is applied through the RS switch at wordline and CS is grounded as shown. The voltage gets divided across the RRAM and \(R_{CS}\) and is sensed by the sense amplifier (SA) [27]. \(C_{line}\) denotes the line capacitance to be charged by RRAM read current. This sets the charge-up time. The proposed scheme has similar reading configurations with RS and CS driving two word and bitlines to voltages shown. The select transistor is biased at \(V_{read-select}\), which acts as a resistive voltage divider to produce a voltage (\(V_{gate-select}\)), which drives the read transistor. Read transistor allows high current flow charging the line capacitance (\(C_{line}\)) and \(SA_{in}\) terminal. A diode-connected transistor in CS (drain shorted to the gate) acts as a current to voltage converter to drive the SA.

Fig. 5.
figure 5

Equivalent circuits for (a) conventional 1T1R and (b) proposed 2T1R schemes. The decrease in time constant is because of reduction in load capacitance to be charged

In this scheme, the RRAM resistance charges primarily the ‘small’ gate capacitance of the read transistor, while the read transistor charges the substantial line capacitance. The approximate time constants of charging show the 1T1R scheme charging the line capacitance (\(\sim \)10 fF) while the proposed scheme charging the small gate capacitance (\(\sim \)0.1 fF) resulting in faster reading (Fig. 5).

2.2 Maximizing Sense Margin

Choosing the \(V_{read-select}\) determines the gate voltage for read transistor. We use Verilog-A RRAM model [9] and 45nm CMOS technology [31] for HSPICE simulations of circuits. From the vast range of resistances reported in the literature, we consider the case of a high resistance RRAM with a low resistance state (LRS) = 1 M\(\varOmega \) and high resistance state (HRS) = 10 M\(\varOmega \) as described in Fig. 6. We fix the \(R_{RRAM}\) and vary \(V_{read-select}\). The task is to create a maximum voltage difference at the input of the SA for the given LRS and HRS. For this choice of RRAM resistances, \(V_{read-select}\) of 280 mV turns on for LRS and remains off for HRS (Fig. 6(b)). The read current, therefore, shows the maximum difference (Fig. 6(c)) for the two resistances. The read current is converted to voltage and applied to the input of SA shown in Fig. 6(d) to give the maximum sense margin.

Fig. 6.
figure 6

Dependence of (a) \(V_{gate-select}\) (b) \(I_{read}\) (c) \(V_{SA-in}\) on \(V_{read-select}\). \(V_{read-select}=0.28\) is selected to maximize the swing at the input of the SA

2.3 Breaking the Trade-off

Figure 7(a) shows the timing diagrams for reading in both the 1T1R and 2T1R schemes for the device. The RRAM switching voltage of 0.8 V is assumed, given a range of switching voltages (0.2–4 V) reported in the literature [10, 22, 23, 25, 28]. The array size of 256\(\,\times \,\)64 is taken for calculating the line capacitance. The line capacitance of 15.07 fF is calculated using the 45 nm interconnect technology manual [21].

The voltage at the input terminal of the SA is shown for both schemes along with the output of the SA in Fig. 7(a). The conventional scheme requires more than 2 ns for the \(SA_{in}\) to charge up and the SA output to respond. On the other hand, the proposed scheme charges the SA input terminal quickly in less than a nanosecond response time. Figure 7(b) shows the response time of SA output increases with device LRS resistance with a fixed LRS/HRS ratio. The read time drastically increases with an increase in device resistance for the 1T1R scheme. In comparison, the proposed bitcell does not show degradation of latency with \(R_{LRS}\) to enable fast read (<1 ns) even with high resistance devices. However, for very large resistance (100 M\(\varOmega \)) the performance of the proposed scheme degrades because the extreme low RRAM currents take a long time to charge the gate capacitance of the read transistor. Figure 7(c) shows the reduced read latency with the proposed scheme for various RRAM resistances. The circuit consumes comparable write energy as the previous scheme with much lower read latencies thus, breaking the trade-off (Fig. 7(d)).

Fig. 7.
figure 7

(a) Faster RC charging in the proposed scheme improves the response time (b) Read latency increases with the LRS resistance keeping HRS/LRS constant. The read latency degradation is mitigated by the proposed 2T1R scheme. (c) Isolating the RRAM from the critical read path decouples the read and write current paths breaking the write energy vs read latency trade-off

3 Architecture Level Performance Estimation

Fig. 8.
figure 8

Memory banks for SRAM and RRAM with components colour coded (Color figure online)

Figure 8 shows the bank schematics for SRAM and RRAM memory banks. The components are briefly divided into 4 categories which are colour coded as shown. Same CMOS peripheral components are used in both cases. The components specific to one of the schemes are shown in blue colour. SRAM bank of 16\(\,\times \,\)8 size is simulated using NCSU SRAM compiler [13] in 45 nm technology [31]. RRAM bank of the same size is also simulated in HSPICE. The energies are calculated and scaled for the array size of 256\(\,\times \,\)64. The comparison of area and energy between the conventional and proposed schemes is explained next.

3.1 Energy

Both RRAM and SRAM arrays are simulated in HSPICE and instantaneous power is recorded for each component. The total energy consumed in the writing or reading cycle is calculated. For the RRAM array, the devices are made to switch from LRS to HRS for the calculation of write energy while energy spent in reading a device in LRS is taken as read energy. The write latency is assumed to be 10 ns and read latency is taken to be 1 ns. The standby (SB) energy is calculated by turning the RS and CS off and isolating the array from the power source. For SRAM, all bits are programmed in state 0 and are made to flip in extracting write energy consumption. These energy per cycle values are used in architectural simulations to calculate the total energy consumed. Component-wise write, read and standby energies consumed by SRAM and RRAM banks are shown in Fig. 9(a).

RRAM consumes significantly more energy than SRAM during write mode because of long times scales. This makes the reduction in write energy important by the usage of highly resistive devices. Read energy consumed in RRAM and SRAM is comparable. However, in both reading and writing, the proposed scheme consumes higher energy due to additional transistors present. RRAM specific components (i.e. row/column selectors) consume the major fraction of power in writing, while SA consumes a major fraction of power in reading for RRAMs. In standby mode, SRAM consumes static power, while the RRAM array is disconnected from peripherals by turning off the RS and CS passgate switches. This causes power saving. This results in 20\(\times \) energy reduction in SB mode. These energy values are used in the cache replacement simulation presented in the next section.

Fig. 9.
figure 9

(a) Energy in SRAM and RRAM banks in different regimes of operation (b) Component wise area consumption in the memory banks

3.2 Area

Figure 9(b) shows area reduction achieved by using RRAM in place of SRAM in the bitcell array. SRAM bitcell area is reported to be \(146F^2\) [29] while RRAM 1T1R structure consumes \(>8F^2\) [5]. We assume a conservative estimate of \(10F^2\) for 1T1R and \(25F^2\) for the 2T1R scheme. Therefore, the area advantage is 14.6\(\times \) and 5.84\(\times \) respectively for conventional and proposed schemes. A part of this is taken by the RRAM-specific blocks shown in Fig. 8 like new sense amplifiers and drivers. Thus 1T1R and 2T1R schemes can conservatively pack 8\(\times \) and 4\(\times \) more memory compared to SRAM. A more rigorous area estimation can be carried out in the future.

3.3 Cache Replacement

Table I shows the simulation parameters used. We use both high performance (x86) [3] and embedded (ARM) [4] processor architectures to analyze the applicability of this scheme (Fig. 10). We use 10 programs from MiBench benchmark suite [12] for embedded and 10 programs from SPECS benchmark suite [1] for high-performance architecture. 2 kB (256\(\,\times \,\)64) memory banks are used as shown in Fig. 10. We choose the L1 instruction cache in HP and L0 instruction cache in embedded processors for replacement with RRAM in gem5 simulations [5].

Fig. 10.
figure 10

Processor configuration in (a) high performance and (b) embedded architectures. (c) Memory specifications for the instruction cache

Throughput. Throughput (instructions per cycle - IPC) for each of the case is extracted from gem5 [5] simulations. Figure 11(a) shows that the conventional 1T1R scheme causes huge degradation in the IPC while the proposed scheme makes it comparable to SRAM with only 0.1% degradation for HP and 1.6% degradation in embedded architecture. This is because the proposed scheme along with faster reading provides a larger memory size which reduces the miss rate, which is defined as the need to request data from the L2 cache to store in the L1 cache. The resultant reduction in high latency write instances due to larger memory size enables improved performance.

Fig. 11.
figure 11

Comparison of performance parameters for conventional and proposed scheme by normalization with SRAM for high performance x86 architecture (a) Marginal degradation in the mean throughput can be observed compared to baseline SRAM throughput which is attributed to larger memory size with read latency same as that of SRAM (b) RRAM is seen to consume lesser energy for both proposed and conventional schemes (c) Mean energy-delay product reduces by 82% compared to SRAM

Energy. The energy per read/write/SB cycle extracted from Fig. 11(b) is used to calculate the total energy consumption for different benchmark programs. The number of read/write accesses is extracted from the simulations to get the energy spent in reading and writing. During the remaining cycles, the memory consumes standby energy. The normalized energy plot in Fig. 11(b) shows a reduction in energy in both of the RRAM schemes. Mean energy reduction of 81% is observed in HP and 53% for embedded programs. Energy-saving occurs in RRAMs due to a large number of cycles in standby mode as RRAMs have very low SB energy consumption compared to SRAMs.

Energy-Delay Product (EDP). The energy calculated above is used to compare the performance of RRAM and SRAM using EDP defined only for the instruction cache. The delay is defined as the duration for which the cache is active for writing or reading. Mathematically,

$$L1_{delay} = L1_{read\ latency} + L1_{miss\ rate} \times (L1_{write\ latency}+ L2_{access\ latency})$$
$$L2_{access\ latency} = L2_{read\ latency} + L2_{miss\ rate}\times (L2_{write\ latency} + L3_{access\ latency})$$

Normalized EDP (Fig. 11(c)) reduces by 82% for HP and 53% for embedded processing for the proposed scheme. Table II summarizes the cache replacement results.

4 Conclusion

We proposed a circuit modification in the conventional 1T1R RRAM bitcell to boost the read speed for high resistance devices. This resolves the fast-read vs. low energy dilemma. We analyzed the performance of this fast read based RRAM memory for SRAM replacement at the instruction cache level. EDP reduction of 82% and 53% was observed for x86 and ARM architectures respectively showing the potential of RRAM in lower level caches as a substitute for SRAM. Extensive circuit-level exploration of the variability and reliability of the scheme may be taken up in the near future.