Abstract
RRAM has emerged as a non-volatile and denser alternative to SRAM memory. Various RRAMs show a range of write energies related to write currents. The write current magnitude is proportional to the read current magnitude, which is inversely related to the read latency. Hence, lower write energy leads to higher read latency - producing a fundamental trade-off. This trade-off leads to a fast-read vs. low write power dilemma, hindering the application of RRAM to lower level cache. In this work, we propose a modified bitcell design to overcome this and analyze its impact on L1 instruction cache replacement. We propose a modification in conventional one selection transistor (1T) and RRAM (1R) based 1T1R cell by adding another transistor (i.e. 2T1R cell) to drive high current for fast read irrespective of the RRAM current magnitude. We demonstrate that the read latency vs. write energy trade-off is mitigated using circuit simulations. The impact of the 2T1R bit cell for a fast read and slow write is compared with SRAM and 1T1R scheme for L1 cache replacement. We report an energy-delay product (EDP) reduction of 82% for high performance and 53% for embedded architecture with SRAM comparable throughput. Thus, the fast read capability establishes the potential of RRAM as a lower-level cache substitute for both high performance and embedded applications.
Similar content being viewed by others
Keywords
RRAM memory systems have been demonstrated to have low switching power [23], and higher density [29] compared to regular 6T-SRAM cells. Fabrication of large arrays [7, 8, 19] along with integration in CMOS process flow affirm its potential at industry scale [14, 26]. RRAMs also show an additional power advantage with their non-volatility, where the crossbar array can be isolated from the power source when not in use, thus saving power [24]. However, the application of RRAMs as cache substitutes as an alternative to SRAM has been restricted because of the high latencies [17]. RRAM shows long write times (10–100 ns) limiting their performance in programs requiring frequent and fast write operations [11, 16]. Additionally, fabricated large arrays have shown high read latencies [7, 8, 19].
The prior exploration for using RRAM for memory application can be seen to leverage its key density advantages [2]. Specifically, RRAM is attractive for L2-L4 cache from an area and energy perspective for a small penalty in performance. Thus, RRAMs have mostly been used as last-level caches [6, 20, 30] and main memory [18] due to their greater tolerance for higher latencies. However, L1 cache read latency requirements are more stringent. Reducing latency becomes an important requirement for the application of RRAMs in lower level caches. L1 instruction cache has a fast read requirement but has fewer write operations - which is an excellent application to benchmark fast reading in RRAM to the conventional SRAM.
Typical RRAM characteristics (Fig. 1(a)) show that read and write currents are of similar magnitude. The reduction in write current to reduce write energy produces a reduction in read current. The lower read current needs a longer timescale to charge up the line capacitance (\(C_{line}\)) to enable the sense amplifier to read (Fig. 1(b)). Naturally, the larger arrays have a larger read latency as \(C_{line}\) increases with array size, which is consistent with literature [8, 19]. The read latency with the scaling of array size is shown in Fig. 1(c) for both SRAM and RRAM, where RRAM can be seen to have significantly higher read latency. Thus, write energy vs. read latency trade-off is observed in RRAMs based on literature [10, 22, 23, 25, 28]. Usage of low resistance device for quick reading might seem attractive with resistance values reported for RRAMs in the literature varying 3 orders of magnitude (10k\(\varOmega \)-10M\(\varOmega \)) [10, 22, 23, 25, 28]. However, these high current devices increase the energy consumption in the array proportional to the current values, which makes competition with SRAM difficult (Fig. 1(d). Thus, attractive RRAM with low write energy and low read latency is forbidden by the write energy vs. read latency trade-off.
In this paper, we present a bitcell design to mitigate the write energy vs. read latency trade-off and evaluate its impact on instruction cache level performance. First, we present a modified 2T1R bitcell by adding 1 NMOS transistor to the conventional select transistor (1T) and RRAM (1R) based 1T1R bitcell to produce a high read-current. Second, we show that our proposal ‘breaks’ the write energy vs. read latency trade-off at the cost of the increased bitcell area. Third, we analyze the effect at the architectural level for instruction cache replacement by comparing our proposal with conventional 1T1R bitcell and SRAM in terms of improved energy-delay-product (EDP) for both high performance and embedded processor configuration.
1 Proposal
1.1 Fast Read Solution
Figure 2(a–c) shows bitcells for SRAM and RRAM schemes. An NMOS (1T) is used to select an RRAM (1R) for reading/writing to define the conventional 1T1R bitcell [15]. The read-write scheme is taken from [29]. We propose that the voltage (\(V_{read-select}\)) between 1T and 1R devices on the 1T1R is applied on the gate of an NMOS transistor to form the 2T1R bitcell as shown in Fig. 2(c). The drain and source of read transistor are connected to another bit and word line. This modification makes the read transistor supply the read current that charges the line capacitance to the sense amplifier instead of the RRAM current. Thus, the read transistor can be independently designed for a fast read, while the resistance of the RRAM can be increased for low energy to disable the write energy vs. read latency trade-off.
Array level schematics of these bitcells are shown in Fig. 2(d–f). The conventional 1T1R scheme has one bitcell at the intersection of a cross-bars with two wordlines and a bitline. One complimentary pass transistor switch performs as a row selector (RS) to apply different voltages for reading and writing. A select line turns the select transistors on when a row is to be read or written. The bitline connects the bitcell to the sense amplifier through a column selector (CS) while reading. The proposed scheme requires one pair of additional wordline and bitline connecting the source and drain of the read transistor as shown in Fig. 2(f).
1.2 Targeted Application
Fast reading achieved by the proposed scheme is tested for L1 instruction cache replacement as it requires infrequent writing and frequent reading instances. This produces an aggressive read latency requirement but is tolerant to slow writing. We exploit the area advantage offered by RRAM to accommodate higher memory size in the same area. Thus, we intend to compensate for the high write latency of RRAM using a higher memory size (Fig. 3) - so that the larger cache enables less data miss in L1 to cause a data fetch from L2 and thereby reduce the frequency of write operations. We study the performance of the proposed scheme (2T1R) for both high-performance and embedded architectures.
2 New Read Scheme
2.1 Circuit Schematic
Figure 4 shows circuit schematics for both RRAM schemes in the reading configuration. For the conventional 1T1R scheme, \(V_{read}\) is applied through the RS switch at wordline and CS is grounded as shown. The voltage gets divided across the RRAM and \(R_{CS}\) and is sensed by the sense amplifier (SA) [27]. \(C_{line}\) denotes the line capacitance to be charged by RRAM read current. This sets the charge-up time. The proposed scheme has similar reading configurations with RS and CS driving two word and bitlines to voltages shown. The select transistor is biased at \(V_{read-select}\), which acts as a resistive voltage divider to produce a voltage (\(V_{gate-select}\)), which drives the read transistor. Read transistor allows high current flow charging the line capacitance (\(C_{line}\)) and \(SA_{in}\) terminal. A diode-connected transistor in CS (drain shorted to the gate) acts as a current to voltage converter to drive the SA.
In this scheme, the RRAM resistance charges primarily the ‘small’ gate capacitance of the read transistor, while the read transistor charges the substantial line capacitance. The approximate time constants of charging show the 1T1R scheme charging the line capacitance (\(\sim \)10 fF) while the proposed scheme charging the small gate capacitance (\(\sim \)0.1 fF) resulting in faster reading (Fig. 5).
2.2 Maximizing Sense Margin
Choosing the \(V_{read-select}\) determines the gate voltage for read transistor. We use Verilog-A RRAM model [9] and 45nm CMOS technology [31] for HSPICE simulations of circuits. From the vast range of resistances reported in the literature, we consider the case of a high resistance RRAM with a low resistance state (LRS) = 1 M\(\varOmega \) and high resistance state (HRS) = 10 M\(\varOmega \) as described in Fig. 6. We fix the \(R_{RRAM}\) and vary \(V_{read-select}\). The task is to create a maximum voltage difference at the input of the SA for the given LRS and HRS. For this choice of RRAM resistances, \(V_{read-select}\) of 280 mV turns on for LRS and remains off for HRS (Fig. 6(b)). The read current, therefore, shows the maximum difference (Fig. 6(c)) for the two resistances. The read current is converted to voltage and applied to the input of SA shown in Fig. 6(d) to give the maximum sense margin.
2.3 Breaking the Trade-off
Figure 7(a) shows the timing diagrams for reading in both the 1T1R and 2T1R schemes for the device. The RRAM switching voltage of 0.8 V is assumed, given a range of switching voltages (0.2–4 V) reported in the literature [10, 22, 23, 25, 28]. The array size of 256\(\,\times \,\)64 is taken for calculating the line capacitance. The line capacitance of 15.07 fF is calculated using the 45 nm interconnect technology manual [21].
The voltage at the input terminal of the SA is shown for both schemes along with the output of the SA in Fig. 7(a). The conventional scheme requires more than 2 ns for the \(SA_{in}\) to charge up and the SA output to respond. On the other hand, the proposed scheme charges the SA input terminal quickly in less than a nanosecond response time. Figure 7(b) shows the response time of SA output increases with device LRS resistance with a fixed LRS/HRS ratio. The read time drastically increases with an increase in device resistance for the 1T1R scheme. In comparison, the proposed bitcell does not show degradation of latency with \(R_{LRS}\) to enable fast read (<1 ns) even with high resistance devices. However, for very large resistance (100 M\(\varOmega \)) the performance of the proposed scheme degrades because the extreme low RRAM currents take a long time to charge the gate capacitance of the read transistor. Figure 7(c) shows the reduced read latency with the proposed scheme for various RRAM resistances. The circuit consumes comparable write energy as the previous scheme with much lower read latencies thus, breaking the trade-off (Fig. 7(d)).
3 Architecture Level Performance Estimation
Figure 8 shows the bank schematics for SRAM and RRAM memory banks. The components are briefly divided into 4 categories which are colour coded as shown. Same CMOS peripheral components are used in both cases. The components specific to one of the schemes are shown in blue colour. SRAM bank of 16\(\,\times \,\)8 size is simulated using NCSU SRAM compiler [13] in 45 nm technology [31]. RRAM bank of the same size is also simulated in HSPICE. The energies are calculated and scaled for the array size of 256\(\,\times \,\)64. The comparison of area and energy between the conventional and proposed schemes is explained next.
3.1 Energy
Both RRAM and SRAM arrays are simulated in HSPICE and instantaneous power is recorded for each component. The total energy consumed in the writing or reading cycle is calculated. For the RRAM array, the devices are made to switch from LRS to HRS for the calculation of write energy while energy spent in reading a device in LRS is taken as read energy. The write latency is assumed to be 10 ns and read latency is taken to be 1 ns. The standby (SB) energy is calculated by turning the RS and CS off and isolating the array from the power source. For SRAM, all bits are programmed in state 0 and are made to flip in extracting write energy consumption. These energy per cycle values are used in architectural simulations to calculate the total energy consumed. Component-wise write, read and standby energies consumed by SRAM and RRAM banks are shown in Fig. 9(a).
RRAM consumes significantly more energy than SRAM during write mode because of long times scales. This makes the reduction in write energy important by the usage of highly resistive devices. Read energy consumed in RRAM and SRAM is comparable. However, in both reading and writing, the proposed scheme consumes higher energy due to additional transistors present. RRAM specific components (i.e. row/column selectors) consume the major fraction of power in writing, while SA consumes a major fraction of power in reading for RRAMs. In standby mode, SRAM consumes static power, while the RRAM array is disconnected from peripherals by turning off the RS and CS passgate switches. This causes power saving. This results in 20\(\times \) energy reduction in SB mode. These energy values are used in the cache replacement simulation presented in the next section.
3.2 Area
Figure 9(b) shows area reduction achieved by using RRAM in place of SRAM in the bitcell array. SRAM bitcell area is reported to be \(146F^2\) [29] while RRAM 1T1R structure consumes \(>8F^2\) [5]. We assume a conservative estimate of \(10F^2\) for 1T1R and \(25F^2\) for the 2T1R scheme. Therefore, the area advantage is 14.6\(\times \) and 5.84\(\times \) respectively for conventional and proposed schemes. A part of this is taken by the RRAM-specific blocks shown in Fig. 8 like new sense amplifiers and drivers. Thus 1T1R and 2T1R schemes can conservatively pack 8\(\times \) and 4\(\times \) more memory compared to SRAM. A more rigorous area estimation can be carried out in the future.
3.3 Cache Replacement
Table I shows the simulation parameters used. We use both high performance (x86) [3] and embedded (ARM) [4] processor architectures to analyze the applicability of this scheme (Fig. 10). We use 10 programs from MiBench benchmark suite [12] for embedded and 10 programs from SPECS benchmark suite [1] for high-performance architecture. 2 kB (256\(\,\times \,\)64) memory banks are used as shown in Fig. 10. We choose the L1 instruction cache in HP and L0 instruction cache in embedded processors for replacement with RRAM in gem5 simulations [5].
Throughput. Throughput (instructions per cycle - IPC) for each of the case is extracted from gem5 [5] simulations. Figure 11(a) shows that the conventional 1T1R scheme causes huge degradation in the IPC while the proposed scheme makes it comparable to SRAM with only 0.1% degradation for HP and 1.6% degradation in embedded architecture. This is because the proposed scheme along with faster reading provides a larger memory size which reduces the miss rate, which is defined as the need to request data from the L2 cache to store in the L1 cache. The resultant reduction in high latency write instances due to larger memory size enables improved performance.
Energy. The energy per read/write/SB cycle extracted from Fig. 11(b) is used to calculate the total energy consumption for different benchmark programs. The number of read/write accesses is extracted from the simulations to get the energy spent in reading and writing. During the remaining cycles, the memory consumes standby energy. The normalized energy plot in Fig. 11(b) shows a reduction in energy in both of the RRAM schemes. Mean energy reduction of 81% is observed in HP and 53% for embedded programs. Energy-saving occurs in RRAMs due to a large number of cycles in standby mode as RRAMs have very low SB energy consumption compared to SRAMs.
Energy-Delay Product (EDP). The energy calculated above is used to compare the performance of RRAM and SRAM using EDP defined only for the instruction cache. The delay is defined as the duration for which the cache is active for writing or reading. Mathematically,
Normalized EDP (Fig. 11(c)) reduces by 82% for HP and 53% for embedded processing for the proposed scheme. Table II summarizes the cache replacement results.
4 Conclusion
We proposed a circuit modification in the conventional 1T1R RRAM bitcell to boost the read speed for high resistance devices. This resolves the fast-read vs. low energy dilemma. We analyzed the performance of this fast read based RRAM memory for SRAM replacement at the instruction cache level. EDP reduction of 82% and 53% was observed for x86 and ARM architectures respectively showing the potential of RRAM in lower level caches as a substitute for SRAM. Extensive circuit-level exploration of the variability and reliability of the scheme may be taken up in the near future.
References
Benchmarks, s. p. e. c.: Standard performance evaluation corporation (2000)
Binkert, N., et al.: The gem5 simulator. ACM SIGARCH Comput. Archit. News 39(2), 1–7 (2011)
Catanzaro, M., Kudithipudi, D.: Reconfigurable RRAM for LUT logic mapping: a case study for reliability enhancement. In: 2012 IEEE International SOC Conference, pp. 94–99. IEEE (2012)
Chang, M.F., et al.: 17.5 a 3T1R nonvolatile TCAM using MLC RaRAM with sub-1ns search time. In: 2015 IEEE International Solid-State Circuits Conference-(ISSCC) Digest of Technical Papers, pp. 1–3. IEEE (2015)
Chang, M.F., et al.: 19.4 embedded 1mb ReRAM in 28nm CMOS with 0.27-to-1v read using swing-sample-and-couple sense amplifier and self-boost-write-termination scheme. In: 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pp. 332–333. IEEE (2014)
Chen, P.Y., Yu, S.: Compact modeling of RRAM devices and its applications in 1T1R and 1S1R array design. IEEE Trans. Electron Devices 62(12), 4022–4028 (2015)
Cheng, C.H., Chin, A., Yeh, F.: Ultralow switching energy nigeoxhfontan RRAM. IEEE Electron Device Lett. 32(3), 366–368 (2011)
Govoreanu, B., et al.: 10\(\times \) 10nm 2 Hf/HfO x crossbar resistive RAM with excellent performance, reliability and low-energy operation. In: 2011 International Electron Devices Meeting, pp. 31–6. IEEE (2011)
Guthaus, M.R., Ringenberg, J.S., Ernst, D., Austin, T.M., Mudge, T., Brown, R.B.: MiBench: a free, commercially representative embedded benchmark suite. In: Proceedings of the fourth annual IEEE international workshop on workload characterization. WWC-4 (Cat. No. 01EX538). pp. 3–14. IEEE (2001)
Guthaus, M.R., Stine, J.E., Ataei, S., Chen, B., Wu, B., Sarwar, M.: OpenRAM: an open-source memory compiler. In: 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1–6. IEEE (2016)
Hsieh, M.C., et al.: Ultra high density 3D via RRAM in pure 28nm CMOS process. In: 2013 IEEE International Electron Devices Meeting, pp. 10–3. IEEE (2013)
Huang, J.J., Tseng, Y.M., Luo, W.C., Hsu, C.W., Hou, T.H.: One selector-one resistor (1S1R) crossbar array for high-density flexible memory applications. In: 2011 International Electron Devices Meeting, pp. 31–7. IEEE (2011)
Ielmini, D.: Resistive switching memories based on metal oxides: mechanisms, reliability and scaling. Semicond. Sci. Technol. 31(6), 063002 (2016)
Jokar, M.R., Arjomand, M., Sarbazi-Azad, H.: Sequoia: a high-endurance NVM-based cache architecture. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 24(3), 954–967 (2015)
Jung, M., Shalf, J., Kandemir, M.: Design of a large-scale storage-class RRAM system. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, pp. 103–114 (2013)
Kawahara, A., et al.: An 8 mb multi-layered cross-point ReRAM macro with 443 mb/s write throughput. IEEE J. Solid-State Circuits 48(1), 178–185 (2012)
Kotra, J.B., Arjomand, M., Guttman, D., Kandemir, M.T., Das, C.R.: Re-NUCA: a practical NUCA architecture for ReRAM based last-level caches. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 576–585. IEEE (2016)
Kuhn, K., et al.: Managing process variation in Intel’s 45nm CMOS technology. Intel Technol. J. 12(2) (2008)
Lashkare, S., Chouhan, S., Chavan, T., Bhat, A., Kumbhare, P., Ganguly, U.: PCMO RRAM for integrate-and-fire neuron in spiking neural networks. IEEE Electron Device Lett. 39(4), 484–487 (2018)
Lee, H., et al.: Low power and high speed bipolar switching with a thin reactive TI buffer layer in robust HfO2 based RRAM. In: 2008 IEEE International Electron Devices Meeting, pp. 1–4. IEEE (2008)
Sheu, S.S., et al.: A 5ns fast write multi-level non-volatile 1 k bits RRAM memory with advance write scheme. In: 2009 Symposium on VLSI Circuits, pp. 82–83. IEEE (2009)
Shih, C.C., et al.: Ultra-low switching voltage induced by inserting SIO 2 layer in indium-tin-oxide-based resistance random access memory. IEEE Electron Device Lett. 37(10), 1276–1279 (2016)
Wang, C.H., et al.: Three-dimensional 4f 2 ReRAM cell with CMOS logic compatible process. In: 2010 International Electron Devices Meeting, pp. 29–6. IEEE (2010)
Wang, Y.T., Razavi, B.: An 8-bit 150-MHz CMOS A/D converter. IEEE J. Solid-State Circuits 35(3), 308–317 (2000)
Wu, Y., Lee, B., Wong, H.S.P.: Ultra-low power al 2 o 3-based RRAM with 1\(\mu \)a reset current. In: Proceedings of 2010 International Symposium on VLSI Technology, System and Application, pp. 136–137. IEEE (2010)
Xu, C., Dong, X., Jouppi, N.P., Xie, Y.: Design implications of memristor-based RRAM cross-point structures. In: 2011 Design, Automation and Test in Europe, pp. 1–6. IEEE (2011)
Zhang, J., Donofrio, D., Shalf, J., Jung, M.: Integrating 3D resistive memory cache into GPGPU for energy-efficient data processing. In: 2015 International Conference on Parallel Architecture and Compilation (PACT), pp. 496–497. IEEE (2015)
Zhao, W., Cao, Y.: New generation of predictive technology model for sub-45 nm early design exploration. IEEE Trans. Electron Devices 53(11), 2816–2823 (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lele, A., Jandhyala, S., Gangurde, S., Singh, V., Subramoney, S., Ganguly, U. (2022). Disrupting Low-Write-Energy vs. Fast-Read Dilemma in RRAM to Enable L1 Instruction Cache. In: Shah, A.P., Dasgupta, S., Darji, A., Tudu, J. (eds) VLSI Design and Test. VDAT 2022. Communications in Computer and Information Science, vol 1687. Springer, Cham. https://doi.org/10.1007/978-3-031-21514-8_41
Download citation
DOI: https://doi.org/10.1007/978-3-031-21514-8_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21513-1
Online ISBN: 978-3-031-21514-8
eBook Packages: Computer ScienceComputer Science (R0)