

# A 64 $\times$ 32 bit 4-read 2-write low power and area efficient register file in 65 nm CMOS

Jun Han<sup>1</sup>, Xingxing Zhang<sup>1</sup>, Yi Li<sup>1</sup>, Baoyu Xiong<sup>1</sup>, Yuejun Zhang<sup>2</sup>, Zhang Zhang<sup>3</sup>, Zhiyi Yu<sup>1a)</sup>, Jun Han<sup>1</sup>, Xu Cheng<sup>1</sup>, and Xiaoyang Zeng<sup>1</sup>

<sup>1</sup> State Key Laboratory of ASIC and System, Fudan University, Shanghai 201203, China

<sup>2</sup> Institute of Circuits and Systems, Ningbo University, Ningbo 315211, China

<sup>3</sup> School of Electronic Science and Applied Physics, Hefei University of Technology, Hefei, 230009, Anhui, China

a) zhiyiyu@fudan.edu.cn

**Abstract:** This paper details the design of a  $64 \times 32$  bit 4-read 2write register file in TSMC 65 nm LP process. The register file avoids cell banking with pseudo-differential sensing scheme. Moreover, this approach enables a fully shareable and completely symmetry cell layout which shows competitive area results. Non-full-swing technique is proposed to avoid over design and improve energy efficiency. As for the timing control module, clocked pull-down circuit cuts off a possible short-current path at high clock frequency. A prototype is implemented in TSMC 65 nm LP technology. The measured results demonstrate operation of 0.77 GHz, consuming 7.08 mW at 1.2 V, and occupying 0.018 mm<sup>2</sup>.

**Keywords:** register file, 65 nm, pseudo-differential sensing, low power, area efficient

**Classification:** Integrated circuits

#### References

- V. Zyuban and P. Kogge, "The energy complexity of register files," Int. Symp. Low Power Electronics and Design, pp. 305–310, Aug. 1998.
- [2] R.P. Preston, R.W. Badeau, D.W. Bailey, et al., "Design of an 8-wide superscalar RISC microprocessor with simultaneous multithreading," *IEEE Int. Solid-State Circuits Conference*, pp. 334–472, Feb. 2002.
- [3] K.K. Ran, A. Atila, B. Ganesh, et al., "A 130-nm 6-GHz 256 × 32 bit leakage-tolerant register file," *IEEE J. Solid-State Circuits*, vol. 37, no.5, pp.624–632, 2002.
- [4] G. Burda, Y. Kolla, J. Dieffenderfer, et al., "A 45 nm CMOS 13-port 64word 41b fully associative content-addressable register file," *IEEE Int. Solid-State Circuits Conference*, pp. 286–287, Feb. 2010.
- [5] X. Zhang, Y. Li, B. Xiong, et al., "Robust and low power register file in 65 nm technology," J. Semiconductors, vol. 33, no.3, pp.035010-5, 2012.





- [6] A. Alvandpour, R.K. Krishnamurthy, K. Soumyanath, et al., "A sub-130nm conditional keeper technique," *IEEE J. Solid-State Circuits*, pp. 633– 638, May 2002.
- S. Hsu, A. Agarwal, M. Anders, et al., "An 8.8 GHz 198 mW 16 × 64 b 1 R/1 W variation tolerant register file in 65 nm CMOS," *IEEE Int. Solid-State Circuits Conference*, pp. 1785–1797, Feb. 2006.
- [8] B. Wicht, T. Nirschl, and D. Schmitt-Landsiedel, "Yield and speed optimization of a latch-type voltage sense amplifier," *IEEE J. Solid-State Circuits*, vol. 39, no. 7, pp. 1148–1158, July 2004.
- [9] N. Verma and A.P. Chandrakasan, "A 256 kb 65 nm 8T Subthreshold SRAM Employing Sense-Amplifier Redundancy," *IEEE J. Solid-State Circuits*, vol. 43, no. 1, pp. 141–149, Jan. 2008.
- [10] A. Agarwal, S. Hsu, S. Mathew, et al., "A 32 nm 8.3 GHz 64-entry × 32 b variation tolerant near-threshold voltage register file," *IEEE Symp. VLSI Circuits (VLSIC)*, pp. 105–106, June 2010.
- [11] G.S. Ditlow, R.K. Montoye, S.N. Storino, et al., "A 4 R2W register file for a 2.3 GHz wire-speed POWER<sup>TM</sup> processor with double-pumped write operation," *IEEE Int. Solid-State Circuits Conference*, pp. 256–258, Feb. 2011.

#### **1** Introduction

Register file consumes large part of processor's power and occupies a significant portion of the total area. Previous study [1] shows that register file can be responsible for about 25% of total processor's power consumption. And the register file of Alpha 21464 occupies over five times the area of the 64 KB primary data cache [2]. Thus, it is essential to design a low power and area efficient multi-port register file.

To achieve a compact layout and a low power design, single-ended and hierarchical bit-line structure is often used, and both dynamic and static approaches have been reported before [3, 4, 5]. To compensate for the leakage current and improve noise immunity of the bit-line, a keeper is indispensable for the dynamic hierarchical bit-line scheme. However, the keeper introduces contention, which degrades performance and increases short-circuit power [6]. Furthermore, it might decrease system's reliability remarkably. As for the static hierarchical bit-line scheme, although it doesn't have the reliability problem, larger cell area must be afforded to improve cells' pull up ability. Moreover, extra signal is also needed to multiplex the signal on local bitline [5]. Besides, the hierarchical bit-line structure means dividing the cell into at least 4 banks for  $64 \times 32$  b 4-read 2-write register files. This structure means much power and area overhead [5]. In addition, it seems that cell banking reduces capacitance on the bit-line and reduces read delay; however, due to banking, much additional wiring in the bit-line direction might make layout worse (especially for multi-port register files).

In this paper, we present a  $64 \times 32$  b 4-read 2-write register file that avoids above plaguing problems. The pseudo-differential sensing scheme results in a high-density array without cell banking. And the fully shareable cell layout





proposed further improves design's area efficiency. Non-full-swing approach consumes less power on the bit-line. Careful tradeoff between power, area and reliability avoids over design for register file. This paper is organized as follows. Section 2 introduces the design in detail. Measured results are presented in section 3. Section 4 concludes this paper.

## 2 Design details

Fig. 1 (a) shows the structure of the  $64 \times 32$  b 4-read 2-write register file. Module on the left is the decoder, which is a two-stage NAND type decoder. Addresses entering the decoder are latched to avoid unpredictable changes. Output of the decoder is driven by a clocked driver, whose output is the word-line signal connecting to the cell. The cell's discharge can be amplified by the sense amplifier, which finally gives the output.

## A Robust Cell Design

The proposed 4-read 2-write cell is shown in Fig. 1 (b). It consists of a cross-coupled inverter pair (P0, P1, N0, N1, N2 and N3) for data retention. Write pass-gates (N12, N13, N14 and N15) and Read pass-gates (N8, N9, N10 and N11) connect the cell to write bit-line (WBL) and read bit-line (RBL) respectively. While the read bit-line is single-ended, the write bit-line remains differential in order to improve cell's write ability. Isolating NMOS (N4, N5, N6, and N7) in the figure eliminates the impact of read process on cell stability.

Read static noise margin (SNM<sub>read</sub>) is often used to indicate cell's stability. Both the mean value  $\mu_{\rm SNM}$  and the standard deviation  $\sigma_{\rm SNM}$  can be got from the Monte Carlo simulation. Fig. 2 (a) shows the read margin at a given failure rate (fr) of  $10^{-7}$ . Assume that cell's SNM<sub>read</sub> follows a Gaussian distribution f(SNM,  $\mu$ ,  $\sigma$ ), and the read process will fail when SNM<sub>read</sub> becomes zero. A required  $\mu_{\rm SNM}$  can be got by calculating f(0,  $\mu$ ,  $\sigma$ ) =  $10^{-7}$ . When fr =  $10^{-7}$ ,  $\mu_{\rm SNM}$  = 5.33  $\sigma_{\rm SNM}$  can be obtained correspondingly. Fig. 2 (a) indicates that the margin is sufficient for reliable read over a wide operation range. Thus, good robustness is achieved in cell's design.

#### B Fully Shareable and Symmetry Cell Layout

The use of duplicate-pull-down (DPD) NMOS (N0, N1 and N2, N3) enables a completely symmetry layout shown in Fig. 1 (b). Each of the middle two polys in the figure is shared by five transistors and all the pass-gates (N8-N15) are placed around the cell. The symmetry layout improves design's manufacturability. Furthermore, the proposed layout can share contacts with all around cells, which makes the design area efficient. Table 1 gives the comparison of the cell area. According to the normalized cell area (ACN) defined by formulation (1), this fully shareable layout obtains the smallest cell.

$$ACN = \frac{Area\_cell * 65^2}{L_{process}^2 * N_{port}}$$
(1)

 $C \ {\it Non-full-swing} \ Pseudo-differential \ Sensing \ Scheme$ 









Fig. 1. (a) Structure of the register file. (b) Schematic and layout of the 4-read 2-write cell. (c) Schematic of the sense amplifier [8]. (d) Circuit of the timing control module.





Pseudo-differential sensing scheme avoids cell banking and enables an area-efficient cell design. However, much more attention must be paid on reliable sensing. To ensure design's robustness, some previous work adopts a full-swing technique [9]. That means voltages on the bit-line will be discharged completely, in other words, the input dc voltage ( $V_{INDC}$ ) for the sense amplifier is 0.5 VDD. This scheme does work, but it is over designed for register file. Bit-lines in the register file are much shorter than those of SRAM. Considering less influence of noises on the bit-line of register file, a non-full-swing technique is adopted in this design. Apparently non-full-swing signals consume less power. And low requirement of cell's discharge ability, in turn, reduces cell's area.

Fig. 1 (c) shows the schematic of the sense amplifier. Compared to conventional latch-type sense amplifiers, this one has a high-impedance input differential stage [8]. An input dc voltage of about 0.85 VDD is chosen here considering the tradeoff between power, area and system's reliability. Fig. 2 (b) illustrates how the sensing margin changes versus VDD.  $\Delta V_{min}$  is the minimum voltage difference between SA's inputs for a reliable sensing yield of 99.8%. The voltage difference decreases with decreasing VDD while the required  $\Delta V_{min}$  changes oppositely. For VDD < 0.8 V, the sensing margin becomes 0. And for VDD > 0.9 V, there is sufficient margin for reliable sensing.

#### D High-speed Timing Control Organization

Fig. 1 (d) depicts the circuit of the timing control module. This circuit is applicable for generating long pulse signals, which ensures cell's discharge and the sensing process.

This circuit works as follows: when clk is "0", it charges node L1 through T0. When it changes to the high level abruptly, charges on L1 won't disappear immediately, the NAND outputs "0". After the propagation delay of several inverters, rd\_en outputs "1". Both clk and rd\_en will turn on the pull-down path at this time to generate a signal pulse. This clocked pull-down scheme in Fig. 1 (d) cuts off a possible short-current path. Without clocked T1, short current arises when "clk" goes low and "rd\_en" stays high.

#### **3** Prototype and measured results

A prototype of the register file, including the test-circuit and PLL, is implemented in TSMC 65 nm LP technology. A die photograph of the prototype is shown in Fig. 2 (d). The test-circuit occupies a large part of the chip. And it is specially designed to measure the performance of the register file. PLL acts as the clock source of the total system. Besides, the voltage supply of register file is independent in order to measure the power consumption.

The measured results are shown in Fig. 2 (c). Both the performance and the power consumption versus VDD are presented. At the nominal voltage of 1.2 V, the prototype can work well at 768 MHz, consuming 7.08 mW. Besides, for VDD = 1.2 V, T =  $25^{\circ}$ C, the leakage current of the register file is 2.6 uA.

Tab. I gives the comparison of this design with some previous work. To







Fig. 2. (a) Cells read margin at a given failure rate of 107. (b) Sensing margin versus VDD (Δ Vmin is got for the yield of 99.8%). (c) Measured results of the prototype. (d) Die photograph of the prototype.

**Table I.** Comparison of the register file with some pub-lished work.

| Paper  | Capacity | Ports | Proc | Power | Freq. | Area_ce       | Area     | VDD | ACN                | PN(uW   | AN       |
|--------|----------|-------|------|-------|-------|---------------|----------|-----|--------------------|---------|----------|
|        |          |       | ess( | (mW)  | (GHz  | $ll(\mu m^2)$ | $(mm^2)$ | (V) | (um <sup>2</sup> ) | /[GHz*  | $(um^2)$ |
|        |          |       | nm)  |       | )     |               |          |     |                    | $V^2])$ |          |
| ISSC   | 16×64    | 1R1W  | 65   | 198   | 8.8   | 4.14          | 0.017    | 1.2 | 2.07               | 11.0    | 8.3      |
| C'200  |          |       |      |       |       |               |          |     |                    |         |          |
| 6 [7]  |          |       |      |       |       |               |          |     |                    |         |          |
| VLSI'  | 64×32    | 1R1W  | 32   | 83    | 8.3   |               | 0.0067   | 1.0 |                    | 3.52    | 6.75     |
| 2010[  |          |       |      |       |       |               |          |     |                    |         |          |
| 10]    |          |       |      |       |       |               |          |     |                    |         |          |
| ISSC   | 144×78   | 4R2W  | 45   | 59    | 2.76  | 2.78          | 0.088    | 0.9 | 0.97               | 0.564   | 2.72     |
| C'201  |          |       |      |       |       |               |          |     |                    |         |          |
| 1 [11] |          |       |      |       |       |               |          |     |                    |         |          |
| JOS'2  | 32×32    | 4R2W  | 65   | 7.2   | 0.8   | 20.52         | 0.046    | 1.2 | 3.42               | 1.46    | 7.49     |
| 012    |          |       |      |       |       |               |          |     |                    |         |          |
| [5]    |          |       |      |       |       |               |          |     |                    |         |          |
| This   | 64×32    | 4R2W  | 65   | 7.08  | 0.77  | 3.15          | 0.018    | 1.2 | 0.53               | 0.748   | 1.46     |

make the comparison more explicit, PN and AN defined by formulation (2) and (3) are listed in the table.

$$PN = \frac{P_{ower} * 1.2^2}{N_{port} * Capacity * Freq * VDD^2}.$$
(2)

$$AN = \frac{Area * 65^2}{L_{process}^2 * N_{port} * Capacity}$$
(3)

The comparison manifests that this design is the most area efficient one. As for the power efficiency, it shows competitive results with state-of-the-art design for now in [11].





# 4 Conclusion

This paper details the design of a  $64 \times 32$  bit 4-read 2-write register file using pseudo-differential scheme. The pseudo-differential sensing scheme results in a high-density array without cell banking. And the fully shareable cell layout proposed further improves design's area efficiency. Non-full-swing approach consumes less power on the bit-line. As for the timing control module, clocked pull-down circuit cuts off a possible short-current path at high clock frequency. A prototype is implemented in TSMC 65 nm LP technology; the measured results demonstrate operation of 0.77 GHz, with power consumption of 7.08 mW at 1.2 V.

# Acknowledgments

This work is supported by a grant from National Significant Science and Technology Projects -01 Special 2010ZX01030-001-001-03.

