# A 500-MHz Low-Power Five-Port CMOS Register File

Jiajing Wang

Microelectronics Dept. Fudan University Shanghai, CHINA 200433 Tel: +86-21-65642765 Fax: +86-21-65644158 e-mail: jjwang\_ee@yahoo.com

Abstract<sup>3</sup>/<sub>4</sub> Current-mode approach in register file has the great advantage of high operation speed with low power consumption. Under 0.35- **n** m CMOS technology, we present a 1.85-ns read access, 32 <sup>-</sup> 32-bit, five-port register file in which a modified current-sense amplifier is used. The delay match circuit and TSPC-FF are also used to improve speed. Energy saving is achieved by current mode and other key ideas. Simulation results show it consumes 640mW at 500MHz 3.3V. It can still work at 2.3V with 350MHz 300mW.

#### I. INTRODUCTION

In current microprocessors, register file is one of the key components that influence the overall performance of microprocessor. High data throughput requires fast simultaneous multiple reads and writes in register file. On the other hand, the demand of energy-efficient processor calls for the design of low-power register file because it represents a substantial portion of power budget in modern microprocessors. For example, in Motorola's M.CORE architecture, the register file consumes 16% of the total processor power and 42% of the data path power [1].

Up to now many studies focus on the design of three-port (two reads and one write) high-speed register file with few power constraints [2][3]. However, like ARM10 and Intel XScale, high-performance microprocessors usually use pipeline deeper than five-stage, which requires register file with more read and write ports. And power reduction becomes the main design issue in these microprocessors. So we intend to design a five-port register file with features of both high speed and low power consumption. It can be applied in a high-performance RISC microprocessor.

Voltage-mode operation used to be popular in register file and SRAM. In this mode, the power reduction can be achieved by lowering the swing on the bitlines. However, in the deep sub-micron technology, the small swing signal will be lost under the influence of parasitic capacitance on the bitlines and the coupling noise from interconnect lines. While the current-mode approach can overcome these disadvantages. By using it, multi-port SRAM can have high operation speed with low power consumption [4][5].

In this paper, we design a five-port register file in which a current-mode sense amplifier based on [6] is used. To shorten

## Qianling Zhang

Microelectronics Dept. Fudan University Shanghai, CHINA 200433 Tel: +86-21-65642765 Fax: +86-21-65644158 e-mail: qlzhang@fudan.edu.cn

the idle time of sense amplifier, the delay match circuit is proposed. And to further improve the read access, the high-speed TSPC D-FF is also applied. On the other hand, power reduction is considered. Besides the current sensing scheme, the methods of pulsed control signals and clock gating are also mentioned. Under the  $0.35 \,\mu$  m, 3.3V CMOS technology, circuit simulations indicate this register file has 1.85-ns read access with 0.64W power consumption. Compared with the conventional voltage-mode register file based on [9], the delay is reduced by 0.8-ns and 28% of power is saved.

## II. THE REGISTER FILE OVERVIEW

A block diagram of the five-port  $32 \times 32$ -bit register file is shown in fig.1. The memory cell includes cross-coupled inverters and pass transistors for every port. The differential sensing structure is used because lower energy and higher speed can be achieved by sensing small swing between the set of bitlines.

In the register file, write operation is made during the first half of the cycle while the read data is accessed during the second half of the cycle. This can avoid a bypass path from the write-back stage to the register-read stage in the microprocessor pipelining.



Fig.1. Architecture of 32×32-bit five-port register file

To achieve high-performance register file, the solutions to optimize delay and power are mainly considered. Fig.1 has shown the read path including address latching and decoding, memory cell, sense amplifier and output buffer. Usually the speed of register file is determined by the delay of read path. So we mainly focus on reducing the delays along the read path. The approaches to improve the read speed are described in Section III. And section IV gives the details on key ideas for reducing power consumption.

## **III. HIGH-SPEED APPROACHES**

## A. High-Speed Current-Sense Amplifier

Sense amplifier is the most important part for read access. Using high-speed sense amplifier can greatly reduce the delay time of read access. Due to the small voltage swing, conventional voltage-mode sense amplifiers always use hierarchical structure to improve speed. But there are some disadvantages, such as the delay and power dissipation are accumulated and the circuits become much more complicated. So current-mode sense amplifier is preferable for high-speed SRAM [7]. In this article, we use a new high-speed current-mode sense amplifier, which is based on the circuit of [6].

Fig.2 shows the circuit of this sense amplifier used in the register file. It includes a controlled cross-coupled transistor structure (MP1, MP2, MN1, MN2), control transistors (MP3, MP4, MN3, MN4), state control transistor MN5, recovery transistor MP5, output balance transistor MP6, charge transistors MP7 and MP8, and input balance transistor MP9. SAC is the control signal of whole sense amplifier and WL is the wordline pulse.

There are two work phases, one is for balancing and the other is for amplifying. When WL and SAC are both at low voltage, the memory cell is closed and MP6 is turned on. So the circuit is working at the balancing phase. At this time, MP5 is turned on and node S goes to the high level. Then MN3, MN4, MP1 and MP2 turn on successively. Because MP7, MP8 and MP9 remain open, the voltages on the two bitlines keep high. So the output terminals O and NO are also maintained at source voltage. When WL and SAC rise, differential current forms and the circuit enters the amplifying phase. Node S quickly goes to low level, which makes MP3 and MP4 open. Then MP1, MP2, MN1 and MN2 form the cross-coupled structure. In fig.2, due to the change of current on the bitline BL, output O will quickly go to low level, while NO will keep high level.

Fig.2 is based on [6], but it adds some transistors for data balancing and for providing source current. These modifications help to improve the performance of our register file. As the transistors providing the constant source current, MP7 and MP8 keep charging the two bit-lines and make them always near the source voltage. MP9 is for input balance. It equalizes the set of bitlines and makes the swing extremely small. So the power dissipation can be greatly reduced. While in the worst case of continuously reading alternate data '0' and '1' from cell, the output balance transistor MP6 will help O and

NO to charge and discharge in advance. Thus the read access time in worst case can be shortened. In addition, during the design of cross-coupled structure, the size of MN1 (MN2) should be larger than MP1 (MP2) because the sensing speed is mostly determined on MN1 (MN2). And as the switching transistor for phase change, MN5 and MP5 should be as larger as possible to get faster switches. By optimizing all the transistors, this current-mode sense amplifier can perform better.

Under the  $0.35 \,\mu$ m CMOS technology we simulate the circuit with HSPICE. Fig.3 shows the sensing delay with different source voltages. Here the sensing delay is defined as the time interval from the point inputting 20mV differential voltage between the bitlines to the point when 100mV differential voltage is available between two output nodes. When the source voltage is 3.3V, the delay is less than 0.4ns.

#### B. The Delay Match Circuit

In fig.2, when wordline WL rises, the cell will open and the differential current will immediately appear on the set of bitlines. At that time SAC should also rise to make the sense amplifier enter the amplifying phase. However, due to using high-speed flip-flop and address decoding circuit, the rise of WL will be much earlier than SAC. Thus the sense amplifier will keep idle for a period of time. To improve the speed, we should reduce the idle time of circuits as much as we can. So a delay match circuit shown in fig.4 is proposed.

The clock and its delayed inverse signal pass an NOR gate to generate a basic pulse signal. The read enable signal is ANDed with this pulse to produce the SAC pulse. While the row decoder signal is also ANDed with it to produce the WL pulse. By adjusting the delay of gates in the two paths, the



Fig.2. Circuit of high-speed current-sense amplifier



Fig.3. Delay of current-sense amplifier vs. supply voltage



Fig.4. Circuit of optimum timing chain

rise edge of SAC can be much closer to that of WL so as to shorten the idle time of sense amplifier.

## C. High-Speed TSPC D-FF

In fig.1, all the addresses should be firstly put into the flip-flops before decoding. So high-speed flip-flop can also contribute to improving the speed of whole register file. In our design, a kind of high-speed TSPC (True Single-Phase Clock) D-FF is used [8]. The delay of it is about 0.25ns less than that of conventional master-slave flip-flop. In addition, this D-FF has the complementary outputs (Q and NQ). By using it, the complementary address signals required in the decoding stage can be provided ahead of time. So the delay of address decoding can be reduced too.

#### **IV. LOW POWER SOLUTIONS**

#### A. Reduction of Bit Line Energy

The sets of bit lines are big nodes in register file because they are connected with all the cells along one column in the memory array. During a read access, the energy consumed at one bit line is:

$$E_{bl} = V_{dd}V_{swing}C_{bl}$$

To reduce the  $E_{bl}$ , we can decease the value of  $V_{swing}$ , the required signal for sensing scheme, and  $C_{bl}$ , the capacitance on the bitline. In section III, we have described the current-mode read in which the current-mode sense amplifier has very low input impedance. The differential current signal can be sensed without the need for charging or discharging the bitline capacitance. As a result the voltage change on the bitline,  $V_{swing}$ can be very small. So the current-mode approach is superior to conventional voltage mode in terms of both high speed and efficient energy. For large register files, the reduction of  $C_{bl}$  is also valuable. However, for the small register file such as  $32 \times 32$  bit, it is unworthy to add extra logic for reducing the  $C_{bl}$  because much power will be consumed in additional control circuits.

#### B. Pulsed Woldline and Sense Amplifier Control Signal

As mentioned in the delay match circuit, WL and SAC are both generated from the basic pulse. So the duration of active operation in cell and sense amplifier can be shorten by the narrower pulse width. By adjusting the delay of inverter chain in fig.4, we can get the optimum pulse width for the minimum time required for reading and writing in the cell array. Thus less power is consumed on the unnecessary activation of circuits.

## C. Clock Gating

Actually, most instructions do not require all the five ports except some complicated ones like MAC. To make the register file more energy-efficient, we use the technique of clock gating. The clock will be gated by the port-enable signal. And when the port is unused, the gated clock can prevent irrelevant switching activity in the internal nodes and clock line. For example, the gating clock for write data flip-flops can reduce the write power.

## V. THE SIMULATION RESULT AND CHIP IMPLEMENTATION

## A. Simulation Result

Using the model of TSMC  $0.35 \,\mu$  m, 3.3V CMOS technology, the five-port register file is simulated with HSPICE. Fig.5 shows the simulation waveforms. In this figure, A and NA are the two internal nodes in the memory cell; OUT is the output node on the read port. Simulation results show the write time (from the positive edge of CLK to the point input data written into cell) is 1.75ns; while the read access (from the negative edge of CLK to the point output data being stable) is 1.85ns. And the power dissipation is about 640mW. At a low voltage of 2.3V, this register file can still work at high frequency with low power consumption. Fig.6 shows the delay and power consumption with the change of source voltage.



Fig.5. Simulation waveforms of register-file read and write



Fig.6. the Delay and power consumption of register file with variable source voltages.

Under the same technology we have developed another five-port register file, which used the conventional voltage sensing structure proposed by [9]. Table I lists the delay of the read path in each register file. Due to using current mode approach and other high-speed low-power solutions, the new register file has a better performance than conventional one. The whole delay time has decreased by 0.8ns and the power consumption has reduced by 28%.

## B. Chip Implementation

The hardware of circuit is fabricated with TSMC  $0.35 \,\mu$  m, 3.3V, 1P4M CMOS technology. The register file macro is 0.852mm×0.861mm. In addition, because multi-port register file has too many pins that will increase the expense of chip package and the difficulty of test, a converter from serial signals to parallel signals and scan test circuits are designed. The layout photo of register file and its test circuit is shown in Fig.7. The whole chip size is 1.680mm×2.008mm.

## V. CONCLUSIONS

High-speed Low-power approaches have been used in the five-port,  $32 \times 32$ -bit CMOS register file. A modified current-mode sense amplifier based on [6] was used. Moreover, the delay match circuit and TSPC D-FF were developed to further improve the read speed. On the other hand, to reduce the overall power consumption, the methods such as pulsed control signals and clock gating were also presented. Under the TSMC 0.35  $\mu$  m, 3.3V CMOS technology, HSPICE simulation results indicate that this new register file features a read access time of 1.85ns and the power consumption of 640mW. At a low voltage of 2.3V, it can still work at 350MHz with 300mW. It can be applied in the high performance microprocessors.

#### REFERENCES

 D. R. Gonzales, "Micro-RISC architecture for the wireless market," *Motorola M-Core Technology Center*, 1999.

TABLE I DELAY OF READ PATH (NS)

|       | Address<br>register | Address decoding &<br>SAC generator | Sense amplifier & output buffer | Read<br>access |
|-------|---------------------|-------------------------------------|---------------------------------|----------------|
| [9]'s | 0.45                | 0.95                                | 1.25                            | 2.65           |
| Ours  | 0.20                | 0.80                                | 0.85                            | 1.85           |



Fig.7. Layout of the Register File Chip with Test Circuit

- [2] M. Nomura, M. Yamashina, K. Suzuki, M. Izumikawa, H. Igura, H. Abiko, et al, "A 500-MHz, 0.4-um CMOS, 32-word by 32-bit 3-port register file," *IEEE 1995 Custom Integrated Circuits Conference*, pp. 151, 1995.
- [3] R. L. Franch, J. Ji, C. L. Chen, "A 640-ps, 0.25-um CMOS, 16×64-b three-port register file," *IEEE J. Solid-State Circuits*, Vol. 32, pp. 1288-1292, 1997.
- [4] M. Izumikawa, M. Yamashina, "A current direction sense techique for multiport SRAM's," *IEEE J. Solid State Circuit*, Vol. 31, pp. 546-551, 1996.
- [5] M. M. Khellah, M. I. Elmasry, "A low-power high-performance current-mode multiport SRAM," *IEEE Trans. VLSI System*, Vol. 9, pp. 590-598, 2001.
- [6] G. V. Kristovski, Y. L. Pogrebnoy, "New sense amplifier for small-swing CMOS logic circuit," *IEEE Trans. Circuit and Systems-II: Analog and Digital Signal Processing*, Vol. 47, pp. 573-576, 2000.
- [7] E. Seevinck, P. J. Beers, H. Ontrop, "Current-mode techniques for high-speed VLSI circuits with applications to current sense amplifier for CMOS SRAM's," *IEEE J. Solid-State Circuits*, Vol. 26, pp. 525-536, 1991.
- [8] Q. Huang, R. Rogenmoser, "Speed optimization of edge-triggered CMOS circuits for gigahertz single-phase clocks," *IEEE J. Solid-State Circuit*, Vol. 31, pp. 456-465, 1996.
- [9] S. Teruo, I. Eisaku, F. Chiaka, "A 6-ns, 1-Mb CMOS SRAM with latched sense amplifier," *IEEE J. Solid-State Circuit*, Vol. 28, pp. 478-483, 1993.