# **UC Santa Cruz**

# **UC Santa Cruz Previously Published Works**

# **Title**

Low-power clock distribution using a current-pulsed clocked flip-flop

# **Permalink**

https://escholarship.org/uc/item/8gt5v1b4

# **Journal**

IEEE TCAS-I, 62(4)

# **Authors**

Islam, Riadul Guthaus, Matthew R.

# **Publication Date**

2015-04-01

Peer reviewed

# Low-Power Clock Distribution Using a Current-Pulsed Clocked Flip-Flop

Riadul Islam, Student Member, IEEE, and Matthew R. Guthaus, Senior Member, IEEE

Abstract—We propose a new paradigm for clock distribution that uses current, rather than voltage, to distribute a global clock signal with reduced power consumption. While current-mode (CM) signaling has been used in one-to-one signals, this is the first usage in a one-to-many clock distribution network. To accomplish this, we create a new high-performance current-mode pulsed flip-flop with enable (CMPFFE) using 45 nm CMOS technology. When the CMPFFE is combined with a CM transmitter, the first CM clock distribution network exhibits 62% lower average power compared to traditional voltage mode clocks.

Index Terms—Clock distribution network, crosstalk, current-mode, flip-flop, low-power.

### I. Introduction

PORTABLE electronic devices require long battery lifetimes which can only be obtained by utilizing low-power components. Recently, low-power design has become quite critical in synchronous application specific integrated circuits (ASICs) and system-on-chips (SOCs) because interconnect in scaled technologies is consuming an increasingly significant amount of power. Researchers have demonstrated that the major consumers of this power are global buses, clock distribution networks (CDNs), and synchronous signals in general [1].The CDN in the POWER4 microprocessor, for example, dissipates 70% of total chip power [2].

In addition to power, interconnect delay poses a major obstacle to high-frequency operation. Technology scaling reduces transistor and local interconnect delay while increasing global interconnect delay [3], [4]. Moreover, conventional CDN structures are becoming increasingly difficult for multi-GHz ICs because skew, jitter, and variability are often proportional to large latencies [5].

Prior to and in early CMOS technologies, current-mode (CM) logic was an attractive high-speed signaling scheme [6]. CM logic, however, consumes significant static power to offer these high speeds. Because of this, standard CMOS voltage-mode (VM) signaling has been the *de facto* standard logic family for several decades.

Low-swing and current-mode signaling, however, are highly attractive solutions to help address the interconnect power

Manuscript received August 31, 2014; revised December 10, 2014; accepted January 26, 2015. Date of current version March 27, 2015. This work was supported in part by the National Science Foundation under Grant CCF-1053838. This paper was recommended by Associate Editor V. Chandra.

The authors are with the Department of Computer Engineering, University of California, Santa Cruz, Santa Cruz, CA 95064 USA (e-mail: rislam@ucsc.edu; mrg@ucsc.edu)

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSI.2015.2402938

and variability problems [1], [4], [7]–[9]. Traditionally, the static power dominates dynamic power consumption in a CM signaling scheme. However, the static power is often significantly less than VM dynamic power and latency is significantly improved over VM in global CM interconnect. CM signaling schemes also offer higher reliability since they are less susceptible to single-event transient upsets due to the absence of buffers with source/drain diffusion areas that can be hit by high-energy particles.

Previous CM schemes have been used for commonly, offchip signals. Standard logic signals, however, have remained VM to benefit from the low static power of CMOS logic. In our proposed scheme, it is not practical to make each individual point-to-point segment of the CDN CM, but the clock signal should still benefit from the power and reliability of CM signaling. Instead, the power savings is maximized by creating a high-fanout physically or electrically symmetric distribution [5] that feeds many CM flip-flop (FF) receivers. Logic signals on the FF receivers retain VM compatibility with low-power CMOS logic in the remainder of the chip.

In this paper, we present the first true CM CDN and a new CM pulsed D-type FF where the clock (CLK) input is a CM receiver and the data input (D), an active low enable  $(\overline{EN})$ , and output (Q) are VM. In particular, the key contributions of this paper are:

- The first demonstration of a CM clocked FF.
- The effective integration of the CM FF with VM CMOS logic.
- Power consumption comparison of CM CDN and VM CDN at different frequencies.
- Noise and variability analysis of CM and VM CDN.

The rest of the paper is split into following sections: Section II gives a brief overview of some existing CM signaling schemes. Section III proposes our CM FF and CDN. Section IV compares our new FF and CDN with existing scheme considering power and noise immunity. Finally, Section V concludes the paper.

### II. OVERVIEW OF EXISTING CM SIGNALING SCHEMES

In a CM signaling scheme, a transmitter (Tx) utilizes a VM input signal to transmit a current with minimal voltage swing into an interconnect (transmission line), while a receiver (Rx) converts current-to-voltage providing a full swing output voltage. The representative CM scheme in Fig. 1 uses a CMOS inverter as the Tx while the Rx is based on a transimpedance amplifier [10]. This scheme provides delay improvement over VM schemes, but the Rx voltage swings



Fig. 1. Previous CM schemes used an expensive transimpedance amp Rx which could result in significant skew due to  $V_{CM}$  shift if applied to CDNs [10].



Fig. 2. Expensive variation tolerant CM signaling scheme [8] consumes large static and dynamic power when compared to the other CM techniques.

around a common-mode voltage  $(V_{CM})$  and any  $V_{CM}$  shift would cause a large CDN skew [11].

Other researchers have used a dynamic over-driving Tx with a strong and weak driver alongside a low-gain inverter amplifier Rx and a controlled current source that addresses the previous  $V_{CM}$  problem [4]. However, this scheme results in rise- and fall-time mismatch at the output [8] which can be problematic in CDNs.

Variation-tolerant CM signaling schemes have used a CM Tx with corner-aware bias circuitry [8]. Fig. 2 shows the variation tolerant CM scheme including Rx and Tx circuits [8]. In this scheme, the inverter amplifier Rx circuit provides low-impedance to ground and holds the terminal point at the switching threshold. However, this comes at the expense of large static and dynamic power when compared to the other CM techniques and makes it unattractive compared to existing VM signaling.

### III. CURRENT-MODE CLOCKING

All of the previous CM signaling schemes perform current-to-voltage conversion and then use the buffered VM clock signal. However, driving the lowest level of a CDN with a full-swing voltage results in large dynamic power in addition to significant buffer area to drive the clock pin capacitances. Our CM scheme is highly integrated into the FFs that directly receive the CM signal to reduce overall power consumption and silicon area.

# A. Current-Mode Pulsed Flip-Flop With Enable (CMPFFE)

Fig. 3 and Fig. 4 show the circuit and simulation data of the proposed current-mode pulsed DFF with enable (CMPFFE). The CMPFFE is similar to our previously published CMPFF [12], but uses an active-low enable  $(\overline{EN})$  signal. The CMPFFE uses an input current-comparator (CC) stage, a register stage,



Fig. 3. The proposed CMPFFE uses current-comparator and feedback connection to generate a voltage pulse that triggers a register stage to store data in the storage cell.



Fig. 4. Simulation waveforms confirm the internal current-to-voltage pulse generation (clk\_p) that triggers input data capture.

and a static storage cell. The CC stage compares the input pushpull current with a reference current and conditionally amplifies the clock to a full-swing voltage pulse that triggers the data to latch at the register stage. The feedback pulsed FF is in stark contrast to the previous CM schemes which utilized expensive Rx circuits and buffers to drive the final FFs.

The choice of push-pull current enables a simple Tx circuit (discussed further in Section III-B) while maintaining a constant (or at least low-swing) bias voltage on the CDN interconnect. The CMPFFE in Fig. 3 is only sensitive to unidirectional push current which provides the positive edge trigger operation of the FF. This design is easily modified using a complementary current comparator into negative clock edge FF using the pull current

In order to efficiently receive an input pulse current, a CM Rx requires a low input impedance  $(Z_{in})$ . A small signal analysis at the input of the proposed CMPFFE ensures the low  $Z_{in}$  according to

$$Z_{in} = \frac{1}{g_{m1} + g_{m2}} \tag{1}$$

where  $g_{m1}$  and  $g_{m2}$  are the transconductance of transistor M1 and M2, respectively. The input impedance of the proposed CM FF is also identical to the previously reported variation-tolerant CM signaling Rx [8].

Traditionally, CM Rx/logic circuits consume a significant amount of static power even when the circuits are in sleep mode. Our CMPFFE incorporates an active-low enable  $(\overline{EN})$  signal that, when low, connects PMOS (M4) to vdd for normal



Fig. 5. The proposed CMPFFE generates an output voltage pulse depending on the input current and also complementing the edge triggered operation.

operation. On the other hand, it disables the static current I1 in stand-by mode when high. Since internal node B is decoupled in this stand-by mode, an additional transistor M7 is required to ground the internal clock node and prevent any unintentional latching of input data. Transistor M7 is disabled during normal operation. Adding an extra OFF transistor will introduce a stacking effect in the CC [13]; which in turn will reduce the leakage current in M4 significantly. The peak CMPFFE leakage current of 134  $\mu$ A in active mode. However, global  $\overline{EN}$  routing requires extra metal resources. Since the proposed CM scheme does not require buffers in the CDN, it is not difficult to globally route  $\overline{EN}$ .

In the input stage, the reference voltage generator (Mr2–Mr3) creates a reference current (Iref1) that is mirrored by M4 and generates I1. Similarly, the M1–M2 pair creates the FF reference current (Iref2) which is combined with the input current (i\_in); this current is then mirrored by M5 to I2. A PMOS (Mr1) is added to replicate the voltage drop of M3.

It is possible to use a local or global reference voltage generator for the input gate voltage of M4. Using a global reference can increase the robustness by reducing transistor mismatch between FFs. Hence, we used a global reference voltage generator that distributed across the whole chip, when we integrate the CMPFFE with the CM CDN. This also saves two transistors per FF and reduces static power with a negligible performance penalty. Unlike corner-aware reference voltage generators [8], we used a simple three transistor global reference voltage generator as shown in Fig. 3. In addition, CM signaling eliminates the requirement of CDN buffers, which reduces significant active area and makes easier global reference routing.

The mirrored currents I1 and I2 are compared using the inverting amplifier (A1) at node B and further extended to a CMOS logic level at node C by another inverting amplifier (A2). The inverter pair (X1–X2) generate the required voltage pulse duration before the feedback connection in M6.

The feedback connection from the generated voltage pulse with M6 quickly pulls down the current comparator node B which facilitates generating a small voltage pulse and results in fewer transistors in the register stage. In addition, we properly size the X2 inverter so that it can efficiently drive the clock capacitance of register stage without affecting circuit performance.



Fig. 6. (a) The proposed CM Tx and CDN converts an VM input signal to a push-pull current with minimal interconnect voltage swing and distributes current equally to the CMPFFEs and (b) simulation waveforms confirm a VM input is converted to a constant CDN voltage and a representative push-pull current at each CMPFFE.

Fig. 5 shows the transfer characteristics of the proposed CMPFFE based on input current and voltage pulse (clk\_p) generation. Fig. 5 identifies three regions of operation of the proposed FF. In region 1, the input current is  $\leq 0$ , and node B starts discharging from steady state resulting in a high voltage (very low swing 980 mV–850 mV) at the A1 output. Hence, the clk\_p signal stays at 0. In region 2, the input current is  $(0 < i\_in < 1.5 \, \mu\text{A})$ , and node B starts moving towards steady state to high. However, the swing is not large enough resulting in a low clk\_p signal. In region 3, the input current is  $\geq 1.5 \, \mu\text{A}$ , and the voltage swing at node B is large enough so that the amplifiers and inverter chain can generate required voltage pulse (clk\_p goes low to high Fig. 4) for the register stage.

The register stage is similar to a single-phase register [14], but requires fewer transistors and has a reduced clock load compared to other pulsed FFs. The current-generated voltage pulse triggers storing data in the output storage cell.

The sizing of M6 is critical to the voltage pulse; we use a minimum sized NMOS transistor with unity aspect ratio. The width of the generated clk\_p is also sensitive to the width and amplitude of input current (i\_in). The amplitude of i\_in strongly affects the FF performance by changing the operating point of M5 and adding extra delay to generated clk\_p signal. In order to achieve minimum CLK-to-Q delay, the ideal input current

has a  $\pm 2.3~\mu A$  amplitude and 70 ps pulse width. This can be guardbanded to tolerate noise and variation.

### B. Current-Mode Transmitter and Distribution

In order to integrate the CMPFFE, a Tx provides a push-pull current into the clock network and distributes the required amount of current to each CMPFFE. Our proposed CM CDN with Tx, interconnect, and the CMPFFE is shown in Fig. 6(a). The Tx receives a traditional voltage CLK from a PLL/clock divider at the root of the H-tree network and supplies a pulsed current to the interconnect which is held at a near constant voltage. The clock distribution is a symmetric H-tree with equal impedances in each branch so that current is distributed equally to each CMPFFE leaf node.

The pulsed current Tx in Fig. 6(a) is similar to previous Tx circuits [4], [8], but uses a NAND-NOR design. The NAND gate uses the CLK signal and a delayed inverted CLK signal, clkb, as inputs to generate a small negative pulse to briefly turn on M1. Hence, the PMOS transistor briefly sources charge from the supply while the NMOS is off. Similarly, the NOR gate utilizes the negative edge of the CLK and clkb signals to briefly turn on M2. Hence, the NMOS transistor briefly sinks current while the M1 is off. The non-overlapping input signals from the NAND-NOR gates remove any short circuit current from Tx.

The Tx M1 and M2 device sizes are adjusted to supply/sink charge into/from the CDN. Depending on the size of load (number of sinks) and the size of chip, the device sizes need to be adjusted (discussed further in Section IV-C). The root wires of the CDN carry current that is distributed to all branches so the sizing of CDN wires are critical for both performance and reliability. If the resistance of the wire is too high, the current waveform magnitude and period will be distorted and affect performance of the CMPFFEs. The wire width must also consider electromigration effects while carrying a total current to drive all the FFs with the required current amplitude and duration.

### IV. EXPERIMENTS

### A. Experimental Setup

We implemented our proposed CMPFFE, a traditional VM master-slave DFF (MS DFF), a traditional VM pulsed FF (Tra. PFF) [15], a high-performance conditional pulse-enhancement FF (CPEFF) [16], and a recently reported low-power dual dynamic node pulsed hybrid FF (DDPFF) [17] in FreePDK 45 nm CMOS technology [18]. Each FF is compatible with a standard cell library height of 12 horizontal M2 tracks. The layout areas, maximum clock-to-Q (CLK-Q) delay, setup times  $(t_s)$ , hold times  $(t_h)$ , and total power are listed in Table I. The performance of the FFs was evaluated using post-layout SPICE simulation at clock frequencies from 2–5 GHz with less than 10 ps slew and a 1 V supply voltage. The power considers input data at 100% activity and 4 minimum size inverter load.

In order to validate the functionality of the CM Tx and the proposed CMPFFE in a CDN, we implemented a symmetric H-tree network spanning 1.2 mm  $\times$  1.2 mm. Each branch of clock tree is modeled as a lumped 3-component  $\Pi$ -model and then connected together to make a distributed CDN model. The

TABLE I
THE PROPOSED CMPFFE IS 83% FASTER AND SIMILAR AREA COMPARED TO
THE TRA. PFF BUT CONSUMES MORE STATIC POWER

| Types of FF   | Area (μm²) | Delay (ps) |       |       | Total Power (static + dynamic) ( $\mu W$ ) |       |       |       |
|---------------|------------|------------|-------|-------|--------------------------------------------|-------|-------|-------|
|               |            | CLK-Q      | $t_s$ | $t_h$ | 2 GHz                                      | 3 GHz | 4 GHz | 5 GHz |
| MS DFF        | 5.03       | 37.0       | 21.0  | 5.0   | 49                                         | 73    | 98    | 122   |
| Tra. PFF [15] | 7.48       | 75.5       | -46.0 | 87.0  | 77                                         | 103   | 137   | 171   |
| CPEFF [16]    | 7.75       | 25.0       | -10.0 | 130   | 60                                         | 89    | 117   | 149   |
| DDPFF [17]    | 9.86       | 33.0       | -5.0  | 14    | 62                                         | 95    | 123   | 155   |
| CMPFFE        | 7.34       | 40.3       | -15.8 | 46.6  | 141                                        | 151   | 168   | 183   |



Fig. 7. Using standard cell height, the proposed CM FF consumes lower silicon area compared to the recently reported VM pulsed FFs [16], [17].

interconnect unit capacitance and resistance values are as suggested by 2009–2010 ISPD Clock Synthesis contest [19]. In addition, It is reasonable to model clock network as RC wires instead of RLC wires as suggest by 2010 ISPD Clock Synthesis contest [19]. The primary reason is the total clock network resistance is much higher than the total inductive reactance [20] for nominal global clock frequency range ( $\leq$ 5 GHz). The functional simulation results with the resulting output current are shown in Fig. 6(b).

# B. CMPFFE Analysis

The CMPFFE consumes 5.3% and 26% less silicon area compared to the recently reported CPEFF and DDPFF, respectively. The proposed FF uses 25 transistors and the VM Tra. PFF and MS DFF use 26 and 20 transistors, respectively. While CPEFF and DDPFF use 23 and 22 transistors, respectively. In order to work in all process corners, we used 4 extra transistors in the pulse generation of the later 2 FFs. Fig. 7 shows the layout of the proposed CMPFFE.

The CLK-Q delays of the FFs are measured under relaxed timing conditions—the data is stable sufficiently before the arrival of the clock edge. This applies both to the rising edge of the VM signal and the current pulse for the CM clock. In a VM FF, we considered 50% input clock transition to 50% FF output (Q) transition as the CLK-Q delay of a VM FF. Similar to a VM FF, in CM case we considered 50% ideal input current (2.3  $\mu$ A) transition to 50% Q transition as the CLK-Q delay of CM FF. Table I shows the maximum CLK-Q delay for both high-to-low and low-to-high Q transitions. Among all the FFs, the CPEFF has lowest CLK-Q delay. However, low CLK-Q delay and negative setup time also introduce large hold times for a FF. Clearly, the CMPFFE has lower CLK-Q delay than the Tra. PFF but is only slightly slower than the MS DFF. The DDPFF has 18% lower CLK-Q delay, but the proposed FF has 13% lower data-to-Q delay.



Fig. 8. The resiliency of the proposed CM scheme is demonstrated through non-uniform Monte-Carlo process variations and mismatch simulations.

Fig. 8 shows the Monte-Carlo simulations of CLK-Q delay of the proposed CMPFFE under varying process and mismatch conditions at 25°C.

We also measured the  $t_s$  and  $t_h$  times for each FF. These use the common definition as the time margin that causes a CLK-Q delay increase of 10% beyond nominal. The  $t_s$  and  $t_h$  of the CMPFFE are -15.8 ps and 46.6 ps, respectively. The setup time of the CMPFFE is  $1.75\times$  lower than the traditional MS DFF. In addition, recently reported CPEFF has  $2.8\times$  more  $t_h$  compared to the proposed CMPFFE. The CMPFFE has  $3.2\times$  better  $t_s$ , but also has  $3.3\times$  more  $t_h$  compared to the DDPFF.

Table I presents the total power including both static and dynamic. At low frequencies the CMPFFE consumes higher power than the Tra. PFF, CPEFF, DDPFF, and MS DFF due to a high static power overhead. However, the dynamic power of the CMPFFE increases proportional to the frequency at a slower rate than the other VM FFs. At high frequencies, the power consumption of the CMPFFE is comparable to the Tra. PFF and the CPEFF.

The FF power, however, does not represent the overall power consumption of a CDN because interconnect and buffers are major contributors. In Section IV-C, we show that the power savings in the CDN is worth the increase in CMPFFE total power despite the additional static power.

## C. CM CDN Analysis

Total system power consumption of a CDN includes the CDN interconnect, buffer power and the FF power consumption. When measuring the total power consumption, we have considered different number of sinks distributed in different size chips followed by the references provided by 2009–2010 ISPD Clock Synthesis contest (i.e., sinks per unit area is the same in each case) [19]. In order to supply the required amount of current to each sink, we used different size Txs depending on the size of chip and number of sinks. Table II presents the Tx sizing for different number of load and chip size. Theoretically, the Tx size should increase  $4\times$ , since we are increasing number of sinks in the same manner. However, the chip size also doubled in each case, resulting approximately 6× increase of Tx size. The control circuitry in the Tx may require size increases or buffers to drive a larger capacitive load when M1/M2 sizes in Fig. 6(a) are increased.

TABLE II THE RELATIVE SIZING OF CURRENT-MODE TRANSMITTER AT FIG. 6(a) INCREASES  $6 \times$  IN Each Case

| No. of sinks | Chip-edge (mm) | Txs relative sizing                        |  |  |
|--------------|----------------|--------------------------------------------|--|--|
| 4            | 0.48           | $W_{M1} = 1, W_{M2} = 1$                   |  |  |
| 16           | 0.96           | $W_{M1} \approx 6, W_{M2} \approx 6$       |  |  |
| 64           | 1.92           | $W_{M1} \approx 36, W_{M2} \approx 36$     |  |  |
| 256          | 3.84           | $W_{M1} \approx 216, W_{M2} \approx 216$   |  |  |
| 1024         | 7.69           | $W_{M1} \approx 1296, W_{M2} \approx 1296$ |  |  |



Fig. 9. The average power savings of the CM CDN system increases proportional to the frequency compared to the other VM FF based CDN scheme Table III.

In a VM CDN, the dynamic switching power of the interconnect and clock load capacitances along with clock buffers dominate the power consumption. In a CM CDN, the power due to small fluctuations in  $V_{CM}$  and the Tx power contribute, but the static power of the CMPFFE dominates. In both cases, the number of sinks and chip dimensions increase the total power consumption.

We use the same H-tree model in both the CM and VM CDN, but buffers drive the VM CDN instead of the CM Tx circuit. The VM buffered network is optimized for an output clock signal with less than 20 ps slew from 2–5 GHz. Since, the proposed CM FF is pulsed by nature, the VM CDN considers several pulsed FFs (Tra. PFF [15], CPEFF [16], DDPFF [17]) and also considers the MS DFF as reference. In order to facilitate normal CM FF operation, we used an active low  $(\overline{EN})$  signal and also included the required routing power in the CM CDN power calculation.

Table III shows the power breakdown of the VM and CM CDN's simulation of clock frequencies ranging from 2–5 GHz. The total power consumption of CMPFFE system including  $\overline{EN}$  signal routing, global reference routing, CM Tx, CMPFFEs power, and CM CDN power. On average, the CM CDN consumes less power than the VM CDN for all sizes of CDN at different frequencies. This is due to the large dynamic power consumption due to the voltage swing (0-to- $V_{dd}$ ) in the VM CDN, whereas the CM CDN has negligible voltage swing as shown in Fig. 6(b).

Among different FF systems, the CM FFs consume higher power than the other VM FFs. However, VM interconnect

Total CDN power (mW)Total power consumption including FFs and CDNs (mW)% Saving compared to Fre. # of CE1 (GHz) VM CDN CM CDN MS DFF sys. Tra. PFF sys. CPEFF sys. DDPFF sys. CMPFFE sys.2 MS DFF sys. Tra. PFF sys. | CPEFF sys. DDPFF sys. sinks (mm) 0.72 0.92 0.83 9.82 0.48 0.26 1.03 0.96 0.97 19.63 13.95 14.66 0.96 3.03 0.54 3.81 4.26 3.99 4.02 2.79 26.8 34.49 30.03 30.58 16 2 1.92 10.62 1.13 13.75 15.54 14.46 14.58 10.16 26.14 34.66 256 44.28 4.52 56.82 63.99 59.64 60.15 40.62 28.52 36.53 31.9 32.48 1024 7.69 184.90 235.07 248.38 18.25 263.74 246.34 162.63 30.81 38.34 33.98 34.52 4 0.48 1.07 0.30 1.36 1.48 1.43 1.45 0.90 33.89 39.23 36.85 37.89 16 0.96 4.23 0.58 5.40 5.88 5.66 5.76 3.00 44.42 48.96 46.94 47.82 64 1.92 15.56 1.15 20.23 22.15 21.26 21.64 10.72 47.00 51.59 49.55 50.45 66.51 85.19 92.88 89.29 90.83 43.64 48.77 53.01 51.95 1024 7.69 270.02 19.78 344.77 375.49 361.16 367.30 174.71 49.33 53.47 52.43 51.62 1.13 1.68 1.62 34.26 40.37 37.38 0.96 4.24 1.04 5.81 35.81 42.04 39.00 39.95 6.43 6.11 6.21 3.73 64 1.92 21.88 1.20 28.15 30.65 29.37 29.75 11.94 57.58 61.03 59.34 59.86 89.38 114.47 119.33 256 3.84 5.29 124.45 120.87 48.30 57.81 61.19 59.63 60.04 461.72 361.37 20.03 501.66 481.18 487.32 192.06 58.40 2.32 4 0.48 1.70 0.34 2.19 2.38 2.30 1.06 51.46 55.45 53.75 54.22 1.38 9.13 9.92 52.83 56.56 55.40 1.92 27.14 1.62 34.95 38.08 36.68 37.06 13.33 61.85 64.99 63.65 64.03

TABLE III
POWER SAVING INCREASES WITH THE INCREASE OF FREQUENCY UTILIZING OUR CM CDN COMPARED TO OTHER VM CDNS

CE<sup>1</sup>: Chip-edge, CMPFFE sys.<sup>2</sup>: Total power consumption of CMPFFE system including  $\overline{EN}$  signal routing, global reference routing, CM Tx, CMPFFEs power, and CM CDN power.

151.73

612.34

52.65

207.76

150.19

606.20

power dominates the CM FF power even at small sizes. The real advantage is that the CM CDN power does not increase like the VM CDN power over frequency. Since the fluctuation of  $V_{CM}$ is relatively small, the dynamic power consumption of the CM CDN is negligible. At a low 2 GHz clock frequency, the CM CDN system with number of CMPFFEs ranging from 4 to 1024 exhibits total power savings of 9% to 32% compared to a MS DFF system. At the same frequency, the proposed system with 1024 sinks shows a total power savings of 33% and 38% compared to the Tra. PFF system and CPEFF system, respectively. As expected and suggested by Table I, we observed a linear increase in total power savings with the increase of frequency using CM CDN compared to a VM CDN as in Fig. 9. At 5 GHz in particular, the CM CDN system exhibits 51% to 67% total power savings considering 4 to 1024 sinks. The primary reason behind that is at high frequencies the relative power consumption of the VM FFs and CMPFFE is nearly equal. At 2 GHz the CM CDN system saves up to 33% average power compared to other VM CDN. While at 5 GHz the CM CDN system saves 59% to 62% average power compared to other VM FFs (MS DFF, CPEFF, Tra. PFF, and DDPFF) system as shown in Fig. 9.

3.84

7.69

256

1024

112.05

453.62

5.78

20.37

143.28

578.55

155.83

628.72

In addition to dynamic power consumption of VM and CM CDN, we also measured the static power consumption of the

largest CDN network with 1024 sinks. The total static power consumption for CM CDN with no clock activity is 154  $\mu$ W. In the same conditions, the total static power consumption of the VM CPEFF system is 186  $\mu$ W. The results are nearly the same and the difference is negligible compared to the dynamic power consumption of each CDN.

64.95

65.73

66.21

66.95

65.30

66.07

63.26

64.09

# D. Reliability Analysis

Unlike an exponentially tapered H-tree [21], we used homogeneous wire sizing from the root to each sink, and verified the maximum current density of CM CDN in the root wire to be  $0.275~\mathrm{MA/cm^2}$  which is less than VM CDN,  $0.53~\mathrm{MA/cm^2}$ . This more than satisfies the ITRS suggestion that current density be limited to  $1.5~\mathrm{MA/cm^2}$  [22]. Therefore, electromigration is not a problem for the demonstrated sizes.

### E. Noise Analysis

In order to measure the noise immunity, we compare crosstalk noise simulations for both CM and VM. Fig. 10 shows the test-bench to analyze the effects of crosstalk noise on traditional VM buffer driven interconnects. This experiment is commonly used to quantify the effect of coupling capacitance on dynamic delay due to the switching activity of neighboring nets that have



Fig. 10. Traditional VM schemes are most susceptible for crosstalk noise, when the aggressors are 180° out of phase compared to the victim line.

significant coupling to the original circuit. In scaled technologies, traditional VM schemes are most susceptible when the aggressors are 180° out of phase compared to the victim line. Fig. 10 mimic the worst case crosstalk by considering 3 parallel interconnections (5 mm long) driven by variable impedance drivers/buffers (VM). Hence, the victim line experience an effective capacitance which is double than the original coupling capacitance. Each 5 mm interconnect line was buffered/segmented every 1 mm. In this case, simulation shows that victim line delay can increase up to 35%. In the CM design, two aggressors are driven by VM buffers, while the victim line is a CM Tx. Simulations suggest that the CM scheme exhibits negligible performance penalty and more robustness to noise because the CM victim line has a much larger capacitance without buffering. This means that the relatively short neighboring VM aggressor lines have less crosstalk coupling and therefore less influence on CM delay. Unlike VM CDN, the CM CDN requires a global reference voltage and active low enable  $(\overline{EN})$  signal routing for the CMPFFEs. Since, the centralized reference voltage and EN signal both are constant voltage, these have minimum effect due to crosstalk noise. In addition, the wire cap is large so it is not affected much.

### F. Variability Analysis

Transistor threshold voltage  $(V_{TH})$  may be affected by variations in doping concentration, gate oxide thickness, gate length effective dimension, etc. [23]. Unlike crosstalk noise,  $V_{TH}$  mismatch can introduce large skew in clock network. Hence, quantifying  $V_{TH}$  induced clock skew is very critical for reliability of the clock network.

We considered the worst case corner for both the CM and VM CDN. For CM, this is with  $V_{TH}$  variation only in the CM Tx and CM FFs because it does not use other buffers. However, the CM Tx is shared and adds zero skew. For VM, this includes variation in the VM FFs and the clock buffers. Traditionally, clock skew is measured at the clock pins of the FFs. However, we wanted to include the impact of variability on our new FF so skew is measured at the FF output. This effectively includes CLK-Q variation in addition to normal clock skew variation. Fig. 11 shows an example to calculate skew due to  $V_{TH}$  variation at ss-ff corner. In CM CDN, we calculated the time delay considering input CLK signal transition of the CM Tx and the output of both CMPFFEs with ss  $V_{TH}$  and ff  $V_{TH}$ . The delay difference is the skew in CM case. Similarly, we calculated the



Fig. 11. In ss-ff corner, the proposed CM CDN has up to 60% less skew compared to other VM CDNs.

TABLE IV
THE PROPOSED CM CDN HAS LOWER SKEW DUE TO SUPPLY VOLTAGE AND
THRESHOLD VARIATION COMPARED TO RECENTLY REPORTED PULSED FF
BASED VM CDN SCHEMES

|               | Skew (ps)       |                 |                             |  |  |  |
|---------------|-----------------|-----------------|-----------------------------|--|--|--|
| CDN with      | Supply volta    | nge variation   | Threshold voltage variation |  |  |  |
|               | $V_{dd} = 0.9V$ | $V_{dd} = 1.1V$ | ff-ss                       |  |  |  |
| MS DFF        | 10              | -18             | 33                          |  |  |  |
| Tra. PFF [15] | 12              | -21             | 43                          |  |  |  |
| CPEFF [16]    | 11              | -17             | 35                          |  |  |  |
| DDPFF [17]    | 13              | -17             | 34                          |  |  |  |
| CMPFFE        | -4              | 15              | 17                          |  |  |  |

skew in VM CDN considering CLK transition at the root buffer to the output of VM FFs with ss  $V_{TH}$  and ff  $V_{TH}$ .

Table IV shows the effect of worst corner  $V_{TH}$  variation on different CDN skews. The traditional VM MS DFF, CPEFF, and DDPFF based CDN show comparable skew at all corner variations. In the ff-ss corner, the CM CDN clock has 17 ps skew while classic MS DFF based VM CDN has 33 ps. In addition, the proposed CMPFFE-based CM CDN exhibits 51% and 60% less skew compared to the CPEFF and Tra. PFF based CDN, respectively. This is due to fact that the VM CDNs uses buffers to distribute the highly capacitive clock to the sinks.

As mentioned earlier, the performance of CMPFFE is sensitive to the width and amplitude of its input current (i\_in). We performed numerous simulations aimed at determining the sensitivity of the clock to output delay of the CMPFFE as function of the input current. Fig. 12 shows the variation of this CLK-Q delay relative to input current amplitude and pulse width (PW) variations. We define the current sensitivity of the CLK-Q delay as the slope of the approximated linear trendline of the CLK-Q delay curves. We utilized the minimum



Fig. 12. The CMPFFE current sensitivity on CLK-Q delay is within the nominal CLK-Q delay of traditional VM MS DFF and Tra. PFF.

input current (i.e.,  $\pm 2.3~\mu A$ ) and varied it up to  $2\times$  considering different PW. At  $PW=70~\rm ps$ , the current sensitivity on the CLK-Q delay is the highest and while providing the lowest CLK-Q delay compared to the other PWs. On the other hand, at  $PW=75~\rm ps$  the current sensitivity of CLK-Q delay is the lowest but provides the highest CLK-Q delay in comparison to other PWs. The delay variation, however, is within the nominal CLK-Q delay of traditional VM MS DFF and Tra. PFF. Hence, the proposed CMPFFE has a wide input current range while maintaining the optimal performance. This current sensitivity analysis is helpful towards understanding the performance tradeoffs in the proposed CMPFFE with respect to the input current, and guides the early stage design of the current Tx.

### G. Supply Voltage Fluctuation

Due to the spatial variation it is possible that the power supply or  $V_{dd}$  could vary at different locations of the chip. Traditionally, designers utilize  $\pm 10\%$  supply voltage fluctuation from the nominal value. Table IV shows effect of the supply voltage fluctuation ( $\pm 10\%$  deviation from 1 V supply) on the various CDNs' performance. Similar to the  $V_{TH}$  variation, we considered performance metric of CDNs considering the delay variation from root to FFs output. When the supply voltage is low (0.9 V), the VM CDN and VM FFs have a positive skew from the nominal supply. The primary reason is the lower overdrive voltage  $(V_{GS} - V_{TH})$ . On the other hand, applying high supply voltage (1.1 V) in VM CDNs exhibits a negative skew from the nominal case. However, at 0.9 V supply the proposed CM CDN shown a negative skew compared to the nominal supply voltage. While at 1.1 V, the proposed scheme exhibits a positive skew. This is due to the operating point variation of the CMPFFE and also validates our current sensitivity analysis. Overall, the proposed CM CDN has a lower or comparable skew compared to the other VM CDNs.

### V. CONCLUSION

In this paper, we presented the first true CM FF and its usage in a fully CM CDN. The proposed CMPFFE is 87% faster, requires similar silicon area and consumes only 7% more power compared to a traditional PFF at 5 GHz. Better yet, the CMPFFE enables a 24% to 62% power reduction on average when used in a CM CDN compared to conventional VM CDNs. The CMPFFE also eliminates the need for complex CM Rx circuitry and/or local VM buffers to drive highly capacitive clock sinks as in previously proposed CM signaling schemes.

### REFERENCES

- [1] H. Zhang, G. Varghese, and J. M. Rabaey, "Low swing on-chip signaling techniques: Effectiveness and robustness," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 8, no. 3, pp. 264–272, Jun. 2000.
- C. Anderson, J. Petrovick, J. Keaty, J. Warnock, G. Nussbaum, J. Tendier, C. Carter, S. Chu, J. Clabes, J. DiLullo, P. Dudley, P. Harvey, B. Krauter, J. LeBlanc, P.-F. Lu, B. McCredie, G. Plum, P. Restle, S. Runyon, M. Scheuermann, S. Schmidt, J. Wagoner, R. Weiss, S. Weitzel, and B. Zoric, "Physical design of a fourth-generation power glz microprocessor," in *Proc. ISSCC*, Feb. 2001, pp. 232–233.
   D. Sylvester and C. Hu, "Analytical modeling and characterization
- [3] D. Sylvester and C. Hu, "Analytical modeling and characterization of deep-submicrometer interconnect," *Proc. IEEE*, vol. 89, no. 5, pp. 634–664, May 2001.
- [4] A. Katoch, H. Veendrick, and E. Seevinck, "High speed current-mode signaling circuits for on-chip interconnects," in *Proc. ISCAS*, May 2005, pp. 4138–4141.
- [5] M. R. Guthaus, G. Wilke, and R. Reis, "Revisiting automated physical synthesis of high-performance clock networks," ACM Trans. Design Autom. Electron. Syst., vol. 18, no. 2, pp. 31:1–31:27, Apr. 2013.
- [6] M. Yamashina and H. Yamada, "An MOS current mode logic (MCML) circuit for low-power sub-GHz processors," *IEICE Trans. Electron.*, vol. E75-C, no. 10, pp. 1181–1187, 1992.
- [7] E. Seevinck, P. J. V. Beers, and H. Ontrop, "Current-mode techniques for high-speed VLSI circuits with application to current sense amplifier for CMOS SRAM's," *J. Solid-State Circuits*, vol. 26, no. 4, pp. 525–536, Apr. 1991.
- [8] M. Dave, M. Jain, S. Baghini, and D. Sharma, "A variation tolerant current-mode signaling scheme for on-chip interconnects," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. PP, no. 99, pp. 1–12, Jan. 2012.
- [9] F. Yuan, Cmos Current-Mode Circuits for Data Communications. New York: Springer, Apr. 2007.
- [10] A. Narasimhan, S. Divekar, P. Elakkumanan, and R. Sridhar, "A low-power current-mode clock distribution scheme for multi-GHz NoC-based SoCs," in *Proc. 18th Int. Conf. VLSI Design*, Jan. 2005, pp. 130–135.
- [11] N. K. Kancharapu, M. Dave, V. Masimukkula, M. S. Baghini, and D. K. Sharma, "A low-power low-skew current-mode clock distribution network in 90 nm CMOS technology," in *Proc. IEEE Comput. Soc. Annu. Symp. VLSI (ISVLSI)*, Jul. 2011, pp. 132–137.
- [12] R. Islam and M. Guthaus, "Current-mode clock distribution," in *Proc. ISCAS*, Jun. 2014, pp. 1203–1206.
- [13] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand, "Leakage current mechanisms and leakage reduction techniques in deep-submicrometer CMOS circuits," *Proc. IEEE*, vol. 91, no. 2, pp. 305–327, Feb. 2003.
- [14] J. Yuan and C. Svensson, "High-speed CMOS circuit technique," J. Solid-State Circuits, vol. 24, no. 1, pp. 62–70, 1989.
- [15] S. Kozu, M. Daito, Y. Sugiyama, H. Suzuki, H. Morita, M. Nomura, K. Nadehara, S. Ishibuchi, M. Tokuda, Y. Inoue, T. Nakayama, H. Harigai, and Y. Yano, "A 100 MHz, 0.4 w RISC processor with 200 MHz multiply adder, using pulse-register technique," in *Proc. ISSCC*, 1996, pp. 140–141.
- [16] Y.-T. Hwang, J.-F. Lin, and M. hwa Sheu, "Low-power pulse-trig-gered flip-flop design with conditional pulse-enhancement scheme," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 20, no. 2, pp. 361–366, Feb. 2012.
- [17] K. Absel, L. Manuel, and R. Kavitha, "Low-power dual dynamic node pulsed hybrid flip-flop featuring efficient embedded logic," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 21, no. 9, pp. 1693–1704, Sep. 2013.
- [18] NCSU, FreePDK45 [Online]. Available: http://www.eda.ncsu.edu/ wiki/FreePDK45

- [19] ISPD, ISPD 2009 Clock Network Synthesis Contest [Online]. Available: http://ispd.cc/contests/09/ispd09cts.html
- [20] L. Zhang, J. Wilson, R. Bashirullah, L. Lei, J. Xu, and P. Franzon, "Voltage-mode driver preemphasis technique for on-chip global buses," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 15, no. 2, pp. 231–236, Feb. 2007.
- [21] M. El-Moursy and E. Friedman, "Exponentially tapered h-tree clock distribution networks," *IEEE Trans. Very Large Scale Integr. (VLSI)* Syst., vol. 13, no. 8, pp. 971–975, Aug. 2005.
- [22] S. I. Association, The International Technology Roadmap for Semiconductors, 2012 ed. .
- [23] J. de Gyvez and R. Rodriguez-Montanes, "Threshold voltage mismatch ( $\Delta$ VT) fault modeling," in *Proc. 21st VLSI Test Symp.*, Apr. 2003, pp. 145–150.



Riadul Islam received his B.Sc. degree in electrical and electronic engineering from Bangladesh University of Engineering and Technology, Dhaka, Bangladesh, in 2007, and the M.A.Sc. degree in electrical and computer engineering from Concordia University, Montreal, QC, Canada, in 2011. From 2007 to 2009, he worked as a full time faculty member in the Department of Electrical and Electronic Engineering of the University of Asia Pacific, Dhaka, Bangladesh. Currently he is working towards his Ph.D. at the University of California,

Santa Cruz, CA, USA, in the Computer Engineering Department. His research interest includes low-power clock network design, variability-aware low-power/high-speed digital/mixed-signal circuit design, and fault tolerant memory/flip-flop design.



Matthew R. Guthaus (SM'10) received his B.S.E. in computer engineering and the M.S.E. and Ph.D. degrees in electrical engineering from the University of Michigan, Ann Arbor, MI, USA, in 1998, 2000, and 2006, respectively. He is currently an Associate Professor at the University of California, Santa Cruz, CA, USA, in the Computer Engineering department. He is a Senior Member of ACM and a member of IFIP Working Group 10.5. His research interests are in low-power computing including applications in mobile health systems. This includes new circuits,

architectures, and sensors along with their application to mobile and clinical health systems. He is the recipient of a 2011 NSF CAREER award and a 2010 ACM SIGDA Distinguished Service Award. He is also the Director of the UCSC Summer Undergraduate Research Fellowship in IT (SURF-IT), a National Science Foundation sponsored Research Experience for Undergraduates (REU) site.