# Inverting versus Non-Inverting Dynamic Logic for Two-Phase Latch-free Nanopipelines Héctor J. Quintero, Manuel Jiménez, María J. Avedillo and Juan Núñez Instituto de Microelectrónica de Sevilla, IMSE-CNM (CSIC/Universidad de Sevilla) Av. Américo Vespucio s/n 41092, Seville (Spain) {quintero, jimenez, avedillo, jnunez}@imse-cnm.csic.es Abstract— Very fine grained latch-free pipelines are successfully used in critical parts of high performance systems. These approaches are based in Domino logic and multi-phase clock schemes. Reducing the number of logic levels per clock phase and the number of phases to the minimum is a potential way to push the limits of speed. However the implementation of such architectures with just one logic level per clock phase and two clock phases is a challenge which requires extremely full-custom design and exhibits robustness concerns. In this paper we show that the non-inverting feature of Domino plays a critical role in these difficulties. We analyze and compare the performance of two-phase gate-level pipelines implemented with Domino and with ILP, an inverting dynamic gate we have proposed. Our experiments confirm that ILP pipelines are much more robust and could simplify design. Keywords— Nanopipeline, Dynamic logic, Robust design techniques. #### I. INTRODUCTION Unlike conventional logic styles that alternate flip-flops with combinatorial logic blocks, the design of pipeline architectures without memory elements for high-performance applications requires the adoption of unconventional design techniques based on logic circuits that can inherently block data propagation. In this sense, dynamic logic has been used successfully in the design of high-performance VLSI circuits. Specifically, in [1],[2] a comprehensive study is carried out on the implementation of Domino logic using an interconnection scheme with several clock phases and without the requirement to place latches between consecutive clock phases. Variants of these high-performance multiphase architectures have been reported [3], some of which are used to accelerate critical parts of microprocessors. Nanopipeline architectures, with only one gate per clock phase, using two clock phases [4]-[8] exhibit significant advantages in terms of speed because the evaluation only involves two gates for each clock period. In addition, the distribution of clock signals is considerably simplified because only two clock phases are used. Although this design style exhibits important advantages, its limitations must also be pointed out. First, only non-inverting stages can be interconnected since it is required to add a static inverter at the output of each stage to ensure that, prior to the precharge phase, all inputs of the next stage are set to 0. Second, it has been reported [1],[3] that the implementation of Domino pipelines is not straightforward, requiring full custom designs with significant limitations. Furthermore, robustness must be taken into account [9], [10]. The paper is organized as follows: in Section II, background on Domino logic and latch free multi-phase pipelines is given. Section III analyzes the challenges of two-phase Domino pipelines resulting from the noninverting feature of those gates. Section IV explores the implementation of inverting dynamic gates. Section V compares inverting versus non inverting dynamic gates for the target pipeline operation. Finally, conclusions are given in Section VI. ## II. BACKGROUND A conventional dynamic gate (or Domino gate) is shown in Figure 1. It consists of a dynamic stage and a static output stage. A keeper transistor $(W_K)$ is added to protect dynamic node against leakage/noise. It operates in two phases called precharge $(V_{CLK} \text{ low})$ and evaluation $(V_{CLK} \text{ high})$ . During the precharge phase, the dynamic node $(V_{DYN})$ is precharged to $V_{DD}$ through $M_{PREC}$ (and, thus, the output is discharged), whereas during the evaluation phase, the pull-down network (PDN) and the footer transistor $W_{FOOTER}$ conditionally discharges the dynamic node depending on the applied input combination. Figure 2a depicts an overlapping clocked latch free pipeline operated with three clock phases. Two pipeline stages are depicted. There are three clock phases in each pipeline stage and two gate levels per clock phase. So, there are six gate levels per pipeline stage. Figure 2b shows the clock phases required to operate the circuit in a. Let's denote the cycle time or period T. Note each phase is delayed by T/3 which respect Fig 1. Domino gate. Fig 2. Pipeline operation with overlapping clocks. (a) Architecture. (b) Clock phases. Fig 3. Two phase Domino nanopipeline to its previous one. For the general case of N phases, consecutive clock phases are delayed by T/N. $T_{low}$ and $T_{high}$ are not necessarily identical. This architecture is advantageous with respect to traditional Domino pipelines with latches, not only because of the saving associated to the removing of the latches, but also in terms of clock skew tolerance and potential for time borrowing [1]. Advantages in terms of operating frequency, energy, noise immunity or process parameter variations tolerance, as well as competitive trade-offs among these design criteria, can be obtained by using a single gate per clock phase (nanopipeline). In addition, using a clock scheme with only two phases is attractive for different reasons. Clock period must accommodate the evaluation of just two gates and the distribution of the clock signals is simplified. The rest of the paper focuses in these two-phase nanopipelines. # III. ANALYSIS OF TWO-PHASE DOMINO NANOPIPELINES Figure 3 shows the block diagram of the target gate-level pipeline interconnection scheme. Each stage is a Domino gate. # Throughout failure The latch free pipeline operation relies on the overlapping of consecutive clock phases to work. For this, the duty cycle in a two-phase scheme is larger than 50%. However, excessive increase of overlap can cause throughout or sliding failures [1]. Figure 4 illustrates it. Assume each stage in Figure 3 is a buffer. We expect to obtain the input sequence (IN) at the output of STG10 (OUT\_CHAIN) after some latency. A wrong operation is observed. In order to analyse the problem the output of the first tree stage and the clock phases are also shown. STG1 evaluates with positive pulse of CLK1. Since Fig. 4. Throughout failure in a two-phase Domino pipeline interconnection scheme due to excessive overlap. Fig 5. Functional failure in a two-phase Domino pipeline interconnection scheme due to the propagation of non-ideal logic ones through the logic network the gate evaluates very fast, the one reaches STG2 before the falling edge of CLK2 and so, STG2 evaluate the new input data, which should not be evaluated until the next pulse of CLK2 for a correct pipeline operation. Note this narrow one is fully transmitted to STG3 (OUT3). #### Robustness concerns We have realized that the operating frequency of Domino nanopipelines is not independent of the number of pipeline stages as it should be. This behavior rises from the fact that in order to produce a logic one, a non-inverting gate requires one or more of its inputs to be also at logic one. This translates in that non-ideal logic ones get worse as they propagate through the logic network, eventually leading to a functional failure. Non-ideal logic ones could be the result of parameter or operating conditions variations [9], [10]. Figure 5 illustrates this behavior. This behavior can be explained on the basis of the input combination producing a zero-to-one transition. In Domino, being non-inverting, this output transition is associated with inputs combinations discharging the dynamic node. Discharging of the dynamic node requires one or more inputs being at logic one. "Good" ones are required to fully discharge dynamic node and produce a "good" output one. Moreover, non-ideal behavior of consecutive stages accumulates. A non-ideal one causes that the dynamic node is not fully discharged. This translates in faster precharge of the dynamic node and so, even narrower logic one output. Thus, dynamic node of the next stage is discharged to a higher voltage level. Fig 6. ILP gate. | CLK1 | CLK2 | CLK1 | CLK2 | CLK1 | CLK2 | CLK1 | CLK2 | CLK1 | CLK2 | |------|------|------|------|------|------|------|------|------|------| | S | S | S | S | C | C | C | S | S | S | Fig 7. Block diagram of a two-phase nanopipeline with simple ( $\mathcal{S}$ ) and complex ( $\mathcal{C}$ ) stages. These results suggest advantages related to the use of inverting stages. #### IV. INVERTING DYNAMIC GATES The most straightforward way to implement an inverting dynamic gate is by removing the static output inverter from the Domino topology or adding a second static inverting stage to it. However, in general, such gates cannot be chained. Their outputs precharge to high, so leading to the possible unintentionally discharging of the dynamic nodes of successive gates when entering evaluation, and before actual gate has evaluated. This is a well-known limitation of basic dynamic logic gates which motivates the addition of the static inverter to form the wide used Domino gate from which working circuits can be built. # Proposed Topology Figure 6 depicts the schematic of the proposed topology, in which the static inverter of the output stage of the Domino gate has been replaced by a static NAND gate and one static inverter. Note that the inputs of the NAND are the dynamic node and the clock, $V_{CLK}$ , For $V_{CLK}$ low, $V_{NAND}$ is pulled up independently of $V_{DYN}$ . The static inverter is added guarantying that the precharge value of the gate output ( $V_{OUT}$ ) is low as in Domino logic. For $V_{CLK}$ high, the NAND gate evaluates its input. For input combinations which discharge the dynamic node, the pull-down network is off and gate output remains low. For input combinations which do not discharge dynamic node, the NAND output node is pulled down and $V_{OUT}$ is pulled up. This topology resembles the Delayed Output Evaluation (DOE) topology, we have proposed [11] to solve the problem of obtaining wide fan-in dynamic gates (required in some applications) with practical speed—noise tolerance trade-offs, a challenge in scaled technologies. In DOE a delayed version of the clock is used by the NAND gate and so a pair of extra inverters is used. It can be eliminated in the moderate fan-in gates used in logic circuits. This simplified DOE gate exhibits the features we are looking for the two-phase pipelines. It is inverting (I) and its output precharge value is low (LP). Since the evaluation of the output is no longer delayed let's call it ILP gate. In next Section we explore the suitability of this proposed topology to be used as building block for two-phase latch-free nanopipelines circuits which could benefit from its inverting nature to improve the classical Domino ones. # V. Non-inverting versus inverting pipeline stages comparison The nanopipeline depicted in Figure 7 has been implemented both in Domino (non-inverting) and with the proposed topology. Note it consists of gates of different complexity, fan-out and/or load associated with the interconnections. That is, there are two types of gates: a simple one (S) and a complex one (C). It is observed that the different situations of interest are being considered: consecutive simple stages, consecutive complex stages, simple stage followed by a complex and complex stage followed by a simple one. Beyond the quantitative results for each topology, the goal of this work is to highlight the qualitative differences between them. Both nanopipelines have been designed using the UMC 1.2V 0.13µm CMOS commercial technology. The same transistor sizing has been used for both designs except for the keeper, which has been selected to suit each gate topology. Although only simulation results are reported, an integrated circuit has been fabricated from which experimental measurements will be obtained to compare the results presented in this work. First of all we have evaluated the minimum and maximum overlap for each design. The overlap is determined by the duty cycle (DTC) of the clock signal. Table I shows the minimum and maximum DTC with a 4GHz clock signal. Effectively it is observed that the valid range is smaller for Domino than for the ILP version. In the case of Domino, the upper limit is associated with the occurrence of throughput failures. In the case of ILP, this is determined by an excessive reduction of the precharge phase that does not allow it to be carried out satisfactorily. The latter is less restrictive. In addition to this quantitative difference, there is an important qualitative difference associated to the distint origin of the upper limit for the DTC, or maximum allowed overlap between phases, that we will explain once the following experiment has been described. We have carried out simulations in which a skew is introduced between the two clock phases and the maximum one allowed is evaluated. In this experiment, the *DTC* used for each topology is chosen to maximize skew tolerance. Again the frequency is 4GHz. Last two columns in Table I show the obtained results. For each topology, the maximum value of the skew expressed in units of time and in percentage of the period of the clock signal is indicated. We say that a circuit tolerates a certain skew if it operates correctly with a clock phase CLK2 both delayed and advanced by that amount of time. The results show that, as we expected, the ILP topology is more robust with respect to clock non-idealities than the Domino topology. Note that the optimal *DTC* value in the case of the ILP topology is above the maximum allowed for Domino. TABLE I DTC VALID RANGE AND SKEW CHARACTERIZATION | | Min. DTC | Max. DTC | Skew | | | |--------|----------|----------|-------|----------|--| | | % Period | % Period | Time | % Period | | | Domino | 53 | 61 | ±25ps | 10% | | | ILP | 52 | 76 | ±56ps | 22% | | It could be thought that these differences were due to the fact that the frequency at which the measurements were taken is more favourable to the ILP architecture. That is, if the maximum frequency of the Domino circuit is below that of the ILP circuits, we would expect that the first one had less capacity to handle the skew. However, this is not true due to the qualitative difference above mentioned. In general, one way to increase the tolerated skew is to reduce the operating frequency. However, in the case of the Domino nanopipeline we are analyzing, decreasing the frequency is not a valid solution. In absolute terms the minimum and maximum overlap of the clock phases does not change with the frequency and therefore the amount in which the edges of one phase can move with respect to the other one without causing a malfunction is not increased. Figures 8 (a) and (b) show simulations for Domino (at 4GHz) and ILP (at 3GHz and 4GHz) respectively. Simulations at the higher frequency with a skew value tolerated and another one not tolerated are shown. The chosen input sequence alternates groups of consecutive zeros and consecutive ones. Correct (wrong) operation is observed for the Domino pipeline with 20ps (30ps) of skew and for ILP one with skew equal to 50ps Fig 8. Simulation results with non-ideal clocks. (a) Domino nanopipeline. (b) ILP nanopipeline. (70ps). We have verified that a correct Domino operation cannot be achieved even varying *DTC* at the reduced frequency for the second skew value (30ps). However it is achieved for ILP, as shown in Figure 8b, in which a simulation at the lower frequency with the second value of skew (70ps) is also depicted (waveform at the bottom of the figure). The solution in the case of Domino would require a detailed design of each of the gates to, for example, slow down both the evaluation and the precharge of the simple gates. ## VI. CONCLUSIONS We have analyzed and compared the operation of latch-free two-phase gate-level pipelines implemented with Domino and with a proposed non inverting dynamic gate (ILP). Our experiments have shown the advantages of the ILP topology in terms of robustness. The superiority is due to the inverting feature of ILP. Because of it, ILP does not suffer from sliding failures and so the upper limit on the maximum tolerated clock overlap is less restrictive than for Domino. Also the accumulative effect of the variations in the individual gates is not exhibited by ILP. This robustness improvement translates in a simplification of the design of this high performance architectures which requires a full custom approach in Domino. # ACKNOWLEDGMENT This work has been funded by Ministerio de Economía y Competitividad del Gobierno de España with support from FEDER under Project TEC2013-28302 and TEC2017-87052-P. #### REFERENCES - [1] D. Harris and M.A. Horowitz, "Skew-tolerant domino circuits", *IEEE J. of Solid-State Circuits*, vol.32, no.11, pp.1702-1711, Nov. 1997. - [2] David Harris. "Skew-Tolerant Circuit Design". Morgan Kaufmann Publishers.Inc., San Francisco, CA, USA, 2001. - [3] R. Hossain, "High Performance ASIC Design", Cambridge, 2008. - [4] W. Belluomini; D. Jamsek; A. Martin; C. McDowell; R. Montoye; T. et al. "An 8 GHz floating point multiply", *IEEE International Solid-State Circuits Conference*, pp. 374-604., 2005. - [5] S. Horne, D. Glowka, S. McMahon, P. Nixon, M. Seningen and G. Vijayan, "Fast14 Technology: design technology for the automation of multi-gigahertz digital logic", *International Conference on Integrated Circuit Design and Technology*, pp. 165-173, 2004. - [6] Z. Owda, Y. Tsiatoushas and T. Haniotakis, "High Performance and Low Power Dynamic Circuit Design" *IEEE New Circuits and System Conference*, pp. 502-505, 2011. - [7] C.K. Jerry, W.-H. Ma, S. Kim and M. Papaefthymiou, "2.07 GHz floating-point unit with resonant-clock precharge logic", *IEEE Asian Solid State Circuits Conference (A-SSCC)*, pp.1-4, Nov. 2010. - [8] R.J. Sung and D.G. Elliot, "Clock-logic domino circuits for high-speed and energy-efficient microprocessor pipelines", *IEEE Trans. on Circuits* and Systems II: Express Briefs, vol. 54, no. 5, pp. 460-464, 2007. - and Systems II: Express Briefs, vol. 54, no.5, pp. 460-464, 2007. [9] H.J. Quintero, M.J. Avedillo and J. Núñez, "Improving robustness of dynamic logic based pipelines", Proceedings Conference on Design of Circuits and Integrated Systems, 2015. - [10] J. Núñez, M.J. Avedillo and H.J. Quintero, "DOE Based High-Performance Gate-Level Pipelines", Proceedings Int. Workshop on Power and Timing Modeling, Optimization and Simulation, 2014. - [11] J. Núñez, M.J. Avedillo, J. M. Quintana, H. J. Quintero. "Improving delay-noise trade-off of dynamic gates for fine-grained pipelined applications" *Proceedings Conference on the Design of Circuits and Integrated Systems*, 2013.