# A 56-Gbps PAM-4 Wireline Receiver With 4-Tap Direct DFE Employing Dynamic CML Comparators in 65 nm CMOS

Dengjie Wang<sup>D</sup>, Ziqiang Wang, Hao Xu, Jiawei Wang<sup>D</sup>, Zeliang Zhao<sup>D</sup>,

Chun Zhang<sup>10</sup>, Senior Member, IEEE, Zhihua Wang<sup>10</sup>, Fellow, IEEE, and Hong Chen<sup>10</sup>, Senior Member, IEEE

Abstract—This paper presents a four-level pulse amplitude modulation (PAM-4) receiver that incorporates a continuous time linear equalizer, a variable gain amplifier, a phase interpolatorbased clock and data recovery, and a 4-tap direct decision feedback equalizer (DFE) for moderate channel loss applications in wireline communication. A dynamic current-mode logic comparator (DCMLC) is proposed and employed in the DFE. The DCMLC, which adopts dynamic logic, breaks the trade-off between the bandwidth and the clock to Q delay in the traditional current-mode logic comparator (CMLC). Compared with the traditional CMLC, the DCMLC reduces the clock to Q delay by 36%, which allows the implementation of a 4-tap direct DFE. Moreover, the first tap feedback signals are directly tapped from the output of the DCMLC, allowing the first tap feedback current to initiate 0.5UI before the decision clock. The PAM-4 receiver prototype is fabricated in a 65nm CMOS process. At a data rate of 56-Gbps, it can compensate for up to 20.17dB loss and achieve a bit error rate < 1E-10 with a power efficiency of 4.75 pJ/bit.

*Index Terms*—Receiver (RX), four-level pulse amplitude modulation (PAM-4), decision feedback equalizer (DFE), clock and data recovery (CDR).

# I. INTRODUCTION

THE ever-increasing bandwidth requirements of communication systems have driven wireline transceivers to operate at speed of up to 56-Gbps, which has promoted the recent development of high-speed I/O standards using fourlevel pulse modulation (PAM-4) [1], [2]. In PAM-4 signaling, four levels are used to represent 2-bit information (LSB and MSB), which doubles the bandwidth utilization compared with the non-return-to-zero (NRZ) signaling. The increase of

The authors are with the School of Integrated Circuits, Tsinghua University, Beijing 100084, China (e-mail: hongchen@tsinghua.edu.cn).

Color versions of one or more figures in this article are available at https://doi.org/10.1109/TCSI.2021.3125355.

Digital Object Identifier 10.1109/TCSI.2021.3125355

bandwidth utilization means that the same bandwidth with a PAM-4 signal can achieve two times the data rate of the NRZ signal, which may lead to higher data rates and lower equalization requirements. However, the eye swing of a PAM-4 signal decreases to 1/3 of an NRZ signal, which reduces the signal-to-noise ratio (SNR) by 9.5dB. And the multilevel nature of the PAM-4 signal introduces new challenges in designing PAM-4 transceivers. Besides, the PAM-4 signaling increases the circuit complexity of the transmitter and receiver. At the transmitter, the separate multiplexer paths of LSB and MSB signals are required, and the implementation of a PAM-4 feedforward equalizer also requires a large number of output stage segments. At the receiver, the multi-level decoding requires more hardware (at least three times more samplers than the NRZ receiver) which brings high power dissipation and heavy loading. Moreover, the design of PAM-4 equalization is another challenge. In particular, it is difficult to realize a PAM-4 DFE with high energy efficiency and performance. Finally, the analog front-end (AFE) with sufficient linearity of the receiver is required to avoid further deterioration of the SNR. This paper aims to achieve an energy efficiency PAM-4 receiver by addressing the forgoing infers design challenges, especially for DFE.

Two receiver architectures of the mixed-signal receiver and analog-to-digital converters (ADC)-digital signal processing (DSP) based receiver have been employed. ADC-DSP-based receivers [3]–[6] have powerful equalization capabilities due to the DSP, which are widely used in high insertion loss applications. But the mixed-signal PAM-4 receivers [7]–[9] are more power-efficient for moderate channel loss applications (less than 20 dB), which are the target applications of this paper. In the mixed-signal wireline receiver, continuous-time linear equalizers (CTLEs) are often used to boost the main cursor, however, the high-frequency peaking cannot be too large with the consideration of noise and crosstalk amplification. The decision feedback equalizer (DFE) has the advantage of eliminating the residual post cursors while not amplifying the noise and cross-talk, which is often adopted in the receiver.

However, with the UI time constraints set by the feedback nature of DFE, the design of a high-speed DFE is challenging, which is further aggravated by the circuit complexity in PAM-4 receiver. In [10] and [11], the un-loop first tap, which is proposed in NRZ signaling, is used to satisfy the UI timing

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Manuscript received June 22, 2021; revised September 23, 2021 and October 30, 2021; accepted October 31, 2021. Date of publication November 12, 2021; date of current version February 25, 2022. This work was supported in part by the National Science and Technology Major Project from the Ministry of Science and Technology, China, under Grant 2018AAA0103100; in part by the National Natural Science Foundation of China under Grant U19B2041; in part by the Shenzhen Science and Technology Program under Grant JCYJ20180306170609470; in part by the Beijing Engineering Research Center under Grant BG0149; and in part by the Tsinghua National Laboratory for Information Science and Technology under Grant 042003266. This article was recommended by Associate Editor P. Rombouts. (*Corresponding author: Hong Chen.*)



Fig. 1. A direct DFE design with the first tap and its timing constraint.

constraint. But in PAM-4 signaling, each additional DFE tap increases the circuit complexity by four times. For example, a full-rate PAM-4 DFE with one un-loop tap requires 12 comparators, which introduces a large capacitance load and degrades the power efficiency. Besides, the extra time burden of the multiplexer for non-unrolled taps is added.

Fig.1 demonstrates a direct DFE design with the first tap (h1). As illustrated in Fig.1, the h1 loop timing constraint that contains  $T_{cq}$ ,  $T_{settle}$  and  $T_{setup}$  is required to be done within 1UI.  $T_{cq}$  is the clock to Q delay of the comparator,  $T_{settle}$  is the settling time of DFE summer, and  $T_{setup}$  is the setup time of the comparator. According to the DFE timing analysis in [14], the delay of the comparator  $(T_{cq})$  occupies a considerable part of the one-UI timing constraint in the h1 loop. The direct DFE adopting a Strong-Arm comparator (SAC) [7] or a CMOS track-and-regenerate slice [8] have been demonstrated that works over 56-Gbps, and they are fabricated in 16nm [7] and 28nm [8] CMOS technology. The currentmode-logic (CML) comparator-based direct DFE with data rate 56-Gbps have been presented in [12] [13] using 40nm and 65nm CMOS technology respectively. Therefore, the paper aims to realize a power-efficient PAM-4 receiver with 4-tap direct DFE by decreasing the delay of the CML comparator (CMLC).

This paper presents a 56-Gbps mixed-signal PAM-4 receiver for moderate channel loss applications (<20dB), which is organized as follows. Section II provides an overview of the PAM-4 receiver architecture, and each subsection describes the key circuit design in the AFE and clock and data recovery (CDR). Section III describes the operations and features of the proposed dynamic CML comparator (DCMLC) in detail and introduces the optimization made in this paper to meet the timing constraints of the DFE. In Section IV, the measurement results of the 65nm CMOS prototype are presented and discussed. And finally, Section V concludes this paper.

#### II. RECEIVER ARCHITECTURE

Fig.2 presents the overall architecture of the PAM-4 receiver. The T-coil network consists of ESD protection diodes, peaking inductors, and on-die termination resistors. After the T-coil network, the PAM-4 signal enters the AFE path, which consists of a CTLE, a variable gain amplifier (VGA) stage, and a buffer. The PMA-4 signal equalized by AFE then enters the 4-tap DFE block, which completes the DFE equalization and data and edge decision. The DFE adopts a quarter-rate

architecture and consists of four identical DFE slicers, each slicer contains three samplers with different thresholds. Each sampler contains a proposed DCMLC that serves as the data comparator to complete DFE summing and data slicing, and an edge comparator that consists of a pass-transistor and a SAC to sample edge data. The DFE is sampled by an eight-phase clock controlled by a phase interpolator (PI). The sampled data and edge streams are sent to the 1:2 deserializer. After deserialization, the data are converted from thermometer code to binary code. Meanwhile, the 1/8 baud rate data and clock are also output through PADs for off-chip bit error rate (BER) testing. And then all the data and edge data are further deserialized to a 1/16 baud rate to lower the operating frequency of the CDR logic, which rotates the PI output clock phase to make DFE sampling data at the center of the eye. The clock path receives an external half-baud rate single-end clock signal, which is then converted to differential CML levels. A CML frequency divider is used to generate the four-phase quadrature clocks. A triangle circuit is designed to reshape the input clocks of PI to a triangle.

The following parts describe the design details of CTLE in Section II-A, VGA, buffer, and AFE linearity in Section II-B, and CDR loop in Section II-C.

# A. CTLE Design

Both pre-cursor inter-symbol interference (ISI) and long post-cursor ISI are mitigated by a single-stage CTLE, which is shown in Fig.3 (a). The CTLE adopts the structure of RC source-degenerated differential amplifier with adjustable degraded resistance and capacitance. The source-degenerated capacitor  $C_S$  is manually configured by a 4-bit switched MOS varactors array. The source-degenerated resistor  $R_S$  contains a poly resistor and a MOS resistor. The gate of the MOS resistor and varactors are connected to a bias voltage,  $V_{CTLE}$ , whose value is configured through a PAD connected to an off-chip voltage source. The inductor peaking technique is employed to broaden the bandwidth, enlarge the peaking, and achieve reasonable power efficiency.

The AC response of the CTLE is simulated with the number of switched on varactors changing from 0 to 15 when  $V_{CTLE}$ is 0.5V, as shown in Fig.3 (b), from which it can be seen that the minimum DC gain is -7.57dB, tuning the capacitor array will change the medium and high-frequency response of the AC curve, providing up to a 14.4dB peaking at Nyquist frequency of 14GHz. Fig.3 (c) presents the simulated AC response of the CTLE with  $V_{CTLE}$  varying from 0.5V to 1V when all varactors in the CTLE are turned on, which shows that tuning the control voltage  $V_{CTLE}$  changes the whole frequency response, but it mainly affects the low-frequency gain and the low-frequency response. In this design, changing the  $V_{CTLE}$  can provide close to 7dB gain control at lowfrequency. The CTLE with two degrees of freedom has great flexibility to adapt to the different insertion loss of channels.

# B. VGA and Buffer

A relatively fixed input swing is required for PAM-4 DFE design, which is directly related to the threshold of



Fig. 2. The PAM-4 receiver architecture.



Fig. 3. (a) Circuit schematic of CTLE; (b) Simulated AC response of the CTLE with digitally setting of varactors array at  $V_{CTLE} = 0.5V$ ; (c) Simulated AC response of the CTLE with different settings of  $V_{CTLE}$  when all varactors array is switched on.

comparators and coefficients of DFE. A VGA is often designed for swing adjustment of the DFE input. The most popular VGA circuit is based on a degenerated differential pair, which consists of a programmable resistor [15], [16]. This kind of VGA has the advantages of good linearity and low input capacitance due to its simple structure. However, it is difficult to achieve fine and linear gain control. Furthermore, the capacitance at the source of the input differential pair introduces an undesirable high-frequency boost. The alternative VGA implementation makes use of a current steering structure, such as a Gilbert cell [17]. In each Gilbert cell, only one differential pair is activated to achieve positive or negative gain. The circuit is preferable because of its constant output common-mode voltage, gain-independent bandwidth, and wide tuning range. However, the VGA with the current steering structure has the following disadvantages: bandwidth reduction due to large



Fig. 4. Schematic of VGA circuit.

input and output capacitances, poorer gain compression, and increased power consumption.

In order to take advantage of the two kinds of VGAs, a VGA combine with two above structures is designed, as depicted in Fig.4. A Gilbert cell is paralleled with the degenerated differential pair array, the total gain of the VGA should be the sum of the current gains of each differential pair multiplied by the load resistance  $R_D$ . The current gain of the Gilbert cell is either  $g_{m5}$  or  $-g_{m5}$  according to its configuration. And the degenerated differential pair array is similar to that in [18], and the degeneration network is purely resistive. Degenerated pairs are segmented into 15 cells, in which the degeneration network consists of a switch and a resistor in parallel. Therefore, the current gain of each cell in the array can be switched between



Fig. 5. (a) Simulated AC response of the VGA with control code from 0 to 15; (b) Simulated AC response of the VGA with control code from 16 to 31.

two different values according to the state of the switch. Denoting the number segment of switched on degenerated differential pair is N. The current gain of the degenerated pair array is given by (1), and the total gain of VGA can be expressed as (2). The gain of the VGA is controlled by a 5-bit binary code (RC < 4:0 > in Fig.4, RC < 4 > is the control bit of the Gilbert cell). And an inductor is designed to provide a bandwidth extension.

$$g_1 = N \cdot g_m + \frac{(15 - N)g_m}{1 + g_m R_1} \tag{1}$$

$$G_{VGA} = [g_1 \pm g_{m5}] \times R_D \tag{2}$$

Fig.5 (a) and (b) show the AC response of the VGA varying with the number segment of the switched on degenerated pairs from 0 to 15 when the Gilbert cell is configured with a positive and negative gain, respectively. As stated before, the gain boost of VGA should be avoided as much as possible. Therefore, a 1dB gain boost at 14GHz is set as the limited point in VGA design. Configurations that make the gain boost of VGA at 14GHz more than1dB are not used, such as the lines with a DC gain less than 0.78dB in Fig.5 (a). The overall VGA gain varies from -2.17dB to 4.95dB within a 1dB boost at 14 GHz. In practice, the non-monotonicity of VGA gain with the code from 15 to 16 should be paid attention to.

Due to the large capacitive loading at DFE inputs, a CML buffer with RC source-degenerated and inductor peaking is implemented before DFE, which provides parasitic isolation and bandwidth extension. The bandwidth of the buffer is around 40 GHz with 2dB peaking at 16 GHz.

Since the linearity performance of the AFE is essential for PAM-4 signal, the linearity of the CTLE, VGA, and buffer are examined via the input 1dB-gain compression point (1dBcp), which is represented by the input differential swing (Vpp). Generally, the 1dBcp will be improved when the gain is reduced. Therefore, the simulated and required input 1dBcp of the CTLE at 14GHz with the configuration of all varactors switched on are depicted in Fig.6 (a). And the required input 1dBcp is plotted. First, a 1.2V differential swing of the signal before a channel is assumed. Then, the required 1dBcp is calculated with the gain boost of the CTLE at the corresponding frequency. The VGA's input 1dBcp at 14GHz is examined over all control codes. And the simulated and required input 1dBcp over process, voltage, and temperature (PVT) are presented in Fig.6 (b), in which the jump is caused by the gain decrease of the VGA with the control code from 15 to 16. Similarly, the required 1dBcp is estimated by



Fig. 6. (a) Simulated input linearity of the CTLE at 14GHz over PVT with all varactors switched on; (b) Simulated input linearity of the VGA at 14GHz over PVT varies with VGA control code.



Fig. 7. The CDR loop.



Fig. 8. PAM-4 transitions. (a); (b) Good transitions; (c) Bad transitions with "small late"; (d) Bad transitions with "large late".

TABLE I PD LOGIC FOR CDR

|                   | (a) | (b) | (c) | (d) |
|-------------------|-----|-----|-----|-----|
| BBPD <sub>H</sub> | +1  | -1  | -1  | +1  |
| BBPDz             | 0   | +1  | +1  | +1  |
| BBPDL             | 0   | +1  | 0   | 0   |
| PD                | +1  | +1  | 0   | +1  |

assuming that the required differential swing of the PAM-4 signal to DFE is 0.4V. In addition, the input 1dBcp of the buffer is higher than 1 Vppd over PVT.

# C. CDR Design

A PI-based CDR (Fig.7) provides a flexible and energyefficiency solution for adjusting the PI output clock phase to track the data. A phase detector (PD) takes in the 1/8 baud rate data and edge samples and generates phase error information. Similar to that in [19], the proposed PD uses all transitions, and the final phase error is determined through a bang-bang PD (BBPD)-voter. As shown in Fig.8 and Table I,  $BBPD_H$ ,  $BBPD_Z$ , and  $BBPD_L$  represent the BBPD results of three edge slicers with thresholds of  $+V_T$ ,  $V_0$ , and  $-V_T$ , respectively. For good transitions (see Fig.8 (a) and (b)), the PD performance is same as a conventional BBPD. For bad



Fig. 9. The schematic of (a) TWG circuit, and (b) PI cell.

transitions, there are two situations: the "small late/early" (shown in Fig.8 (c)), and the "the large late/early" (Fig.8 (d)). In "small late/early" situation, the PD logic ignores the extracted phase information, and in "large late/early" situation, the PD logic performance is same as a conventional BBPD. A voter is adopted to reduce the data rate of phase error information to a data rate compatible with the digital filter. The second-order digital filter [20] with adjustable  $K_p$  and  $K_i$ , as shown in Fig.7, is designed to generate stable PI control words. A triangular-modulated PI is adopted to guarantee the linearity of the PI over a wide frequency range. A triangular waveform generator (TWG) (Fig.9 (a)) converts the input quadrature clock to the high slew-rate clocks for PI using current sources and a capacitor. The PI circuits, depicted in Fig.9 (b), complete phase interpolating between *ip* and *qp* by controlling the number of segments assigned to *ip* or *qp*.

# **III. DFE LOOPS**

#### A. Comparator Design

Voltage comparators (also called sense amplifiers) are widely used in the design of mixed-signal circuits and systems, which represent the interface between the analog domain and digital domain. The function of the comparator is similar to a regenerative amplifier, sampling the input signal at a certain moment, and then determining whether the voltage is lower or higher than the threshold voltage. A variety of comparators have been demonstrated in previous works. According to the level logic, the comparator can be divided into CMLC and CMOS latch-type comparators. The CMOS latch-type comparator, such as the SAC originally presented in memory circuits [21], is an appealing one for its CMOS-level outputs and no dc power. And variants of SAC are proposed, such as double-tail latch-type comparator (DTLC) proposed in [22] enhances the ability of SAC to operate at lower input commonmode voltage and power supply voltage. However, the CMOS latch-type comparator is always sensitive to supply variations and input common-mode levels. In DFE, the comparator's inputs are the outputs of DFE summer, whose common-mode levels vary with DFE feedback currents. Therefore, the variants of DTLC [23]-[26] attempt to meet the delay performance in an extended common-mode range. However, an unwanted common-mode restoration circuit [7]-[8] is still required in



Fig. 10. Schematic of the traditional CMLC.

DFE summer to ensure the performance of the CMOS latchtype comparator.

On the other hand, a single-stage CMLC can provide higher bandwidth and reduced delay, and has been employed in 56-Gbps data receivers [12], [13]. Besides, the CMLC has higher sampling gain and input sensitivity than the CMOS latch-type, moreover, it poses higher immunity to supply variation and input common-mode voltage. However, the CMLC has some drawbacks. First, the CMLC suffers from static power consumption. Second, the CMLC presents large output capacitances due to the cross-coupled MOS pair, which lower the bandwidth of CMLC. Third, the CMLC output level cannot be directly compatible with the relatively powerefficient CMOS logic circuit. The CMLC in [16] attempts to jointly optimize the output swing and DFE taps size, without much optimizations to reduce the CMLC delay. However, given that the goal of this paper is a data rate of 56-Gbps in a 65nm process, the CMLC is adopted.

Fig.10 demonstrates a traditional CMLC circuitry, which consists of an input tracking stage  $M_1$  and  $M_2$  for detecting and tracking the input data, a threshold stage  $M_5$  and  $M_6$  for setting the slicer threshold, and a cross-coupled regeneration pair  $M_3$  and  $M_4$  for regenerating the data. The tracking and regeneration modes are determined by CKP and CKN of the  $M_7$  and  $M_8$  differential pair. When the CKP is "1", the tail current  $(I_{SS1})$  all flows to the tracking path, which allows CMLC to track the input data. Meanwhile,  $M_9$  is on, the tail current  $(I_{SS2})$  flows through the threshold stage. Therefore, the tail currents  $(I_{SS1}, I_{SS2})$  are summed at the output node through the load resistor  $R_D$ . Finally, the voltage of  $V_{in}$  and  $V_{th}$  are compared by currents. In the regeneration mode, the CKP is "0", CKN is "1", and the tracking stage is disabled, whereas the regeneration pair is enabled to regenerate toward to logic state through the positive feedback of the crosscoupled pair.

Besides the disadvantages mentioned above, the traditional CMLC has a primary limitation that the same tail current is used for tracking and regeneration pairs. Moreover, the parasitic capacitances of  $M_3$  and  $M_4$  degrade the bandwidth for a proper tracking operation and severely limit the size



Fig. 11. (a) Small-signal mode of the CMLC in tracking mode; (b) Small-signal mode of the CMLC in regeneration mode.

of cross-coupled pair for a reliable regeneration operation. In [27] inductive peaking is adopted to broaden the bandwidth and maximize the speed. However, the PAM-4 DFE requires three times more hardware resources than that of the NRZ DFE, making the inductor solution non-area-power efficiency. On the other hand, in the CMLC design, the key parameter is the value of load resistance ( $R_D$ ), which is related to the bandwidth of tracking and the speed of regeneration. The small-signal modes of the comparator in tracking and regeneration mode are given in Fig.11 (a) and (b), respectively, from which two conclusions can be conducted. One is that the traditional CMLC has a settling time constant of  $R_DC_L$  in the tracking mode. Another is that the negative conductance of CMLC in the regeneration mode is

$$G_{m,regen} = \frac{1}{R_D} - G_{m2},\tag{3}$$

where the  $G_{m2}$  is the conductance of the cross-coupled pair. Therefore, there is a trade-off between the bandwidth of CMLC in the tracking mode and the regeneration speed of CMLC in regeneration mode. In other words, reducing  $R_D$  to improve settling time in the tracking mode increases the comparator's regeneration time constant  $\tau_{regen}$  as a consequence. Moreover, the output swing is directly determined by  $I_{SS1} \times R_D$ , as a result, reducing RD will increase the power to maintain the swing.

In order to solve the problems mentioned above, a DCMLC is proposed and designed to reduce the clock-to-Q delay and maintain the high bandwidth. When the DFE is designed with the DCMLC, the bandwidth requirement of DFE summer is reduced as the result of the reduced comparator delay, thereby realizing an area-energy-efficient DFE design that works at high data rates. Fig.12 shows the circuit schematic of the proposed DCMLC with a negative capacitance. The proposed DCMLC to realize the independent load resistance in the tracking and regeneration mode.

As shown in Fig.12, when DCMLC works in the tracking mode, that is, clock *COP* is "1" (*CON* is "0"),  $M_{9-10}$  and  $M_{13-14}$  turn on,  $M_{15}$  turns off, and  $M_{9-10}$  and  $M_{11-12}$  are connected in parallel to achieve a small resistance. The small load resistance allows the DCMLC to achieve a relatively large bandwidth, it should be noted that the smaller the resistance, the greater the power consumption is needed to maintain

a reasonable swing of PAM-4 signal. Moreover, the MOS transistors are used as resistors instead of polysilicon resistors to achieve a small area on the layout to decrease the parasitic capacitance. Furthermore, the  $M_{7-8}$ ,  $M_{16-17}$ , and  $C_C$  in the dashed box can be equivalent to a negative capacitor proposed in [28], which is designed to compensate for the self-loading effect of the cross-coupled pair and expand the bandwidth of the output node of DCMLC. The  $V_C$  connected to a PAD is used to adjust the negative capacitance value to compensate for the variation over PVT. Simulation results show that the AC gain of DCMLC in the tracking mode at 14 GHz drops 2.38dB without the negative capacitor, and 0.1dB with the negative capacitor.

On the other hand, when DCMLC works at the regeneration mode (*CKOP* is "0" and *CON* is "1"),  $M_{9-10}$  and  $M_{13-14}$  turn off,  $M_{15}$  turns on. Only  $M_{11-12}$  serve as the load resistor to realize a large resistance, reducing the regeneration time constant and shortening the regeneration process.

In theory, the output swing of DCMLC in the regeneration mode is proportional to the load resistor. However, the output swing of DCMLC is lower than that of CMLC. The reason is that the input pair and threshold pair,  $M_{1-4}$ , connect the outputs together, which provides leakage current paths. Fig.13 illustrates this phenomenon when the DCMLC works in the regeneration mode with output "0". It can be seen that  $M_{1-4}$ provide  $M_{11}$  with current paths (arrow in Fig.13) to ground through  $M_6$  and  $M_{15}$ . Therefore, the output voltage level of "1" is determined by the partial voltage of  $M_{11}$  and  $M_{1-4}$ ,  $M_6, M_{15}$ . A closer analysis shows that this current helps the positive feedback of the cross couple pair and further reduces the delay of regeneration. Hence, the smaller the  $M_{11}$  is, the lower the voltage level of "1" is. In order to reduce the delay of DCMLC and ensure enough output swing, the size ratio of  $M_{9-10}$  and  $M_{11-12}$  in the DCMLC is designed to be 3/2.

The simulation and comparison results of the "trackregenerate" comparators (DCMLC and CMLC) and the "resetregenerate" comparators (SAC and DTLC) are presented in Fig.14. The CMLC has the same size of input transistor, the same tail current, and the same load resistor as that of the DCMLC. The power consumption of the SAC and DTLC is about 6mW, which depends on the power consumption of the DFE slicer for keeping the same power consumption as that of the DFE slice designed with the DCMLC. All the clockto-Q delay simulations are based on post-layout. In addition, a 100fF load capacitor is driven by each comparator, which represents the estimated parasitic and the load capacitance of the tap. As shown in Fig.14 (a), the input signals of comparators in the simulation are a worst-case data pattern, which represents a weak symbol among a long strong symbol. For example, a  $V_{-1}$  symbol is between a long sequence of  $V_3$ . This pattern is chosen for the reason that the high-level symbols tend to cause relatively large ISI for the subsequent symbols.  $\Delta V$  represents the voltage difference between PAM-4 level " $V_{-1}$ " and " $V_1$ ". In comparison, the Q point is defined as  $\pm$ 600mV, and  $\Delta$ V is set to 60mV. Fig.14 (b) gives the clockto-Q delay of each comparator, from which it can be concluded that the delays of the proposed DCMLC, DTLC, CMLC, and SAC are 20.2ps, 27.7ps, 31.5ps, and 38.2ps respectively.



Fig. 12. Circuit schematic of the proposed DCMLC with a negative capacitor.



Fig. 13. The leakage current path of the DCMLC in regeneration mode with output "0".

Compared with the last three comparators the DCMLC delay is reduced by 27%, 36%, and 47%. In addition, the swing loss of DCMLC compared with the swing of CMLC is negligible.

Besides, the DCMLC has high immunity to supply variation and common-mode voltage. A voltage drops of 100 mV from a 1.2 V supply simulation results (Fig.14 (c)) show that the delays of the proposed DCMLC, DTLC, CMLC, and SAC are 21.1ps, 31.9ps, 32.7ps, and 45.6ps respectively. Compared with that with the supply of 1.2V, the delay was increased by 4.5%, 15%, 3.8%, and 19% respectively. As already have been verified in [22], the input common-mode level seriously affects the delay performance of DTLC, which changed by 20% when the common-mode varies from 0.6V to 0.75V. The DFE using the comparator proposed in this paper completes the DFE summer in DCMLC, which is described in detail later. Therefore, the common-mode level variation occurs at the output of DCMLC. Fig.15 (a) demonstrates the simulation results of the delay performance of the DCMLC varies with the output common-mode level, and shows that the clockto-Q delay varies by 1.3ps when the common-mode voltage changed from 0.82V to 0.93V. That verifies that the delay of the DCMLC is immune to the common-mode change caused by DFE taps.



Fig. 14. (a) Input signals to the comparators; (b) Clock-to-Q delay of the SAC, DTLC, CMLC, and the proposed DCMLC under 1.2V supply; (c) Clock- to-Q delay of the SAC, DTLC, CMLC, and the proposed DCMLC under 1.1V supply.

In addition to reducing the clock-to-Q delay, the proposed DCMLC also holds excellent input sensitivity compared with the SAC and DTLC. The input sensitivity is defined as the minimum differential input swing required by the comparator when the output swing is greater than 600mV within 1UI delay. The input of the comparators used in the simulation is shown in Fig.14 (a), and Fig.15 (b) illustrates the curve of the input sensitivity with the same simulation condition in the clock-to-Q delay simulation. It can be observed that the sensitivity of SAC and DTLC becomes very poor at high



Fig. 15. (a) Clock-to-Q delay vs. output common-mode voltage of the DCMLC; (b) Simulated input sensitivity of the comparators at different baud rates.



Fig. 16. The DFE architecture.

baud rates due to the reset phase. Compared with CMLC, the proposed DCMLC with negative capacitance to broaden the bandwidth has better sensitivity than that of CMLC. And the proposed DCMLC has the best sensitivity among all the compared comparators.

## B. Time Constraints of DFE Loops

As the data rate increases, meeting the unit interval (UI) timing constraints of DFE taps, especially the timing constraints of the first tap, poses a serious challenge to the design of the DFE circuit. In this section, the timing constraint in the 4-tap PAM-4 DFE, designed with the proposed comparator, is analyzed in detail.

As shown in Fig.2, the quarter-rate DFE architecture is designed to reduce clock frequency to lower clock power. Fig.16 shows the overall architecture of the DFE circuit. As described in Section II, the input data is sampled by time-interleaved four identical DFE slicers. Each DFE slicer consists of three samplers with different thresholds, and the sampler is composed of a data sampler and an edge sampler. The data sampler completes DFE summation and data decision by using the DCMLC. The edge sampler is designed with a pass-transistor and a SAC to sample the edge information for CDR. Take slicer 270P as an example to illustrate the connection relationship of DFE taps. In the first tap (h1)loop, the DCMLC's outputs in DFE slicer 270P are directly connected to the h1 tap in DFE slicer OP. And the output of DCMLC are amplified to CMOS-level by a two-stage inverter buffer for tapping to the second tap (h2) in DFE slicer 90P,

and then the signal is transformed to NRZ signaling by the SR latch, which drives the third tap (h3) in DFE slicer 180P, finally, the data is buffered by inverters to the fourth tap (h4) in DFE slicer 270P. With the quarter-rate architecture, a 4-tap DFE can be achieved without latch or D flip-flop (DFF). Therefore, there are four DFE loops, h1, h2, h3, h4, whose time constraints can be expressed in (4), (5), (6), (7), respectively.

$$T_{cq} + T_{settle} + T_{setup} < 1UI \quad (4)$$

$$T_{cq} + T_{settle} + T_{setup} + T_{buf1} < 2UI \quad (5)$$

$$T_{cq} + T_{settle} + T_{setup} + T_{buf1} + T_{SR} < 3UI \quad (6)$$

$$T_{cq} + T_{settle} + T_{setup} + T_{buf1} + T_{SR} + T_{buf2} < 4UI \quad (7)$$

 $T_{cq}$  is the clock-to-Q delay of the proposed DCMLC,  $T_{settle}$  is settling time of DCMLC in the tracking mode (served as the DFE summer),  $T_{setup}$  is the setup time of the DCMLC,  $T_{buf1}$  is the delay of the two-stage inverter buffer after the DCMLC,  $T_{SR}$  is the delay of SR-latch,  $T_{buf2}$  is the delay of the last buffer, and 1UI is 35.71ps for 56-Gbps PAM-4 signaling. In addition,  $T_{buf1}$ ,  $T_{SR}$ , and  $T_{buf2}$  are all less than 1 UI. Therefore, the tightest timing constraint is the *h1* loop.

Fig.17 depicts the h1 closed loop with the proposed DCMLC, in which the output of DCMLC is directly connected to the *h1* tap. The current of each DFE tap is set by the gate voltage (connected to pads) of the tail MOS transistor of the corresponding tap path. The time constraint of h1 loop is given in (4), and  $T_{cq}$  is directly affected by the swing of the Q point. The swing of Q point should be the differential input swing that makes the input differential pair of the h1 tap work in switch mode. In order to reduce the delay of  $T_{cq}$ , a small swing of Q point is required, which means that a large feedback stage is needed and a large capacitive load is introduced to the output of the DCMLC. In this design, the feedback stage of 1.2u/0.06u is designed to achieve robust feedback of DFE. Fig.18 demonstrates the curve of *h1* tail current (*Itap*) and *h1* coefficient current (Idiff) with the input differential swing of the feedback stage under PVT. It can be seen that when the input swing of the stage is above 500mV, more than 95% of the current is used for the DFE feedback calculation. Besides, to further realize almost noise-free feedback, the swing of Q point is set to 600mV. Meanwhile, as shown in Fig.19, the worst  $T_{cq}$  of the DCMLC is 26.6ps over PVT.

The setup time  $(T_{setup})$  is a concept from digital DFF, which is defined as the required time of the digital input to the trigger clock edge. And this concept can be equivalent to the sampling aperture of the comparator. The impulse sensitivity function (ISF) describes the sensitivity of the output of a clocked comparator to the impulse input at a certain arrival time, the width of the ISF can be used to characterize the sampling aperture time. More fundamentals and details can be found in [26], [29]. Here, the approach presented in [29] is adopted to simulate the ISF of the comparator. First, a small step with offset voltage  $V_{os}$  is applied as input to the comparator,  $V_{MS}(\tau)$  is obtained by sweeping the  $V_{os}$ that makes the comparator metastable at each time. Then, the ISF can be derived from the derivative of  $V_{MS}(\tau)$ . Fig.20 shows the simulated normalized  $V_{MS}(\tau)$  (Fig.20 (a)) and



Fig. 17. The h1 loop of the designed DFE with the proposed DCMLC.



Fig. 18. Simulated h1 tap current and h1 coefficient vs. differential input swing of h1 tap over PVT.

normalized ISF (Fig.20 (b)) of the DCMLC at 28 Gbaud rate operation. The time range, during which the integral of the curve is greater than 80% of the total integral of the ISF curve, represents the sampling aperture of the DCMLC [30], that is, the time range between  $t_0$  and  $t_1$  in Fig.20 (b). Therefore, the sampling aperture of the proposed DCMLC is 15ps.  $t_0$  and  $t_1$  specify a valid timing window during which the input signal can affect the output signal. The time 0 in Fig.20 represents the falling/rising edge of the clock signal. Therefore, the  $T_{setup}$  can be considered as 3ps.

Substituting the simulation values of  $T_{cq}$  and  $T_{setup}$  into (4), the desirable requirement of  $T_{settle}$  should be less than 6.11ps (35.71–26.6-3). And the simulation result shows that the DFE summer's settling time with 95% settling is about 13ps with about 35Ghz bandwidth of -3dB. However, it is noteworthy to point out that the settling of the *h1* tap has been already started before the decision clock in the case of employing the proposed DCMLC.

As demonstrated in Fig.21, taking the *h1* tap in DFE *OP* slicer as an example,  $\beta_1$  is the DFE coefficient of the *h1* 



Fig. 19. Simulated clock-to-Q delay of the DCMLC with distinct differential output swing over PVT.



Fig. 20. (a) Simulated  $V_{MS}(\tau)$  of the proposed DCMLC at 28Gbaud/s; (b) Simulated ISF of the proposed DCMLC at 28Gbaud/s.

tap. When *COP* is "1", the DCMLC in DFE *OP* slicer enters the tracking mode, the DCMLC tracks the input data and the tap data, the h1 tap data is from DFE 270P slicer (Fig.16), the clock *COP* and *C270P* have 2-UI overlap. Therefore, the timing of the data decision is similar to that of soft-decision [31]. Before the decision clock edge (the falling edge of *C270P*), the DCMLC in DFE *0P* slicer has already tracked the h1 data (*D*-1 in Fig.21). That is, the h1 tap feedback current has already taken effect before 0.5UI of the decision clock



Fig. 21. Timing diagram of h1 loop in the quarter-rate DFE.



Fig. 22. Simulation results of the DFE at 56-Gbps with distinct DFE settings. (a) Input eye diagram of the DFE; (b) Output eye diagram of the DCMLC with all DFE taps disabled; (c) Output eye diagram of the DCMLC with h1 tap enabled and others disabled; (d) Output eye diagram of the DCMLC with h1, h2 taps, enabled and others disabled; (e) Output eye diagram of the DCMLC with h1, h2 taps, enabled and others disabled; (f) Output eye diagram of the DCMLC with h1, h2, h3 taps enabled and others disabled; (f) Output eye diagram of the DCMLC with h1, h2, h3, h4 enabled.

edge, which effectively reduces the timing requirement for DFE summer settling. In addition, the settling of DFE summer is carried out simultaneously with the DCMLC regeneration process. Therefore, the *h1* time constraint can be considered as  $T_{cq} + T_{setup} < 1UI$ , as shown in Fig.21. The *h1* loop can be closed at 56-Gbps PAM-4.

Even if  $T_{settle}$  does not fully meet the h1 loop timing constraint, as previously stated, as long as the feedback signal is still within the sampling aperture (9ps after the clock edge), the h1 loop can still be closed. But the situation will decrease the DFE summing accuracy for the reason that the settling of DFE summer is not completely accomplished.

To verify that the direct 4-tap DFE loops designed with the DCMLC can be closed, we simulate the 4-tap DFE at circuit-level. And the simulation results are presented in Fig.22. As shown in Fig.22 (a), the input PAM-4 signal generated by two PRBS7 NRZ signals has been filtered by a  $0.6+0.2Z^{-1}+0.1Z^{-2}+0.05Z^{-3}+0.05Z^{-4}$  channel. Fig.22 (b)-(f) show the simulated eye diagrams at 56-Gbps without/with DFE taps. It can be find that the eye-opening further expands with h1, h2, h3, and h4 taps on, and the eye-opening expands from 18mV to 80mV with h1 tap to h1-h4 taps on.

#### IV. EXPERIMENTAL RESULTS

The receiver prototype chip is fabricated in 65nm CMOS technology. Fig.23 (a) shows the measurement setup, the chip is bonded on a PCB, where the 1/8 baud rate data, clock, configure signal, and power supply are wire-bonded to the PCB, and the half-rate input clock and full-speed different PAM-4 data are AC coupled to the receiver chip through the probe. A high-speed clock and pattern generator (Anritsu MP1900A) generates the half-rate clock and PAM-4 data signals (PRBS7) and transmits them to the chip through cables and probes. The channel loss mainly consists of cables (6.5m cable shown in Fig.23 (a)), which is measured to be 16.78dB and 20.17dB at 10Ghz and 14 GHz, respectively, excluding the probe. One of the 1/8 rate differential data /clock is cabled and AC coupled to a BERT (Tektronix BSX 320) for the BER measurement. The others are AC coupled to the oscilloscope (Keysight DSAZ594A) to monitor the recovered clock and data signals.

Fig.24 (a) and (c) show the pre-channel eye diagram of the 40-Gbps and 56-Gbps PAM-4 with a 1.2-V<sub>ppd</sub> setting, respectively. The eye diagrams of the PAM-4 signal with 1-tap FFE equalization (0.6dB) of the pattern generator are completely closed after the channel (Fig.24 (b) and (d)).

Fig.25 demonstrates the receiver performance measured with the cable channel. Since there are no dedicated eyescan samplers, the eye diagram at the DFE sampling node is obtained by scanning the DFE sampling clock phase, and simultaneously scanning the thresholds of the data samplers of the middle, upper and lower eyes. The non-linearity of the PAM-4 eye diagram in Fig.25 is caused by various non-idealities in the actual measurement environment, such as the mismatch of channel mismatch, level mismatch of AFE path, and mismatch between data samplers.

To verify the effectiveness of the direct DFE loops, the BER bathtubs and four sets of BER eye scans in Fig.25 are measured by turning on and off DFE taps at 40-Gbps and 56-Gbps, respectively. The progressively increasing eyeopening and lower BER graphically verify the effectiveness of direct 4-tap DFE. At 40-Gbps, as shown in Fig.25 (a) (only CTLE is used) and (b) (with CTLE and 4-tap DFE), the horizontal opening of the eve for BER = 1E-12 is improved from 0.19UI to 0.38UI with the help of DFE taps. Also, the minimal vertical opening of the eye for BER = 1E-12 at the center is expended from 17 to 57, which is the difference value of the DAC control code (linearly mapped to voltage). And the 4-tap DFE coefficients are estimated to be  $\beta 1=-0.03$ ,  $\beta 2 = -0.1$ ,  $\beta 3 = -0.009$ ,  $\beta 4 = -0.005$ , relative to  $\beta 0$ . When the receiver works at 56-Gbps, from Fig.25 (c) (only CTLE is used) to Fig.25 (d) (DFE taps are enabled), more than four orders of magnitude BER improvement is achieved. And the DFE coefficients are estimated to be  $\beta 1 = -0.115$ ,  $\beta 2=0, \beta 3=-0.09, \beta 4=-0.006$ . Moreover, the recovered clock and data are measured at 1/8 baud-rate, as shown in Fig.25 (g) and (h).

The chip micrograph is shown in Fig.23 (a), in which the key building blocks are highlighted, including the AFE, the 4-tap DFE with the proposed comparators, the CDR logic and the data demultiplexer (CDR), the PI, the CML



Fig. 23. (a) Measurement setup; (b) Power breakdown of the receiver.

TABLE II Performance Summary and Comparison

|                  | This work          | [10]                                  | [7]                  | [32]                | [16]               | [13]                               | [5]                             |
|------------------|--------------------|---------------------------------------|----------------------|---------------------|--------------------|------------------------------------|---------------------------------|
| Technology       | 65nm               | 7nm                                   | 16nm                 | 40nm                | 40nm               | 65nm                               | 65nm                            |
| Architecture     | Analog             | Analog                                | Analog               | Analog              | Analog             | Analog                             | ADC                             |
| Data rate        | 56-Gbps            | 56-Gbps                               | 56-Gbps              | 40-Gbps             | 56-Gbps            | 56-Gbps                            | 52-Gbps                         |
| Comparator       | DCMLC              | SAC                                   | SAC                  | CMLC                | CMLC               | CMLC                               | -                               |
| Equalization     | CTLE<br>+4-tap DFE | CTLE<br>+10-tap DFE<br>with h1 unloop | CTLE<br>+ 10-tap DFE | CTLE<br>+ 2-tap DFE | CTLE<br>+3-tap DFE | CTLE + 1-tap<br>FIR & 1-tap<br>IIR | CTLE<br>+ 3-tap<br>embedded FFE |
| Log(BER)         | -10                | -6                                    | -12                  | -7                  | -12                | -12                                | -6                              |
| Supply (V)       | 1.2                | 1,1.2                                 | 0.9,1.2,1.8          | 1                   | 1                  | 1.2                                | 1,1.2                           |
| Channel          | 20dB@14G           | 25dB@14G                              | 10dB@14G             | 10dB@10G            | 24dB@14G           | 20dB@14G                           | 31dB@13G                        |
| Power/DFE (mW)   | 266/160            | 378/189                               | 230*/129             | 242*/195            | 382/-              | 259/166                            | 419/-                           |
| Power efficiency | 4.75pJ/bit         | 8.04pJ/bit                            | 4.12pJ/bit           | 6.05pJ/bit          | 6.82pJ/bit         | 4.63pJ/bit                         | 8.06pJ/bit                      |
| FOM(pJ/bit/dB)** | 0.24               | 0.32                                  | 0.41                 | 0.61                | 0.28               | 0.23                               | 0.41                            |

\*Excluding CDR.

\*\*FOM (pJ/bit/dB)=Power/Data rate/Channel loss.



Fig. 24. (a) Measured 40-Gbps PAM-4 data eyes before channel; (b) Measured 40-Gbps PAM-4 data eyes after channel; (c) Measured 56-Gbps PAM-4 data eyes before channel; (d) Measured 56-Gbps PAM-4 data eyes after channel.

driver of 1/8 rate data and clock (*tb\_buf*), CKBUFs, the clock input path (*clk\_buf*), and the voltage DAC (*VDAC*) blocks. The total chip area measures 1.4mm  $\times$  1.6 mm.

The receiver consumes 266 mW at 56-Gbps for 20.17 dB insertion loss, including high-speed local clock buffers and PI. Fig.23 (b) details the power breakdown of the receiver, where DFE and clock buffer consumes most of the power (85%). A 4.75 pJ/bit energy efficiency is achieved at 56-Gbps.

The performance comparisons with the state of the art works are summarized in Table II. The DFE in our work achieves a lower power consumption than the DFE in [13], [32] designed with the CMLC. Also, the DCMLC allows for a 4-tap DFE relative to that in [13]. Compared with the DFE using SACs in [7], the same data rate is realized in an old 65-nm process. The power consumption of the direct DFE in the paper is greatly reduced compared to that of the speculative DFE in [10] fabricated in 7nm FinFET process. Moreover, better power efficiency of the receiver is achieved relative to that in [10], [32], [16], [5]. Compared with [7], the presented work achieves the same data rate and stronger equalization in 65nm CMOS process.



Fig. 25. PAM-4 eye-diagram and BER bathtub curves. (a) Measured eye scan at 40-Gbps PRBS7 PAM-4 with only CTLE; (b) Measured eye scan at 40-Gbps PRBS7 PAM-4 with OTLE and all DFE taps; (c) Measured eye scan at 56-Gbps PRBS7 PAM-4 with only CTLE; (d) Measured eye scan at 56-Gbps PRBS7 PAM-4 with CTLE and all DFE taps; (e) Measured BER bathtub curves at 40-Gbps; (f) Measured BER bathtub curves at 56-Gbps; (g) Measured 1/8 baud rate eyes of recovered clock and data at 40-Gbps PRBS7 PAM-4; (h) Measured 1/8 baud rate eyes of recovered clock and data at 56-Gbps PRBS7 PAM-4.

# V. CONCLUSION

A dynamic CML comparator is proposed and designed to reduce the clock-to-Q delay of the conventional CML comparator, which applies the dynamic logic technology to realize the independent load resistance in the tracking and regeneration mode of the CMLC. The proposed DCMLC reduces the delay by 36% compared with traditional CMLC and effectively solves the timing constraint of the first tap in DFE. A quarter-rate PAM-4 receiver with 4-tap DFE, employing the proposed DCMLC, benefits from the relaxed settling time constraint with the reduced decision delay. The prototype fabricated in the 65nm CMOS process achieves a power efficiency of 4.75 pJ/bit at 56-Gbps over a channel with 20.17dB loss at Nyquist frequency, demonstrating an energyefficient PAM-4 receiver with 4-tap DFE.

## REFERENCES

- CEI-56G-VSR\_PAM4 Very Short Reach Interface, document OIF 2014.230.07, Optical Internetworking Forum, Jun. 2016.
- [2] IEEE P802.3bs 200 Gb/s and 400 Gb/s Ethernet Task Force. Accessed: Nov. 2016. [Online]. Available: http://www.ieee802.org/3/bs/

- [3] Y. Frans *et al.*, "A 56-Gb/s PAM4 wireline transceiver using a 32-way time-interleaved SAR ADC in 16-nm FinFET," *IEEE J. Solid-State Circuits*, vol. 52, no. 4, pp. 1101–1110, Apr. 2017.
- [4] T. Ali et al., "6.2 A 460 mW 112Gb/s DSP-based transceiver with 38dB loss compensation for next-generation data centers in 7nm FinFET technology," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2020, pp. 118–120.
- [5] S. Kiran, S. Cai, Y. Luo, S. Hoyos, and S. Palermo, "A 52-Gb/s ADCbased PAM-4 receiver with comparator-assisted 2-bit/stage SAR ADC and partially unrolled DFE in 65-nm CMOS," *IEEE J. Solid-State Circuits*, vol. 54, no. 3, pp. 659–671, Mar. 2019.
- [6] P. Upadhyaya *et al.*, "A fully adaptive 19-to-56Gb/s PAM-4 wireline transceiver with a configurable ADC in 16nm FinFET," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2018, pp. 108–110.
- [7] J. Im et al., "A 40-to-56 Gb/s PAM-4 receiver with ten-tap direct decision-feedback equalization in 16-nm FinFET," *IEEE J. Solid-State Circuits*, vol. 52, no. 12, pp. 3486–3502, Dec. 2017.
- [8] K.-C. Chen, W. W.-T. Kuo, and A. Emami, "A 60-Gb/s PAM4 wireline receiver with 2-tap direct decision feedback equalization employing track-and-regenerate slicers in 28-nm CMOS," *IEEE J. Solid-State Circuits*, vol. 56, no. 3, pp. 750–762, Mar. 2021.
- [9] L. Tang, W. Gai, L. Shi, X. Xiang, K. Sheng, and A. He, "A 32Gb/s 133 mW PAM-4 transceiver with DFE based on adaptive clock phase and threshold voltage in 65nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2018, pp. 114–116.

- [10] W.-C. Chen, S.-C. Yang, Y.-N. Shih, W.-H. Huang, C.-C. Tsai, and K. C.-H. Hsieh, "A 56Gb/s PAM-4 receiver with voltage pre-shift CTLE and 10-tap DFE of tap-1 speculation in 7nm FinFET," in *Proc. Symp. VLSI Circuits*, Jun. 2019, pp. C272–C273.
- [11] A. Cevrero et al., "6.1 A 100Gb/s 1.1pJ/b PAM-4 RX with dual-mode 1-tap PAM-4/ 3-tap NRZ speculative DFE in 14nm CMOS FinFET," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2019, pp. 112–114.
- [12] J. Lee et al., "Design of 56 Gb/s NRZ and PAM4 SerDes transceivers in CMOS technologies," *IEEE J. Solid-State Circuits*, vol. 50, no. 9, pp. 2061–2073, Sep. 2015.
- [13] A. Roshan-Zamir *et al.*, "A 56-Gb/s PAM4 receiver with low-overhead techniques for threshold and edge-based DFE FIR- and IIR-tap adaptation in 65-nm CMOS," *IEEE J. Solid-State Circuits*, vol. 54, no. 3, pp. 672–684, Mar. 2019.
  [14] Y. Lu and E. Alon, "Design techniques for a 66 Gb/s 46 mW 3-tap
- [14] Y. Lu and E. Alon, "Design techniques for a 66 Gb/s 46 mW 3-tap decision feedback equalizer in 65 nm CMOS," *IEEE J. Solid-State Circuits*, vol. 48, no. 12, pp. 3243–3257, Dec. 2013.
- [15] D. Cui et al., "3.2 A 320 mW 32Gb/s 8b ADC-based PAM-4 analog front-end with programmable gain control and analog peaking in 28nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Jan. 2016, pp. 58–59.
- [16] P.J. Peng, J.-F. Li, L.-Y. Chen, and J. Lee, "6.1 A 56Gb/s PAM-4/NRZ transceiver in 40nm CMOS," in *IEEE Int. Solid-State Circuits Conf.* (*ISSCC*) Dig. Tech. Papers, Feb. 2017, pp. 110–111.
- [17] E. Mammei *et al.*, "Analysis and design of a power-scalable continuoustime FIR equalizer for 10 Gb/s to 25 Gb/s multi-mode fiber EDC in 28 nm LP CMOS," *IEEE J. Solid-State Circuits*, vol. 49, no. 12, pp. 3130–3140, Dec. 2014.
- [18] P. W. de Abreu Farias Neto et al., "A 112–134-Gb/s PAM4 receiver using a 36-way dual-comparator TI-SAR ADC in 7-nm FinFET," IEEE Solid-State Circuits Lett., vol. 3, pp. 138–141, 2020.
- [19] B. Dehlaghi *et al.*, "A 1.41-pJ/b 56-Gb/s PAM-4 receiver using enhanced transition utilization CDR and genetic adaptation algorithms in 7-nm CMOS," *IEEE Solid-State Circuits Lett.*, vol. 2, no. 11, pp. 248–251, Nov. 2019.
- [20] J. L. Sonntag and J. Stonick, "A digital clock and data recovery architecture for multi-gigabit/s binary links," *IEEE J. Solid-State Circuits*, vol. 41, no. 8, pp. 1867–1875, Aug. 2006.
- [21] T. Kobayashi, K. Nogami, T. Shirotori, Y. Fujimoto, and O. Watanabe, "A current-mode latch sense amplifier and a static power saving input buffer for low-power architecture," in *Symp. VLSI Circuits Dig. Tech. Papers*, Jun. 1992, pp. 28–29.
- [22] D. Schinkel, E. Mensink, E. Klumperink, E. van Tuijl, and B. Nauta, "A double-tail latch-type voltage sense amplifier with 18ps Setup+Hold time," in *IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers*, Feb. 2007, pp. 314–605.
- [23] P. A. Francese et al., "23.6 A 30Gb/s 0.8pJ/b 14nm FinFET receiver data-path," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Jan. 2016, pp. 408–409.
- [24] K. C. Chen and A. Emami, "A 25-Gb/s avalanche photodetector-based burst-mode optical receiver with 2.24-ns reconfiguration time in 28-nm CMOS," *IEEE J. Solid-State Circuits*, vol. 54, no. 6, pp. 1682–1693, Jun. 2019.
- [25] I. Ozkaya et al., "A 64-Gb/s 1.4-pJ/b NRZ optical receiver data-path in 14-nm CMOS FinFET," *IEEE J. Solid-State Circuits*, vol. 52, no. 12, pp. 3458–3473, Dec. 2017.
- [26] J. Kim, B. S. Leibowitz, J. Ren, and C. J. Madden, "Simulation and analysis of random decision errors in clocked comparators," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 56, no. 8, pp. 1844–1857, Aug. 2009.
- [27] S. Ibrahim and B. Razavi, "Low-power CMOS equalizer design for 20-Gb/s systems," *IEEE J. Solid-State Circuits*, vol. 46, no. 6, pp. 1321–1336, Jun. 2011.
- [28] D. Lee, J. Han, G. Han, and S. M. Park, "10 Gbit/s 0.0065 mm<sup>2</sup> 6 mW analogue adaptive equaliser utilising negative capacitance," *Electron. Lett.*, vol. 45, no. 17, p. 863, 2009.
- [29] M. Jeeradit et al., "Characterizing sampling aperture of clocked comparators," in Proc. IEEE Symp. VLSI Circuits, Jun. 2008, pp. 68–69.
- [30] H. O. Johansson and C. Svensson, "Time resolution of NMOS sampling switches used on low-swing signals," *IEEE J. Solid-State Circuits*, vol. 33, no. 2, pp. 237–245, Feb. 1998.
- [31] K. L. J. Wong, A. Rylyakov, and C. K. K. Yang, "A 5-mW 6-Gb/S quarter-rate sampling receiver with a 2-tap DFE using soft decisions," *IEEE J. Solid-State Circuits*, vol. 42, no. 4, pp. 881–888, Apr. 2007.
- [32] C.-T. Hung, Y.-P. Huang, and W.-Z. Chen, "A 40 Gb/s PAM-4 receiver with 2-tap DFE based on automatically non-even level tracking," in *Proc. IEEE Asian Solid-State Circuits Conf. (A-SSCC)*, Nov. 2018, pp. 213–214.



**Dengjie Wang** was born in Hebei, China. He received the B.S. degree in microelectronics from Xidian University, Xi'an, China, in 2015. He is currently pursuing the Ph.D. degree with the School of Integrated Circuits, Tsinghua University. His research interest includes high-speed wireline communication systems.



Ziqiang Wang was born in Beijing, China, in 1975. He received the B.S. and Ph.D. degrees from the Department of Electronic Engineering, Tsinghua University, Beijing, in 1999 and 2006, respectively. Since 2015, he has been an Associate Professor with the School of Integrated Circuits. He currently works as a Research Assistant with the Institute of Microelectronics, Tsinghua University. His research interest includes analog circuit design.



**Hao Xu** received the B.S. degree from Zhejiang University, China, in 2020. He is currently pursuing the M.S. degree with the School of Integrated Circuits, Tsinghua University. His research interest includes high-speed wireline communication systems.



**Jiawei Wang** received the B.S. degree from the University of Electronic Science and Technology of China, China, in 2019. He is currently pursuing the M.S. degree with the School of Integrated Circuits, Tsinghua University. His research interest includes high-speed wireline communication systems.



Zeliang Zhao received the B.S. degree from Xi'an Jiaotong University, China, in 2019. He is currently pursuing the M.S. degree with the School of Integrated Circuits, Tsinghua University. His research interest includes high-speed wireline communication systems.



**Chun Zhang** (Senior Member, IEEE) received the B.S. and Ph.D. degrees from the Department of Electronic Engineering, Tsinghua University, Beijing, China, in 1995 and 2000, respectively.

Since 2000, he has been with Tsinghua University, where he was with the Department of Electronic Engineering from 2000 to 2004. Since 2005, he has been an Associate Professor with the School of Integrated Circuits. His research interests include mixedsignal integrated circuits and systems, embedded microprocessor design, digital signal processing, and radio-frequency identification.



Zhihua Wang (Fellow, IEEE) received the B.S., M.S., and Ph.D. degrees in electronic engineering from Tsinghua University, Beijing, China, 1983, 1985, and 1990, respectively.

He was a Visiting Scholar with CMU, Pittsburgh, PA, USA, from 1992 to 1993, and KU Leuven, Leuven, Belgium, from 1993 to 1994. He has been working as a Full Professor and the Deputy Director of the School of Integrated Circuits, Tsinghua University, since 1997 and 2000. From September 2014 to March 2015, he was a Visiting Professor

with HKUST, Hong Kong. He has coauthored 13 books/chapters, more than 197 (514) articles in international journals (conferences), over 246 (29) articles in Chinese journals (conferences). He holds 118 Chinese and nine U.S. patents. His current research interests include CMOS RFIC and biomedical applications, involving RFID, PLL, low-power wireless transceivers, and smart clinic equipment combined with leading edge RFIC and digital image processing techniques.

Dr. Wang has served as a Technology Program Committee Member for the IEEE ISSCC from 2005 to 2011 and has been a Steering Committee Member for the IEEE A-SSCC since 2005. He has served as the Chairperson for the IEEE SSCS Beijing Chapter from 1999 to 2009, and an AdCom Member for the IEEE SSCS from 2016 to 2019. He has served as the Technical Program Chair for A-SSCC 2013, a Guest Editor for the IEEE JOURNAL OF SOLID-STATE CIRCUITS Special Issues in December 2006, December 2009, and November 2014, an Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: REGULAR PAPERS, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, and the IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, and other administrative/expert committee positions in China's national science and technology projects.



Hong Chen (Senior Member, IEEE) received the Ph.D. degree from the Department of Electronic Engineering, Tsinghua University, in 2005. From 2005 to 2007, she worked as a Post-Doctoral Fellow with the Institute of Microelectronics of Tsinghua University (IMETU). Since 2007, she has been working with IMETU. She worked as a Visiting Scholar at the Medical Center, Nebraska University, and the Department of Electronics and Computer Engineering, Georgia Tech, in 2006 and 2016, respectively. She has been with the School of Inte-

grated Circuits since 2021, where she is currently an Associate Professor. Her research interests include monitoring-system design for TKR/THR surgery, low-power digital integrated-circuit design, asynchronous circuit design, PZT power electronics, low-power mixed-signal SoC design, and serial transceiver.