

# A 7.5-mW 10-Gb/s 16-QAM Wireline Transceiver with Carrier Synchronization and Threshold Calibration for Mobile Inter-chip Communications in 16-nm FinFET

Jieqiong Du<sup>†</sup> University of California, Los Angeles, USA <u>du.jieqiong@ucla.edu</u>

Wei-Han Cho University of California, Los Angeles, USA <u>weihan.cho@ucla.edu</u>

Po-Tsang Huang National Chiao Tung University, Taiwan <u>bug.ee91g@nctu.edu.tw</u> Special Session Paper

Chien-Heng Wong University of California, Los Angeles, USA <u>kenonearth@ucla.edu</u>

Yilei Li University of California, Los Angeles, USA <u>ylli1986@ucla.edu</u>

Sheau-Jiung Lee TSVLink Corp, Santa Clara, USA <u>sjlee@tsvlink.com</u> Yo-Hao Tu National Central University, Taoyuan, Taiwan <u>100581002@cc.ncu.edu.tw</u>

Yuan Du University of California, Los Angeles, USA <u>yuandu@ucla.edu</u>

Mau-Chung Frank Chang University of California, Los Angeles, USA <u>mfchang@ee.ucla.edu</u>

# ABSTRACT

A compact energy-efficient 16-QAM wireline transceiver with carrier synchronization and threshold calibration is proposed to leverage high-density fine-pitch interconnects. Utilizing frequency-division multiplexing, the transceiver transfers four-bit data through one RF band to reduce intersymbol interferences. A forwarded clock is also transmitted through the same interconnect with the data simultaneously to enable low-power PVT-insensitive symbol clock recovery. A carrier synchronization algorithm is proposed to overcome nontrivial current and phase mismatches by including DC offset calibration and dedicated I/O phase adjustments. Along with this carrier synchronization, a threshold calibration process is used for the transceiver to tolerate channel and circuit variations. The transceiver implemented in 16-nm FinFET occupies only 0.006-mm<sup>2</sup> and achieves 10 Gb/s with 0.75-pJ/bit efficiency and <2.5ns latency.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from <u>Permissions@acm.org</u>. *NOCS '19*, October 17–18, 2019, New York, NY, USA

@ 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6700-4/19/10 $\square\$15.00$ 

https://doi.org/10.1145/3313231.3352381

### **CCS CONCEPTS**

#### • Hardware → Interconnect

# **KEYWORDS**

Frequency-division Multiplexing (FDM), Quadratureamplitude Modulation (QAM), carrier synchronization, threshold calibration, wireline, forwarded clock.

#### **ACM Reference format:**

Jieqiong Du, Chien-Heng Wong, Yo-Hao Tu, Wei-Han Cho, Yilei Li, Yuan Du, Po-Tsang Huang, Sheau-Jiung Lee and Mau-Chung Frank Chang. 2019. A 7.5-mW 10-Gb/s 16-QAM Wireline Transceiver with Carrier Synchronization and Threshold Calibration for Mobile Inter-chip Communications in 16-nm FinFET. In *International Symposium on Networks-on-Chip (NOCS'19), October 17-18, 2019, New York, NY, USA.* ACM, NEW York, NY, USA, 8 pages. https://doi.org/10.1145/3313231. 3352381

# 1 Introduction

Energy-efficiency, bandwidth, area, and latency are key design parameters for mobile inter-chip communications. To support ever-demanding data communications, advanced packaging technologies such as the integrated fan-out (InFO) are introduced to provide high pin counts and high-density integration [1]. To take full advantage of these packaging technologies, compatible input-output (I/O) circuitries should also be fast, compact, energy-efficient, and low-latency.



Figure 3. TX output spectrum.

Frequency-division multiplexing wireline transceivers demonstrate great potential to satisfy the requirements, according to recent publications [2,3]. Unlike conventional time-division multiplexing (TDM) transceivers that serialize

12.0

15.0

DATA

(DQ[3:0])

5GHz - 10GHz

Frequency(<sup>8.0</sup>/<sub>GHz</sub>)

Jieqiong, et al.

data in time, FDM transceivers transfer multiple low-speed data streams simultaneously through orthogonal frequencies, avoiding high-speed power-hungry serializers / deserializers. Double-sideband signaling used in FDM transceivers also self-equalizes the channel and further enables high-speed communication without additional equalization circuitries [2]. Requiring no high-speed serializers/de-serializers or equalizers which are commonly used in TDM transceivers and contribute to more energy consumption and latency, FDM transceivers can be energy / area-efficient with low latency. In [3], a tri-band 16-QAM transceiver achieved 0.95pJ/bit energy-efficiency, consuming only 0.01mm<sup>2</sup> in 28nm CMOS area.

However, prior arts of FDM architecture faced challenges of high-linearity requirement and high inter-band interferences owing to the adoption of multiple frequency bands [3,6]. In addition, two issues that were not handled in previous works need to be resolved in coherent multi-level modulated systems: 1) carrier synchronization and 2) threshold calibration under channel and circuit variations.

To address the aforementioned issues, this paper proposes a 16-QAM transceiver with carrier synchronization and threshold calibration that consumes only a small area and low energy. The FDM transceiver transfers data over one RF band at 7.5 GHz with a symbol rate of 2.5 GHz. The symbol clock is forwarded along with data through the same channel and identical circuitry paths, which provides a robust clock and data tracking. Efficient carrier synchronization and threshold calibration methods are also introduced to improve the transceiver performance. With fully differential current mode signaling, the transceiver consumes 7.5mW at 10 Gb/s and occupies only 0.006 mm<sup>2</sup> die area with a latency < 2.5ns.

#### 2 System Architecture of the Transceiver

#### 2.1 System Overview

As in Figure 1, the system architecture includes a transmitter (TX) that performs 16-QAM modulation and clock-data combining, a receiver (RX) that demodulates the data and recovers the forwarded clock, and a carrier generation block for TX/RX In-phase (I)/Quadrature-phase (Q) carrier generation and distribution. A carrier-phase and comparatorthreshold controller adjusts receiver carrier phases and comparator thresholds.

At the transmitter, data comes from either an PRBS generator or a set of pre-defined symbol for calibration purposes. The PRBS runs at 2.5 GHz at full speed. Two current digital-to-analog converters map the data to 16-QAM symbols. The baseband symbols are modulated to RF band by two orthogonal I/Q carriers at 7.5 GHz. Along with the modulated signals, a clock that toggles at one half the symbol rate is transmitted through the same channel for receiver clock recovery. Figure 2 shows the baseband symbol timing. Figure 3. shows the transmitter output spectrum.

At the receiving end, the combined signal is distributed to three paths. At data path, transmitted data are demodulated by I/Q mixers, low-pass filters, and comparators. An I/Q and polarity swapping block is introduced so that the carrier phase interpolators only need 90° cover range for synchronization. At clock path, the forwarded clock is recovered by removing RF-band interferences using a lowpass filter.

This architecture offers several advantages. Data transmission through RF frequency lowers the inter-symbol interferences. By transmitting four-bit data simultaneously, symbol rate can be reduced to ¼ of that of NRZ signaling, which also relaxes the timing constraints for baseband signal processing. Compared to prior arts of FDM transceiver, having only one RF band also helps to reduce the peak-to-average power ratio (PAPR) of the transmitted signal and relax the requirement of linearity. Also, this compacts the transceiver, reduces inter-band interferences level, and ease the design of low-pass filters. Therefore, better area/energy efficiency can be achieved than previous work. On the other hand, by transmitting the clock along the data and using identical circuitry for both the data and clock, the clock will track the data without using a DLL.

#### 2.2 Carrier Synchronization

For coherently modulated 16-QAM symbols, carrier phases must be synchronized to recover the information. However, carrier phase offsets change dramatically under channel and circuit variations. A hardware-efficient single tone synchronization system using a comparator is designed in Figure 4 [5]. During synchronization, only Q path is turned on and a DC signal is applied to DAC. Optimal carrier phase is found when either the output of I or Q path low-pass filter becomes zero. Figure 5(a) shows an example for the optimal phase sweep by constellation. By increasing the phase from  $\theta_1$  to  $\theta_2$ , I-amplitude crosses zero and by detecting the zerocrossing point through a comparator, the optimal phase can be found. When reaching optimal phase, either I<sub>I</sub> or I<sub>q</sub> is 0 and the phase error reduces to  $k \cdot 90^{\circ}(k=0,1,2,3)$ . When k is not 0, inverting and swapping the I/Q data outputs at the data path can further compensates the phase error by shift data phases by 180° and 90°/270°, respectively.

However, mismatches will degrade the effectiveness of this algorithm by introducing significant DC offsets at RX baseband and introducing I/Q carrier phase mismatch. Figure 5(b) shows the effect in the system where zero crossing shifts. To the first order, the phase error using the algorithm when current offsets preset is arcsine (Ierr/Isig,max). When Ierr/Isig,max is larger than 0.1, this phase error can be more than 5°. Mismatches in DACs, mixers, low-pass filters, comparators, and clock feedthrough all contribute to receiver output DC offsets. However, the offset in the TX DAC is in/anti-phase with the transmitted signal and the contribution will be zero at optimal phase. Offsets from the receiver circuitry are more significant but they are quasi-invariant. Therefore, receiver offsets can be cancelled by a foreground calibration. Note that I/O buffer mismatches do not contribute to output dc offset since they are modulated to the RF frequency. On the other hand, the carrier I/Q phase mismatch requires that the phase of I/Q path must be calibrated separately.

During offset calibration shown in Figure 6, the transmitter is turned off to remove transmitter mismatches and the receiver carrier gen is turned on. In this way, I/O buffer mismatches is removed, and offset resulting from clock feedthrough are included. The offset is then compensated by tuning the comparator threshold until sampled digital outputs are logic high for around 50% of the time. The complete flow includes three steps: 1) receiver DC offsets calibration; 2) Q path synchronization; and 3) I path synchronization. I/Q carrier phases are calibrated separately using the above zero-crossing detection method. To deal with receiver I/Q phase mismatches, phase calibration is performed to I and Q phases separately to minimizes I/Q interferences in each path. The overall calibration flowchart is shown in Figure 7.

The implementation of the algorithm employs two phase interpolators (each covering 90° phase range), low-speed DACs setting the comparator threshold, and a finite state machine.



Figure. 4. Simplified system of single-tone carrier synchronization.

Quad-phase Quad-phase Amp.(Ia) Amp.(Ia) O Optimal RX Pha Sweep phase eep p Orc=Od+k\*90 Suboptimal RX I Orc≠ Od+k\*90°( In-phase In-phase Amp.(I/) Amp.(I/)  $|_{l} < 0$  $|_{l} > 0$ /<0 1 = ler Dlz<0 DIz>0 A) without dc offset B) with dc offset

Figure 5. Constellation of single-tone synchronization



Figure 6. Simplified offset cancellation schematics and illustration.

### 2.3 Threshold Calibration

In addition to carrier synchronization, the comparator thresholds must be calibrated to compensate circuit and channel loss variations. A foreground calibration algorithm using low-speed DACs is proposed here. Note that since carrier phase offsets affect signal amplitude, it is essential that carrier synchronization is performed before we calibrate the comparator thresholds.

The algorithm is developed based on these factors: 1) quasi-invariant signal amplitudes; 2) linear FDM system operations; and 3) small inter-symbol interferences. The threshold calibrations for I and Q paths are performed separately.

When calibrating I path, Q path is turned off in TX to remove possible I/Q interferences caused by residue carrier phase errors after synchronization. On the other hand, four signal levels ( $I_3 > I_1 > I_{-1} > I_{-3}$ ) are sent at TX I path each for a period. During each phase, for example when  $I_3$  are sent, the threshold  $I_H$  of comparator for  $I_{3, RX}$  and  $I_{1, RX}$  level detection is swept to find where sampled comparator output is closest to be high for 50% of time. This threshold value is recorded as  $I_{HH}$  and expected to be equal to  $I_{3, RX}$ . When  $I_1$  is sent, the same threshold is tuned so that another value  $I_{HL}$ that matches  $I_{1, RX}$  is found. The resulting threshold will be the average of  $I_{HH} + I_{LH}$ .



Figure 7. Carrier synchronization flow chart.

# **3** Circuit Design of the Dual-band 16-QAM Transceiver

The transceiver uses fully differential architecture to reduce simultaneous switching noise, crosstalk, and even-order nonlinearity. The transmitter and receiver are detailed as follows.

# 3.1 Transmitter

Shown in Figure 8, the dual-band transmitter consists of three parallel branches - two branches for the in-phase and quadrature-phase data paths that modulate baseband symbols to RF band and one branch to forward the clock for clock recovery at the receiver. Each branch comprises a 2bit thermometer-coded current-steering digital-to-analog converter (DAC) to map 2-bit data into 4-level symbols and a double-balanced mixer to upconvert baseband signal to RF pass band. A summer follows the mixers, sums the current signals from all three branches, and drives a pair of differential channels. This transmitter transmits a total of four bits data in addition to clock simultaneously through one common electrical lane within one unit interval (400ps for a total data rate of 10 Gb/s). In addition, to accommodate different channel losses, the transmitter output power can be adjusted by programming the reference current level of the DAC.

To simplify the clock recovery at the receiver, the clock path includes a dummy mixer to track the time delay of data path although no frequency mixing is performed. The clock runs at one half data rate and lags the data by half a data period so that the transitions of clock will fall around the optimum sampling timing for data recovery at receiver. Because the clock propagates along with the data through the same physical interconnect, clock signal tracks data signal in spite of chip or interconnect variation. Therefore, this allows self-tracking of the clock without requiring a delay-lock loop or other de-skew circuit in most source synchronous systems.

### 3.2 Receiver

At the receiving end, a current amplifier amplifies received current signal and provides  $100-\Omega$  wideband differential channel termination to reduce reflection. The amplifier then distributes the received current signal to 3 parallel signal paths. Each is comprised of a double-balanced mixer to down-convert RF signal to baseband, a 3-order low-pass filter to remove adjacent band interference, three parallel continuous-time comparators to decode the received symbol, and a decoder to recover the transmitted data.

The current amplifier employs gain-reused regulated cascode structure with active inductor shunt peaking to improve energy-efficiency, as Figure 9 shows. A first order analysis of the circuit reveals that the input differential impedance of the current amplifier is dictated by 2/gm1-2/gm2 – the difference between the NMOS and PMOS transconductance. Therefore, the differential input impedance can be reduced to 100- $\Omega$  without burning too much current at the input stage. On the other hand, the active inductor shunt peaking helps to improve the bandwidth of the current amplifier. However, the input impedance is PVT sensitive, which could result in reflections if the input impedance deviates too much from 100- $\Omega$  and would compromise signal integrity.

Since the impedance is inversely proportional to transistor transconductance, the impedance is also inversely proportional to the square root of the I<sub>bias</sub>. To compensate the PVT variation of the transistor, the bias current of the input current amplifier can be programmed through a low-speed current DAC. Thirty-percent variability of reference current enables a tuning range of about fifteen percent for the input impedance. Consuming around 1mA current, the current amplifier provides less than -10 dB return loss from DC to 10 GHz.

A double-balanced mixer follows the current amplifier and down-converts the RF signal to baseband. Doublebalanced structure is used at both the transmitter and receiver to reduce LO leakage. Similar to the transmitter, a dummy mixer also presents at the receiver clock path to match the time delay of the RF data paths.

Following the double-balanced mixer, a third-order Bessel Gm-C low-pass filter removes high-frequency interferences resulting from adjacent band and frequency mixing. Bessel type filters are adopted because they provide relatively constant group delay within band-of-interest to minimize signal distortion and inter-symbol interferences. The third-order Bessel Gm-C low-pass filter has a cut-off frequency at 1.5GHz. Figure 10 shows the simulated eye diagrams at low-pass filter output.

The baseband symbol restored by the low-pass filter is then converted to digital information through three parallel continuous-time comparator. The thresholds of the comparators are set by three different reference currents which can be programmed to deal with different channel attenuations. A decoder then converts the 3-bit signal into 2bit data, which is later sampled at the clock transition edges.

# **3.3** Carrier Generation

The carrier generation uses current mode logic and provides TX/RX I/Q carriers by dividing an external RF source. Two 6-bits phase interpolators provide tunable phase delay for RX I/Q synchronization with 0.8ps ( $\sim 2^{\circ}$ ) resolution at 7.5 GHz.



Figure 8. Schematics of transmitter.



Figure 9. Schematics of the current amplifier.



Figure 10. simulated eye diagram at RX LPF.



Figure 11. Micrograph of the test chip.

### 4 Measurement Results

A test chip comprising a carrier generation block, a digital baseband controller, and four-lane transceiver front-ends is fabricated in TSMC 16nm FinFET; the active area per lane is only 0.006  $\mu$ m<sup>2</sup>. Figure 11 shows the microphotograph of fabricated chip and Figure 12 is the testing environment. The fabricated chip is wire-bonded to PCB for characterization and is tested with 2<sup>31</sup>-1 PRBS data under two fine-pitch channel conditions – 1-inch and 5-inch FR-4 PCB differential traces (3-mil width and 3-mil spacing).



Figure 12. Experiment Platform.



Figure 13. 10-Gb/s 2<sup>31</sup>-1 PRBS eye diagram before (left) and after (right) carrier synchronization and threshold calibration



Figure 14. Left: 4-Gb/s 2<sup>31</sup>-1 PRBS eye diagram of clock and Q channel. Right: transmitted and received clock (upper); PRBS data from TX generator (middle); demodulated data at receiver (lower).

Jieqiong, et al.

#### **TABLE I**

| Metric                | ISSCC'15<br>[6]      | CICC'15<br>[2]        | ISSCC'1<br>6 [3]     | ISSCC'17<br>[8]        | This Work             |                |
|-----------------------|----------------------|-----------------------|----------------------|------------------------|-----------------------|----------------|
| Architecture          | TDM                  | FDM                   | FDM                  | N/A                    | FDM                   |                |
| Tech                  | 65nm                 | 40nm                  | 28nm                 | 14nm                   | 16nm                  |                |
| Supply                | 0.7 V                | 0.9V                  | 1.2 V                | N/A                    | 1 V                   |                |
| Channel Type          | FR-4<br>1.5-inch     | FR-4<br>2-inch        | FR-4<br>2-inch       | EMIB<br>1.1mm          | FR-4<br>1-inch        | FR-4<br>5-inch |
| Data Rate Per<br>Lane | 6 Gb/s               | 4 Gb/s                | 10 Gb/s              | 2 Gb/s                 | 10 Gb/s               | 4 Gb/s         |
| Power                 | 3.4 mW               | 5.4 mW                | 9.5 mW               | 2.4 mW                 | 7.5 mW                | 5.3 mW         |
| Energy<br>Efficiency  | 0.58 pJ/bit          | 1.35 pJ/bit           | 0.95 pJ/bit          | 1.2 pJ/bit             | 0.75 pJ/bit           | 1.3 pJ/bit     |
| Area                  | 0.15 mm <sup>2</sup> | 0.008 mm <sup>2</sup> | 0.01 mm <sup>2</sup> | 0.0013 mm <sup>2</sup> | 0.006 mm <sup>2</sup> |                |

omponicon with Drion Art

The transceiver can operate up to 10 Gb/s with 1-inch FR-4 traces and up to 4 Gb/s with the 5-inch FR-4 traces while consuming 7.5 mW and 5.3 mW, respectively. Without calibration, the demodulated eye diagram is closed as Figure 13 shows. After both carrier synchronization and threshold calibration, the output eye diagram opens, and transceivers achieved BER<10<sup>-12</sup> at 4Gb/s and BER<10<sup>-8</sup> at 10 Gb/s. The test results also show that the data path latency is tracked by the clock path; clock transitions still track the center of data eye after demodulation, as is shown in Figure 14. In addition, the measured latency on the 5-inch FR-4 PCB trace). The latency is measured by the time delay between outputs of the PRBS generator and the receiver output data, as shown in Figure 14.

# CONCLUSION

In summary, this paper presents an area-compact and energy-efficient 16-QAM FDM wireline transceiver. The demonstrated transceiver can transfer multi-bit data simultaneously over an RF band with minimum intersymbol interferences and concurrently forward data symbol clocks over the baseband via the same physical channel for effective symbol clock recovery. Unique carrier synchronization and threshold calibration methods are also developed and verified to mitigate channel and circuit loss variations. The realized transceiver has achieved 0.75-pJ/bit transmission efficiency at 10 Gb/s with < 2.5 ns latency over 1-inch differential FR-4 traces. It has also achieved 4-Gb/s data rate over 5-inch differential FR-4 traces. It consumes only 0.006 mm<sup>2</sup> die area per lane.

# ACKNOWLEDGMENTS

The authors would like to thank TSMC for chip fabrication.

# REFERENCES

- C. T. Wang and D. Yu, "Signal and Power Integrity Analysis on Integrated Fan-Out PoP (InFO\_PoP) Technology for Next Generation Mobile Applications," 2016 IEEE 66th Electronic Components and Technology Conference (ECTC), Las Vegas, NV, 2016, pp. 380-385.
- [2] W. H. Cho et al., "A 5.4-mW 4-Gb/s 5-band QPSK transceiver for frequency-division multiplexing memory interface," 2015 IEEE Custom Integrated Circuits Conference (CICC), San Jose, CA, 2015, pp. 1-4.
- [3] W. H. Cho et al., "10.2 A 38mW 40Gb/s 4-lane tri-band PAM-4 / 16-QAM transceiver in 28nm CMOS for high-speed Memory interface," 2016 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, 2016, pp. 184-185.
- [4] Y. Du et al., "A 16-Gb/s 14.7-mW Tri-Band Cognitive Serial Link Transmitter With Forwarded Clock to Enable PAM-16/256-QAM and Channel Response Detection," in IEEE

Jieqiong, et al.

Journal of Solid-State Circuits, vol. 52, no. 4, pp. 1111-1122, April 2017.

- [5] Y. Li et al., "Carrier synchronisation for multiband RF interconnect (MRFI) to facilitate chip-to-chip wireline communication," in *Electronics Letters*, vol. 52, no. 7, pp. 535-537, 4 1 2016. doi: 10.1049/el.2015.3966
- [6] W. S. Choi et al., "3.8 A 0.45-to-0.7V 1-to-6Gb/S 0.29-to-0.58pJ/b source-synchronous transceiver using automatic phase calibration in 65nm CMOS," 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers, San Francisco, CA, 2015, pp. 1-3.
- [7] D. Greenhill et al., "3.3 A 14nm 1GHz FPGA with 2.5D transceiver integration," 2017 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, 2017, pp. 54-5