# Invited Paper # Low Power VLSI Circuit Design with Fine-Grain Voltage Engineering MAKOTO TAKAMIYA<sup>†1</sup> and TAKAYASU SAKURAI<sup>†2</sup> In order to cope with the increasing leakage power and the increasing device variability in VLSI's, the required control size of both the space-domain and the time-domain is decreasing. This paper shows the several recent fine-grain voltage engineerings for the low power VLSI circuit design. The space-domain fine-grain voltage engineering includes the fine-grain power supply voltage with 3D-structured on-chip buck converters with the maximum power efficiency up to 71.3% in 0.35- $\mu$ m CMOS and the fine-grain body bias control to reduce power supply voltage in 90-nm CMOS. The time-domain fine-grain voltage engineering includes accelerators for the power supply voltage hopping with a 5-ns transition time in 0.18- $\mu$ m CMOS, the power supply noise canceller with the 32% power supply noise reduction in 90-nm CMOS, and backgate bias accelerators for fast wake-up with 1.5-V change of backgate voltage in 35 ns in 90-nm CMOS. #### 1. Introduction Several low power VLSI design techniques such as power gating, clock gating, threshold voltage ( $V_{TH}$ ) control by the body bias, power supply voltage ( $V_{DD}$ ) hopping, and dynamic voltage and frequency scaling (DVFS) have been proposed. In order to cope with the increasing leakage power and the increasing device variability, the required control size of both the space-domain and the time-domain is decreasing. **Figure 1** shows the recent trend for the low power VLSI circuit design with the fine-grain voltage engineering. The voltage engineering includes the $V_{DD}$ control and the body bias control. In the conventional digital LSI, the clock frequency, $V_{DD}$ , and $V_{TH}$ are fixed and common within a chip. In contrast, the future digital LSI will have hundreds of domains with the ns-order Fig. 1 Recent trend for the low power VLSI circuit design with the fine-grain voltage engineering. controlled different clock frequency, $V_{DD}$ , and $V_{TH}$ within a chip in order to compensate for the device variations/degradations and adopt the environmental (voltage and temperature) changes. This paper shows the several recent fine-grain voltage engineerings for the low power VLSI circuit design. Section 2 provides a space-domain fine-grain voltage engineering including the fine-grain power supply voltage with 3D-structured on-chip buck converters and fine-grain body bias control to reduce $V_{\rm DD}$ . Section 3 describes the time-domain fine-grain voltage engineering including accelerators for $V_{\rm DD}$ hopping, power supply noise canceller, and backgate bias accelerators for fast wake-up. Finally, some concluding remarks are given in Section 4. #### 2. Space-Domain Fine-Grain Voltage Engineering In this section, the space-domain fine-grain voltage engineerings including the fine-grain power supply voltage with 3D-structured on-chip buck converters and fine-grain body bias control to reduce $V_{\rm DD}$ are shown. ## 2.1 Fine-Grain Power Supply Voltage with 3D-Structured On-Chip Buck Converters Fine grain $V_{\rm DD}$ implementation is required in low power and high performance systems. Moreover, supply voltage is sometimes tuned in time to achieve low <sup>†1</sup> VLSI Design and Education Center, the University of Tokyo <sup>†2</sup> Institute of Industrial Science, the University of Tokyo Fig. 2 Distributed power supply for the fine grain V<sub>DD</sub> in SiP's. power consumption, which is called dynamic voltage scaling. The supply of many different and dynamically scaled voltages from outside the package gives rise to much overhead in area. The power integrity, including IR drop and noise, becomes an issue as well. The distributed on-chip power supply circuits are useful for solving these problems. The concept $^{1)}$ of the distributed power supply is SiP's for the fine grain $V_{\rm DD}$ is shown in Fig. 2. High voltage is distributed by a main power grid and is then converted to the lower voltages at the vicinity of the target blocks by distributed on-chip voltage converters. This approach reduces cost and power integrity issues. For DC-DC converters, linear regulator, buck converters and switched capacitor converter are well known circuits. A buck converter requires large passive elements of inductance and capacitance (LC) for an output filter but it shows higher power efficiency than a linear regulator. A switched capacitor converter also needs large capacitors. One more drawback is that the output voltage levels are limited by the ratios of prepared capacitors. That is not very suitable for low-power dynamic voltage scaling systems. In case of the buck converter, high switching frequency is preferable for smaller L and C but the power efficiency is degraded by the dynamic power dissipated by switching transistors at high frequency. Low quality factor (Q) of air-core and on-chip inductors also degrades the power efficiency. High inductance is good for Fig. 3 Stacked-chip buck converter. Fig. 4 Chip microphotograph of the output filter on the upper chip in 0.35- $\mu$ m CMOS. high Q but is not easy to obtain on a chip because of the area limitation and even if a high magnetic permeability material is introduced on a chip, high- $\mu$ property is usually lost at high frequency. It is a reasonable choice to implement active elements and output filter on separate dies whose process technologies are different <sup>1)</sup>. By stacking two chips face-to-face and connecting them via metal bumps, a buck converter for on-chip distributed power supply systems can be fabricated in a well balanced manner for best cost and power trade-off. An on-chip buck converter with stacked-chip implementation for the fine-grain $V_{DD}$ has been designed and fabricated in 0.35- $\mu$ m CMOS for upper and lower chips <sup>1)</sup>. The lower chip could be manufactured by 90-nm or more advanced technology for the higher efficiency but this test chip is to show the feasibility of the stacked-chip approach. **Figure 3** shows the circuit diagram of the test converter. **Figure 4** shows the chip microphotograph Fig. 5 3D-sructured buck converter. (a) Cross-sectional diagram. (b) Photograph. of the output filter on the upper chip. The filter is $2\times 2\,\mathrm{mm}$ by assuming that $10\,\mathrm{mm}$ -square chip can have 25 voltage domains. The calculated inductance is $22\,\mathrm{nH}$ . The open space at the center of the inductor is filled with a MOS capacitor for the output filter. Area efficiency is more important than linearity for the filter capacitor, because the output voltage does not change dynamically in a normal operation. From that aspect, MOS capacitor is more suitable than any other types of on-chip capacitors like Metal-Insulator-Metal (MIM) capacitor or poly silicon capacitor. The obtained capacitance is about $1\,\mathrm{nF}$ . Under those conditions, the switching frequency was chosen to $200\,\mathrm{MHz}$ . The gate widths of the switching transistors were optimized at the load current of $60\,\mathrm{mA}$ . **Figure 5** (a) shows the cross-sectional diagram of the 3D-sructured buck converter and Fig. 5 (b) shows the photograph. The pad size and the effective bump diameter of this experimental setup are $200\times200\,\mu\mathrm{m}$ and $150\,\mu\mathrm{m}$ , respectively. Micro bumps whose diameter is $30\,\mu\mathrm{m}$ and whose resistance is as low as $14\,\mathrm{m}\Omega/\mathrm{bump}$ have been realized in industry environments $^2$ and can be used instead for further smaller area. Fig. 6 Measured power efficiency with the input voltage of $3.3\,\mathrm{V}$ and the output voltage of $2.3\,\mathrm{V}$ . Fig. 7 Inductor array on FR-4 glass epoxy interposer. **Figure 6** shows the measured power efficiency with the input voltage of $3.3 \,\mathrm{V}$ and the output voltage of $2.3 \,\mathrm{V}$ for an output current range from $20 \,\mathrm{mA}$ to $70 \,\mathrm{mA}$ . The maximum efficiency of 62% is achieved for $70 \,\mathrm{mA}$ output current. In order to further increase the efficiency, it is effective to use inductors with the low parasitic resistance. A thin-film inductor surrounded by magnetic core material can be a solution but is expensive. Implementing the inductor on a glass epoxy interposer is an effective yet inexpensive solution. **Figure 7** shows an inductor array on generic Flame Resistant 4 (FR-4) glass epoxy interposer with two metal layers. The circled inductor in the array, which achieved the minimum metal spacing in the trial manufacture, is used for the measurement. The metal thickness on the interposer is $30 \,\mu\text{m}$ , the substrate thickness is $100 \,\mu\text{m}$ , and the diameter of the through-hole via is $100 \,\mu\text{m}$ . This implementation decreases the parasitic resistance by one-thirtieth compared with the case of an on-chip inductor. The outer diameter of the inductor is increased by 10% to achieve the same value of on-chip inductance because the minimum spacing of metal lands on glass epoxy is larger than that of on-chip interconnects. Measured inductance was $18\,\mathrm{nH}$ . The resistance increases rapidly because of skin effect in the high frequency region over $200\,\mathrm{MHz}$ , however, the characteristics below $200\,\mathrm{MHz}$ are important in this application. Figure 6 shows the comparison of measured power efficiency between the on-chip inductor and the inductor on FR-4. The input voltage is $3.3\,\mathrm{V}$ and the output voltage is $2.3\,\mathrm{V}$ . The power efficiency with the inductor on FR-4 is improved by $5{\text -}14\%$ depending on the output current compared with the on-chip implementation. The maximum power efficiency of 71.3% is achieved at an output current of $60\,\mathrm{mA}$ . #### 2.2 Fine-Grain Body Bias Control to Reduce V<sub>DDmin</sub> In this section, the fine-grain body bias control to compensate for the intra-die variations in the ultra low $V_{\rm DD}$ is discussed. Very low voltage operation of VLSI's is effective in reducing both dynamic and leakage power and the maximum energy efficiency is achieved at low $V_{DD}$ (e.g., $320\,\mathrm{mV}^{\,3)}$ ). Thus many works have been carried out on the subthreshold operation of logic circuits $^{3)-7)}$ and SRAM's $^{8)}$ , where $V_{DD}$ is less than $V_{TH}$ of transistors. However, the number of transistors of the previously reported subthreshold circuits is small (e.g., $70\,\mathrm{k}$ transistor logic circuits at $V_{DD}$ of $230\,\mathrm{mV}^{\,3)}$ , a $32\,\mathrm{kbit}$ SRAM at $V_{DD}$ of $160\,\mathrm{mV}^{\,8)}$ , and a 1000 stage inverter chain at $V_{DD}$ of $60\,\mathrm{mV}^{\,6)}$ ), and the possibility of the mega gate scale subthreshold circuits is not clear. $V_{\rm DDmin}$ is the minimum power supply voltage when the circuits operate without function errors. Ring oscillators (RO's) are useful $V_{\rm DDmin}$ detectors <sup>9)</sup>, because RO's stop oscillation when the first function error in the logic circuits happens. **Figure 8** shows simulated waveform of 5-stage CMOS inverter RO. $V_{\rm DD}$ is varied from 0.2 V to 0 V. At $V_{\rm DDmin}$ of 50 mV, RO stops oscillation. The origin of the $V_{\rm DDmin}$ is analyzed with Monte Carlo SPICE simulations. **Figure 9** (a) shows the schematic of the simulated 11-stage RO's where each transistor has random $V_{\rm TH}$ . The inverter chain with the input of $V_{\rm DD}$ is simulated. Figure 9 (b) shows the node voltages $(V_1 - V_{11})$ and the inversion voltages $(V_{\rm INV}$ 's) of the inverters. Normally, the logical low of $V_1 - V_{11}$ is lower than Fig. 8 Simulated waveform of 5-stage CMOS RO. Definition of $V_{\rm DDmin}$ is shown. Fig. 9 (a) Simulated 11-stage inverter chain where each transistor has random $V_{TH}$ . (b) Node voltages $(V_1 - V_{11})$ and inversion voltages $(V_{INV}$ 's) of the inverters. $V_{\rm INV}$ and the logical high of $V_1-V_{11}$ is higher than $V_{\rm IN}$ . The inverter chain, however, has a function error at #7 and #8 inverter, because #7 inverter has slow nMOS and fast pMOS, $V_{\rm INV}$ of #7 inverter is high, and the logical low of $V_7$ ( $V_{\rm OUT\_LOW\_7}$ ) is higher than $V_{\rm INV}$ of #8 inverter. The function error stops the RO oscillation. Fig. 10 Measured dependence of the average $V_{DDmin}$ of RO's on the number of stages. In order to emulate the recent SoC's, the mega stage scale RO's are required, because the recent SoC's have $10-100\,\mathrm{M}$ logic gates. With the technology scaling and the increased number of transistors on a chip, $V_{\mathrm{DDmin}}$ will increase, because the more gates there are, the more likely it is that the worst-case condition will occur, and thus a higher $V_{\mathrm{DD}}$ will be required. In order to measure $V_{\rm DDmin}$ , 90-nm CMOS RO's with varied number of stages are fabricated. **Figure 10** shows the measured dependence of the average dieto-die $V_{\rm DDmin}$ with $\pm 1\,\sigma$ error bar of inverter RO's on the number of stages <sup>10)</sup>. As the number of stages is increased, the average $V_{\rm DDmin}$ increases, because $V_{\rm DDmin}$ is determined by the worst inverter(s) in each RO. For example, the average $V_{\rm DDmin}$ increases from 90 mV to 343 mV when the number of RO stages increases from 11 to 1 Mega. The 343 mV means above $V_{\rm TH}$ operation. The results indicate that $V_{\rm DDmin}$ for logic circuits depends on the scale of the circuits and large scale logic circuits have high $V_{\rm DDmin}$ . A higher $V_{\rm DDmin}$ as the number of stages increases is not acceptable. The fine-grain adaptive body bias control is an effective technique to compensate for the intra-die systematic $V_{\rm TH}$ variations $^{11)}$ . The effectiveness for the intra-die random $V_{\rm TH}$ variations, however, is not clear. The required circuit block size for the fine-grain control is also unclear. Therefore, $V_{\rm DDmin}$ has been extracted by SPICE simulations for different grain sizes. **Figure 11** shows the initial and Fig. 11 Initial and compensated V<sub>DDmin</sub> by the various fine-grain adaptive body bias controls for the 11-stage RO. (a) No body bias. (b) Common body bias. (c) Body bias for every 2 inverters. (d) Inverter-by-inverter body bias. compensated $V_{\rm DDmin}$ for the 11-stage RO. The body bias of pMOS is adaptively controlled to minimize $V_{\rm DDmin}$ while the body bias of nMOS is fixed. When a common body bias is applied to the 11 inverters (Fig. 11 (b)), $V_{\rm DDmin}$ is improved from 89 mV to 87 mV. The $V_{\rm DDmin}$ reduction by common body bias control is also verified by the measurement. **Figure 12** shows the measured $V_{\rm DDmin}$ dependence on the body bias of both nMOS and pMOS for an 11-stage RO in the same 90-nm CMOS. When $V_{\rm TH}$ of nMOS and that of pMOS are balanced, $V_{\rm DDmin}$ is low. In contrast, when they are unbalanced, $V_{\rm DDmin}$ is high $^{5),6}$ . The initial $V_{\rm DDmin}$ is 91 mV when both body biases are 0 V. Common body bias control allows to reduce $V_{\rm DDmin}$ to 87 mV, i.e., by 4 mV only. This is coherent with the simulation results and Fig. 12 Measured $V_{\rm DDmin}$ dependence of the body bias of both nMOS and pMOS for a 11-stage RO. shows that the coarse-grain body bias control is not effective to significantly reduce $V_{\rm DDmin}$ . When independent body bias is applied for every 2 inverters, $V_{\rm DDmin}$ lowers to 85 mV as shown in Fig. 11 (c). In contrast, when inverter-by-inverter body bias is applied, $V_{\rm DDmin}$ is drastically reduced to 43 mV as shown in Fig. 11 (d). Despite the significant improvement, the inverter-by-inverter body bias control is impractical due to large area penalty. Therefore, fine-grain adaptive body bias control is not effective to compensate the intra-die random $V_{\rm TH}$ variations in ultra low-voltage logic circuits. ## 3. Time-Domain Fine-Grain Voltage Engineering In this section, the time-domain fine-grain voltage engineerings including accelerators for $V_{\rm DD}$ hopping, power supply noise canceller, and backgate bias accelerators for fast wake-up are shown. #### 3.1 Fast Change of Power Supply Voltage # 3.1.1 Accelerators for Power Supply Voltage Hopping It has been known that reducing $V_{\rm DD}$ decreases the power consumption, when the required speed is slow. To implement this concept, $V_{\rm DD}$ -hopping has been introduced, where $V_{\rm DD}$ is changed among discrete levels adaptive to the required performance to reduce power consumption while maintaining the real-time fea- Fig. 13 Ideal waveform (dotted line) and actual waveform (solid line) for V<sub>DDINT</sub>. Fig. 14 Concept of the $V_{\rm DD}$ -hopping accelerator. ture $^{12),13)}$ . Since $V_{DD}$ -hopping should be executed for each circuit block, the distributed power supply circuits should have the capability of changing the voltage in time. In the $V_{DD}$ -hopping system, the load circuit should not be used during the transition from one voltage level to another because the load circuit block has not been verified its operation between the voltage levels in the test sequence. Therefore, high-speed transition among different levels is important not to steal much time for the voltage hopping. Figure 13 shows an example of ideal and actual waveforms of the internal $V_{\rm DD}$ ( $V_{\rm DDINT}$ ). The long transition time steals much time in the $V_{\rm DD}$ -hopping and hence reduces performance of the system. In addition, if the transition time is long, it will be difficult to apply to very quick real-time systems such as servomechanism control systems. To solve the problem, a technique to reduce the transition time, namely a $V_{\rm DD}$ -hopping accelerator, is proposed $^{14}$ ), and the effectiveness is verified through experiments. Figure 14 shows the basic concept of the $V_{\rm DD}$ -hopping accelerator. The Fig. 15 Waveforms of quick raiser and quick dropper. PMOS/NMOS transistor labeled "quick raiser"/"quick dropper," which is added at the output of a distributed voltage regulator on a chip, accelerates the $V_{\rm DD}$ -hopping process. The schematic waveforms of with and without the quick raiser/dropper are shown in Fig. 15. The transition time depends on the RC time constant of $C_L$ and the effective resistance of the quick raiser/dropper, $R_{\rm ON\_P}/R_{\rm ON\_N}$ . Since the quick dropper charges/discharges $C_L$ not aiming at $V_{\rm DDH}/V_{\rm DDL}$ but aiming at much higher/lower voltage of $V_{\rm AH}/V_{\rm AL}$ , the charging/discharging time is highly accelerated. The acceleration will be achieved without any extra power supply lines, since $V_{\rm AH}$ and $V_{\rm AL}$ are available as global power grids. Basically, the acceleration is achieved by aiming at a goal that is higher than the target value and stopping at the target value. This "aim-high" is the basic concept for the acceleration. A chip microphotograph of the fabricated 0.18- $\mu$ m CMOS linear regulator with the quick dropper is shown in **Fig. 16**. The quick dropper area is $20 \,\mu$ m $\times 20 \,\mu$ m, while the linear regulator area is $30 \,\mu$ m $\times 70 \,\mu$ m. The area overhead of the quick dropper can be as small as 2% of the load circuit. Figure 17 shows the measured waveform for $V_{\rm DDINT}$ . It is seen that the transition time from $V_{\rm DDH}$ to $V_{\rm DDL}$ is smaller than 5 ns, which enables more than two orders of acceleration over the case without the accelerator circuit. #### 3.1.2 Power Supply Noise Canceller Recent low power VLSI design techniques such as power gating, clock gating. **Fig. 16** Chip microphotograph of the fabricated 0.18- $\mu$ m CMOS. Fig. 17 Measured waveform for $V_{\rm DDINT}$ . The transition time from $V_{\rm DDH}$ to $V_{\rm DDL}$ is smaller than 5 ns. and dynamic voltage and frequency scaling (DVFS) generate rapid and large change of the power supply current at the moment of the wake-up from the sleep mode to the active mode. The large power supply noise generated by such current change is a serious problem for low power digital VLSI's <sup>15),16)</sup>. **Figure 18** shows an LSI including multiple power domains with the power gating. When an "aggressor" block wakes up, the neighboring "victim" block suffers from the power supply noise and causes malfunction, which makes it difficult to effectively sleep/wake-up circuit blocks at high frequency. The noise is nanosecond-range or its frequency is usually from 100 MHz to 500 MHz <sup>17)</sup>, and is determined by the Fig. 18 Power supply noise in multiple power domains with the power gating. Fig. 19 Proposed power supply noise canceller with high voltage supply lines. resonance of the package parasitic inductance and the on-chip decoupling capacitor. The noise suppression by decreasing the package inductance and increasing the on-chip decoupling capacitance leads to the large area penalty. Conventional clock dithering <sup>15)</sup> and power switch control <sup>16)</sup> to slow the current change increase the wake-up time and are not useful for the frequent wake-up and power-down. To solve these problems, an on-chip noise canceller with small area penalty and the fast wake-up is proposed. Figure 19 shows the schematic of the proposed circuit <sup>18)</sup>. A high voltage supply $(V_{DDH})$ and a switch between $V_{DDH}$ and the normal power supply $(V_{DD})$ are added to the normal power supply circuit. When the logic circuit wakes up, the switch between $V_{DDH}$ and $V_{DD}$ is turned on and the current from $V_{DDH}$ Fig. 20 Microphotograph of the fabricated noise canceller with 90-nm CMOS. Fig. 21 Measured power supply noise with and w/o the noise canceller. substitutes the current flowing through the bonding wire and the onboard supply lines of $V_{\rm DD}$ . Since the noise on $V_{\rm DDH}$ does not influence $V_{\rm DD}$ , the impedance of $V_{\rm DDH}$ supply line can be large compared to the main $V_{\rm DD}$ line as long as $V_{\rm DDH}$ can substitute the current for $V_{\rm DD}$ . A chip microphotograph of the test chip fabricated with 1 V 90 nm CMOS is shown in Fig. 20. The noise canceller area is $0.022\,\mathrm{mm^2}$ , while the noise source area is $0.21\,\mathrm{mm^2}$ . Since this noise source emulates logic circuits with NMOS switch, this current consumption is equivalent to $1.5\,\mathrm{mm^2}$ of 2 NAND. Thus the area overhead of the noise canceller can be as small as 1.5% of the load circuit. Transient response with proposed canceller is measured and shown in Fig. 21. Without the canceller, the worst voltage in transient is 71 mV less than the steady state IR drop, while the canceller suppresses this noise to 32%. The noise canceller consumes 2.0% power overhead of the load current for 25 k transitions/sec. #### 3.2 Backgate Bias Accelerators for Fast Wake-up Reduction of static power dissipation during standby (or 'sleep') periods, i.e., when no data operation must be performed, is a major requirement for any VLSI chip today. The backgate (body) bias control reduces the leakage power. Backgate bias shows several advantages compared to power gating: - 1) Unlike power gating, there is no data loss during standby mode, eliminating the requirement of specific storage elements. - 2) Backgate bias can be used in active mode as well to balance process and temperature variations, and/or tune the circuit speed according to the computation requirements. In active mode, the backgate bias generator must provide adequate backgate bias voltage $V_{\rm BGA}$ to balance process and temperature variations. Typically, this generator can be implemented as a voltage buffer with a simple source follower or an amplifier in a feedback loop <sup>19)</sup>. It must dissipate minimum power, provided its output impedance is sufficiently low for not introducing additional noise onto the substrate. Considering this, a conventional backgate generator cannot provide fast charging of the large backgate capacitance to sweep its voltage from negative sleep backgate bias ( $V_{\rm BGS}$ ) to $V_{\rm BGA}$ in a short time. To solve the problem, a Backgate Bias Accelerator (BBA) circuit that allows to strongly accelerate the charging of the backgate to have fast transition from sleep to active modes, with $V_{\rm BGA}$ tuning capability, is proposed $^{20)}$ . Let us consider the backgate bias technique in the case of NMOSFETs. Figure 22 (a) illustrates the principle of the proposed circuit to accelerate the sleep-to-active modes transition. In sleep mode, the sleep control signal is HIGH and the backgate is tied to $V_{\rm BGS}$ (e.g., $-1\,\rm V$ ). The active mode backgate bias generator is turned off and doesn't consume any DC bias current. Once the SLEEP control signal goes down, a large PMOS (the raiser) is turned on and quickly charges the backgate as shown in Fig. 22 (b). The raiser is turned off once the backgate voltage has reached $V_{\rm BGA}$ . This voltage is then maintained in active mode by the voltage follower. The raiser must be accurately controlled to avoid Fig. 22 Backgate Bias Accelerator (BBA) to accelerate the sleep-to-active modes transition. (a) Circuit schematics. (b) Timing chart. backgate charging above (if it is turned off too late) or below (if it is turned off too early) $V_{\rm BGA}$ . If we turn off the raiser after that a comparator has detected that the backgate has reached $V_{\rm BGA}$ , the delays of the comparator and the long raiser buffer chain introduce error in final backgate voltage. Therefore, the accurate timing control of the gate of the raiser is important and the detail of the Fig. 23 Microphotograph of the fabricated BBA with 90-nm CMOS. timing control circuits in Fig. 22 (a) is described in Ref. 20). A picture of the 90-nm CMOS test chip is shown in **Fig. 23**. The total area of the backgate bias accelerator represents less than 2.5% of the total area for the 40 k NAND gates. The voltage of the p-well of the 40 k NAND gates is measured by a high frequency active probe. Figure 24 shows the measured backgate bias during sleep-to-active modes transitions. $V_{BGS}$ is fixed to $-1\,\mathrm{V}$ , while $V_{BGA}$ is swept between $-0.4\,\mathrm{V}$ and $0.4\,\mathrm{V}$ . The BBA efficiently controls the ON time of the raiser according to the $V_{BGA}$ value, allowing on-chip tuning of both sleep and active backgate bias voltages. Without BBA, the active mode backgate bias generator alone takes up to $1\,\mu\mathrm{s}$ to charge the backgate. With the BBA, the transition time between sleep and active modes ranges from 12 ns to 35 ns, that is more than 28 times faster. The BBA achieves $0.5\,\mathrm{V}$ change of backgate voltage in 12 ns and $1.5\,\mathrm{V}$ change in $35\,\mathrm{ns}$ (i.e, $\approx 24\,\mathrm{ns/V}$ ). #### 4. Conclusions This paper shows the several recent fine-grain voltage engineerings for the low power VLSI circuit design. The space-domain fine-grain voltage engineering includes the fine-grain $V_{\rm DD}$ with the 3D-structured on-chip buck converters and the fine-grain body bias control to reduce $V_{\rm DD}$ . The time-domain fine-grain voltage engineering includes the accelerators for the $V_{\rm DD}$ hopping, the $V_{\rm DD}$ noise canceller, and the backgate bias accelerators for fast wake-up. Fig. 24 Measured backgate bias during sleep-to active modes transitions with and w/o BBA. $V_{\rm BGS}$ is fixed to $-1\,\rm V$ , while $V_{\rm BGA}$ is swept between $-0.4\,\rm V$ and $0.4\,\rm V$ . **Acknowledgments** This work is partially supported by MEXT, STARC, and Intel Corporation. The VLSI chips were fabricated through the chip fabrication program of VDEC with the collaboration by STARC. #### References - 1) Onizuka, K., Inagaki, K., Kawaguchi, H., Takamiya, M. and Sakurai, T.: Stacked-chip implementation of on-chip buck converter for distributed power supply system in SiPs, *IEEE Journal of Solid-State Circuits*, Vol.42, No.11, pp.2404–2410 (2007). - Ezaki, T., Kondo, K., Ozaki, H., Sasaki, N., Yonernura, H., Kitano, M., Tanaka, S. and Hirayarna, T.: A 160Gb/s interface design configuration for multichip LSI, Proc. IEEE International Solid-State Circuits Conference, pp.140–141 (2004). - Kaul, H., Anders, M., Mathew, S., Hsu, S., Agarwal, A., Krishnamurthy, R. and Borkar, S.: A 320mV 56μW 411GOPS/Watt ultra-low voltage motion estimation - accelerator in 65nm CMOS, Proc. IEEE International Solid-State Circuits Conference, pp.316–317 (2008). - 4) Calhoun, B. and Chandrakasan, A.: Ultra-dynamic voltage scaling (UDVS) using sub-threshold operation and local voltage dithering, IEEE Journal of Solid-State Circuit, Vol.41, No.1, pp.238–245 (2006). - 5) Hanson, S., Zhai, B., Seok, M., Cline, B., Zhou, K., Singhal, M., Minuth, M., Olson, J., Nazhan-dali, L., Austin, T., Sylvester, D. and Blaauw, D.: Performance and variability optimization strategies in a sub-200mV, 3.5pJ/inst, 11nW subthreshold processor, Proc. IEEE Symposium on VLSI Circuits, pp.152–153 (2007). - 6) Hwang, M., Raychowdhury, A., Kim, K. and Roy, K.: A 85mV 40nW processtolerant subthreshold 8x8 FIR filter in 130nm technology, Proc. IEEE Symposium on VLSI Circuits, pp.154–155 (2007). - 7) Kwong, J. and Chandrakasan, A.: Variation-driven device sizing for minimum energy sub-threshold circuits, Proc. ACM International Symposium on Low Power Electronics and Design, pp.8–13 (2006). - 8) Chang, I., Kim, J., Park, S. and Roy, K.: A 32kb 10 T subthreshold SRAM array with bit-interleaving and differential read scheme in 90nm CMOS, Proc. IEEE International Solid-State Circuits Conference, pp.388–389 (2008). - 9) Niiyama, T., Zhe, P., Ishida, K., Murakata, M., Takamiya, M. and Sakurai, T.: Dependence of minimum operating voltage (V<sub>DDmin</sub>) on block size of 90-nm CMOS ring oscillators and its implications in low power DFM, Proc. IEEE International Symposium on Quality Electronic Design, pp.133–136 (2008). - 10) Niiyama, T., Zhe, P., Ishida, K., Murakata, M., Takamiya, M. and Sakurai, T.: Increasing minimum operating voltage (V<sub>DDmin</sub>) with number of CMOS logic gates and experimental verification with up to 1Mega-stage ring oscillators, Proc. ACM International Symposium on Low Power Electronics and Design, pp.117–122 (2008). - 11) Tschanz, J., Kao, J., Narendra, S., Nair, R., Antoniadis, D., Chandrakasan, A. and De, V.: Adaptive body bias for reducing impacts of die-to-die and within-die parameter variations on microprocessor frequency and leakage, IEEE Journal of Solid-State Circuits, Vol.37, No.11, pp.1396–1402 (2002). - 12) Kawaguchi, H., Kanda, K., Nose, K., Hattori, S., Antono, D.D., Yamada, D., Miyazaki, T., Inagaki, K., Hiramoto, T. and Sakurai, T.: A 0.5-V, 400-MHz, V<sub>DD</sub>hopping processor with zero-V<sub>TH</sub> FD-SOI technology, Proc. IEEE International Solid-State Circuits Conference, pp.106–107 (2003). - 13) Nakai, M., Akui, S., Seno, K., Meguro, T., Seki, T., Kondo, T., Hashiguchi, A., Kumano, H. and Shimura, M.: Dynamic voltage and frequency management for a low-power embedded microprocessor, IEEE Journal of Solid-State Circuits, Vol. 40. No.1, pp.28–35 (2005). - 14) Onizuka, K., Kawaguchi, H., Takamiya, M. and Sakurai, T.: V<sub>DD</sub>-hopping accelerators for on-chip power supply circuit to achieve nanosecond-order transient time. IEEE Journal of Solid-State Circuits, Vol.41, No.11, pp.2382–2389 (2006). - 15) Lichtenau, C., Ringler, M.I., Pfluger, T., Geissler, S., Hilgendorf, R., Heaslip, J., Weiss, U., Sandon, P., Rohrer, N., Cohen, E. and Canada, M.: Powertune: advanced frequency and power scaling on 64b PowerPC microprocessor, Proc. IEEE International Solid-State Circuits Conference, pp.356–357 (2004). - 16) Kanno, Y., Mizuno, H., Yasu, Y., Hirose, K., Shimazaki, Y., Hoshi, T., Miyairi, Y., Ishii, T., Yamada, T., Irita, T., Hattori, T., Yanagisawa, K. and Irie, N.: Hierarchical power distribution with 20 power domains in 90-nm low power multi-CPU processor, Proc. IEEE International Solid-State Circuits Conferenc, pp.540-541 (2006). - 17) Rahal-Arabi, T., Taylor, G., Ma, M. and Webb, C.: Design & validation of the Pentium III and Pentium 4 processors power delivery, Proc. IEEE Symposium on VLSI Circuits, pp.220–223 (2002). - 18) Nakamura, Y., Takamiya, M. and Sakurai, T.: An on-chip noise canceller with high voltage supply lines for nanosecond-range power supply noise, Proc. IEEE Symposium on VLSI Circuits, pp.124–125 (2007). - 19) Narendra, S., Haycock, M., Govindarajulu, V., Erraguntla, V., Wilson, H., Vangal, S., Pangal, A., Seligman, E., Nair, R., Keshavarzi, A., Bloechel, B., Dermer, G., Mooney, R., Borkar, N., Borkar, S. and De, V.: 1.1V 1GHz communications router with on-chip body bias in 150nm CMOS, Proc. IEEE International Solid-State Circuits Conference, pp.270–271 (2002). - 20) Levacq, D., Takamiya, M. and Sakurai, T.: Backgate bias accelerator for 10nsorder sleep-to-active modes transition time, Proc. IEEE Asian Solid-State Circuits Conference, pp.296–299 (2007). (Received October 6, 2008) (Released February 17, 2009) (Invited by Editor-in-Chief: Hidetoshi Onodera) Makoto Takamiya received the B.S., M.S., and Ph.D. degrees in electronic engineering from the University of Tokyo, Japan, in 1995, 1997, and 2000, respectively. In 2000, he joined NEC Corporation, Japan, where he was engaged in the circuit design of high speed digital LSIs. In 2005, he joined the University of Tokyo, Japan, where he is an associate professor of VLSI Design and Education Center. His research interests include the circuit design of the low-power RF circuits, the ultra low-voltage digital circuits, and the large area electronics with organic transistors. He is a member of the technical program committee of IEEE Symposium on VLSI Circuits and IEEE Custom Integrated Circuits Conference (CICC). Takayasu Sakurai received the Ph.D. degree in EE from the University of Tokyo in 1981. In 1981 he joined Toshiba Corporation, where he designed CMOS DRAM, SRAM, RISC processors, DSPs, and SoC Solutions. He has worked extensively on interconnect delay and capacitance modeling known as Sakurai model and alpha power-law MOS model. From 1988 through 1990, he was a visiting researcher at the University of California Berkeley, where he conducted research in the field of VLSI CAD. From 1996, he has been a professor at the University of Tokyo, working on low-power high-speed VLSI, memory design, interconnects, ubiquitous electronics, organic IC's and large-area electronics. He has published more than 400 technical publications including 100 invited presentations and several books and filed more than 200 patents. He served as a conference chair for the Symp. on VLSI Circuits, and ICICDT, a vice chair for ASPDAC, a TPC chair for the first A-SSCC, and VLSI symp. and a program committee member for ISSCC, CICC, A-SSCC, DAC, ESSCIRC, ICCAD, ISLPED, and other international conferences. He is a recepient of 2005 IEEE ICICDT Award, 2004 IEEE ISSCC Takuo Sugano Award and 2005 P&I patent of the year award and four product awards. He gave keynote speech at more than 50 conferences including ISSCC, ESSCIRC and ISLPED. He is consulting to startup and international companies. He was an elected AdCom member for the IEEE Solid-State Circuits Society and an IEEE CAS and SSCS distinguished lecturer. He is a STARC Fellow and an IEEE Fellow.