1 Introduction

Supply voltage overscaling (VOS) is one of the most efficient energy saving techniques since the energy dissipation of a digital circuit has a quadratic dependency on its supply voltage (\(V_{\mathrm{dd}}\)) [5, 8, 20, 25]. However, the VOS is applicable under the constraint that the worst-case delay of the digital circuit does not exceed a target clock period, imposing a lower bound of the \(V_{\mathrm{dd}}\). Since the probability that a digital circuit experiences the worst-case delay is not so high, the conventional design approaches which strictly enforce this timing constraint may be inefficient in terms of power consumption.

So far, various VOS techniques have been introduced to mitigate the timing constraints. The authors of [9, 18] presented techniques to correct timing errors due to the VOS using time borrowing techniques. In [10], a stretchable clock technique was developed to control timing errors caused by the VOS. The authors of [22, 26] applied forward error correction techniques to correct timing errors. In [2, 6, 12, 13], the approximate computing techniques were introduced, where the digital operations are approximated to reduce the critical path delay without affecting the functional behavior much. Algorithmic noise tolerance (ANT) framework introduced in [14] cancels timing errors in an algorithmic level. A variety of ANT techniques have been proposed in the context of digital filtering [1, 7, 14, 19, 23], image/video applications [16, 21], and digital communication applications [15, 17]. While these schemes effectively remove the timing errors, the output contains the residual errors that were not suppressed by the ANT. Hence, the ANT techniques have been mainly developed for digital signal processing applications where the system is designed to meet some algorithm-specific quality of service (QOS) margin. Due to this erroneous computing, the ANT techniques are not suitable for an arithmetic logic unit (ALU) of general-purpose processor that requires bit-exact computation. A power hungry embedded processor where energy comes from a battery and thus is limited also needs ALU providing reliable operation results under the VOS.

In this paper, therefore, we propose a novel error recovery system, which not only enables the VOS without degrading system throughput, but also guarantees the perfect reconstruction from the timing errors, namely a perfect timing error cancelation (PTEC) system. Our scheme can be applied to the ALUs where output bits are sequentially computed from the least significant bit (LSB) to the most significant bit (MSB), so called LSB-first and MSB-last (LFML) arithmetic unit. Timing errors due to VOS are allowed to occur at some designated MSB’s by properly controlling \(V_{\mathrm{dd}}\). Treating these timing errors as an additive noise, the proposed system exactly recovers the timing errors and cancels them from the noisy output of the arithmetic unit. The proposed design is different from the previous ANT schemes [1, 7, 14,15,16,17, 19, 21, 23] in that bit-exact error cancelation is achieved by exploiting the fact that timing errors have discrete values due to LFML-based computation. In particular, it is worth comparing the proposed PTEC technique with the reduced precision redundancy (RPR) method in [23]. The RPR method employs the reduced precision replica and replaces the output of the target unit by the replica when timing errors are detected. Hence, the desired output cannot be exactly recovered. On the contrary, the proposed PTEC directly subtracts the exact estimate of the desired output from the output of the target unit. Recently, we found that similar idea has been presented independently in [11]. Though both this paper and the work in [11] present the scheme enabling perfect error compensation, our work is different from theirs in that we provide the specific PEC design for arithmetic units used for general-purpose computers while the work in [11] mostly focuses on the digital signal processing (DSP) units such as the FIR filters. Specifically, we provide the analytical derivation of the perfect error recovery condition for multiplier architectures, and through the PEC design procedure based on quantization error analysis, we demonstrate the feasibility of the PEC systems for realizing the energy-efficient multipliers. In addition, we present the optimized (minimum delay) hardware design for the error correction unit which has not been presented in [11].

Using the proposed PTEC technique, a binary multiplier which is an essential component in the ALU is designed in predictive technology model (PTM) [4]. Under the same latency and throughput condition, the PTEC system can perform 200 mV VOS without incurring any timing errors and consequently achieve \(38\%\) power saving compared to normal \(V_{\mathrm{dd}}\) (=1 V) operation when they are designed using 22-nm CMOS library.

The rest of the paper is organized as follows. In Sect. 2, characteristic of timing error is investigated and timing error model is formulated. Section 3 derives the perfect recovery condition of the output from the VOS. Based on the condition, a detailed design of the proposed PTEC system is described with an example of a carry-save multiplier. Simulation results and discussions are given in Sect. 4. Conclusions are presented in Sect. 5.

2 Modeling of the Timing Error

Fig. 1
figure 1

A \(4 \times 4\) carry-save multiplier under voltage overscaling

In this subsection, we present a timing error model caused by the VOS for an LFML arithmetic unit. When \(V_{\mathrm{dd}}\) is scaled below the minimum supply voltage guaranteeing the correct circuit operations, the overall propagation delay may exceed clock period (\(T_{c}\)), thereby causing logic errors in some output bits. Let us consider an arithmetic unit based on the LFML computation. Timing errors would sequentially occur from the MSB since those bits have larger propagation delay. Under this circumstance, the feasible range of the timing errors can be regulated by controlling the extent of the voltage scaling. In Fig. 1, this behavior is illustrated in details with an example of \(4 \times 4\) carry-save multiplier, whose vector merging part is implemented as a ripple-carry adder. Let \(\tau _i\) be the worst-case propagation delay for the ith output bit of the multiplier. Since this multiplier is based on LFML computation, it clearly holds that \(\tau _i>\tau _j\) for \(i>j\). When the VOS is applied, the propagation delays for all output bits would increase. Then, the output bits can be divided into two groups: a MSB group satisfying \(\tau _i>T_c\) and a LSB group satisfying \(\tau _i\le T_c\). The MSB group is exposed to timing errors by the VOS, which is called as an error-prone (EP) region, whereas the remaining bits are named as a safe region. As \(V_{\mathrm{dd}}\) is scaled down further aggressively, the EP region grows, and thus, more output bits experience timing errors.

Let us consider a timing error model for an LFML arithmetic unit designed to produce N-bit unsigned output bits. Suppose that the output signal of this arithmetic unit has an unsigned format. Under VOS scenario, the kth output signal sample of the arithmetic unit is decomposed into the desired signal term and timing error term, i.e,

$$\begin{aligned} z_{k} = y_k + e_k \end{aligned}$$
(1)

where \(z_k\) and \(y_k\) are the noisy signal and the ideal output signal of the arithmetic unit, respectively, and \(e_k\) is the timing error induced by the VOS. Since \(e_k\) may have a negative value, \((N+1)\) bits are assigned for \(e_k\) by adding a sign bit.

Denoting the bit-width of the safe region as M, the M LSB’s of the error signal \(e_k\) should be zeros. Then, we have

$$\begin{aligned} e_{k}= [\underbrace{*~*~\cdots ~*}_{(N-M+1) \text{ bits }} \;\;\;\;\; \underbrace{0~0~\cdots ~0}_{M \text{ bits }}~]. \end{aligned}$$
(2)

It can be seen that the value of the signal \(e_k\) is multiples of \(2^M\), i.e., \(e_{k} = 2^{M}l\) with an integer \(l \in [-2^{N-M} \cdots 2^{N-M}-1] \). This timing error model is used to design the proposed PTEC system which can remove the timing error perfectly.

3 PTEC System Design

3.1 Exact Timing Error Recovery Condition

Fig. 2
figure 2

Block diagram of the proposed PTEC system

In this section, we present the PTEC system that recovers the ideal output \(y_{k}\) from the noisy output \(z_{k}\) in (1). The block diagram of the proposed PTEC system is depicted in Fig. 2. An approximate processing unit (APU) estimates the output of the target processing unit (TPU) with reduced hardware resources. The error correction unit (ECU) first subtracts the output of the APU denoted as \({\tilde{y}}_{k}\) from the noisy output \(z_{k}\) to have

$$\begin{aligned} v_{k}&= z_k - {\tilde{y}}_{k}, \end{aligned}$$
(3)

and then estimates the timing error denoted as \({\hat{e}}_k\) using \(v_k\). Finally, the estimated timing error is subtracted from the output \(z_k\) to recover the output \({\hat{y}}_{k}\). If the perfect recovery is attained, \({\hat{y}}_{k}\) should be the exactly same as the ideal output \(y_k\). From (1) and (3), we have

$$\begin{aligned} v_{k}&= e_k + (y_k - {\tilde{y}}_{k}), \end{aligned}$$
(4)

where \((y_k - {\tilde{y}}_{k})\) is called the “residual signal.” It is possible to restrict the amplitude of the residual signal \((y_k - {\tilde{y}}_{k})\) within the certain range by controlling the precision of the APU. Note that possible values of \(e_{k}\) are spaced by \(2^M\) in LFML arithmetic unit since errors occur only in EP region, i.e., \(e_{k} = 2^{M}l\) with an integer \(l \in [-2^{N-M} \cdots 2^{N-M}-1]\). If the amplitude of the residual signal is bounded as \(B_l \le (y_k - {\tilde{y}}_{k})< B_h\) (or equivalently, \(e_k+ B_l \le v_{k} < e_k+B_h\)) and

$$\begin{aligned} (B_h - B_l) \le 2^{M}, \end{aligned}$$
(5)

the timing error \(e_{k}\) can be recovered without ambiguity by mapping the observation \(v_{k}\) to the point \(2^M l\) satisfying \(2^M l + B_l \le v_{k} < 2^M l + B_h \). Figure 3 illustrates the exact recovery of the timing error. Since the range of \(v_k\) corresponding to each realization of \(e_{k}\) does not overlap with each other, it is possible to find the value of \(e_{k}\) that yields the outcome \(v_{k}\). In essence, if the largest and the smallest value that the possible residual signal \((y_k -{\tilde{y}}_{k})\) differ by less than \(2^{M}\), it is possible to recover the timing error \(e_{k}\) from the signal \(v_k\). Note that the timing errors can be corrected from the TPU perfectly if the precision of the APU is chosen to meet the perfect recovery condition as described above.

Fig. 3
figure 3

Illustration of the exact recovery of the timing error

3.2 Design of Approximate Processing Unit

In this section, we design the approximate processing unit (APU) in the proposed PTEC system so that the perfect recovery of the output is attained. Basically, the APU has the same implementation details with the TPU except that the precision of the APU is lower than that of the TPU. The aim of the APU design is to reduce the area complexity which are referred to as the bit-width of the unit as much as possible to minimize the overall complexity of the PTEC system while meeting the requirement for the perfect recovery of (5). Specifically, the perfect recovery condition is met by choosing the bit-width of the APU such that the upper and lower bound of the residual signal \((y_k -{\tilde{y}}_{k})\) differs by less than \(2^{M}\). Hence, for the given supply voltage, the APU is designed by (1) estimating the bit-width of the safe region M from path delay profiling for the target unit, (2) deriving the range of \((y_k -{\tilde{y}}_{k})\) as a function of bit-width, and (3) choosing the smallest bit-width for APU satisfying (5).

In this section, the \(L \times L\) unsigned carry-save multiplier is given as an example to demonstrate the design of the APU in the proposed PTEC system. Let us denote \(a_k\) and \(b_k\) as two L-bit unsigned inputs to the target multiplier. Then, the \(N(=2L)\) bit output corrupted by the timing error is expressed as

$$\begin{aligned} z_{k} = a_k \times b_k + e_k, \end{aligned}$$
(6)

where the desired output is \(y_{k} = a_k \times b_k\). If the APU takes P MSBs \((P < L)\) from the inputs of the target multiplier, we have

$$\begin{aligned} {\tilde{y}}_{k} = (T_{L-P}(a_k) \times T_{L-P}(b_k) ) , \end{aligned}$$
(7)

where \(T_{n}(\cdot )\) denotes a truncation operation dropping up to the nth bit places (e.g., \(T_{2}(11111) = 11100\)). Now, the goal is to find the smallest value of P which satisfies (5).

Denoting \(a_k = T_{L-P}(a_k) + \varDelta a_{k}\) and \(b_k = T_{L-P}(b_k) + \varDelta b_{k}\), the residual signal \((y_k - {\tilde{y}}_{k})\) is expressed as

$$\begin{aligned} y_k - {\tilde{y}}_{k}&= a_k \times b_k - T_{L-P}(a_k) \times T_{L-P}(b_k) \nonumber \\&= \underbrace{(T_{L-P}(a_k)+\varDelta a_{k})\varDelta b_{k}}_{2L-P \text{ bits }} + \underbrace{\varDelta a_{k} \times T_{L-P}(b_k)}_{2L-P \text{ bits }}. \end{aligned}$$
(8)

It is known from (5) that \(B_{l}=0\) and \(B_{h} = 2^{2L-P+1}\) since \((y_k - {\tilde{y}}_{k})\) is between \([0, 2^{2L-P+1}-1]\). Therefore, the exact recovery condition of timing error is given by

$$\begin{aligned} 2^{2L-P+1} \le 2^{M}. \end{aligned}$$
(9)

Accordingly, the minimum precision of the APU multiplier which guarantees the perfect timing error recovery is determined as

$$\begin{aligned} P = 2L - M + 1. \end{aligned}$$
(10)

The proposed APU design can be applied to any LFML arithmetic unit design by simply reducing the precision of the TPU and calculating a proper bit-width for the perfect recovery. Let us consider the two’s complement signed multipliers where \(a_k\) and \(b_k\) are L-bit signed numbers and their MSBs are used as a sign bit. In this case, the residual signal \((y_{k} - {\tilde{y}}_{k})\) in (8) is not always positive, and thus, the exact recovery condition of (9) is not valid. We can make the residual signal positive by modifying (7) to

$$\begin{aligned} {\tilde{y}}_{k}&= \, \left( T_{L-P}(a_k) + \mathrm{sign}(a_k) 2^{L-P+1} \right) \nonumber \\&\qquad \times \left( T_{L-P}(b_k) + \mathrm{sign}(b_k) 2^{L-P+1} \right) , \end{aligned}$$
(11)

where \(\mathrm{sign}(x)\) produces 1 if \(x<0\) and 0, otherwise. Note that this operation is equivalent to adding the sign bit of \(a_k\) and \(b_k\) to the input of the APU (see Fig. 4). Then, the error correction steps described above can be employed without any modification as if the TPU is an unsigned multiplier. Since the modification for APU requires only P half adders, an increase in power consumption is expected to be marginal.

Fig. 4
figure 4

Modified APU for application to the signed multiplier

3.3 Design of Error Correction Unit

In this section, the ECU is designed to estimate the error, \({\hat{e}}_{k}\) from \(z_{k}\) and \({\tilde{y}}_{k}\) as shown in the dotted box in Fig. 2. First, the output of the APU is subtracted from the output of the TPU to obtain \(v_{k}\) as in (3). Then, the timing error \(e_{k}\) is estimated from

$$\begin{aligned} {\hat{e}}_{k} = 2^M l \text{ such } \text{ that } \{2^M l \le v_{k} < 2^M (l +1) \}. \end{aligned}$$
(12)

Note that (12) can be simply implemented by the truncation operation as

$$\begin{aligned} {\hat{e}}_{k} = T_{M}(v_{k}). \end{aligned}$$
(13)

Finally, the estimate of the timing error \({\hat{e}}_{k}\) is subtracted from the noisy output \(z_{k}\), i.e,

$$\begin{aligned} {\hat{y}}_{k} = z_{k} - {\hat{e}}_{k}. \end{aligned}$$
(14)

Since the estimate of the timing error is exact, i.e., \({\hat{e}}_{k} = e_{k}\), the desired signal can be perfectly obtained without any degradation, that is, \({\hat{y}}_{k} = y_{k}\). Substituting (3) and (13) into (14), we have

$$\begin{aligned} {\hat{y}}_{k} = z_{k} -T_{M}(z_{k} - {\tilde{y}}_{k}), \end{aligned}$$
(15)

which formulates the operation of the ECU. The direct computation of (15) requires \((2L - 2P)\) bit subtraction for \(v_{k}=z_{k} - {\tilde{y}}_{k}\) and another \((2L-M)\) bit subtraction for \(z_{k} -T_{M}(v_{k})\), incurring a significant propagation delay when M is small. Such a long propagation delay is not desirable since it is added to the propagation delay needed for obtaining the safe region bits in the target multiplier, thereby limiting the extent of voltage scaling. In order to find the structure with the reduced propagation delay of the ECU, the bits composing \({\hat{y}}_{k}\) in (15) are divided into two parts: M LSBs denoted as \({\hat{y}}_{k}^{(LSB)}\) and \((2L-M)\) MSBs denoted as \({\hat{y}}_{k}^{(MSB)}\). Similarly, the bits in \(z_{k}\) are divided into \(z_{k}^{(LSB)}\) and \(z_{k}^{(MSB)}\). Obviously, \({\hat{y}}_{k}^{(LSB)}\) is the same as \(z_{k}^{(LSB)}\) since M lower bits of \((z_{k} - {\tilde{y}}_{k})\) are truncated as in (15). On the otherhand, \({\hat{y}}_{k}^{(MSB)}\) is given by

$$\begin{aligned} {\hat{y}}_{k}^{(MSB)} = z_{k}^{(MSB)} - \left( (z_{k} - {\tilde{y}}_{k}) \gg M \right) \end{aligned}$$
(16)

where

$$\begin{aligned}&(z_{k} - {\tilde{y}}_{k}) \gg M \nonumber \\&= \left\{ \begin{array}{ll} z_{k}^{(MSB)} - ({\tilde{y}}_{k} \gg M) &{}\quad \text{ if } z_{k}^{(LSB)} \ge {\tilde{y}}_{k}^{(LSB)}, \\ z_{k}^{(MSB)} - ({\tilde{y}}_{k} \gg M) -1, &{}\quad \text{ if } z_{k}^{(LSB)} < {\tilde{y}}_{k}^{(LSB)}. \end{array} \right. \end{aligned}$$
(17)

From (16) and (17), \({\hat{y}}_{k}=\left[ {\hat{y}}_{k}^{(LSB)}, {\hat{y}}_{k}^{(MSB)} \right] \) can be obtained as

$$\begin{aligned} {\hat{y}}_{k}^{(LSB)}&= z_{k}^{(LSB)} \nonumber \\ {\hat{y}}_{k}^{(MSB)}&= \left\{ \begin{array}{ll} ({\tilde{y}}_{k} \gg M) , &{}\quad \text{ if } z_{k}^{(LSB)} \ge {\tilde{y}}_{k}^{(LSB)}, \\ ({\tilde{y}}_{k} \gg M) +1, &{}\quad \text{ if } z_{k}^{(LSB)} < {\tilde{y}}_{k}^{(LSB)} . \end{array} \right. \end{aligned}$$
(18)

The block diagram of the overall PTEC system with the proposed ECU is depicted in Fig. 5. The ECU requires (1) the addition of one to \(({\tilde{y}}_{k}\gg M)\), (2) the comparison of \(z_{k}^{(LSB)}\) and \({\tilde{y}}_{k}^{(LSB)}\), and (3) the selection between \(({\tilde{y}}_{k} \gg M)\) and \(({\tilde{y}}_{k} \gg M) +1\). The comparison operation can be implemented using the \((M-2L+2P)\)-bit subtractor. The carryout of the subtractor is used as a selection bit for the multiplexer to choose between \(({\tilde{y}}_{k} \gg M)\) and \(({\tilde{y}}_{k} \gg M) +1\). Since \(z_{k}^{(MSB)}\) is not used to obtain the final output, further power saving can be achieved by removing circuits regarding computation of the error-prone bits \(z_{k}^{(MSB)}\) from the TPU.

Fig. 5
figure 5

Block diagram of the PTEC system with minimum delay

Now, let us compute the propagation delay incurred by (18). Note that the addition-by-one operation can be performed in parallel with the target multiplier, therefore, not affecting the overall propagation delay of the PTEC system. Next, the subtraction operation can also be performed in parallel except for the last full adder that produces the carryout bit. The multiplexer takes the carryout bit from the final full adder of the subtractor and performs the multiplexing so that the propagation delay for multiplexer is added to the total computational delay for obtaining the final output \({\hat{y}}_{k}^{(MSB)}\). In conclusion, the overhead of the propagation delay of the ECU becomes the sum of the propagation delay of the full adder and the delay of the multiplexer, i.e., \(\tau _{\mathrm{FA}}+\tau _{\mathrm{mux}}\) which is a marginal increase in the total propagation delay of the PTEC system.

3.4 General PTEC Framework for Application to Other Arithmetic Units

So far, we have presented the design of PTEC technique for the example of \(L\times L\) carry-save multiplier. The PTEC technique can be applied to a variety of computing architectures based on LFML computation. In this subsection, we will describe how we can apply the PTEC principle for general LFML-based arithmetic units. Starting from the original supply voltage, we consider reduction in the supply voltage by the step size (e.g., step size = 0.1 V). For the given supply voltage, we can analyze the worst-case path delay of the output bits using the hardware simulations. Due to the LFML-based computation, MSBs would have higher path delay than LSBs. Hence, by identifying the output bits whose path delays are less than the clock period by an appropriately chosen margin via hardware simulations, we can determine the bit-width M of the safe region. Once M is determined, we need to find the best precision P of the APU such that the perfect error recovery condition in (5) is satisfied. Specifically, we need to ensure that the gap between the upper-bound \(B_h\) and the lower-bound \(B_l\) of the residual signal (i.e., difference between the outputs of TPU and APU) is less than \(2^M\). For the given architecture of TPU, we can analytically derive the upper-bound \(B_h\) and the lower-bound \(B_l\) of the residual signal as a function of the precision P (e.g., \(B_l = 0, B_h =2^{2L-P+1}\) for the example described in the previous subsections). Then, we can find the smallest precision P subject to the constraint \((B_h - B_l) \le 2^{M}\). Once P is determined, we need to design the ECU which performs the functional mapping depicted in Fig. 3. When the precision of the TPU is reduced to obtain APU, we can implement the ECU using simple truncation operation in (13). Finally, we estimate the power consumption for the TPU, APU, and ECU blocks and save it for each value of the supply voltage. We repeat the above procedure until the supply voltage cannot be reduced further due to the fact that valid value of P satisfying the perfect error recovery condition does not exist. As a last step, we choose the optimum design for PTEC-based arithmetic unit by picking the design achieving the minimum power consumption. We summarize the detailed procedure and guidelines for designing the PTEC system below:

  • STEP 1 Starting from the original supply voltage, lower the supply voltage by one step (e.g., step = 0.5V).

  • STEP 2 Analyze the path delay for each output bit via hardware simulation and find the value of M corresponding the bit-width of the safe region.

  • STEP 3 Compute the upper bound \(B_l\) and the lower-bound \(B_h\) of the residual signal \((y_k - {\tilde{y}}_k)\) by analyzing quantization effect in APU.

  • STEP 4 Find the smallest precision P for APU that satsfies the perfect error recovery condition in (5). If such APU precision is not feasible, go to STEP 7.

  • STEP 5 Design the hardware for ECU which performs the operation of (15).

  • STEP 6 Evaluate the overall power consumption and save it. Go back to STEP 1.

  • STEP 7 Compare the power consumptions obtained for all supply voltages tried and pick the best one.

Note that the design procedure described above is not specific to one example and can be extended to general arithmetic units based on LFML computation. Such arithmetic units would have larger path delays for the MSBs of the output. Hence, via hardware simulations, we can determine the bit-width M of the safe region whose path delays are less than the clock period. Regardless of what arithmetic unit is used, the perfect error recovery condition in (5) is used to determine the precision P of the APUFootnote 1. For the given structure of TPU/APU, we only have to conduct appropriate quantization error analysis to derive the upper-bound \(B_h\) and the lower-bound \(B_l\). While the actual hardware implementation can be slightly adapted for different architectures, the basic functional mapping of ECU in (13) is the same for general arithmetic units. In essence, the PTEC design methodology can be applied to various LFML-based arithmetic operations such as the carry-save adder, Wallace tree multiplier [24], and Baugh–Wooley multiplier [3] with slight modification of the quantization analysis for \(B_h\) and \(B_l\) and ECU hardware implementation.

4 Results and Discussions

In this section, the \(L \times L\) unsigned carry-save multiplier is designed with PTM and 22-nm CMOS technology library [4]. Throughout this work, it is assumed that the system shares a common \(V_{\mathrm{dd}}\) since usage of different \(V_{\mathrm{dd}}\) for the TPU and the APU incurs significant power and area penalties by involving level converters and extra power regulator. The HSPICE simulation is performed in order to obtain the propagation delay versus the target supply voltage \(V_{\mathrm{dd}}\). First, we design the proposed PTEC system for \(L=16\). The target clock frequency is set to 2.5 GHz considering the critical path delay of the original multiplier at nominal \(V_{\mathrm{dd}}\) (=1 V). This includes \(10\%\) margin to account for the setup time of flip flops and some variations due to within-die process variations, jitter, aging, and so on. The minimum grid of \(V_{\mathrm{dd}}\) scaling is set to 100 mV. For each \(V_{\mathrm{dd}}\), the bit-width of the safe region M is calculated by using the delay models obtained by HSPICE simulations. For these calculations, the additional computation delay due to the error correction unit \(\tau _{\mathrm{additional}}(=\tau _{\mathrm{FA}}+\tau _{\mathrm{mux}})\) is included. Accounting for the fact that the setup time and the aforementioned variations tend to increase with \(V_{\mathrm{dd}}\) scaling, we set \(15\%\) delay margin (\(= \tau _{\mathrm{margin}}\)) for lower \(V_{\mathrm{dd}}\). Then, the maximum value of M satisfying

$$\begin{aligned} \tau _{\mathrm{safe}} + \tau _{\mathrm{additional}}+ \tau _{\mathrm{margin}} < T_{c} \end{aligned}$$
(19)
Fig. 6
figure 6

Critical path delay versus supply voltage

is found, where \(\tau _{\mathrm{safe}}\) is the critical path delay needed to produce the safe region bits and \(T_{c}\) is the clock period. Note that in this calculation, it is required that the critical path delay for the APU and addition of one in (18) is less than \(\tau _{\mathrm{safe}}\). If this condition does not hold, the critical path delay would exceed \(\tau _{\mathrm{safe}} + \tau _{\mathrm{additional}}+ \tau _{\mathrm{margin}}\) (and consequently \(T_c\)). Hence, for the successful operation of the proposed PTEC system, one more constraint that the propagation delay of the APU is smaller than \(\tau _{\mathrm{safe}}\) is added. If this constraint is met, the critical path delay of the PTEC system is given by the left term in (19).

Table 1 Bit-width of the safe region and the precision of APU versus supply voltage
Table 2 Number of gates versus supply voltage

Figure 6 shows the simulation results for the critical path delays versus supply voltage \(V_{\mathrm{dd}}\). As shown in Fig. 6, the constraint is satisfied as long as \(V_{\mathrm{dd}}\) is larger or equal to 800 mV. However, for the supply voltage below 800 mV, the propagation delay of the APU becomes significantly larger due to the increased precision and results in the violation of our requirement. This imposes the lower bound of the scaling of \(V_{\mathrm{dd}}\) for the proposed PTEC system. Note that the critical path delay of TPU does not increase as \(V_{\mathrm{dd}}\) decreases since the processing of the TPU completes as soon as only M LSB bits in the safe region are first calculated (see Fig. 5). Since such M bits are those satisfying the timing constraint, the overall delay of the TPU should not exceed the clock period as shown in Fig. 6. Table 1 shows the estimation results of M and P under this bound. Once M is determined, the precision of the APU, P can be estimated by (10). The number of necessary gates for each \(V_{\mathrm{dd}}\) is summarized in Table 2. Note that while the number of gates for APU increases for lower \(V_{\mathrm{dd}}\), that for TPU decreases since unnecessary circuits for calculating the error-prone bits grow with \(V_{\mathrm{dd}}\) scaling. The additional hardware needed for timing error correction can lead to additional loading for the overall system. For instance, as shown in Fig. 5, certain bits of the input signal are fed to both TPU and APU, which incurs additional loading. However, when the input drivers are closely placed to the input of the TPU, extra loading due to the APU would make minor impact on the critical path delay of the target unit. Obviously, this makes the interconnection between the input drivers and the input gates of the APU long. Note that we model such interconnection delay in our simulations.

Fig. 7
figure 7

Time-domain waveforms of a \(y_k\), b \(e_{k}\), c \({\hat{y}}_{k}\), and d \(({\hat{y}}_{k}-y_{k})\) for the proposed PTEC-based \(16\times 16\) carry-save multiplier

The functional behavior of the proposed PTEC system has also been simulated and verified. Using the logic delays obtained from HSPICE simulations, the timing errors for the \(16 \times 16\) unsigned carry-save multiplier have been obtained for randomly generated 1000 test inputs. Figure 7 shows the functional behaviors of the proposed PTEC system when \(V_{\mathrm{dd}}\) is 900 mV. Figure 7 a–d shows the time-domain waveforms of the desired output \(y_{k}\), timing error \(e_{k}\), the recovered signal \({\hat{y}}_{k}\), and the reconstruction error \((y_{k}-{\hat{y}}_{k})\) at \(V_{dd} = 900\) mV, respectively. It is known from Fig. 7 d that the timing error is perfectly removed by the proposed PTEC system.

Fig. 8
figure 8

Normalized power consumption versus supply voltage

Figure 8 shows the power consumption of the proposed PTEC-based \(16 \times 16\) unsigned carry-save multiplier obtained from HSPICE power simulation. The results are normalized with respect to the power consumed by the conventional multiplier without voltage scaling. It is observed that the power consumption of the TPU decreases rapidly as the supply voltage \(V_{\mathrm{dd}}\) decreases. It can be seen that the decrease in power consumption is much faster than the theoretical expectation, i.e., power \(\sim {V_{\mathrm{dd}}}^{2}\). This seems due to the fact that active leakage and short-circuit power consumption which occupy substantial portion of the overall power consumption are reduced and the number of gates in the TPU is reduced by the scaling of \(V_{\mathrm{dd}}\). The power consumption of the APU increases as \(V_{\mathrm{dd}}\) goes down since the bit-width of safe region M decreases, and thus, the precision of the APU in (10) gets larger. Nevertheless, we know that remarkable savings in the overall power consumptions can be achieved by voltage overscaling. As shown in Fig. 8, the proposed design achieves 19 and 38% power savings with \(V_{\mathrm{dd}}=900\) and 800 mV, respectively, compared to the original multiplier without the VOS. The ANT scheme in [14] provides marginal energy savings when the output signal-to-noise ratio is more than 20 dB. Moreover, the ANT scheme cannot achieve the perfect recovery of the output signal. However, the proposed PTEC system offers the perfect recovery of the output sample as well as significant energy savings.

Table 3 Design specifications and power saving of the PTEC scheme for \(L=16\) and 24

Next, we evaluate the performance of the PTEC technique for the multipliers with larger bit widths, i.e., \(L=24\) and 32. For each value of L, the APU is redesigned such that the perfect recovery condition in (10) is satisfied and the supply voltage is determined to meet the timing constraint in (19). Table 3 summarizes the design specifications of the PTEC technique for each value of L. The power savings achieved by the PTEC scheme are also included in the table. For all cases under consideration, we could not reduce the supply voltage below 800 mV not to violate the timing constraint in (19). We observe that the proposed multiplier can still maintain significant power savings when the bit-width of the TPU gets larger. Note that the PTEC technique can deliver up to 57% power savings for the \(32\times 32\) unsigned multiplier.

5 Conclusions

In this paper, we have proposed a novel system which can perfectly recover the output sample from timing error introduced by the voltage overscaling. The inherent characteristics of the timing errors for the LSB-first and MSB-last computing arithmetic units have been exploited to derive the condition for the perfect recovery. The \(16 \times 16\) carry-save multiplier has been demonstrated to verify the proposed PTEC system. From simulation results, a significant power saving could be obtained with no reconstruction error when PTM and 22-nm CMOS library are used. Therefore, the proposed PTEC system is expected to be very useful in the design of the ALU in a general-purpose processor or an embedded processor where energy is limited, but erroneous arithmetic is imperative.