1 Introduction

Since the inspiring work of Kocher et al. [7], Differential Power Analysis (DPA) has been widely considered as a critical threat to the hardware implementations of cryptographic algorithms. The idea of DPA is that the processed data in the crypto device inevitably has correlation with the amount of the withdrawn current from the power source. By analyzing the dependencies hidden inside the collected power consumption, some confidential information, typically as the cipher key, can be exploited. FPGA has been one of the most widely used platforms of cryptographic implementations, and it is shown in [10, 13] that DPA can be a real and important threat against the FPGA based cryptographic implementations. Moreover, authors of [14] show that the register on FPGA is an important source of power leakage, and the activity of register can be easily distinguished due to the enable signal of registers managed by a control part, which makes register the most common DPA attack element on FPGA.

To defeat DPA, many countermeasures need to be applied at low-level logic layers, i.e., gate level or layout level. This is due to the fact that many significant power leakages come from the physical level rather than the higher algorithmic level. Dual-Rail Precharge Logic (DPL) is one of the typical protection methods aiming at low-level protection. In order to compensate the data-dependent power leakages, two parallel rails corresponding to the true rail and the false rail are generated to work in the evaluation phase and the precharge phase periodically. However, one of the most influential drawbacks of DPL scheme is the significant degradation of the system throughput, since the precharge phase which generates invalid value in the circuit commonly occupies a great portion of the computation time. For instance, the calculation speed of WDDL (a typical DPL scheme) [18] falls to 50 % of the unprotected implementation. In fact, all DPL methods suffer from such a reduction of computational performances. Thus, these countermeasures are not suitable for the high performance environments (e.g., base stations in modern encryption-supported mobile communication systems).

Two possible methods to solve this problem are reducing the precharge phase and eliminating the precharge phase. In [9], a non-regular clock is applied to BCDL, in order to reduce the precharge phase. Unfortunately, the non-regular clock is more complex than the conventional clock scheme, and the speed of BCDL still falls to 75 % of the unprotected implementation. Different from DPL schemes, a new Dual-Rail scheme named HDRL [16] removes the precharge phase, by connecting together the ground voltage current (i.e. VSS) of the complementary gate and the supply voltage current (i.e. VDD) of the original gate. However, HDRL is not applicable to FPGAs, since the VSS and VDD cannot be properly implemented on Look-Up-Tables (LUTs).

Our Contributions. In this paper, we follow the same general idea about eliminating the precharge phase and take advantage of the strategy of trading space for time, in order to improve the performance of our DPA countermeasure. We propose a new DPA-hardened approach on FPGA based implementations, namely, the Quadruple-Rail Logic (QRL) architecture, which can maintain a high performance that is approximately equal to the unprotected system, while without sacrificing its resistance against DPA attacks. In order to verify the robustness of QRL against DPA, we perform experiments on a QRL-enabled AES implementation based on FPGA. The experimental results show that the DPA robustness of QRL is as good as a typical DPL scheme WDDL, and is at least 110 times stronger than the unprotected AES implementation.

Organization. In Sect. 2, we briefly introduce the related works of existing DPL solutions. Section 3 discusses the reason for the performance degradation and introduces the details of our new QRL structure. We launch experiments to further demonstrate the strength of QRL against DPA in Sect. 4. Section 5 concludes the paper.

2 Related Work

DPL is one of the most typical logic-level countermeasures against DPA. The operation of the DPL logic is defined by two main concepts. First, DPL has two parallel rails, namely, the dual-rail which consists of the original (True/T) rail and the complementary (False/F) rail as a counterpart of the T rail. Therefore, the signals in DPL are represented with a pair of values which are generated by the T rail and the F rail respectively. Second, the dual rails work simultaneously in complementary behaviors. More precisely, the T rail and the F rail work in the evaluation and the precharge phase periodically. In each evaluation period, the value \(a_{(T)}\) in T rail and \(a_{(F)}\) in F rail compensate with each other (i.e., \([a_{(T)}:a_{(F)}]\) is always in state of [1 : 0] or [0 : 1]). In each precharge period, the pair of values \([a_{(T)}:a_{(F)}]\) is reset to a fix state (typically [0 : 0]). In a sequel, the circuit system theoretically provides constant power consumption regardless of the processed data if DPL can be properly realized. The theoretical foundations for security of dual-rail logic and some efficient techniques for building such logic circuits is introduced by [6].

Many architectures have been introduced for achieving a secure and low-cost DPL realization. Wave Dynamic Differential Logic (WDDL) was proposed in [18], where a logic wave of values ‘0’ is propagated through all the gate of the combinatorial logic chain. To optimize and improve DPL, several techniques have been previously proposed. Some techniques focus on the problem about identical routing, such as MDPL (Mask DPL)[11], DWDDL (Double WDDL) [19], and routing repair methods [4]. Others pay attention to the Early Propagation Effect (EPE) [15]. For instance, DRSL [2], STTL [12], BCDL [9], DPL-noEE [1], and PADPL [3], take different tactics to overcome the EPE problem.

According to the effect of the precharge mechanism, one of the most influential drawbacks of DPL schemes is its significant decrease of the performance. In [9], authors prove that among several schemes (WDDL, MDPL, STTL, DRSL, Seclib, IWDDL and basic BCDL), all schemes provide a low calculation speed which is less than 50 %. The performance of PADPL is also less than half of the unprotected implementation. In [3] and [4], two PADPL instances respectively provide 25 % and 41.7 % evaluation time during each clock cycle.

To improve the performance of DPL schemes, two solutions are usually employed: the first is to reduce the precharge phase and the second is to eliminate the precharge phase. The first idea can be achieved by using a non-regular clock, which leads to a shorter clock cycle in each precharge phase. In [9], compared with the basic BCDL, the accelerated BCDL can achieve a higher speed up to 1.3–1.5 times. More precisely, the maximum frequency of the accelerated BCDL is 50.64 MHz, rising up to 70 % of unprotected implementation (71.88 MHz of unprotected implementation). Unlike the DPL architectures, a Dual-Rail scheme called HDRL [16] improves the calculation speed by following the second idea. HDRL is designed based on the hypothesis that the VSS current drawn by a gate is indistinguishable for different inputs. The complementary pair of gates consists of two identical gates where the VDD of the gate in F rail is connected with the VSS of the gate in T rail. Through this scheme, the precharge phase is unnecessary and HDRL can achieve a higher calculation speed that is approximately equal to the unprotected system. HDRL is the first approach to eliminate the precharge phase, although the source current drawn by the circuit is not considered in HDRL, and the result is based on simulation tools rather than practical experiments [8].

However, compared with the unprotected implementation, both ideas have some drawbacks. Although the non-regular clock provides a less precharge time, the performance degradation still exists. Another drawback of the non-regular clock is that it may be limited by the clock system in ASIC or FPGA. Furthermore, even if HDRL is a good method to remove the precharge phase, it is not suitable for FPGA platform, where the VSS and VDD of LUT are restricted. Therefore, how to increase the performance of DPL-like approaches remains an open problem, and it is still the main bottleneck to apply these countermeasure in high performance environments, especially in FPGA.

3 Quadruple-Rail Logic

Although HDRL is not suitable for FPGA platform, it does imply that the DPL structure without the precharge mechanism is a promising approach to achieve higher performance. Based on the general idea, we propose a logic style named Quadruple-Rail Logic (QRL), which follows the strategy of trading space for time to highly improve the performance while still maintaining strong DPA resistance.

3.1 The Low Performance and Precharge Mechanism

Although the DPL schemes mentioned in Sect. 2 provide resistance against generic side-channel analyses, they are not able to provide satisfying performances compared with the unprotected implementation. For instance, in the case of WDDL, supposing the signal sequence of \(a_{(T)}\) to be “1,0,0,1,1”, the corresponding time sequence is shown in Fig. 1. WDDL takes 10 clocks to accomplish such operation while only 5 clocks are required in the unprotected one. The main cause of this efficiency reduction is the typical two-phase protocol (precharge and evaluation). In most cases, the precharge phase has roughly the same duration as the evaluation phase, it must last long enough for the signal ‘0’ to propagate through the longest combinatorial logic chain.

Fig. 1.
figure 1

The time sequence of WDDL

However, the precharge mechanism plays an important role in DPL, where the precharge phase is inserted between two adjacent evaluation phases to forcibly reset the whole system, except for the values stored in the precharge state, as shown in Fig. 1. Let CntS and Cnt1 be the number of switches between every two adjacent states and value ‘1’ in each state respectively. It is obvious that all values of CntS are 1 in both phases. At the same time, Cnt1 keeps 1 in each evaluation phase and 0 in each precharge phase. This operation ensures that each complementary signal pair \([a_{(T)}: a_{(F)}]\) is able to provide constant values in both phases, and only one bit switch in each phase. Thus, this scheme effectively mitigates the variations of the power leakages, but at the expense of occupying half of the duty cycle of the evaluation phase. Briefly speaking, DPL provides a dual-rail complementary scheme in circuit: the first one is the value complement through the dual-rail mechanism, and the second one is the switch complement through the precharge mechanism.

3.2 QRL Architecture

As mentioned above, the DPL circuit is based on 2 principles: the value complement and the switch complement, while sacrificing its performance due to the precharge phase. To improve the performance, the precharge duty cycle must be decreased. Compared to the solution of reducing the precharge phase, removing the precharge phase is a more thoroughgoing solution. Following the solution of removing the precharge phase, we will conceive our new scheme, which adheres to the 2 principles in DPL circuit. Thus, the new scheme should follow the three principles:

  • Value Complement. Maintain each state of the complementary signal pair which has half and only half bits to ‘1’.

  • Switch Complement. Maintain two adjacent states of the complementary signal pair which have half and only half bits switch. In other words, the new scheme ensures that the complementary signal pair has half and only half bits switch in each clock cycle.

  • No precharge. The precharge phase is fully removed.

In QRL, a quadruple-rail network which consists of four rails replaces the single rail in the unprotected implementation. The four rails in QRL are defined as follows:

  • Original (True/T) rail: the value \(a_{(T)}\) in the T rail is the original value in the unprotected single rail circuit.

  • Complementary (False/F) rail: the value \(a_{(F)}\) in the F rail is the complementary value of \(a_{(T)}\) as the case of DPL, i.e., \(a_{(F)}=\overline{a_{(T)}}\).

  • Switch complementary (SC) rail: the value \(a_{(SC)}\) in the SC rail is generated by \(a_{(T)}\) and the last state of \(a_{(T)}\). When \(a_{(T)}\) is switched in two adjacent states, \(a_{(SC)}\) would keep its state. On the contrary, \(a_{(SC)}\) would be switched. To sum up, the switch complementary rail ensures that the signal pair \([a_{(T)}:a_{(SC)}]\) has one and only one switch in each clock cycle. Let \(a^L_{(\upsilon )}\) and \(a^P_{(\upsilon )}\) denote the last and the current state of the value \(a_{(\upsilon )}\). The value \(a^P_{(SC)}\) is described by:

    $$\begin{aligned} a^L_{(T)} \oplus a^P_{(T)}= & {} \overline{a^L_{(SC)} \oplus a^P_{(SC)}} \nonumber \\ a^P_{(SC)}= & {} a^L_{(T)} \oplus a^P_{(T)} \oplus \overline{a^L_{(SC)}}. \end{aligned}$$
    (1)
  • Double complementary (DC) rail: the value \(a_{(DC)}\) in the DC rail is the complementary value of \(a_{(SC)}\). The DC rail is not only the value complement of the SC rail, but also ensures that the signal pair \([a_{(F)}:a_{(DC)}]\) has one and only one bit switch in each clock cycle. Similarly, the value \(a^P_{(DC)}\) is described by:

    $$\begin{aligned} a^P_{(DC)}= & {} a^L_{(F)} \oplus a^P_{(F)} \oplus \overline{a^L_{(DC)}}. \end{aligned}$$
    (2)

We use the consecutive switching states of QRL in five clock cycles as an example to further clarify QRL. Suppose all signals are reset (typically value ‘0’ to \(a_{(T)}\) and \(a_{(SC)}\), and value ‘1’ to \(a_{(F)}\) and \(a_{(DC)}\) in this instance) before evaluation, and the state sequence of \(a_{(T)}\) is “1,0,0,1,1”. Due to the definition of each rails, the state sequences of \(a_{(F)}\), \(a_{(SC)}\), and \(a_{(DC)}\) are “0,1,1,0,0”, “0,0,1,1,0”, and “1,1,0,0,1” respectively, as shown in Fig. 2. Let CntS and Cnt1 be the number of switches between every two adjacent states and value ‘1’ in each state respectively. It is obvious that all values of CntS and Cnt1 are 2. It means that the complementary signal quartet \([a_{(T)}: a_{(F)}:a_{(SC)}:a_{(DC)}]\) is able to provide only two bits to ‘1’ and only two bits switch in each clock cycle. Therefore, QRL meets the three principles, and improves the computational performance.

Fig. 2.
figure 2

The time sequence of QRL

Based on the aforementioned theoretical elaboration, we discuss the practical implementation of QRL. Firstly, we show how to implement the basic component, namely, the QRL-enabled compound register system. Then we explain how to complete the whole QRL system.

3.3 QRL Register Exemplar

As mentioned above, a quadruple-rail network with four rails is employed in QRL instead of the single rail in the unprotected implementation. Thus, a compound register system which consists of four registers replace the originally single register. Let T-reg, F-reg, SC-reg, and DC-reg be four standard registers of the compound register system. Let \(u_{in}\) and \(u_{out}\) denote the input and output of u-reg respectively. According to Eqs. 1 and 2, the relationship between these registers is as follows:

$$\begin{aligned} SC_{out}= & {} T_{in} \oplus T_{out} \oplus \overline{SC_{in}}\nonumber \\= & {} T_{in} \oplus T_{out} \oplus DC_{in}. \end{aligned}$$
(3)
$$\begin{aligned} DC_{out}= & {} F_{in} \oplus F_{out} \oplus \overline{DC_{in}}\nonumber \\= & {} F_{in} \oplus F_{out} \oplus SC_{in}. \end{aligned}$$
(4)

The compound register system follows the three principles, and QRL is capable of protecting a single standard register by a compound register system, as shown in Fig. 3.

Fig. 3.
figure 3

The compound register system used in QRL

3.4 Generalized QRL

Although the register can be easily protected by QRL, similar protection manner can not be applied to LUT, the basic calculation component in FPGA, mainly due to the fact that the last state of LUT cannot be stored. As a result, it is impossible to make the LUTs on SC and DC rails meet the Switch Complement principle. Therefore, a different strategy of implementing QRL on LUT must be developed.

Considering the instance of QRL sequence ‘1,0,0,1,1’ as shown in Fig. 2, there is an interesting fact that \(a_{(SC)}\) is equal to \(a_{(T)}\) in the second and the fourth clock cycles, and it is equal to \(a_{(F)}\) in the first, the third, and the fifth clock cycles. Since we have \(a^L_{(F)}= \overline{a^L_{(T)}}\), and according to Eq. 3, \(a^P_{(SC)}\) can be described as follows:

$$\begin{aligned} a^P_{(SC)} = \overline{a^L_{(SC)}} \oplus a^L_{(T)} \oplus a^P_{(T)} = \left\{ \begin{array}{ll} \overline{a^P_{(T)}} &{},\ \ when\ a^L_{(SC)} = a^L_{(T)},\\ \\ a^P_{(T)} &{}, \ \ when\ a^L_{(SC)} = a^L_{(F)}. \end{array}\right. \end{aligned}$$
(5)

Therefore, \(a_{(SC)}\) is determined by the reset value of QRL. Two cases need to be discussed about the reset value:

  • Reset the complementary signal quartet \([a_{(T)}:a_{(F)}:a_{(SC)}:a_{(DC)}]\) which meet that \(a_{(SC)}=a_{(T)}\), \(a_{(F)}=\overline{a_{(T)}}\), and \(a_{(DC)}=\overline{a_{(SC)}}\). According to Eq. 5, in such situation, \(a_{(SC)}=a_{(T)}\) in each even clock cycle, and \(a_{(SC)}=a_{(F)}\) in each odd clock cycle. The relationship of \(a_{(DC)}\) is similar.

  • Reset the complementary signal quartet \([a_{(T)}:a_{(F)}:a_{(SC)}:a_{(DC)}]\) which meet that \(a_{(SC)}=\overline{a_{(T)}}\), \(a_{(F)}=\overline{a_{(T)}}\), and \(a_{(DC)}=\overline{a_{(SC)}}\). As similar as the first case, in such situation, \(a_{(SC)}=a_{(F)}\) in each even clock cycle, and \(a_{(SC)}=a_{(T)}\) in each odd clock cycle. The relationship of \(a_{(DC)}\) is similar.

For the sake of simplicity, the reset value of QRL follows the second case. Thus, the functions of LUTs on SC and DC rails can be established by a control signal Par which indicates the parity of the number of clock cycle. Let \(f_{\upsilon }()\) and \(a^i_{(\upsilon )}\) denote the function of LUT on \(\upsilon \) rail and the i-th input of such LUT respectively. The i-input functions of LUTs on SC and DC rails are defined as follows:

$$\begin{aligned} f_{SC}(a^1_{(SC)},a^2_{(SC)},\cdots ,a^i_{(SC)})=&f_{T}(a^1_{(T)},a^2_{(T)},\cdots ,a^i_{(T)})*Par +\nonumber \\&f_{F}(a^1_{(F)},a^2_{(F)},\cdots ,a^i_{(F)})*\overline{Par}. \end{aligned}$$
(6)
$$\begin{aligned} f_{DC}(a^1_{(DC)},a^2_{(DC)},\cdots ,a^i_{(DC)})=&f_{F}(a^1_{(F)},a^2_{(F)},\cdots ,a^i_{(F)})*Par +\nonumber \\&f_{T}(a^1_{(T)},a^2_{(T)},\cdots ,a^i_{(T)})*\overline{Par}. \end{aligned}$$
(7)

where \(*\) and \(+\) are bit-and operator and bit-or operator respectively.

However, the inputs \(a^i_{(T)}\) and \(a^i_{(F)}\) cannot appear in the function \(f_{SC}()\) and \(f_{DC}()\), due to the relationship between the complementary signal quartet \([a_{(T)}: a_{(F)}:a_{(SC)}:a_{(DC)}]\) and the feature of LUT. Thus, Eqs. 6 and 7 are replaced by:

$$\begin{aligned} f_{SC}(a^1_{(SC)},a^2_{(SC)},\cdots ,a^i_{(SC)})=&f_{T}(a^1_{(SC)},a^2_{(SC)},\cdots ,a^i_{(SC)})*Par +\nonumber \\&f_{F}(a^1_{(SC)},a^2_{(SC)},\cdots ,a^i_{(SC)})*\overline{Par}. \end{aligned}$$
(8)
$$\begin{aligned} f_{DC}(a^1_{(DC)},a^2_{(DC)},\cdots ,a^i_{(DC)})=&f_{F}(a^1_{(DC)},a^2_{(DC)},\cdots ,a^i_{(DC)})*Par +\nonumber \\&f_{T}(a^1_{(DC)},a^2_{(DC)},\cdots ,a^i_{(DC)})*\overline{Par}. \end{aligned}$$
(9)

To ensure all LUTs on all rails have the same number of inputs, LUTs on T and F rails must have an additional control signal En which indicates the state of the whole system (in work state or not). The i-input functions of LUTs on T and F rails are defined by:

$$\begin{aligned} f_{T}(a^1_{(T)},a^2_{(T)},\cdots ,a^i_{(T)})=&f_{T}(a^1_{(T)},a^2_{(T)},\cdots ,a^i_{(T)})*En +\nonumber \\&f_{F}(a^1_{(T)},a^2_{(T)},\cdots ,a^i_{(T)})*\overline{En}. \end{aligned}$$
(10)
$$\begin{aligned} f_{F}(a^1_{(F)},a^2_{(F)},\cdots ,a^i_{(F)})=&f_{F}(a^1_{(F)},a^2_{(F)},\cdots ,a^i_{(F)})*En +\nonumber \\&f_{T}(a^1_{(F)},a^2_{(F)},\cdots ,a^i_{(F)})*\overline{En}. \end{aligned}$$
(11)

When the whole system is in work state, i.e. \(En=1\), the functions of LUTs on T and F rails remain unchanged, all rails work as expected. When the whole system is not in work state, i.e. \(En=0\), the sensitive data is not in the system, thus it has no effect on the security of whole system. As a result, the control signal is embedded inside the LUT equations by using an extra LUT input. In other words, 6-input LUT utilization for the task logic is equivalent to that of a 5-input LUT in QRL.

Taking each rail which consists of register and LUT into consideration, the compound register system should be replaced with four standard registers. The inputs of registers are generated by LUTs on each rails, and the outputs of registers become inputs of subsequent LUTs on corresponding rails. Therefore, the generalized QRL scheme consists of two parts. The first part is the compound register system, which stores the input data (e.g., plaintext and key) from the top module, in order to generate the complementary signal quartet \([a_{(T)}: a_{(F)}:a_{(SC)}:a_{(DC)}]\). The second part is the four divided rails, which consist of registers and LUTs , in order to complete the compensation of the quadruple-rail network. More precisely, the generalized QRL scheme can be completed only by the second part, when the input data from the top module meets the requirement of the complementary signal quartet.

Moreover, QRL can be potentially applied to other non-FPGA platforms, although it is designed directly for FPGA scenarios. LUT, which is the special component in FPGA, can be implemented by standard CMOS cells. For instance, ASIC platform, which has more design freedom to designers than FPGA, can generate a QRL circuit by standard CMOS cells. Consequently, QRL is not only suitable for FPGA, but also portable to other platforms.

4 Implementation and Security Evaluation

In order to evaluate the robustness, area cost and performance of QRL, we implemented AES-128 with three different structures, an unprotected AES-128, a WDDL AES-128 and a QRL AES-128 on a Xilinx Virtex-5 XC5VLX50 FPGA, where the clock frequency is 2 MHz.

4.1 Implementation

To implement QRL, we used a strategy based on an automated “copy-paste” and “conflict-repair” method as mentioned in [5, 17], which can reduce the design complexity and provide more balanced routing on FPGAs. Firstly, we split the initial AES-128 design into two functional modules, one functional module Cont to supply clock signal, to control I/O and to be the top package, the other called Enc to preform AES algorithm. It is obvious that only the security-sensitive Enc needs to be transformed into QRL style. Then, the original rail in Enc is generated with 5-input LUTs, and other three rails are created by the “copy-paste” execution. In this case, it would hardly be possible to repair the routing conflicts between four rails with an interleaved placement way, so we adopt the separate placement of each rail in QRL. Next, we insert the control signals Par and En into corresponding LUTs, and recode the Boolean functions of these LUTs. Afterwards, all wires are routed and repaired by the “re-route” execution in the FPGA editor. Finally, the netlist is exported to the FPGA editor to generate the bitstream.

4.2 Security Evaluation and Attack Results

Pearson Correlation Coefficient based Power Analysis (CPA) is applied during the security analysis, for evaluating the security level of the QRL against differential power attacks. In order to make fair comparisons, we launch similar attacks to an unprotected implementation and a WDDL implementation. The power traces are captured using a LeCroy WaveRunner 610Zi digital oscilloscope at a sampling rate of 2.5 GS/s. The attack point is the registers which store the S-box outputs from each computation rounds, and the transitions of those registers will leak the information about AES sub-keys. Hamming distance model is used in our experiments due to the power consumption property of register on the Virtex-5 FPGA.

The experiment results show that the right hypotheses of all AES sub-keys at last round can be differentiated from the wrong hypotheses by analyzing merely less than 900 traces in the CPA attack to the unprotected one. Comparatively, we are unable to reveal even one of these sub-keys of either WDDL or QRL AES, when the number of power traces reaches 100,000. To further illustrate the result, we choose the first sub-key as an instance. The tendency of the relationship between the each hypothesis correlation coefficient and the number of power traces in the unprotected case is shown in Fig. 4, and the number of measurements to disclosure (MTD) is 837. On the contrary, in the cases of WDDL and QRL, the right key is not yet revealed when the trace number increases to 100,000, as seen in Figs. 5 and 6. More precisely, the rank of the right key in each structure is shown in Table 1.

Fig. 4.
figure 4

Tendency of experimental attacks to unprotected structure

Fig. 5.
figure 5

Tendency of experimental attacks to WDDL

Fig. 6.
figure 6

Tendency of experimental attacks to QRL

Table 1. Rank of right key in unprotected, WDDL, QRL structures.

According to Table 1, it is obvious that both WDDL and QRL have the tendencies that the right keys may be revealed by significantly increasing the number of power traces, although the right keys do not lead to the highest coefficient. In this case, their ranks in all key guesses list seem small, and the right keys in WDDL and QRL can be revealed by the analysis based on the \(3^{rd}\)-order success rate and the \(12^{th}\)-order success rate respectively. The primary cause is the unbalanced routing repairs which are generated by commercial FPGA design tools, and it is more severe in the case of QRL, which should be keep the balanced routing signal quartet. However, even if QRL suffers more serious unbalanced routing that may lead to glitches, the rank of the right key in QRL is larger than that of WDDL, which implies that the security of QRL is a little better than that of WDDL. Consequently, QRL gains an increase factor of robustness against DPA at least 110 times compared to the unprotected one, and its DPA resistance level is comparable to WDDL.

4.3 Cost and Performance Evaluation

We estimate several indicators, in terms of register occupation, LUT occupation, duty cycle of evaluation phase and the maximum throughput for each design. The results are illustrated in Table 2.

Table 2. Cost and performance of AES in unprotected, WDDL, QRL structures.

Compared with the unprotected structure, both register occupations of QRL and WDDL are increased by 768. The increased area of QRL comes from the registers to store the 128-bit round-key and 128-bit intermediate state of each round in F, SC, and DC rails. For WDDL, the increased area is due to the “Master-Slave” register system which is commonly used in DPL architectures. The total number of occupied registers in both QRL and WDDL is less than 4 times of the unprotected one, due to the fact that some control components do not require transformation, such as the control registers in Cont. In the perspective of LUT, both the quadruple-rail network with four complementary rails and the 5 inputs of LUT (the last input is used by Par or En) contribute to the increased area of QRL. However, the results show that the LUT occupation of QRL just rises up to 105 % of WDDL (5428 LUTs of QRL compared to 5153 LUTs of WDDL). This is because of the restriction to the usage of limited gates as the AND and OR logic in WDDL. More precisely, we can constraint 2 compound 2-input gates to one SLICE, since there are 4 LUTs in each Virtex-5 SLICE. According to the result, we find that the cost of QRL is a little more than WDDL.

Next, we discuss the performance for each design. Due to the removal of the precharge phase, the duty cycle of evaluation phase in QRL is equal to the unprotected one, which is 2 times higher than WDDL (100 % in QRL compared to 50 % in WDDL). Thus, the maximum throughput of QRL is theoretically equal to the unprotected one, and rises up to 94 % of the unprotected one in practical evaluations (1.78 Gbps of QRL compared to 1.89 Gbps of the unprotected case), which is much higher than WDDL (0.91 Gbps of WDDL). Moreover, in order to further evaluate the performance, we compare QRL to the accelerated BCDLFootnote 1, which has the highest performance among the existing DPL schemes for FPGA. According to [9], the duty cycle of evaluation phase and the maximum throughput of the accelerated BCDL rise up to 75 % and 53 % of the unprotected implementation respectively. It is obvious that QRL has a higher performance than the accelerated BCDL. The result implies that the more efficient solution for improving performance is to eliminate the precharge phase rather than to reduce it.

5 Conclusion

In this paper, we proposed QRL as a new countermeasure against DPA on FPGAs. Based on the strategy of trading space for time, by adopting the quadruple-rail network, we can remove the precharge phase in previous DPL architectures in order to concurrently achieve high performance and high DPA resistance. Due to the elimination of the precharge phase, QRL provides the performance at a high level that is equal to the unprotected system, which is roughly 2 times higher than other DPL-based countermeasures. At the same time, the high resistance against DPA is not sacrificed. As shown in our experiments, a QRL implementation of AES-128 on FPGA offers the competitive DPA resistance level as WDDL, at least 110 times stronger than the unprotected implementation. Thus, QRL is suitable for some high performance scenarios where the security enhancement is as well desired. The techniques to minimize the glitches and unbalanced routing signal quartet among each rail in QRL will be specially emphasized in the future work.