Keywords

1 Introduction

In recent years, with the rapid development of Internet of Things (IoT), all kinds of IoT devices flood the market, which has greatly changed people’s life style. Meanwhile, as market demand changes, the functions of IoT system are becoming more and more powerful, complex and personalized. All Programmable SoC, which combines ARM with FPGA, creates new possibilities for IoT systems, giving system architects and ARM developers a flexible platform to satisfy customer personal demands [1]. The proliferation of IoT devices brings comfort and convenience to humans, but it also allows more sensitive data to be stored on IoT devices or transmitted through the Internet. Therefore, the security of the sensitive data usage and transmission in IoT raise concerns.

Cryptography is one of the most common methods to solve security problems, and IoT devices are no exception. Modern cryptographic algorithms are considered secure from a mathematical theoretical view point. Nevertheless, weaknesses of these algorithms become easy to be exploited when they are implemented in real-world devices. These attacks, which get far more private information from the real-world implementation of cryptography, earn their well-known name as “Side Channel Attacks (SCA)”. Attackers utilize characteristics such as running time [2, 3], cache behavior [4], power consumption [5] and electromagnetic radiation [6] to extract secret keys from the physical executions of encryption algorithms. Among these attacks, cache timing attacks and power/electromagnetic analysis attacks are two well-developed types of attacks which have been widely studied by researchers.

Cache timing attacks utilize the difference in access times between cache and main memory to crack secret keys from the encryption time data. Kocher first proposed the concept of cache timing attacks [2]. Subsequently, Bernstein et al. performed a successful cache timing attack on the AES T-table implementation running on the PC [7]. In recent years, with the popularity of smart devices, many researchers conducted cache timing attack experiments on ARM [8,9,10]. Power/electromagnetic analysis attacks exploit power consumption/electromagnetic radiation to extract secret keys. In the past 20 years, a large number of researchers have devoted themselves to the research of power/electromagnetic analysis attacks. There are plenty of published results of power/electromagnetic analysis attacks on 8-bit microprocessor, FPGA, ARM, Intel/AMD processor and so on [5, 6, 11,12,13,14].

To thwart SCA, plenty of countermeasures have been proposed, e.g. masking [15,16,17,18], relying on the addition of random delays [19], shuffling the execution order of independent operations [20,21,22,23] and so on. Among these countermeasures, masking is the most common one. However, both the software and hardware overheads of masking are very costly. Moreover, due to the presence of glitches, the hardware masking’s defense ability may be greatly reduced. Another countermeasure is adding random delays, which will increase the huge time overhead. What’s more, it is easy to remove the noise of random delays with a simple preprocessing program. The third countermeasure is shuffling the execution order of independent operations. It is an appropriate countermeasure which can greatly increase power/electromagnetic noise by adding acceptable time overhead. More importantly, most of the countermeasures can only withstand a single type of side channel attacks. When facing multi-type attacks, they are usually powerless.

In addition to the above mentioned, most countermeasures use chips of widely used architectures as implementation platforms, such as 8-bit microprocessor, FPGA, ARM, Intel/AMD processor and so on. It is still a blank research field to implement schemes on the special architecture of All Programable SoC, which combines software (ARM) with hardware (FPGA). How to use it to create a more efficient and safe encryption implementation is an interesting and promising research topic.

In this paper, we introduce an AES implementation with combination of software and hardware which executed on an All Programmable SoC (Zynq-7000) and improves both the security and performance. Our main contributions are as follows:

  • We propose a new encryption solution with combination of hardware and software that breaks the regularity and alignment pattern of time data and power/electromagnetic traces. By randomizing the start and end round of hardware and software stage, our scheme destroys the statistical regularity of encryption time data due to the use of cache. Meanwhile, shuffling the software execution order and randomizing hardware start round destroys the trace alignment that power/electromagnetic analysis attacks depend on. Therefore, our implementation can resist both cache timing attacks and power/electromagnetic analysis attacks. It can be used not only in AES encryption implementation, but also in many other encryption algorithms. It presents a new way to improve resistance of modern cryptographic algorithm against side channel attacks.

  • To improve the data throughput of our implementation, we test the performance of the AXI-GP, AXI-HP and AXI-ACP interfaces separately on the All Programmable SoC. Finally, we choose the fastest AXI-GP interface as the data transmission channel between software and hardware for real-time and small-batch data encryption. The experimental results show that our AES implementation achieves 0.86 times data throughput of shuffled software AES implementation. The performance loss of our scheme is acceptable, especially when considering that shuffled AES implementation can only resist power/electromagnetic attacks and our scheme is equally effective against both cache timing and power/electromagnetic attacks.

  • We utilize the Test Vector Leakage Assessment (TVLA) methodology to evaluate the side channel leakage of the encryption time data of three implementations. To the best of our knowledge, it is the first work to evaluate the encryption time data by the TVLA methodology. We get a clear TVLA comparison of three implementations with only 10000 samples of encryption time data each. It proves that TVLA method is very fast and effective to evaluate encryption time data.

This paper is organized as follows. Section 2 presents an overview of Zynq-7000 SoC, side channel attacks, countermeasures and TVLA. Section 3 describes our AES implementation with combination of hardware and software. Section 4 shows the results of cache timing and power/electromagnetic attacks and the TVLA leakages of encryption time data and power/electromagnetic traces. This paper ends with conclusions and discussion in Sect. 5.

2 Background and Related Work

In this section, we first elaborate the required preliminaries of Xilinx All Programmable SoC and AES, then discuss the related work of side channel attacks, countermeasures against side channel attacks and TVLA assessment method.

2.1 All Programmable SoC (Zynq-7000)

The Zynq-7000 family utilizes the Xilinx All Programmable SoC (AP SoC) architecture, which is a very creative and attractive framework. A feature-rich dual or single-core ARM Cortex-A9 MPCore based processing system (PS) and Xilinx programmable logic (PL) are grouped together into a single device. The heart of the PS is the ARM Cortex-A9 MPCore CPUs. Beyond that, PS also includes on-chip memory, external memory interfaces, and a rich set of I/O peripherals [24]. The Zynq-7000 family provide not only the performance, power, and usability of ASIC and ASSPs (Application Specific Standard Products), but also the flexibility and scalability of an FPGA. As a result, the devices of the Zynq-7000 family can be designed more freely to meet diversified and personalized applications in IoT systems.

2.2 Software and Hardware Implementations of AES

In 2001, Rijndael, which designed by J. Daemen and V. Rijmen, was specified as the Advanced Encryption Standard (AES) by the National Institute of Standards and Technology (NIST) [25]. Nowadays, it has become one of the most popular encryption algorithms and widely adopted for a variety of encryption needs. The AES algorithm is a symmetric block cipher, and several rounds of processing convert each 128-bit block. There are three different key sizes: 128 bits, 192 bits, or 256 bits, which correspond to 10 rounds, 12 rounds, or 14 rounds, respectively. For simplicity and without loss of generality, we discuss the AES implementation with a key length of 128 bits and hence 10 rounds in this paper.

AES is an iterated algorithm: Each round i takes an intermediate value series of 16 bytes \(S^{i}=\{s^{i}_0, ..., s^{i}_{15}\}\) and a round key series of 16 bytes \(RK^{i}=\{rk^{i}_{0}, ..., rk^{i}_{15}\}\) as inputs, and outputs a 16-byte intermediate value series \(S^{i+1}=\{s^{i+1}_0, ..., s^{i+1}_{15}\}\). There are four algebraic operations in one round, which are called SubBytes, ShiftRows, MixColumns, and AddRoundKey. Before the first round, The input block are computed as \(s^{1}_{j}= p_{j}\,\oplus \,rk^{0}_{j}\) where \(j\in \{0,\cdots ,15\}\), with \(p_{j}\) representing the jth plaintext byte and \(rk^{0}_{j}\) the jth initial round key byte. And the last round omits the algebraic operation of MixColumns. Except the last round, all rounds have the same four steps, and each round i uses a different round key \(RK^{i}\).

Software implementations of the AES usually utilize look-up tables to reduce the computational overhead. All the three operations (SubBytes, ShiftRows and MixColumns) are combined into the four look-up tables \(T_{0}, T_{1}, T_{2}, T_{3}\), each of which consists of 256 4-byte elements and maps one byte of input to four bytes of output. The encryption round of AES software implementation using look-up tables is carried out as:

$$\begin{aligned} \begin{array}{l} (s^{i+1}_{0},s^{i+1}_{1},s^{i+1}_{2},s^{i+1}_{3})=T_{0}[s^{i}_{0}] \oplus T_{1}[s^{i}_{5}] \oplus T_{2}[s^{i}_{10}] \oplus T_{3}[s^{i}_{15}] \oplus \{rk^{i}_{0}, rk^{i}_{1}, rk^{i}_{2}, rk^{i}_{3}\},\\ (s^{i+1}_{4},s^{i+1}_{5},s^{i+1}_{6},s^{i+1}_{7})=T_{0}[s^{i}_{4}] \oplus T_{1}[s^{i}_{9}] \oplus T_{2}[s^{i}_{14}] \oplus T_{3}[s^{i}_{3}] \oplus \{rk^{i}_{4}, rk^{i}_{5}, rk^{i}_{6}, rk^{i}_{7}\},\\ (s^{i+1}_{8},s^{i+1}_{9},s^{i+1}_{10},s^{i+1}_{11})=T_{0}[s^{i}_{8}] \oplus T_{1}[s^{i}_{13}] \oplus T_{2}[s^{i}_{2}] \oplus T_{3}[s^{i}_{7}] \oplus \{rk^{i}_{8}, rk^{i}_{9}, rk^{i}_{10}, rk^{i}_{11}\},\\ (s^{i+1}_{12},s^{i+1}_{13},s^{i+1}_{14},s^{i+1}_{15})=T_{0}[s^{i}_{12}] \oplus T_{1}[s^{i}_{1}] \oplus T_{2}[s^{i}_{6}] \oplus T_{3}[s^{i}_{11}] \oplus \{rk^{i}_{12}, rk^{i}_{13}, rk^{i}_{14}, rk^{i}_{15}\}. \end{array} \end{aligned}$$
(1)

Using the method of table lookups and 16 bytes XOR, the round calculation running in software can be very fast and easy to implement. However, the large look-up tables makes the AES highly vulnerable to cache attacks, such as cache timing attack.

For hardware implementations of AES, there are three major types of schemes to meet different needs. The first type of AES designs focuses on higher data throughput with limited number of architectural optimizations, which resulted in poor resource utilization. Another part of researchers pursues better utilization of FPGA resources with suitable encryption speeds to support most of the embedded applications. The third kind of designers try their best to reduce the power consumption of AES circuits. Like AES software implementations, hardware implementations also leak side channel information, thus are vulnerable to side channel attacks.

2.3 Side Channel Attacks

Cache Timing Attacks. Between the CPU and main memory, there is a small, fast storage area which is called “cache”. In order to reduce the latency of main memory accesses, CPUs employ caches to store the most frequently accessed memory locations. When CPU looks up values in main memory, CPU will store the values in the cache, where old values will be evicted from the cache. After that, lookups to the same memory address can get the data faster from the cache than main memory, which has a well-known name called “cache hit”. The secret key can be recovered through the exploitation of the execution time of a cryptographic algorithm due to different access times in the memory hierarchy.

Kocher demonstrated timing attacks against a variety of software public-key systems in 1996 [2], who also proposed the concept of cache-behaviour analysis in that paper. Kelsey et al. [26] later suggested the exploitation of information leaked through cache-memory access times as a potential attack against cryptographic implementations that employ large S-boxes. With the rapid development of AES implementations, researchers pay more attention on the cache attacks against this symmetric cipher. Bernstein [7] exploited the total execution time of AES T-table implementations and showed that such an attack can be mounted remotely.

Researches mentioned above were launched successfully on Intel or AMD CPUs. On the other hand, in recent years, due to the wide-spread usage of ARM, the investigation on this type of CPU has increased. Bogdanov et al. proposed a type of cache-collision timing attacks on software implementations of AES running on an ARM9 board in 2010 [8]. Two years later, Wei\(\ss \) et al. demonstrated their cache timing attack on an ARM Cortex-A8 processor, who extracted sensitive keying material from an isolated trusted execution domain [9]. In 2013, Spreitzer investigated the applicability of Bernstein’s timing attack and the cache-collision attack by Bogdanov et al. on three mobile devices, all of which employed the ARM Cortex-A CPU [10].

Power and Electromagnetic Analysis Attacks. Power analysis attacks exploit information leaked through power consumption to recover secret keys from implementations of different cryptographic algorithms. Kocher et al. examined Simple Power Analysis (SPA) and Differential Power Analysis (DPA) to find secret keys from cryptographic devices in 1999 [5]. Since then, power analysis attack has become a well-known and thoroughly studied threat for cryptographic implementations. In 2004, Brier et al. first proposed Correlation Power Analysis (CPA) attack which was more efficient than traditional DPA attack [12]. Not long after, Mangard et al. showed that the unmasked and masked AES hardware implementations leaked side channel information due to glitches at the output of logic gates [13].

As the name suggests, electromagnetic (EM) analysis attacks extract the secret key by exploiting data dependent EM radiations. Gandolfi et al. describes their electromagnetic experiments conducted on three different CMOS chips, executing three different cryptographic algorithms [6]. Agrawal et al. presented a systematic investigation of electromagnetic (EM) leakage from CMOS devices [11]. In 2015, Longo investigated the electromagnetic-based leakage of a complex ARM-Core SoC [14].

2.4 Countermeasures Against Side Channel Attacks

To thwart side channel attacks, researchers proposed many different countermeasures such as masking, the use of random delays and shuffling. Among the existing countermeasures, the most widely deployed one is masking [15,16,17,18]. Masking conceals all sensitive intermediate values of a computation with at least one random value. However, the cost of implementing masking increases significantly either in hardware or in software. What’s more, because of the presence of glitches, masked hardware implementations can still be vulnerable to first-order DPA [13, 27]. Another countermeasure is the use of random delays. Tunstall et al. proposed a manner of generating random delays, which reduced the time lost, while maintaining the increased desynchronization [19].

Shuffling the execution order of independent operations is a lightweight countermeasure which can amplify the power/EM noise. Herbst et al. described an efficient AES software implementation resistant against side channel attacks, which masked the intermediate results and shuffled the operation order at the beginning and the end of the AES execution [20]. Rivain et al. designed a new scalable scheme which combined high-order masking with shuffling [21]. Veyrat-Charvillon et al. showed a careful information theoretic and security analysis of different shuffling variants [22]. Patranabis et al. proposed a two-round version of the shuffling countermeasure, and tested its security using TVLA [23].

2.5 Test Vector Leakage Assessment (TVLA)

The huge threat of side channel attacks promoted NIST to organize the “Non-Invasive Attack Testing Workshop” in 2011 to establish a testing methodology which can reliably assess the physical security vulnerabilities of encryption devices. Existing assessment methods require the evaluation labs to actually check the feasibility of the state-of-the-art attacks conducted on the device under test (DUT) [28]. However, these assessment methods are very time-consuming, and the technical threshold is very high.

Goodwill et al. proposed a method (at the workshop mentioned above) that is more widely applicable and easier to implement, known as the Test Vector Leakage Assessment (TVLA) [29]. In 2015, Schneider and Moradi provided a further detail of the TVLA method [28]. TVLA uses a t-test to assess whether there is a significant difference in distribution between the groups of collected data. This method provides a robust test that can be applied to multiple types of data and intermediate values. TVLA has been first utilized to determine if the power consumption of a device relates to the data it is manipulating [29]. In fact, this method is also very effective in the assessment of the leakage of encryption time data, which will be shown in Sect. 4 of this paper.

3 AES Implementation with Combination of Hardware and Software

This section explores our AES implementation with combination of hardware and software on an Xilinx Zynq-7000 All Programmable SoC. This AES countermeasure aims to be robust against both cache timing attacks and power/electromagnetic analysis attacks, while keep performances and complexity close to unprotected AES design. We first describe the entire encryption data flow of our AES design in Sect. 3.1. In Sect. 3.2, we show the detailed description of software and hardware stages. Finally, we introduce the communication between software and hardware in Sect. 3.3.

3.1 Encryption Data Flow

The AES implementation use two random numbers \(R_{1}\) and \(R_{2}\) to divide the AES encryption process into three stages. Figure 1 shows the entire encryption data flow of our AES implementation with combination of hardware and software. The first and last stage run in software of PS (ARM) and the middle stage runs in hardware of PL (FPGA). In each round of the two software stages, the execution order of independent operations is shuffled by the two random numbers \(R_{1}\) and \(R_{2}\). Furthermore, the middle hardware stage has a random beginning (Round \(R_{1}+1\)) and a random end (Round \(R_{2}\)). The entire encryption process can be completed in a random time controlled by the two random numbers \(R_{1}\) and \(R_{2}\). All the 44 bytes round keys are pre-computed and given to the software and hardware.

Fig. 1.
figure 1

Entire encryption data flow of AES implementation with combination of hardware and software.

figure a

3.2 Software and Hardware Stages

In each round of the beginning and final software stages, a set of sensitive operations are shuffled in terms of their execution order to amplify the noise of device power/electromagnetic leakage. As described in Eq. 1, we can divide the software AES encryption round (using look-up tables) into 4 independent operations. And which operation run first doesn’t make any difference to the final result. In our AES implementation, we utilize the two random numbers \(R_{1}\) and \(R_{2}\) to shuffle the execution order of the 4 independent operations.

We use \(s^{i}_{j, k, u, w}\) denotes the values of \(s^{i}_{j}\), \(s^{i}_{k}\), \(s^{i}_{u}\) and \(s^{i}_{w}\). The number \(R_{1} \% 4\) decides which 4-byte intermediate value will be calculated first. If \(R_{1} \% 4==0\), the implementation first calculate the 4-byte values of \(s^{i}_{0,1,2,3}\). When \(R_{1} \% 4==1\), \(s^{i}_{4,5,6,7}\) will be computed first. Another number \(R_{2} \% 3\) controls the second operation and \((R_{2} - R_{1})\%2\) corresponds to the third. For example, if \(R_{1} \% 4==2\), \(s^{i}_{8,9,10,11}\) are computed first. Three 4-byte values of \(s^{i}_{0,1,2,3}\), \(s^{i}_{4,5,6,7}\) and \(s^{i}_{12,13,14,15}\) are left. Then the implementation check the value of \(R_{2} \% 3\). If \(R_{2} \% 3==1\), the values of \(s^{i}_{4,5,6,7}\) will be computed. Meanwhile \(s^{i}_{0,1,2,3}\) and \(s^{i}_{12,13,14,15}\) are left. Then the implementation check the value of \((R_{2} - R_{1})\%2\). If \((R_{2} - R_{1})\%2==0\), \(s^{i}_{0,1,2,3}\) will be computed. Otherwise the implementation will calculate \(s^{i}_{12,13,14,15}\) before \(s^{i}_{0,1,2,3}\). The rest may be deduced by analogy. The algorithm running in the beginning software stage is described in Algorithm 1.

After the beginning software stage, 16-byte round \(R_{1}\) intermediate value \(S^{R_{1}}\) will be transferred to the middle hardware stage. As Algorithm 2 shows, the middle hardware stage starts at round \(R_{1}+1\) and ends at round \(R_{1}+10\). It should be noted that the output value \(Sout^{R_{2}}\) has been calculated at round \(R_{2}-1\). We add round \(R_{2}\) to round \(R_{1}+10\) as dummy rounds. The dummy rounds are applied to make sure that attackers can’t predict the number of encryption rounds in the middle hardware stage by power/electromagnetic traces. When the middle hardware stage is complete, 16-byte round \(R_{2}\) intermediate value \(S^{R_{2}}\) will be sent to the final software stage as input. The 4 independent operations of each round are shuffled the same as the beginning software stage, see Algorithm 3.

figure b
figure c

3.3 Communication Between Software and Hardware

On the Zynq-7000 SoC, there are three types of interfaces between PS (ARM) and PL (FPGA), which are AXI-ACP, AXI-GP and AXI-HP. AXI-GP interfaces are connected directly to the ports of the master interconnect and the slave interconnect without any additional FIFO buffering. AXI-HP interfaces provide PL bus masters with high bandwidth datapaths to the DDR and OCM memories. AXI-ACP interface provides low-latency access to programmable logic masters, with optional coherency with L1 and L2 cache [24]. In order to choose the fastest interface under conditions of real-time data encryption, we tested the performance of the three types of interfaces separately.

From the perspective of the data transmission rate between hardware and software, AXI-HP and AXI-ACP are faster than AXI-GP interfaces. Therefore we first tested the AXI-HP and AXI-ACP interfaces. We apply the AXI-DMA IP core to utilize the AXI-HP and AXI-ACP interfaces. To speed up the encryption process, we enable the cache of the ARM cores. However, it will bring up two problems. First, calculated data may not be immediately sent to DDR memory, but temporarily stored in cache. Second, ARM cores can’t be notified immediately that the data in DDR memory has been changed by AXI-DMA IP core. To solve this two problems, we apply the function \(Xil\_DCacheFlushRange\) to flush the Dcache before AXI-DMA transferring data from software to hardware. Furthermore we run the function \(Xil\_DCacheInvalidateRange\) to invalidate the Dcache after AXI-DMA moving data from hardware to software.

Table 1. Performance of three interfaces for real-time and small-batch data encryption

We then tested the performance of AXI-GP interface and got an unexpected result. Since the structure and timing of AXI-GP interface are simple, it is possible to increase the transmission rate by increasing the clock frequency. Moreover, because the data of software is directly from cache of ARM cores, it can save a lot of time to operate the cache (\(Xil\_DCacheFlushRange\) and \(Xil\_DCacheInvalidateRange\)).

In the experiment, we found that for non-real-time bulk data encryption, using AXI-HP and AXI-ACP interfaces to transfer data is much faster than AXI-GP interface. However, for real-time and small-batch data encryption (128 bits at a time), AXI-GP is faster than AXI-HP and AXI-ACP. Table 1 shows the experimental results of the three interfaces for real-time and small-batch data encryption. Considering that our AES encryption implementation is mainly applied to real-time and small-batch data encryption scenarios, we choose the AXI-GP interface to transfer data between hardware and software.

4 Experimental Evaluation

To validate the security of our proposed AES countermeasure, we have implemented our AES design on the ZedBoard and applied cache timing and power/electromagnetic analysis attacks on it. Furthermore, the Test Vector Leakage Assessment (TVLA) tests [28] have been executed on encryption times and power/electromagnetic traces.

4.1 Cache Timing Attacks

In general, there are three types of cache attacks: trace driven, access driven and time driven attacks. Attacks presented in this paper belong to the class of the time driven attacks, so called cache timing attacks. An enormous amount of encryption samples are needed compared to the other two types of cache attacks. However, because time driven attack is the easiest option to launch, it is a huge threat to numerous real-world applications, especially to embedded and IoT systems.

In our cache timing attack experiments, we first obtain the total encryption time data of each 128-bits plaintext which is influenced by cache hits and cache misses. Then we apply two statistical methodologies (first round and final round) to extract key-related information. Finally, we give the TVLA result on encryption time data.

First Round Attacks. Modern CPUs do not store individual bytes in cache but groups of bytes from consecutive “lines” of main memory. Different CPUs have different cache line sizes. The target of our attacks is the ARM Cortex-A9 MPCore of Zynq-7000 AP SoC, which have a fixed cache line length of 32 bytes [30]. The element size of AES tables (\(T_0\), \(T_1\), \(T_2\), and \(T_3\)) is 4 bytes. We use \(\delta \) to denote the number of table elements in one cache line. So groups of \(\delta \) \((32/4=8)\) table elements share a line in the cache on a ARM Cortex-A9 MPCore.

For any bytes s and \(s'\) which are equal ignoring the lower \(\log _2 \delta \) bits, looking up address s will take both address s and \(s'\) into cache. We represent this as \(\left\langle s \right\rangle = \left\langle s' \right\rangle \). When two separate lookups s and \(s'\) satisfy \(\left\langle s \right\rangle = \left\langle s' \right\rangle \), a “cache collision” occurs. On the contrary, if \(\left\langle s \right\rangle \not = \left\langle s' \right\rangle \), the access to \(s'\) may result in a cache miss. On the average, the second situation will take more time because it will require a second cache lookup.

The first round attack utilized cache collisions evoked in the first round of encryption. As can be seen in Eq. 1, table \(T_0\) uses the bytes \(s^1_0\), \(s^1_4\), \(s^1_8\), \(s^1_{12}\) in the first round. They make up a 4-bytes “family” which are used to access the same table. Three other families of 4-bytes share the tables \(T_1\), \(T_2\), and \(T_3\) in round one. Two bytes \(s^1_k\), \(s^1_j\) in the same family will cause a cache collision if \(\left\langle s^1_k \right\rangle = \left\langle s^1_j \right\rangle \). So we can get the equation \(\left\langle p_k \right\rangle \oplus \left\langle rk^0_k\right\rangle = \left\langle p_j\right\rangle \oplus \left\langle rk^0_j \right\rangle \), or after rearranging, \(\left\langle p_k\right\rangle \oplus \left\langle p_j\right\rangle = \left\langle rk^0_k\right\rangle \oplus \left\langle rk^0_j \right\rangle \).

Due to the cache collision, plaintexts satisfying \(\left\langle p_k\right\rangle \oplus \left\langle p_j\right\rangle = \left\langle rk^0_k\right\rangle \oplus \left\langle rk^0_j \right\rangle \) should have a lower average encryption time. We use the pair of bytes \(p_7\) and \(p_{15}\) in \(T_3\) family to carry out attacks. Figure 2 shows the three results of first round attacks against three different AES implementations using 1 million encryption time data. We apply the unprotected software AES implementation of OpenSSL and show the result of first round attack in Fig. 2a. From Fig. 2a we can see that 8 red lines denoting right \(p_7 \oplus p_{15}\) produce an obvious time drop compared to other gray lines. Figure 2b shows the second successful attacks against shuffled software AES implementation which randomize the execution order of each round the same as in Algorithm 1. The third picture Fig. 2c is the result of our AES implementation with combination of hardware and software. It shows that the first round attack against our implementation fails.

The four sets of equations in Eq. 1 for key bytes in the same family are the only information we can get by first round attack. We can’t gain exact key information without considering other rounds. Furthermore, the lower \(\log _2 \delta \) bits of each key byte can’t be learned with the given information. Therefore, the attacker must still guess a total of \(4 * (8 + 3 * \log _2 \delta ) = 68\) bits (for \(\delta = 8\)) key value to recover the full key.

Fig. 2.
figure 2

Results of first round attacks against three different AES implementations using 1 million encryption time data. X label denotes the index of \(p_{7} \oplus p_{15}\), while Y label presents the average encryption time. Red lines are the right indices of \(\left\langle rk^0_7\right\rangle \oplus \left\langle rk^0_{15} \right\rangle \). Gray lines correspond to the wrong indices of \(\left\langle rk^0_7\right\rangle \oplus \left\langle rk^0_{15} \right\rangle \). (Color figure online)

Final Round Attacks. We make final round attacks which are faster than first round attacks and can recover the full key. As mentioned in Sect. 2.2, the final encryption round of AES software implementation omits the algebraic operation of MixColumns. The final round using look-up tables in OpenSSL0.9.7a is carried out as:

$$\begin{aligned} \begin{array}{l} (c_{0},c_{1},c_{2},c_{3})=T_{4}[s^{10}_{0}] \oplus T_{4}[s^{10}_{5}] \oplus T_{4}[s^{10}_{10}] \oplus T_{4}[s^{10}_{15}] \oplus \{rk^{10}_{0}, rk^{10}_{1}, rk^{10}_{2}, rk^{10}_{3}\},\\ (c_{4},c_{5},c_{6},c_{7})=T_{4}[s^{10}_{4}] \oplus T_{4}[s^{10}_{9}] \oplus T_{4}[s^{10}_{14}] \oplus T_{4}[s^{10}_{3}] \oplus \{rk^{10}_{4}, rk^{10}_{5}, rk^{10}_{6}, rk^{10}_{7}\},\\ (c_{8},c_{9},c_{10},c_{11})=T_{4}[s^{10}_{8}] \oplus T_{4}[s^{10}_{13}] \oplus T_{4}[s^{10}_{2}] \oplus T_{4}[s^{10}_{7}] \oplus \{rk^{10}_{8}, rk^{10}_{9}, rk^{10}_{10}, rk^{10}_{11}\},\\ (c_{12},c_{13},c_{14},c_{15})=T_{4}[s^{10}_{12}] \oplus T_{4}[s^{10}_{1}] \oplus T_{4}[s^{10}_{6}] \oplus T_{4}[s^{10}_{11}] \oplus \{rk^{10}_{12}, rk^{10}_{13}, rk^{10}_{14}, rk^{10}_{15}\}. \end{array} \end{aligned}$$
(2)

Moreover, the last encryption round in OpenSSL1.1.0f is executed as:

$$\begin{aligned} \begin{array}{l} (c_{0},c_{1},c_{2},c_{3})=T_{2}[s^{10}_{0}] \oplus T_{3}[s^{10}_{5}] \oplus T_{0}[s^{10}_{10}] \oplus T_{1}[s^{10}_{15}] \oplus \{rk^{10}_{0}, rk^{10}_{1}, rk^{10}_{2}, rk^{10}_{3}\},\\ (c_{4},c_{5},c_{6},c_{7})=T_{2}[s^{10}_{4}] \oplus T_{3}[s^{10}_{9}] \oplus T_{0}[s^{10}_{14}] \oplus T_{1}[s^{10}_{3}] \oplus \{rk^{10}_{4}, rk^{10}_{5}, rk^{10}_{6}, rk^{10}_{7}\},\\ (c_{8},c_{9},c_{10},c_{11})=T_{2}[s^{10}_{8}] \oplus T_{3}[s^{10}_{13}] \oplus T_{0}[s^{10}_{2}] \oplus T_{1}[s^{10}_{7}] \oplus \{rk^{10}_{8}, rk^{10}_{9}, rk^{10}_{10}, rk^{10}_{11}\},\\ (c_{12},c_{13},c_{14},c_{15})=T_{2}[s^{10}_{12}] \oplus T_{3}[s^{10}_{1}] \oplus T_{0}[s^{10}_{6}] \oplus T_{1}[s^{10}_{11}] \oplus \{rk^{10}_{12}, rk^{10}_{13}, rk^{10}_{14}, rk^{10}_{15}\}. \end{array} \end{aligned}$$
(3)

Equation 3 utilizes the T-tables \(T_0,\cdots , T_3\) in a slightly adapted way while Eq. 2 use a separate T-table \(T_4\). That’s the only difference between the two implementations. Because the T-tables are typically the same, both the two implementations can’t resist the final round attack. Next we take Eq. 2 as an example to describe the details of the final round attack.

Fig. 3.
figure 3

Results of final round attacks against three different AES implementations using 0.3 million encryption time data. X label denotes the index of \(c_{1} \oplus c_{5}\), while Y label presents the average encryption time. Red line is the right index of \(c_{1} \oplus c_{5}\). Gray lines correspond to the wrong indices of \(c_{1} \oplus c_{5}\). (Color figure online)

For any two ciphertext bytes \(c_k\), \(c_j\), it holds that \(c_k = rk^{10}_k \oplus T_4[s^{10}_u]\) for some u and \(c_j = rk^{10}_j \oplus T_4[s^{10}_w]\) for some w. A cache collision occurs on \(T_4\) when \(s^{10}_u = s^{10}_w\). In this given condition we can get the result \(T_4[s^{10}_u]=T_4[s^{10}_w]\). After variable substitution, we get the equation \(c_k \oplus rk^{10}_k = c_j \oplus rk^{10}_j\), or after rearranging, \(c_k \oplus c_j = rk^{10}_k \oplus rk^{10}_j\). Therefore, a cache collision occurs in \(T_4\) when \(c_k \oplus c_j = rk^{10}_k \oplus rk^{10}_j\). Otherwise, we can’t ensure that \(s^{10}_u\) and \(s^{10}_w\) are in the same cache line to cause a cache collision. Because of the cache collision, ciphertexts satisfying \(c_k \oplus c_j = rk^{10}_k \oplus rk^{10}_j\) should be the lowest encryption time.

We use the pair of bytes \(c_1\) and \(c_5\) to make the final round attacks. Figure 3 shows the three results of final round attacks against three different AES implementations using 0.3 million encryption time data. From Fig. 3a we can see that 1 red line denoting right \(c_1 \oplus c_5\) is the lowest one compared to other gray lines. Figure 3b shows the second successful attack against shuffled software AES implementation. The third picture Fig. 3c is the result of our AES implementation with combination of hardware and software. It shows that the final round attack against our implementation still fails.

Timing TVLA. In order to compare the encryption time data security of our countermeasure with the unprotected and shuffled software AES implementation of OpenSSL, we use the Test Vector Leakage Assessment (TVLA) [28] methodology. We performed non-specific TVLA test with two sets of encryption time data. One is the set of randomly chosen plaintexts while the other is a fixed plaintext.

Fig. 4.
figure 4

Comparison of TVLA leakage from 10000 samples of encryption time data.

Figure 4 presents three comparative TVLA leakages from the three different implementations of AES, namely unprotected software AES implementation, shuffled software AES implementation and our proposed countermeasure with combination of hardware and software. Each set comprises of 10000 samples of encryption time data for both fixed and random plaintexts. It is quite clear that our countermeasure with combination of hardware and software has significantly lower side channel leakage compared to unprotected and shuffled software AES for the same number of encryption data. In power/electromagnetic side channel literature, if a TVLA leakage is less than \(\pm 4.5\), it will be very difficult to break the implementation using side channel attacks. However, according to what we have learnt, there is no work to utilize TVLA methodology on encryption time data. Although the TVLA leakage of our scheme is greater than 4.5 with more than 1500 samples, we have reason to believe that it is very effective to resist cache timing attacks.

4.2 Power/Electromagnetic Analysis Attacks

Power/electromagnetic analysis attack exploits the basic concept that the side channel leakages are correlated to operations and data. At the beginning of our power/electromagnetic analysis attack experiments, we focused on both software and hardware stages as the attack target. We first tried to crack key from software stages using Longo’s method [14]. However, because of our rough attack tools and poor preprocessor capability, we couldn’t make our power/electromagnetic attacks successfully. In Longo’s research, 46 kB data was needed to successfully attack AES decryption implementation on ARM core with GPIO-based trigger. We have reasons to believe that far more data will be needed to successfully attack our shuffled software stage.

In our following experiments, we compare power/EM traces of hardware stage with estimated power consumptions/EM radiations. An appropriate model will be required to estimate the leakages. To relate the leakages of switching activity in CMOS devices, the Hamming distance (HD) model is usually utilized. HD model assumes that the leakages are proportional to the number of both \(0 \rightarrow 1\) and \(1 \rightarrow 0\) transitions which produce the same amount of leakages. The jth byte HD model estimation leakage of round i \(w^i_j\) for two intermediate values \(s^i_j\) and \(s^{i+1}_j\) using the same register is given below:

$$\begin{aligned} w^i_j=HD(s^i_j,s^{i+1}_j)=HW(s^i_j \oplus s^{i+1}_j), \quad j \in \{1,\cdots ,15\}. \end{aligned}$$
(4)

In Eq. 4, HD() denotes the function of calculating the Hamming distance and HW() represents computing the Hamming weight. \(W^i_j\) denotes the set of all \(w^i_j\) derived using Eq. 4 for all plaintexts. We assume that l(t) is the t point of one power/electromagnetic trace and L(t) represents the set of l(t) for all power/EM traces. The correlation coefficient (Pearsons correlation coefficient) \(C^i_j(t)\) between the estimation leakage set \(W^i_j\) and the t point set of all power/EM traces L(t) is calculated using the equation given as:

$$\begin{aligned} C^i_j(t)=\frac{E(W^i_j L(t))-E(W^i_j) E(L(t))}{\sqrt{Var(W^i_j) Var(L(t))}}. \end{aligned}$$
(5)

In Eq. 5, E() denotes the average function, while Var() represents the variance function. When \(rk^i_j\) is not the correct round key, the corresponding \(W^i_j\) and L(t) will have less correlation. Then the small correlation factor \(C^i_j(t)\) will be obtained. On the contrary, if \(rk^i_j\) is the correct round key, the \(C^i_j(t)\) corresponding \(W^i_j\) and L(t) will be the highest point.

Power Analysis Attacks. Figure 5 shows the results of correlation power analysis attacks on the \(HD(s^4_3, s^{5}_3)\) byte of two different AES implementation using 10000 power traces and the TVLA results using 5000 samples of power trace. The first implementation runs on the programmable logic (PL) of Zynq-7000 with no protection measure. The second implementation is our countermeasure with combination of hardware and software. Both the two AES implementations give the trigger signals when hardware stage starts. For the power analysis attack on our countermeasure, we suppose the two unpredictable random numbers \(R_{1}=1\) and \(R_{2}=9\).

Fig. 5.
figure 5

Power analysis attacks on the \(HD(s^4_3, s^{5}_3)\) byte of two different AES implementation using 10000 power traces and TVLA result using 5000 samples. In (a) and (b), the red curve denotes the correlation coefficient of the correct round key while gray curves represents the correlation coefficient of the wrong round key. (Color figure online)

Fig. 6.
figure 6

Electromagnetic analysis attacks on the \(HD(s^4_3, s^{5}_3)\) byte of two different AES implementation using 10000 electromagnetic traces and TVLA result using 5000 samples. In (a) and (b), the red curve denotes the correlation coefficient of the correct round key while gray curves represents the correlation coefficient of the wrong round key. (Color figure online)

As we can see from Fig. 5a, the 532th time point has the highest correlation coefficient. It is clear that the power attack was successful on unprotect hardware AES implementation. Figure 5b shows the result of the power attack on our countermeasure. This attack failed because there are no significant higher correlation coefficient at all time samples. We performed non-specific TVLA tests, which is described in Sect. 4.1, on the 532th time point of two AES implementations. Figure 5c shows that the power TVLA leakage of our countermeasure is much lower than the unprotected hardware AES implementation.

Electromagnetic Analysis Attacks. Figure 6 shows the results of correlation electromagnetic analysis attacks on the \(HD(s^4_3, s^{5}_3)\) byte of two different AES implementation using 10000 power traces and the TVLA results using 5000 samples of power trace. The two implementations are the same as in the power attack experiments. Meanwhile we still suppose the two unpredictable random numbers \(R_{1}=1\) and \(R_{2}=9\) to attack our countermeasure.

From Fig. 6a we know that the electromagnetic attack on the unprotected hardware AES implementation succeed at the 523th time point. On the contrary, the attack on our countermeasure fails, as shown in Fig. 6b. Figure 6c shows that the electromagnetic TVLA leakage of our countermeasure is much lower than the unprotected hardware AES implementation at the 523th time point.

4.3 Data Throughput and FPGA Resource Requirements

We use 0.1 million encryption time data to calculate the average encryption times and data throughput of three different AES implementations. As we can see from Table 2, the AES implementation with combination of hardware and software needs average 1653 clock cycles to complete the 128-bit encryption. While unprotected and shuffled software AES implementations need 1050 and 1415 clock cycles respectively. We normalized the data throughput based on the shuffled software AES implementation because the two software stages of our countermeasure are shuffled. The data throughput of our AES implementation with combination of hardware and software is degradated by 14% compared to the shuffled software AES implementation.

Table 2. Data throughput of three different implementations

Table 3 shows the FPGA resource requirement of four different implementations. From Table 3 we know that the FPGA resource consumption of our AES implementation is similar to unprotected hardware AES implementation when using the AXI-GP interface for data transfer. The main reason is that we use two random numbers as the start and end signal of hardware encryption stage, which only changes few registers. Compared to the two AES implementations mentioned above, implementations using the AXI-HP and AXI-ACP interfaces take far more FPGA resource requirements due to the use of AXI-DMA IP core.

Table 3. FPGA resource requirement of four different implementations

5 Conclusion

This paper presented a new AES implementation with combination of hardware and software based on All Programmable SoC. Compared with most of the existing countermeasures resistant to a single type of attacks, our proposed countermeasure can resist both cache timing and power/electromagnetic attacks. Our experiments illustrate that both the time and power/electromagnetic leakages from our countermeasure are significantly lower than other implementations with acceptable performance loss. The new idea “combination of hardware and software” presents a new way to improve the security of modern cryptographic implementation against side channel attacks.