Abstract
Deep learning (DL) has revolutionized Side Channel Analysis (SCA) in recent years. One of the major advantages of DL in the context of SCA is that it can automatically handle masking and desynchronization countermeasures, even while they are applied simultaneously for a cryptographic implementation. However, the success of the attack strongly depends on the DL model used for the attack. Traditionally, Convolutional Neural Networks (CNNs) have been utilized in this regard. This work proposes to use Transformer Network (TN) for attacking implementations secured with masking and desynchronization. Our choice is motivated by the fact that TN is good at capturing the dependencies among distant points of interest in a power trace. Furthermore, we show that TN can be made shift-invariant which is an important property required to handle desynchronized traces. Experimental validation on several public datasets establishes that our proposed TN-based model, called TransNet, outperforms the present state-of-the-art on several occasions. Specifically, TransNet outperforms the other methods by a wide margin when the traces are highly desynchronized. Additionally, TransNet shows good attack performance against implementations with desynchronized traces even when it is trained on synchronized traces.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In profiling SCA, the adversary possesses a device similar to the attack device and uses that device to train a model for the target device. The trained model is used to attack the target device. A profiling SCA assumes the strongest adversary and provides the worst-case security analysis of a cryptographic device. In this work, we have considered profiling SCA only.
- 2.
Setting the pool size and stride of the average pooling layer to 1, the model will behave as if there is no average pooling layer. However, setting those values to a larger value will make the model computationally efficient at the cost of attack efficacy and shift-invariance.
- 3.
- 4.
Note that the length of the power traces of the software implementations is typically in the order of 1e5. For example, the traces of the ASCAD dataset are 100000 points long. Thus, a desync value such as 400 is possible in those traces.
References
Abdellatif, K.M.: Mixup data augmentation for deep learning side-channel attacks. IACR Cryptology ePrint Archive, p. 328 (2021)
Ba, L.J., Kiros, J.R., Hinton, G.E.: Layer normalization. CoRR abs/1607.06450 (2016)
Benadjila, R., Prouff, E., Strullu, R., Cagli, E., Dumas, C.: Deep learning for side-channel analysis and introduction to ASCAD database. J. Cryptogr. Eng. 10(2), 163–188 (2019). https://doi.org/10.1007/s13389-019-00220-8
Bhasin, S., Bruneau, N., Danger, J.-L., Guilley, S., Najm, Z.: Analysis and improvements of the DPA contest v4 implementation. In: Chakraborty, R.S., Matyas, V., Schaumont, P. (eds.) SPACE 2014. LNCS, vol. 8804, pp. 201–218. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12060-7_14
Cagli, E., Dumas, C., Prouff, E.: Convolutional neural networks with data augmentation against jitter-based countermeasures. In: Fischer, W., Homma, N. (eds.) CHES 2017. LNCS, vol. 10529, pp. 45–68. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66787-4_3
Chari, S., Jutla, C.S., Rao, J.R., Rohatgi, P.: Towards sound approaches to counteract power-analysis attacks. In: Wiener, M. (ed.) CRYPTO 1999. LNCS, vol. 1666, pp. 398–412. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48405-1_26
Coron, J.-S., Kizhvatov, I.: An efficient method for random delay generation in embedded software. In: Clavier, C., Gaj, K. (eds.) CHES 2009. LNCS, vol. 5747, pp. 156–170. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04138-9_12
Dai, Z., Yang, Z., Yang, Y., Carbonell, J.G., Le, Q.V., Salakhutdinov, R.: Transformer-XL: attentive language models beyond a fixed-length context. In: ACL, Italy, vol. 1, pp. 2978–2988. ACL (2019)
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: AISTATS, USA. JMLR Proceedings, vol. 15, pp. 315–323. JMLR.org (2011)
Gohr, A., Jacob, S., Schindler, W.: Subsampling and knowledge distillation on adversarial examples: new techniques for deep learning based side channel evaluations. In: Dunkelman, O., Jacobson, Jr., M.J., O’Flynn, C. (eds.) SAC 2020. LNCS, vol. 12804, pp. 567–592. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-81652-0_22
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, USA, pp. 770–778. IEEE Computer Society (2016)
Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent nets: the difficulty of learning long-term dependencies (2001)
Kerkhof, M., Wu, L., Perin, G., Picek, S.: Focus is key to success: a focal loss function for deep learning-based side-channel analysis. In: Balasch, J., O’Flynn, C. (eds.) COSADE 2022. LNCS, vol. 13211, pp. 29–48. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99766-3_2
Kim, J., Picek, S., Heuser, A., Bhasin, S., Hanjalic, A.: Make some noise. unleashing the power of convolutional neural networks for profiled side-channel analysis. TCHES 2019(3), 148–179 (2019)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR, USA (2015)
Kocher, P., Jaffe, J., Jun, B.: Differential power analysis. In: Wiener, M. (ed.) CRYPTO 1999. LNCS, vol. 1666, pp. 388–397. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48405-1_25
Lu, X., Zhang, C., Cao, P., Gu, D., Lu, H.: Pay attention to raw traces: a deep learning architecture for end-to-end profiling attacks. TCHES 2021(3), 235–274 (2021)
Maghrebi, H.: Deep learning based side channel attacks in practice. IACR Cryptology ePrint Archive 2019/578 (2019)
Maghrebi, H., Portigliatti, T., Prouff, E.: Breaking cryptographic implementations using deep learning techniques. In: Carlet, C., Hasan, M.A., Saraswat, V. (eds.) SPACE 2016. LNCS, vol. 10076, pp. 3–26. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49445-6_1
Martinasek, Z., Hajny, J., Malina, L.: Optimization of power analysis using neural network. In: Francillon, A., Rohatgi, P. (eds.) CARDIS 2013. LNCS, vol. 8419, pp. 94–107. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08302-5_7
Martinasek, Z., Zeman, V.: Innovative method of the power analysis. Radioengineering 22(2), 586–594 (2013)
Masure, L., et al.: Deep learning side-channel analysis on large-scale traces. In: Chen, L., Li, N., Liang, K., Schneider, S. (eds.) ESORICS 2020. LNCS, vol. 12308, pp. 440–460. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58951-6_22
Paguada, S., Armendariz, I.: The forgotten hyperparameter: introducing dilated convolution for boosting CNN-based side-channel attacks. In: Zhou, J., et al. (eds.) ACNS 2020. LNCS, vol. 12418, pp. 217–236. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61638-0_13
Pereira, O., Standaert, F., Vivek, S.: Leakage-resilient authentication and encryption from symmetric cryptographic primitives. In: Ray, I., Li, N., Kruegel, C. (eds.) ACM SIGSAC, USA, pp. 96–108. ACM (2015)
Perin, G., Chmielewski, L., Picek, S.: Strength in numbers: improving generalization with ensembles in machine learning-based profiled side-channel analysis. TCHES 2020(4), 337–364 (2020)
Perin, G., Wu, L., Picek, S.: Exploring feature selection scenarios for deep learning-based side-channel analysis. IACR Cryptology ePrint Archive, p. 1414 (2021)
Picek, S., Heuser, A., Jovic, A., Bhasin, S., Regazzoni, F.: The curse of class imbalance and conflicting metrics with machine learning for side-channel evaluations. TCHES 2019(1), 209–237 (2019)
Picek, S., Samiotis, I.P., Kim, J., Heuser, A., Bhasin, S., Legay, A.: On the performance of convolutional neural networks for side-channel analysis. In: Chattopadhyay, A., Rebeiro, C., Yarom, Y. (eds.) SPACE 2018. LNCS, vol. 11348, pp. 157–176. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-05072-6_10
Prouff, E., Rivain, M., Bevan, R.: Statistical analysis of second order differential power analysis. IACR Cryptology ePrint Archive, p. 646 (2010)
Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. In: NAACL-HLT, USA, vol. 2, pp. 464–468. ACL (2018)
Thapar, D., Alam, M., Mukhopadhyay, D.: TranSCA: cross-family profiled side-channel attacks using transfer learning on deep neural networks. IACR Cryptology ePrint Archive 2020/1258 (2020)
Vaswani, A., et al.: Attention is all you need. In: NIPS, USA, pp. 5998–6008 (2017)
Won, Y., Hou, X., Jap, D., Breier, J., Bhasin, S.: Back to the basics: seamless integration of side-channel pre-processing in deep neural networks. IACR Cryptology ePrint Archive, 2020/1134 (2020)
Wouters, L., Arribas, V., Gierlichs, B., Preneel, B.: Revisiting a methodology for efficient CNN architectures in profiling attacks. TCHES 2020(3), 147–168 (2020)
Yarotsky, D.: Universal approximations of invariant maps by neural networks. CoRR abs/1804.10306 (2018)
Zaid, G., Bossuet, L., Dassance, F., Habrard, A., Venelli, A.: Ranking loss: maximizing the success rate in deep learning side-channel analysis. TCHES 2021(1), 25–55 (2021)
Zaid, G., Bossuet, L., Habrard, A., Venelli, A.: Methodology for efficient CNN architectures in profiling attacks. TCHES 2020(1), 1–36 (2020)
Zhang, A., Lipton, Z.C., Li, M., Smola, A.J.: Dive into Deep Learning (2020). http://d2l.ai
Zhang, L., Xing, X., Fan, J., Wang, Z., Wang, S.: Multi-label deep learning based side channel attack. In: AsianHOST, China, pp. 1–6. IEEE (2019)
Zhou, Y., Standaert, F.: Deep learning mitigates but does not annihilate the need of aligned traces and a generalized ResNet model for side-channel attacks. J. Cryptogr. Eng. 10(1), 85–95 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Proof of Lemma 1
The attention probabilities in the self-attention layer of \(\text {TN}_{\text {1L}}\) is calculated following Eqs. (8) and (9). If we set \(\textrm{W}_Q\), \(\textrm{W}_K\), \(\{\textbf{r}_i\}_{i\ne l}\) all to zero of appropriate dimensions, \(\textbf{r}_l=c\sqrt{d_k}\textbf{1}\) and \(\textbf{t}=\textbf{1}\) where \(\textbf{1}\) is a vector whose only first element is 1 and rest are zero, and c is a real constant in Eq. (8) and Eq. (9), we have \(p_{ij}\) equals to \(\frac{e^c}{e^c+n-1}\) if \(j=i+l\) and \(\frac{1}{e^c+n-1}\) otherwise for \(0\le i<n-l\). Setting \(c> \text {ln}\left( \frac{1-\epsilon }{\epsilon }\right) +\text {ln}(n-1)\), we get \(p_{i,i+l}>1-\epsilon \) for all \(0\le i < n-l\) and \(0<\epsilon <1\). Similarly, it is straight forward to show that \(p_{ij}=1/n\) for any \(n-l\le i< n\) and \(0\le j < n\) for the same value of the parameters.
B Proof of Proposition 1
From the Eqs (11), we have \(\textbf{U}_i=\textbf{Y}_i+\textbf{X}_i\), \(\textbf{U}''_i=\text {FFN}(\textbf{U}_i)+\textbf{U}_i\), for \(i=0, \cdots , n-1\) where \(\textbf{Y}_0, \textbf{Y}_1, \cdots , \textbf{Y}_{n-1} = {RelPositionalSelfAttention}(\textbf{X}_0, \textbf{X}_1, \cdots , \textbf{X}_{n-1})\). And the output of \(\text {TN}_{\text {1L}}\) is given by \(\text {TN}_{1\text {L}}(\textbf{X}_0,\cdots ,\textbf{X}_{n-1}) = \frac{1}{n} \sum _{i=0}^{n-1} \textbf{U}''_i \).
From Eq. (4) and (5), we get \(\textbf{Y}_j = \textrm{W}_O\left( \sum _{k=0}^{n-1} P_{jk}\textrm{W}_V\textbf{X}_k \right) \). Thus, we can write \(\textbf{Y}_{m_1}\) (where \(m_1\) is defined in Assumption 1) as
Equation (a2) follows since \(i=m_1\) satisfies \(P_{i,i+l}=1\) in Assumption 3. Similarly we can write \(\textbf{Y}_{i}\) for \(0\le i< n-l, i\ne m_1\) as
For \(n-l \le i <n\), we can write
since, by Assumption 3, \(P_{ij} = 1/n\) for \(j=0, \cdots , n-1\) and \(n-l\le i<n\). Now we compute \(\textbf{U}''_i\) for \(i=0, \cdots , n-1\).
Note that among all the \(\{\textbf{U}''_i\}_{0\le i<n}\), only \(\textbf{U}''_{m_1}\) and \(\{\textbf{U}''_i\}_{n-l\le i <n}\) involve both the terms \(\textbf{X}_{m_1}\) and \(\textbf{X}_{m_1+l}\), thus can be dependent on the sensitive variable Z (from Assumption 1). Rest of the \(\textbf{U}''_i\)s are independent of Z (from Assumption 2). The output of \(\text {TN}_{\text {1L}}\) can be written as
The expectation of the output conditioned on Z can be given by
The second step follows because the random variables \(\{\textbf{U}_i\}_{0\le i <n-l,i\ne m_1}\) are independent of Z. To complete the proof, we compute
From Assumption 2, we get
Thus, comparing the right hand side of Eq. (a8) and Eq. (a9) we have
which completes the proof.
C Comparison with CNN Using Global Pooling Model
The state-of-the-art CNN models use a flattening layer after all the convolutional model to convert the two-dimensional feature representation into a one-dimensional feature representation. However, the use of a flattening layer reduces the shift-invariance of the CNN models resulting in their poor performance on highly desynchronized traces (ref. Fig. 4d). This section compares TransNet to a CNN model that uses global pooling instead of the flattening layer. For this purpose, we have used the same model as EffCNN (desync400) except for the flattening layer replaced by the global pooling layer. We refer to the resulting model as EffCNN+GlobalPooling. The results of EffCNN+GlobalPooling on highly desynchronized ASCAD_desync0 dataset is compared with that of TransNet in Fig. 9. The results suggest that TransNet performs significantly better than EffCNN+GlobalPooling.
D Sensitivity of EffCNN to Profiling Desynchronization
As the experiments of TransNet in Sect. 6.6, we verify the robustness of EffCNN training to the amount of profiling desync. To verify that, we trained the EffCNN models using only synchronized traces and tested them on desynchronized traces. The results are shown in Fig. 10. From the figure, it can be seen that as the amount of desynchronization in the attack traces increases, the performance of the models gets worse rapidly, suggesting the superiority of TransNet over EffCNN when the profiling desync is significantly less than the attack desync.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hajra, S., Saha, S., Alam, M., Mukhopadhyay, D. (2022). TransNet: Shift Invariant Transformer Network for Side Channel Analysis. In: Batina, L., Daemen, J. (eds) Progress in Cryptology - AFRICACRYPT 2022. AFRICACRYPT 2022. Lecture Notes in Computer Science, vol 13503. Springer, Cham. https://doi.org/10.1007/978-3-031-17433-9_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-17433-9_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17432-2
Online ISBN: 978-3-031-17433-9
eBook Packages: Computer ScienceComputer Science (R0)