Skip to main content

TransNet: Shift Invariant Transformer Network for Side Channel Analysis

  • Conference paper
  • First Online:
Progress in Cryptology - AFRICACRYPT 2022 (AFRICACRYPT 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13503))

Included in the following conference series:

  • 520 Accesses

Abstract

Deep learning (DL) has revolutionized Side Channel Analysis (SCA) in recent years. One of the major advantages of DL in the context of SCA is that it can automatically handle masking and desynchronization countermeasures, even while they are applied simultaneously for a cryptographic implementation. However, the success of the attack strongly depends on the DL model used for the attack. Traditionally, Convolutional Neural Networks (CNNs) have been utilized in this regard. This work proposes to use Transformer Network (TN) for attacking implementations secured with masking and desynchronization. Our choice is motivated by the fact that TN is good at capturing the dependencies among distant points of interest in a power trace. Furthermore, we show that TN can be made shift-invariant which is an important property required to handle desynchronized traces. Experimental validation on several public datasets establishes that our proposed TN-based model, called TransNet, outperforms the present state-of-the-art on several occasions. Specifically, TransNet outperforms the other methods by a wide margin when the traces are highly desynchronized. Additionally, TransNet shows good attack performance against implementations with desynchronized traces even when it is trained on synchronized traces.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In profiling SCA, the adversary possesses a device similar to the attack device and uses that device to train a model for the target device. The trained model is used to attack the target device. A profiling SCA assumes the strongest adversary and provides the worst-case security analysis of a cryptographic device. In this work, we have considered profiling SCA only.

  2. 2.

    Setting the pool size and stride of the average pooling layer to 1, the model will behave as if there is no average pooling layer. However, setting those values to a larger value will make the model computationally efficient at the cost of attack efficacy and shift-invariance.

  3. 3.

    https://github.com/ANSSI-FR/ASCAD.git.

  4. 4.

    Note that the length of the power traces of the software implementations is typically in the order of 1e5. For example, the traces of the ASCAD dataset are 100000 points long. Thus, a desync value such as 400 is possible in those traces.

References

  1. Abdellatif, K.M.: Mixup data augmentation for deep learning side-channel attacks. IACR Cryptology ePrint Archive, p. 328 (2021)

    Google Scholar 

  2. Ba, L.J., Kiros, J.R., Hinton, G.E.: Layer normalization. CoRR abs/1607.06450 (2016)

    Google Scholar 

  3. Benadjila, R., Prouff, E., Strullu, R., Cagli, E., Dumas, C.: Deep learning for side-channel analysis and introduction to ASCAD database. J. Cryptogr. Eng. 10(2), 163–188 (2019). https://doi.org/10.1007/s13389-019-00220-8

    Article  Google Scholar 

  4. Bhasin, S., Bruneau, N., Danger, J.-L., Guilley, S., Najm, Z.: Analysis and improvements of the DPA contest v4 implementation. In: Chakraborty, R.S., Matyas, V., Schaumont, P. (eds.) SPACE 2014. LNCS, vol. 8804, pp. 201–218. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12060-7_14

    Chapter  MATH  Google Scholar 

  5. Cagli, E., Dumas, C., Prouff, E.: Convolutional neural networks with data augmentation against jitter-based countermeasures. In: Fischer, W., Homma, N. (eds.) CHES 2017. LNCS, vol. 10529, pp. 45–68. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66787-4_3

    Chapter  Google Scholar 

  6. Chari, S., Jutla, C.S., Rao, J.R., Rohatgi, P.: Towards sound approaches to counteract power-analysis attacks. In: Wiener, M. (ed.) CRYPTO 1999. LNCS, vol. 1666, pp. 398–412. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48405-1_26

    Chapter  Google Scholar 

  7. Coron, J.-S., Kizhvatov, I.: An efficient method for random delay generation in embedded software. In: Clavier, C., Gaj, K. (eds.) CHES 2009. LNCS, vol. 5747, pp. 156–170. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04138-9_12

    Chapter  MATH  Google Scholar 

  8. Dai, Z., Yang, Z., Yang, Y., Carbonell, J.G., Le, Q.V., Salakhutdinov, R.: Transformer-XL: attentive language models beyond a fixed-length context. In: ACL, Italy, vol. 1, pp. 2978–2988. ACL (2019)

    Google Scholar 

  9. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: AISTATS, USA. JMLR Proceedings, vol. 15, pp. 315–323. JMLR.org (2011)

    Google Scholar 

  10. Gohr, A., Jacob, S., Schindler, W.: Subsampling and knowledge distillation on adversarial examples: new techniques for deep learning based side channel evaluations. In: Dunkelman, O., Jacobson, Jr., M.J., O’Flynn, C. (eds.) SAC 2020. LNCS, vol. 12804, pp. 567–592. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-81652-0_22

    Chapter  Google Scholar 

  11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, USA, pp. 770–778. IEEE Computer Society (2016)

    Google Scholar 

  12. Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent nets: the difficulty of learning long-term dependencies (2001)

    Google Scholar 

  13. Kerkhof, M., Wu, L., Perin, G., Picek, S.: Focus is key to success: a focal loss function for deep learning-based side-channel analysis. In: Balasch, J., O’Flynn, C. (eds.) COSADE 2022. LNCS, vol. 13211, pp. 29–48. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99766-3_2

    Chapter  Google Scholar 

  14. Kim, J., Picek, S., Heuser, A., Bhasin, S., Hanjalic, A.: Make some noise. unleashing the power of convolutional neural networks for profiled side-channel analysis. TCHES 2019(3), 148–179 (2019)

    Google Scholar 

  15. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR, USA (2015)

    Google Scholar 

  16. Kocher, P., Jaffe, J., Jun, B.: Differential power analysis. In: Wiener, M. (ed.) CRYPTO 1999. LNCS, vol. 1666, pp. 388–397. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48405-1_25

    Chapter  Google Scholar 

  17. Lu, X., Zhang, C., Cao, P., Gu, D., Lu, H.: Pay attention to raw traces: a deep learning architecture for end-to-end profiling attacks. TCHES 2021(3), 235–274 (2021)

    Article  Google Scholar 

  18. Maghrebi, H.: Deep learning based side channel attacks in practice. IACR Cryptology ePrint Archive 2019/578 (2019)

    Google Scholar 

  19. Maghrebi, H., Portigliatti, T., Prouff, E.: Breaking cryptographic implementations using deep learning techniques. In: Carlet, C., Hasan, M.A., Saraswat, V. (eds.) SPACE 2016. LNCS, vol. 10076, pp. 3–26. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49445-6_1

    Chapter  Google Scholar 

  20. Martinasek, Z., Hajny, J., Malina, L.: Optimization of power analysis using neural network. In: Francillon, A., Rohatgi, P. (eds.) CARDIS 2013. LNCS, vol. 8419, pp. 94–107. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08302-5_7

    Chapter  Google Scholar 

  21. Martinasek, Z., Zeman, V.: Innovative method of the power analysis. Radioengineering 22(2), 586–594 (2013)

    Google Scholar 

  22. Masure, L., et al.: Deep learning side-channel analysis on large-scale traces. In: Chen, L., Li, N., Liang, K., Schneider, S. (eds.) ESORICS 2020. LNCS, vol. 12308, pp. 440–460. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58951-6_22

    Chapter  Google Scholar 

  23. Paguada, S., Armendariz, I.: The forgotten hyperparameter: introducing dilated convolution for boosting CNN-based side-channel attacks. In: Zhou, J., et al. (eds.) ACNS 2020. LNCS, vol. 12418, pp. 217–236. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61638-0_13

    Chapter  Google Scholar 

  24. Pereira, O., Standaert, F., Vivek, S.: Leakage-resilient authentication and encryption from symmetric cryptographic primitives. In: Ray, I., Li, N., Kruegel, C. (eds.) ACM SIGSAC, USA, pp. 96–108. ACM (2015)

    Google Scholar 

  25. Perin, G., Chmielewski, L., Picek, S.: Strength in numbers: improving generalization with ensembles in machine learning-based profiled side-channel analysis. TCHES 2020(4), 337–364 (2020)

    Article  Google Scholar 

  26. Perin, G., Wu, L., Picek, S.: Exploring feature selection scenarios for deep learning-based side-channel analysis. IACR Cryptology ePrint Archive, p. 1414 (2021)

    Google Scholar 

  27. Picek, S., Heuser, A., Jovic, A., Bhasin, S., Regazzoni, F.: The curse of class imbalance and conflicting metrics with machine learning for side-channel evaluations. TCHES 2019(1), 209–237 (2019)

    Google Scholar 

  28. Picek, S., Samiotis, I.P., Kim, J., Heuser, A., Bhasin, S., Legay, A.: On the performance of convolutional neural networks for side-channel analysis. In: Chattopadhyay, A., Rebeiro, C., Yarom, Y. (eds.) SPACE 2018. LNCS, vol. 11348, pp. 157–176. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-05072-6_10

    Chapter  Google Scholar 

  29. Prouff, E., Rivain, M., Bevan, R.: Statistical analysis of second order differential power analysis. IACR Cryptology ePrint Archive, p. 646 (2010)

    Google Scholar 

  30. Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. In: NAACL-HLT, USA, vol. 2, pp. 464–468. ACL (2018)

    Google Scholar 

  31. Thapar, D., Alam, M., Mukhopadhyay, D.: TranSCA: cross-family profiled side-channel attacks using transfer learning on deep neural networks. IACR Cryptology ePrint Archive 2020/1258 (2020)

    Google Scholar 

  32. Vaswani, A., et al.: Attention is all you need. In: NIPS, USA, pp. 5998–6008 (2017)

    Google Scholar 

  33. Won, Y., Hou, X., Jap, D., Breier, J., Bhasin, S.: Back to the basics: seamless integration of side-channel pre-processing in deep neural networks. IACR Cryptology ePrint Archive, 2020/1134 (2020)

    Google Scholar 

  34. Wouters, L., Arribas, V., Gierlichs, B., Preneel, B.: Revisiting a methodology for efficient CNN architectures in profiling attacks. TCHES 2020(3), 147–168 (2020)

    Article  Google Scholar 

  35. Yarotsky, D.: Universal approximations of invariant maps by neural networks. CoRR abs/1804.10306 (2018)

    Google Scholar 

  36. Zaid, G., Bossuet, L., Dassance, F., Habrard, A., Venelli, A.: Ranking loss: maximizing the success rate in deep learning side-channel analysis. TCHES 2021(1), 25–55 (2021)

    Google Scholar 

  37. Zaid, G., Bossuet, L., Habrard, A., Venelli, A.: Methodology for efficient CNN architectures in profiling attacks. TCHES 2020(1), 1–36 (2020)

    Google Scholar 

  38. Zhang, A., Lipton, Z.C., Li, M., Smola, A.J.: Dive into Deep Learning (2020). http://d2l.ai

  39. Zhang, L., Xing, X., Fan, J., Wang, Z., Wang, S.: Multi-label deep learning based side channel attack. In: AsianHOST, China, pp. 1–6. IEEE (2019)

    Google Scholar 

  40. Zhou, Y., Standaert, F.: Deep learning mitigates but does not annihilate the need of aligned traces and a generalized ResNet model for side-channel attacks. J. Cryptogr. Eng. 10(1), 85–95 (2020)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suvadeep Hajra .

Editor information

Editors and Affiliations

Appendices

A Proof of Lemma 1

The attention probabilities in the self-attention layer of \(\text {TN}_{\text {1L}}\) is calculated following Eqs. (8) and (9). If we set \(\textrm{W}_Q\), \(\textrm{W}_K\), \(\{\textbf{r}_i\}_{i\ne l}\) all to zero of appropriate dimensions, \(\textbf{r}_l=c\sqrt{d_k}\textbf{1}\) and \(\textbf{t}=\textbf{1}\) where \(\textbf{1}\) is a vector whose only first element is 1 and rest are zero, and c is a real constant in Eq. (8) and Eq. (9), we have \(p_{ij}\) equals to \(\frac{e^c}{e^c+n-1}\) if \(j=i+l\) and \(\frac{1}{e^c+n-1}\) otherwise for \(0\le i<n-l\). Setting \(c> \text {ln}\left( \frac{1-\epsilon }{\epsilon }\right) +\text {ln}(n-1)\), we get \(p_{i,i+l}>1-\epsilon \) for all \(0\le i < n-l\) and \(0<\epsilon <1\). Similarly, it is straight forward to show that \(p_{ij}=1/n\) for any \(n-l\le i< n\) and \(0\le j < n\) for the same value of the parameters.

B Proof of Proposition 1

From the Eqs (11), we have \(\textbf{U}_i=\textbf{Y}_i+\textbf{X}_i\), \(\textbf{U}''_i=\text {FFN}(\textbf{U}_i)+\textbf{U}_i\), for \(i=0, \cdots , n-1\) where \(\textbf{Y}_0, \textbf{Y}_1, \cdots , \textbf{Y}_{n-1} = {RelPositionalSelfAttention}(\textbf{X}_0, \textbf{X}_1, \cdots , \textbf{X}_{n-1})\). And the output of \(\text {TN}_{\text {1L}}\) is given by \(\text {TN}_{1\text {L}}(\textbf{X}_0,\cdots ,\textbf{X}_{n-1}) = \frac{1}{n} \sum _{i=0}^{n-1} \textbf{U}''_i \).

From Eq. (4) and (5), we get \(\textbf{Y}_j = \textrm{W}_O\left( \sum _{k=0}^{n-1} P_{jk}\textrm{W}_V\textbf{X}_k \right) \). Thus, we can write \(\textbf{Y}_{m_1}\) (where \(m_1\) is defined in Assumption 1) as

$$\begin{aligned} \textbf{Y}_{m_1}&= \textrm{W}_O\left( \sum _{k=0}^{n-1} P_{m_1k}\textrm{W}_V\textbf{X}_k \right) = \textrm{W}_O\textrm{W}_V\textbf{X}_{m_1+l}, \text { and thus} \end{aligned}$$
(a2)
$$\begin{aligned} \textbf{U}_{m_1}&= \textrm{W}_O\textrm{W}_V\textbf{X}_{m_1+l} + \textbf{X}_{m_1} \end{aligned}$$
(a3)

Equation (a2) follows since \(i=m_1\) satisfies \(P_{i,i+l}=1\) in Assumption 3. Similarly we can write \(\textbf{Y}_{i}\) for \(0\le i< n-l, i\ne m_1\) as

$$\begin{aligned} \textbf{Y}_{i}&= \textrm{W}_O\left( \sum _{k=0}^{n-1} P_{ik}\textrm{W}_V\textbf{X}_k \right) = \textrm{W}_O\textrm{W}_V\textbf{X}_{i+l}, \text { and thus} \end{aligned}$$
(a4)
$$\begin{aligned} \textbf{U}_i&= \textrm{W}_O\textrm{W}_V\textbf{X}_{i+l} + \textbf{X}_i \end{aligned}$$
(a5)

For \(n-l \le i <n\), we can write

$$\begin{aligned} \textbf{Y}_{i}&= \frac{1}{n}\textrm{W}_O\textrm{W}_V\sum _{k=0}^{n-1}\textbf{X}_{k}\quad \text { and, } \textbf{U}_{i} = \frac{1}{n}\textrm{W}_O\textrm{W}_V\sum _{k=0}^{n-1}\textbf{X}_{k} + \textbf{X}_i \end{aligned}$$

since, by Assumption 3, \(P_{ij} = 1/n\) for \(j=0, \cdots , n-1\) and \(n-l\le i<n\). Now we compute \(\textbf{U}''_i\) for \(i=0, \cdots , n-1\).

$$\begin{aligned} \textbf{U}''_i = {FFN}(\textbf{U}_i) + \textbf{U}_i \end{aligned}$$
(a6)

Note that among all the \(\{\textbf{U}''_i\}_{0\le i<n}\), only \(\textbf{U}''_{m_1}\) and \(\{\textbf{U}''_i\}_{n-l\le i <n}\) involve both the terms \(\textbf{X}_{m_1}\) and \(\textbf{X}_{m_1+l}\), thus can be dependent on the sensitive variable Z (from Assumption 1). Rest of the \(\textbf{U}''_i\)s are independent of Z (from Assumption 2). The output of \(\text {TN}_{\text {1L}}\) can be written as

$$\begin{aligned} \text {TN}_{\text {1L}}(\textbf{X}_0, \cdots , \textbf{X}_{n-1})&= \frac{1}{n} \sum _{i=0}^n\textbf{U}''_i = \frac{1}{n}\textbf{U}''_{m_1} + \frac{1}{n}\sum _{0\le i<n-l, i\ne m_1} \textbf{U}''_i + \frac{1}{n}\sum _{n-l\le i<n }\textbf{U}''_i \end{aligned}$$
(a7)

The expectation of the output conditioned on Z can be given by

$$\begin{aligned} \mathbb {E}[\text {TN}_{\text {1L}}(\textbf{X}_0, \cdots , \textbf{X}_{n-1})|Z]&=\frac{1}{n}\mathbb {E}[\textbf{U}''_{m_1}|Z] +\frac{1}{n}\sum _{n-l\le i<n} \mathbb {E}[\textbf{U}''_i|Z]+\frac{1}{n}\sum _{0\le i<n-l, i\ne m_1} \mathbb {E}[\textbf{U}''_i] \end{aligned}$$
(a8)

The second step follows because the random variables \(\{\textbf{U}_i\}_{0\le i <n-l,i\ne m_1}\) are independent of Z. To complete the proof, we compute

$$\begin{aligned}&\mathbb {E}\left[ \text {TN}_{\text {1L}}(T^s(\textbf{X}_{-n+1+m_2}, \cdots , \textbf{X}_{n-1+m_1}))|Z\right] \\&= \mathbb {E}\left[ \text {TN}_{\text {1L}}(\textbf{X}_{-s}, \cdots , \textbf{X}_{n-1-s})|Z\right] \\&= \frac{1}{n}\mathbb {E}[\textbf{U}''_{m_1}|Z] + \frac{1}{n} \sum _{n-l-s\le i<n-s} \mathbb {E}[\textbf{U}''_i|Z] + \frac{1}{n}\sum _{-s\le i<n-l-s, i\ne m_1}\mathbb {E}\left[ \textbf{U}''_i\right] \end{aligned}$$
(a9)

From Assumption 2, we get

$$\begin{aligned} \frac{1}{n}\sum _{n-l\le i<n}\mathbb {E}\left[ \textbf{U}''_{i}|Z\right]&= \frac{1}{n} \sum _{n-l-s\le i<n-s} \mathbb {E}\left[ \textbf{U}''_i|Z\right] , \\ \text {and } \frac{1}{n}\sum _{0\le i<n-l, i\ne {m_1}} \mathbb {E}\left[ \textbf{U}''_i\right]&= \frac{1}{n}\sum _{-s\le i<n-l-s, i\ne m_1}\mathbb {E}[\textbf{U}''_i] \end{aligned}$$

Thus, comparing the right hand side of Eq. (a8) and Eq. (a9) we have

$$\begin{aligned}&\mathbb {E}[\text {TN}_{\text {1L}}(\textbf{X}_0, \cdots , \textbf{X}_{n-1})|Z] = \mathbb {E}\left[ \text {TN}_{\text {1L}}(T^s(\textbf{X}_{-n+1+m_2}, \cdots , \textbf{X}_{n-1+m_1}))|Z\right] \end{aligned}$$

which completes the proof.

C Comparison with CNN Using Global Pooling Model

The state-of-the-art CNN models use a flattening layer after all the convolutional model to convert the two-dimensional feature representation into a one-dimensional feature representation. However, the use of a flattening layer reduces the shift-invariance of the CNN models resulting in their poor performance on highly desynchronized traces (ref. Fig. 4d). This section compares TransNet to a CNN model that uses global pooling instead of the flattening layer. For this purpose, we have used the same model as EffCNN (desync400) except for the flattening layer replaced by the global pooling layer. We refer to the resulting model as EffCNN+GlobalPooling. The results of EffCNN+GlobalPooling on highly desynchronized ASCAD_desync0 dataset is compared with that of TransNet in Fig. 9. The results suggest that TransNet performs significantly better than EffCNN+GlobalPooling.

Fig. 9.
figure 9

Comparison of TransNet with EffCNN+GlobalPooling on the ASCAD_desync400 datasets.

Fig. 10.
figure 10

Results of EffCNN. The models have been trained with profiling desync 0.

D Sensitivity of EffCNN to Profiling Desynchronization

As the experiments of TransNet in Sect. 6.6, we verify the robustness of EffCNN training to the amount of profiling desync. To verify that, we trained the EffCNN models using only synchronized traces and tested them on desynchronized traces. The results are shown in Fig. 10. From the figure, it can be seen that as the amount of desynchronization in the attack traces increases, the performance of the models gets worse rapidly, suggesting the superiority of TransNet over EffCNN when the profiling desync is significantly less than the attack desync.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hajra, S., Saha, S., Alam, M., Mukhopadhyay, D. (2022). TransNet: Shift Invariant Transformer Network for Side Channel Analysis. In: Batina, L., Daemen, J. (eds) Progress in Cryptology - AFRICACRYPT 2022. AFRICACRYPT 2022. Lecture Notes in Computer Science, vol 13503. Springer, Cham. https://doi.org/10.1007/978-3-031-17433-9_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-17433-9_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-17432-2

  • Online ISBN: 978-3-031-17433-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics