TransNet: Shift Invariant Transformer Network for Side Channel Analysis

Hajra, Suvadeep; Saha, Sayandeep; Alam, Manaar; Mukhopadhyay, Debdeep

doi:10.1007/978-3-031-17433-9_16

Suvadeep Hajra⁹,
Sayandeep Saha¹⁰,
Manaar Alam¹¹ &
…
Debdeep Mukhopadhyay⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13503))

Included in the following conference series:

International Conference on Cryptology in Africa

520 Accesses

Abstract

Deep learning (DL) has revolutionized Side Channel Analysis (SCA) in recent years. One of the major advantages of DL in the context of SCA is that it can automatically handle masking and desynchronization countermeasures, even while they are applied simultaneously for a cryptographic implementation. However, the success of the attack strongly depends on the DL model used for the attack. Traditionally, Convolutional Neural Networks (CNNs) have been utilized in this regard. This work proposes to use Transformer Network (TN) for attacking implementations secured with masking and desynchronization. Our choice is motivated by the fact that TN is good at capturing the dependencies among distant points of interest in a power trace. Furthermore, we show that TN can be made shift-invariant which is an important property required to handle desynchronized traces. Experimental validation on several public datasets establishes that our proposed TN-based model, called TransNet, outperforms the present state-of-the-art on several occasions. Specifically, TransNet outperforms the other methods by a wide margin when the traces are highly desynchronized. Additionally, TransNet shows good attack performance against implementations with desynchronized traces even when it is trained on synchronized traces.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In profiling SCA, the adversary possesses a device similar to the attack device and uses that device to train a model for the target device. The trained model is used to attack the target device. A profiling SCA assumes the strongest adversary and provides the worst-case security analysis of a cryptographic device. In this work, we have considered profiling SCA only.
2.
Setting the pool size and stride of the average pooling layer to 1, the model will behave as if there is no average pooling layer. However, setting those values to a larger value will make the model computationally efficient at the cost of attack efficacy and shift-invariance.
3.
https://github.com/ANSSI-FR/ASCAD.git.
4.
Note that the length of the power traces of the software implementations is typically in the order of 1e5. For example, the traces of the ASCAD dataset are 100000 points long. Thus, a desync value such as 400 is possible in those traces.

References

Abdellatif, K.M.: Mixup data augmentation for deep learning side-channel attacks. IACR Cryptology ePrint Archive, p. 328 (2021)
Google Scholar
Ba, L.J., Kiros, J.R., Hinton, G.E.: Layer normalization. CoRR abs/1607.06450 (2016)
Google Scholar
Benadjila, R., Prouff, E., Strullu, R., Cagli, E., Dumas, C.: Deep learning for side-channel analysis and introduction to ASCAD database. J. Cryptogr. Eng. 10(2), 163–188 (2019). https://doi.org/10.1007/s13389-019-00220-8
Article Google Scholar
Bhasin, S., Bruneau, N., Danger, J.-L., Guilley, S., Najm, Z.: Analysis and improvements of the DPA contest v4 implementation. In: Chakraborty, R.S., Matyas, V., Schaumont, P. (eds.) SPACE 2014. LNCS, vol. 8804, pp. 201–218. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12060-7_14
Chapter MATH Google Scholar
Cagli, E., Dumas, C., Prouff, E.: Convolutional neural networks with data augmentation against jitter-based countermeasures. In: Fischer, W., Homma, N. (eds.) CHES 2017. LNCS, vol. 10529, pp. 45–68. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66787-4_3
Chapter Google Scholar
Chari, S., Jutla, C.S., Rao, J.R., Rohatgi, P.: Towards sound approaches to counteract power-analysis attacks. In: Wiener, M. (ed.) CRYPTO 1999. LNCS, vol. 1666, pp. 398–412. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48405-1_26
Chapter Google Scholar
Coron, J.-S., Kizhvatov, I.: An efficient method for random delay generation in embedded software. In: Clavier, C., Gaj, K. (eds.) CHES 2009. LNCS, vol. 5747, pp. 156–170. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04138-9_12
Chapter MATH Google Scholar
Dai, Z., Yang, Z., Yang, Y., Carbonell, J.G., Le, Q.V., Salakhutdinov, R.: Transformer-XL: attentive language models beyond a fixed-length context. In: ACL, Italy, vol. 1, pp. 2978–2988. ACL (2019)
Google Scholar
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: AISTATS, USA. JMLR Proceedings, vol. 15, pp. 315–323. JMLR.org (2011)
Google Scholar
Gohr, A., Jacob, S., Schindler, W.: Subsampling and knowledge distillation on adversarial examples: new techniques for deep learning based side channel evaluations. In: Dunkelman, O., Jacobson, Jr., M.J., O’Flynn, C. (eds.) SAC 2020. LNCS, vol. 12804, pp. 567–592. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-81652-0_22
Chapter Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, USA, pp. 770–778. IEEE Computer Society (2016)
Google Scholar
Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent nets: the difficulty of learning long-term dependencies (2001)
Google Scholar
Kerkhof, M., Wu, L., Perin, G., Picek, S.: Focus is key to success: a focal loss function for deep learning-based side-channel analysis. In: Balasch, J., O’Flynn, C. (eds.) COSADE 2022. LNCS, vol. 13211, pp. 29–48. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99766-3_2
Chapter Google Scholar
Kim, J., Picek, S., Heuser, A., Bhasin, S., Hanjalic, A.: Make some noise. unleashing the power of convolutional neural networks for profiled side-channel analysis. TCHES 2019(3), 148–179 (2019)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR, USA (2015)
Google Scholar
Kocher, P., Jaffe, J., Jun, B.: Differential power analysis. In: Wiener, M. (ed.) CRYPTO 1999. LNCS, vol. 1666, pp. 388–397. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48405-1_25
Chapter Google Scholar
Lu, X., Zhang, C., Cao, P., Gu, D., Lu, H.: Pay attention to raw traces: a deep learning architecture for end-to-end profiling attacks. TCHES 2021(3), 235–274 (2021)
Article Google Scholar
Maghrebi, H.: Deep learning based side channel attacks in practice. IACR Cryptology ePrint Archive 2019/578 (2019)
Google Scholar
Maghrebi, H., Portigliatti, T., Prouff, E.: Breaking cryptographic implementations using deep learning techniques. In: Carlet, C., Hasan, M.A., Saraswat, V. (eds.) SPACE 2016. LNCS, vol. 10076, pp. 3–26. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49445-6_1
Chapter Google Scholar
Martinasek, Z., Hajny, J., Malina, L.: Optimization of power analysis using neural network. In: Francillon, A., Rohatgi, P. (eds.) CARDIS 2013. LNCS, vol. 8419, pp. 94–107. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08302-5_7
Chapter Google Scholar
Martinasek, Z., Zeman, V.: Innovative method of the power analysis. Radioengineering 22(2), 586–594 (2013)
Google Scholar
Masure, L., et al.: Deep learning side-channel analysis on large-scale traces. In: Chen, L., Li, N., Liang, K., Schneider, S. (eds.) ESORICS 2020. LNCS, vol. 12308, pp. 440–460. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58951-6_22
Chapter Google Scholar
Paguada, S., Armendariz, I.: The forgotten hyperparameter: introducing dilated convolution for boosting CNN-based side-channel attacks. In: Zhou, J., et al. (eds.) ACNS 2020. LNCS, vol. 12418, pp. 217–236. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61638-0_13
Chapter Google Scholar
Pereira, O., Standaert, F., Vivek, S.: Leakage-resilient authentication and encryption from symmetric cryptographic primitives. In: Ray, I., Li, N., Kruegel, C. (eds.) ACM SIGSAC, USA, pp. 96–108. ACM (2015)
Google Scholar
Perin, G., Chmielewski, L., Picek, S.: Strength in numbers: improving generalization with ensembles in machine learning-based profiled side-channel analysis. TCHES 2020(4), 337–364 (2020)
Article Google Scholar
Perin, G., Wu, L., Picek, S.: Exploring feature selection scenarios for deep learning-based side-channel analysis. IACR Cryptology ePrint Archive, p. 1414 (2021)
Google Scholar
Picek, S., Heuser, A., Jovic, A., Bhasin, S., Regazzoni, F.: The curse of class imbalance and conflicting metrics with machine learning for side-channel evaluations. TCHES 2019(1), 209–237 (2019)
Google Scholar
Picek, S., Samiotis, I.P., Kim, J., Heuser, A., Bhasin, S., Legay, A.: On the performance of convolutional neural networks for side-channel analysis. In: Chattopadhyay, A., Rebeiro, C., Yarom, Y. (eds.) SPACE 2018. LNCS, vol. 11348, pp. 157–176. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-05072-6_10
Chapter Google Scholar
Prouff, E., Rivain, M., Bevan, R.: Statistical analysis of second order differential power analysis. IACR Cryptology ePrint Archive, p. 646 (2010)
Google Scholar
Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. In: NAACL-HLT, USA, vol. 2, pp. 464–468. ACL (2018)
Google Scholar
Thapar, D., Alam, M., Mukhopadhyay, D.: TranSCA: cross-family profiled side-channel attacks using transfer learning on deep neural networks. IACR Cryptology ePrint Archive 2020/1258 (2020)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NIPS, USA, pp. 5998–6008 (2017)
Google Scholar
Won, Y., Hou, X., Jap, D., Breier, J., Bhasin, S.: Back to the basics: seamless integration of side-channel pre-processing in deep neural networks. IACR Cryptology ePrint Archive, 2020/1134 (2020)
Google Scholar
Wouters, L., Arribas, V., Gierlichs, B., Preneel, B.: Revisiting a methodology for efficient CNN architectures in profiling attacks. TCHES 2020(3), 147–168 (2020)
Article Google Scholar
Yarotsky, D.: Universal approximations of invariant maps by neural networks. CoRR abs/1804.10306 (2018)
Google Scholar
Zaid, G., Bossuet, L., Dassance, F., Habrard, A., Venelli, A.: Ranking loss: maximizing the success rate in deep learning side-channel analysis. TCHES 2021(1), 25–55 (2021)
Google Scholar
Zaid, G., Bossuet, L., Habrard, A., Venelli, A.: Methodology for efficient CNN architectures in profiling attacks. TCHES 2020(1), 1–36 (2020)
Google Scholar
Zhang, A., Lipton, Z.C., Li, M., Smola, A.J.: Dive into Deep Learning (2020). http://d2l.ai
Zhang, L., Xing, X., Fan, J., Wang, Z., Wang, S.: Multi-label deep learning based side channel attack. In: AsianHOST, China, pp. 1–6. IEEE (2019)
Google Scholar
Zhou, Y., Standaert, F.: Deep learning mitigates but does not annihilate the need of aligned traces and a generalized ResNet model for side-channel attacks. J. Cryptogr. Eng. 10(1), 85–95 (2020)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Indian Institute of Technology Kharagpur, Kharagpur, India
Suvadeep Hajra & Debdeep Mukhopadhyay
Nanyang Technological University, Singapore, Singapore
Sayandeep Saha
New York University Abu Dhabi, Abu Dhabi, United Arab Emirates
Manaar Alam

Authors

Suvadeep Hajra
View author publications
You can also search for this author in PubMed Google Scholar
Sayandeep Saha
View author publications
You can also search for this author in PubMed Google Scholar
Manaar Alam
View author publications
You can also search for this author in PubMed Google Scholar
Debdeep Mukhopadhyay
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Suvadeep Hajra .

Editor information

Editors and Affiliations

Radboud University, Nijmegen, The Netherlands
Lejla Batina
Radboud University, Nijmegen, Gelderland, The Netherlands
Joan Daemen

Appendices

A Proof of Lemma 1

The attention probabilities in the self-attention layer of $\text {TN}_{\text {1L}}$ is calculated following Eqs. (8) and (9). If we set $\textrm{W}_Q$, $\textrm{W}_K$, $\{\textbf{r}_i\}_{i\ne l}$ all to zero of appropriate dimensions, $\textbf{r}_l=c\sqrt{d_k}\textbf{1}$ and $\textbf{t}=\textbf{1}$ where $\textbf{1}$ is a vector whose only first element is 1 and rest are zero, and c is a real constant in Eq. (8) and Eq. (9), we have $p_{ij}$ equals to $\frac{e^c}{e^c+n-1}$ if $j=i+l$ and $\frac{1}{e^c+n-1}$ otherwise for $0\le i<n-l$. Setting $c> \text {ln}\left( \frac{1-\epsilon }{\epsilon }\right) +\text {ln}(n-1)$, we get $p_{i,i+l}>1-\epsilon $ for all $0\le i < n-l$ and $0<\epsilon <1$. Similarly, it is straight forward to show that $p_{ij}=1/n$ for any $n-l\le i< n$ and $0\le j < n$ for the same value of the parameters.

B Proof of Proposition 1

From the Eqs (11), we have $\textbf{U}_i=\textbf{Y}_i+\textbf{X}_i$, $\textbf{U}''_i=\text {FFN}(\textbf{U}_i)+\textbf{U}_i$, for $i=0, \cdots , n-1$ where $\textbf{Y}_0, \textbf{Y}_1, \cdots , \textbf{Y}_{n-1} = {RelPositionalSelfAttention}(\textbf{X}_0, \textbf{X}_1, \cdots , \textbf{X}_{n-1})$. And the output of $\text {TN}_{\text {1L}}$ is given by $\text {TN}_{1\text {L}}(\textbf{X}_0,\cdots ,\textbf{X}_{n-1}) = \frac{1}{n} \sum _{i=0}^{n-1} \textbf{U}''_i $.

From Eq. (4) and (5), we get $\textbf{Y}_j = \textrm{W}_O\left( \sum _{k=0}^{n-1} P_{jk}\textrm{W}_V\textbf{X}_k \right) $. Thus, we can write $\textbf{Y}_{m_1}$ (where $m_1$ is defined in Assumption 1) as

$$\begin{aligned} \textbf{Y}_{m_1}&= \textrm{W}_O\left( \sum _{k=0}^{n-1} P_{m_1k}\textrm{W}_V\textbf{X}_k \right) = \textrm{W}_O\textrm{W}_V\textbf{X}_{m_1+l}, \text { and thus} \end{aligned}$$

(a2)

$$\begin{aligned} \textbf{U}_{m_1}&= \textrm{W}_O\textrm{W}_V\textbf{X}_{m_1+l} + \textbf{X}_{m_1} \end{aligned}$$

(a3)

Equation (a2) follows since $i=m_1$ satisfies $P_{i,i+l}=1$ in Assumption 3. Similarly we can write $\textbf{Y}_{i}$ for $0\le i< n-l, i\ne m_1$ as

$$\begin{aligned} \textbf{Y}_{i}&= \textrm{W}_O\left( \sum _{k=0}^{n-1} P_{ik}\textrm{W}_V\textbf{X}_k \right) = \textrm{W}_O\textrm{W}_V\textbf{X}_{i+l}, \text { and thus} \end{aligned}$$

(a4)

$$\begin{aligned} \textbf{U}_i&= \textrm{W}_O\textrm{W}_V\textbf{X}_{i+l} + \textbf{X}_i \end{aligned}$$

(a5)

For $n-l \le i <n$, we can write

$$\begin{aligned} \textbf{Y}_{i}&= \frac{1}{n}\textrm{W}_O\textrm{W}_V\sum _{k=0}^{n-1}\textbf{X}_{k}\quad \text { and, } \textbf{U}_{i} = \frac{1}{n}\textrm{W}_O\textrm{W}_V\sum _{k=0}^{n-1}\textbf{X}_{k} + \textbf{X}_i \end{aligned}$$

since, by Assumption 3, $P_{ij} = 1/n$ for $j=0, \cdots , n-1$ and $n-l\le i<n$. Now we compute $\textbf{U}''_i$ for $i=0, \cdots , n-1$.

$$\begin{aligned} \textbf{U}''_i = {FFN}(\textbf{U}_i) + \textbf{U}_i \end{aligned}$$

(a6)

Note that among all the $\{\textbf{U}''_i\}_{0\le i<n}$, only $\textbf{U}''_{m_1}$ and $\{\textbf{U}''_i\}_{n-l\le i <n}$ involve both the terms $\textbf{X}_{m_1}$ and $\textbf{X}_{m_1+l}$, thus can be dependent on the sensitive variable Z (from Assumption 1). Rest of the $\textbf{U}''_i$s are independent of Z (from Assumption 2). The output of $\text {TN}_{\text {1L}}$ can be written as

$$\begin{aligned} \text {TN}_{\text {1L}}(\textbf{X}_0, \cdots , \textbf{X}_{n-1})&= \frac{1}{n} \sum _{i=0}^n\textbf{U}''_i = \frac{1}{n}\textbf{U}''_{m_1} + \frac{1}{n}\sum _{0\le i<n-l, i\ne m_1} \textbf{U}''_i + \frac{1}{n}\sum _{n-l\le i<n }\textbf{U}''_i \end{aligned}$$

(a7)

The expectation of the output conditioned on Z can be given by

$$\begin{aligned} \mathbb {E}[\text {TN}_{\text {1L}}(\textbf{X}_0, \cdots , \textbf{X}_{n-1})|Z]&=\frac{1}{n}\mathbb {E}[\textbf{U}''_{m_1}|Z] +\frac{1}{n}\sum _{n-l\le i<n} \mathbb {E}[\textbf{U}''_i|Z]+\frac{1}{n}\sum _{0\le i<n-l, i\ne m_1} \mathbb {E}[\textbf{U}''_i] \end{aligned}$$

(a8)

The second step follows because the random variables $\{\textbf{U}_i\}_{0\le i <n-l,i\ne m_1}$ are independent of Z. To complete the proof, we compute

$$\begin{aligned}&\mathbb {E}\left[ \text {TN}_{\text {1L}}(T^s(\textbf{X}_{-n+1+m_2}, \cdots , \textbf{X}_{n-1+m_1}))|Z\right] \\&= \mathbb {E}\left[ \text {TN}_{\text {1L}}(\textbf{X}_{-s}, \cdots , \textbf{X}_{n-1-s})|Z\right] \\&= \frac{1}{n}\mathbb {E}[\textbf{U}''_{m_1}|Z] + \frac{1}{n} \sum _{n-l-s\le i<n-s} \mathbb {E}[\textbf{U}''_i|Z] + \frac{1}{n}\sum _{-s\le i<n-l-s, i\ne m_1}\mathbb {E}\left[ \textbf{U}''_i\right] \end{aligned}$$

(a9)

From Assumption 2, we get

$$\begin{aligned} \frac{1}{n}\sum _{n-l\le i<n}\mathbb {E}\left[ \textbf{U}''_{i}|Z\right]&= \frac{1}{n} \sum _{n-l-s\le i<n-s} \mathbb {E}\left[ \textbf{U}''_i|Z\right] , \\ \text {and } \frac{1}{n}\sum _{0\le i<n-l, i\ne {m_1}} \mathbb {E}\left[ \textbf{U}''_i\right]&= \frac{1}{n}\sum _{-s\le i<n-l-s, i\ne m_1}\mathbb {E}[\textbf{U}''_i] \end{aligned}$$

Thus, comparing the right hand side of Eq. (a8) and Eq. (a9) we have

$$\begin{aligned}&\mathbb {E}[\text {TN}_{\text {1L}}(\textbf{X}_0, \cdots , \textbf{X}_{n-1})|Z] = \mathbb {E}\left[ \text {TN}_{\text {1L}}(T^s(\textbf{X}_{-n+1+m_2}, \cdots , \textbf{X}_{n-1+m_1}))|Z\right] \end{aligned}$$

which completes the proof.

C Comparison with CNN Using Global Pooling Model

The state-of-the-art CNN models use a flattening layer after all the convolutional model to convert the two-dimensional feature representation into a one-dimensional feature representation. However, the use of a flattening layer reduces the shift-invariance of the CNN models resulting in their poor performance on highly desynchronized traces (ref. Fig. 4d). This section compares TransNet to a CNN model that uses global pooling instead of the flattening layer. For this purpose, we have used the same model as EffCNN (desync400) except for the flattening layer replaced by the global pooling layer. We refer to the resulting model as EffCNN+GlobalPooling. The results of EffCNN+GlobalPooling on highly desynchronized ASCAD_desync0 dataset is compared with that of TransNet in Fig. 9. The results suggest that TransNet performs significantly better than EffCNN+GlobalPooling.

D Sensitivity of EffCNN to Profiling Desynchronization

As the experiments of TransNet in Sect. 6.6, we verify the robustness of EffCNN training to the amount of profiling desync. To verify that, we trained the EffCNN models using only synchronized traces and tested them on desynchronized traces. The results are shown in Fig. 10. From the figure, it can be seen that as the amount of desynchronization in the attack traces increases, the performance of the models gets worse rapidly, suggesting the superiority of TransNet over EffCNN when the profiling desync is significantly less than the attack desync.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hajra, S., Saha, S., Alam, M., Mukhopadhyay, D. (2022). TransNet: Shift Invariant Transformer Network for Side Channel Analysis. In: Batina, L., Daemen, J. (eds) Progress in Cryptology - AFRICACRYPT 2022. AFRICACRYPT 2022. Lecture Notes in Computer Science, vol 13503. Springer, Cham. https://doi.org/10.1007/978-3-031-17433-9_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-17433-9_16
Published: 06 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17432-2
Online ISBN: 978-3-031-17433-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TransNet: Shift Invariant Transformer Network for Side Channel Analysis