Abstract
This paper has four main goals. First, we show how we solved the CHES 2018 AES challenge in the contest using essentially a linear classifier combined with a SAT solver and a custom error correction method. This part of the paper has previously appeared in a preprint by the current authors (e-print report 2019/094) and later as a contribution to a preprint write-up of the solutions by the winning teams (e-print report 2019/860).
Second, we develop a novel deep neural network architecture for side-channel analysis that completely breaks the AES challenge, allowing for fairly reliable key recovery with just a single trace on the unknown-device part of the CHES challenge (with an expected success rate of roughly 70% if about 100 CPU hours are allowed for the equation solving stage of the attack). This solution significantly improves upon all previously published solutions of the AES challenge, including our baseline linear solution.
Third, we consider the question of leakage attribution for both the classifier we used in the challenge and for our deep neural network. Direct inspection of the weight vector of our machine learning model yields a lot of information on the implementation for our linear classifier. For the deep neural network, we test three other strategies (occlusion of traces; inspection of adversarial changes; knowledge distillation) and find that these can yield information on the leakage essentially equivalent to that gained by inspecting the weights of the simpler model.
Fourth, we study the properties of adversarially generated side-channel traces for our model. Partly reproducing recent computer vision work by Ilyas et al. in our application domain, we find that a linear classifier that generalizes to an unseen device much better than our linear baseline can be trained using only adversarial examples (fresh random keys, adversarially perturbed traces) for our deep neural network. This gives a new way of extracting human-usable knowledge from a deep side channel model while also yielding insights on adversarial examples in an application domain where relatively few sources of spurious correlations between data and labels exist.
The experiments described in this paper can be reproduced using code available at https://github.com/agohr/ches2018.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
To be precise, it was known which traces in the training dataset correspond to the challenge device, but no further details about this device were revealed.
- 2.
Shuffling of the training set is important here, since we will use the last 3000 traces of T as our validation set to monitor training progress and control learning rate drops.
- 3.
We tried other ways to combine outputs, such as taking the median of predictions. This did not yield significantly different results.
- 4.
Note that in order to solve the Set 6 challenge, the neural network will need to be invoked for at most two or three traces.
- 5.
Note that the traces in Set 4 all share the same key, and that a ball of radius 150 units in \(\mathbb {R}^{65000}\) covers a volume a factor of \(2^{130000}\) smaller than a ball of radius 600 units.
- 6.
This means that we performed Ridge regression with the same hyperparameters as for the baseline predictor to predict \(K'\) given \(T'\).
- 7.
This is not obvious for side-channel data, as useful features come from small parts of the trace.
- 8.
This idea regarding the source of non-robust features in images is not ours; it is taken from [17] with minor changes.
- 9.
Sampling is easier and sample collection is even better defined e.g. in cryptanalytic problems, but these have a discrete sample space and it is less clear what adversarial perturbation would mean in this case.
- 10.
Our low-pass filter for angular frequency \(2 \pi f\) keeps only the \(\lceil N \cdot f \rceil \) lowest-index coefficients of the DFT of the trace signal and their complex conjugates, where N is the size of the reduced traces.
- 11.
Note however that this does not imply that those global features are useful: for instance, overfitting by memorizing particular data-label pairs is using global features.
- 12.
Note, however, that modern neural network architectures regularly contain processing layers that cannot be naturally described in this framework, for instance layers that normalize data, reshape it, apply fixed transformations, drop some data during training, or add noise.
References
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006). ISBN 978-0387-31073-2
Gohr, A., Jacob, S., Schindler, W.: CHES 2018 Side Channel Contest CTF - Solution of the AES Challenges. IACR eprint archive report 2019/094. https://eprint.iacr.org/2019/094
Damm, T., Freud, S., Klein, D.: Dissecting the CHES 2018 AES Challenge. IACR eprint archive report 2019/783. https://eprint.iacr.org/2019/783
Hu, Y., et al.: Machine Learning and Side-Channel Analysis in a CTF Competition. IACR eprint archive report 2019/860. https://eprint.iacr.org/2019/860
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Soos, M., Nohl, K., Castelluccia, C.: Extending SAT solvers to cryptographic problems. In: 12th International Conference on Theory and Applications of Satisfiability Testing - SAT 2009 (2009)
Pycryptosat homepage. https://pypi.org/project/pycryptosat/. Accessed 08 Oct 2018
Picek, S., Samiotis, I.P., Kim, J., Heuser, A., Bhasin, S., Legay, A.: On the performance of convolutional neural networks for side-channel analysis. In: Chattopadhyay, A., Rebeiro, C., Yarom, Y. (eds.) SPACE 2018. LNCS, vol. 11348, pp. 157–176. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-05072-6_10
Cagli, E., Dumas, C., Prouff, E.: Convolutional neural networks with data augmentation against Jitter-based countermeasures. In: Fischer, W., Homma, N. (eds.) CHES 2017. LNCS, vol. 10529, pp. 45–68. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66787-4_3
Kim, J., Picek, S., Heuser, A., Bhasin, S., Hanjalic, A.: Make some noise. Unleashing the power of convolutional neural networks for profiled side-channel analysis. IACR Trans. Cryptograph. Hardw. Embedded Syst. 2019(3), 148–179. https://doi.org/10.13154/tches.v2019.i3.148-179
He, K., Xang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016). https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf
Emadjila, R., Prouff, E., Strullu, R., Cagli, E., Dumas, C.: Study of Deep Learning Techniques for Side-Channel Analysis and Introduction to the ASCAD Database. IACR eprint report 2018/053. https://eprint.iacr.org/2018/053
Gohr, A.: Improving attacks on round-reduced Speck32/64 using deep learning. In: Boldyreva, A., Micciancio, D. (eds.) CRYPTO 2019. LNCS, vol. 11693, pp. 150–179. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26951-7_6
Kingma, D.P., Ba, J.L.: ADAM: A Method for Stochastic Optimization. ICLR 2015. arXiv:1412.6980 (2015)
Chollet, F.: keras, GitHub (2015). https://github.com/fchollet/keras
Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., Madry, A.: Adversarial Examples Are Not Bugs, They Are Features. NeurIPS 2019. https://arxiv.org/pdf/1905.02175.pdf
Barak, B.: Puzzles of Modern Machine Learning, Windows On Theory Research Blog. https://windowsontheory.org/2019/11/15/puzzles-of-modern-machine-learning/. Accessed 19 Nov 2019
Hettwer, B., Gehrer, S., Güneysu, T.: Deep neural network attribution methods for leakage analysis and symmetric key recovery. In: Paterson, K.G., Stebila, D. (eds.) SAC 2019. LNCS, vol. 11959, pp. 645–666. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-38471-5_26
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016). https://www.deeplearningbook.org
Goodfellow, I., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning (2015)
Sharif, M., Bhagavatula, S., Bauer, L., Reiter, M.K.: Accessorize to a crime: real and stealthy attacks on state-of-the-art face recognition. In: ACM Conference on Computer and Communications Security 2016, New York, pp. 1528–1540 (2016)
Belliza, D., Bronchain, O., Cassiers, G., Momin, C., Standaert, F.-X., Udvarhelyi, B.(organizers): CHES CTF 2020 Hall of Fame. Submissions. Accessed 05 July 2021
Szegedy, C.: Intriguing Properties of Neural Networks. arXiv:1312.6199
Zhou, Y., Standaert, F.-X.: Deep learning mitigates but does not annihilate the need of aligned traces and a generalized ResNet model for side-channel attacks. J. Cryptograph. Eng. 10(1), 1–11 (2019)
Acknowledgements
We thank Friederike Laus for her careful reading of an earlier iteration of this document. In the same vein, we also thank the anonymous reviewers of SAC 2020 for their thoughtful comments on the submitted version. Both helped us a lot to improve the paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix A Choice of Network Topology: Background and Intuitions
Appendix A Choice of Network Topology: Background and Intuitions
Motivation. Artificial neural networks are machine learning systems which are very loosely inspired by animal brains. Like brains, they consist of a large number of relatively simple computing units (neurons) that are linked together in a directed graph which organizes the information flow within the network. Also like brains, artificial neural networks can be taught to perform complex tasks, sometimes at a very high level. However, for typical neural networks, many core elements, e.g. the functioning of individual neurons or the learning algorithms employed, have no resemblance to biology.
Artificial Neural Networks. An artificial neural network is a function family \(f_w:\mathbb {R}^n \rightarrow \mathbb {R}^m\) defined by a directed graph G that connects n input nodes to m output nodes via a (possibly large) number of hidden nodes, where each node \(g \in G\) has associated to it an activation function \(\varphi _g\). A specific member of the function family is selected by specifying a weight vector \(w \in \mathbb {R}^k\). Essentially, each neuron sums up input values from nodes in the network that it has incoming connections to, applies its activation function, multiplies the result by the weights of its outgoing connections, and sends the results onwardsFootnote 12. Neurons are usually organized in layers.
Training Neural Networks. Learning a task using an artificial neural network means finding a weight vector that results in the network being able to solve the task successfully. This is usually done by optimizing the weight vector using stochastic gradient descent (or variants thereof) on some training data with respect to a suitable loss function that is (piecewise) differentiable with respect to network weights.
A range of problems can prevent training from being successful, for instance:
-
learning rates in stochastic gradient descent may be too low or too high to allow for meaningful convergence,
-
the chosen network structure might define a function family that does not contain a good solution (e.g. the network is too shallow, too small or the range of output activations does not fit the problem),
-
especially for deep networks, gradients may provide insufficient guidance towards a good solution,
-
the network might get stuck in some local optimum of the loss function, far away from any good solutions,
-
the network may choose to memorize the training set instead of learning anything useful.
Various responses to these problems have been put forward in the literature. This paper uses quite a few of them. Concretely:
-
We use a convolutional network architecture to enforce weight sharing among different parts of the network, thereby limiting the number of parameters to be optimized and the ability of the network to memorize training examples.
-
We predict a large number of sensitive variables simultaneously. This should make it hard for the network to memorize desired response vectors.
-
We use deep residual networks [11] in order to counteract problems with the convergence of very deep network architectures. Intuitively, in deep residual networks each subsequent layer of the network only computes a correction term on the output of the previous layer. This means that all network layers contribute to the final output in a relatively direct way, making meaningful gradient-based weight updates somewhat easier.
-
We use batch normalization to stabilize the statistical properties of hidden network layer inputs between weight updates. This is expected to improve convergence significantly on a wide range of tasks [21].
-
We monitor the progress of our network during training by keeping track of its performance on a set of validation examples and use this information to control reductions of a global learning rate parameter.
-
We reduce dependency on the choice of a fixed learning rate by using a variant of gradient descent (namely Adam, [14]) that keeps track of past weight updates to adapt learning speed on a per-parameter basis.
Architecture of Our Main Neural Network: Motivations. The main idea behind our network architecture is that the design of a convolutional neural network should reflect symmetries known to exist in the data; previous neural network based approaches to side channel analysis have taken this to mean that low-width convolutions should pick up local features irrespective of their location and that higher network layers should combine these lower-level features to eventually predict the leakage target.
In an attack that is trying to recover all AES subkey bytes, this approach encounters the problem that subkey bytes involved in similar processing steps (e.g. subkey bytes 16 and 32) will cause similar local features, meaning that the convolutional network layers cannot distinguish between them very well; also, the convolutional layers will have trouble picking up global features of the entire trace and may fail to combine leakage on the same subkey byte coming from disparate parts of the trace. The standard way for convolutional networks to overcome these problems is to intersperse convolutional and pooling layers, followed by densely connected layers finally predicting the leakage target. In such a network, roughly speaking, local convolutions are used to detect local features and pooling layers drastically reduce the amount of data that filters through to the prediction layers; finally, a densely connected prediction head draws on the output of these layers to predict the target.
In our design, on the other hand, we assume only that prediction based on a downsampled version of the power signal should - within reason - be independent of the offset of the downsampling. Hence, we take the full trace signal, decompose it into downsampled slices according to a fixed decimation factor n, and treat each slice approximately as if predicting the target using a fully-connected deep feed-forward network operating on each slice in isolation. Therefore, in our design, each neuron sees information from the entire trace, but the information seen by each neuron differs from that seen by any other. In this architecture, the network is completely free to decide which parts of the trace are important for each feature to be predicted, while at the same time, internal weights are shared between all the subnetworks, thus greatly reducing the problem of overfitting.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Gohr, A., Jacob, S., Schindler, W. (2021). Subsampling and Knowledge Distillation on Adversarial Examples: New Techniques for Deep Learning Based Side Channel Evaluations. In: Dunkelman, O., Jacobson, Jr., M.J., O'Flynn, C. (eds) Selected Areas in Cryptography. SAC 2020. Lecture Notes in Computer Science(), vol 12804. Springer, Cham. https://doi.org/10.1007/978-3-030-81652-0_22
Download citation
DOI: https://doi.org/10.1007/978-3-030-81652-0_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-81651-3
Online ISBN: 978-3-030-81652-0
eBook Packages: Computer ScienceComputer Science (R0)