Skip to main content

Subsampling and Knowledge Distillation on Adversarial Examples: New Techniques for Deep Learning Based Side Channel Evaluations

  • Conference paper
  • First Online:
Selected Areas in Cryptography (SAC 2020)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12804))

Included in the following conference series:

Abstract

This paper has four main goals. First, we show how we solved the CHES 2018 AES challenge in the contest using essentially a linear classifier combined with a SAT solver and a custom error correction method. This part of the paper has previously appeared in a preprint by the current authors (e-print report 2019/094) and later as a contribution to a preprint write-up of the solutions by the winning teams (e-print report 2019/860).

Second, we develop a novel deep neural network architecture for side-channel analysis that completely breaks the AES challenge, allowing for fairly reliable key recovery with just a single trace on the unknown-device part of the CHES challenge (with an expected success rate of roughly 70% if about 100 CPU hours are allowed for the equation solving stage of the attack). This solution significantly improves upon all previously published solutions of the AES challenge, including our baseline linear solution.

Third, we consider the question of leakage attribution for both the classifier we used in the challenge and for our deep neural network. Direct inspection of the weight vector of our machine learning model yields a lot of information on the implementation for our linear classifier. For the deep neural network, we test three other strategies (occlusion of traces; inspection of adversarial changes; knowledge distillation) and find that these can yield information on the leakage essentially equivalent to that gained by inspecting the weights of the simpler model.

Fourth, we study the properties of adversarially generated side-channel traces for our model. Partly reproducing recent computer vision work by Ilyas et al. in our application domain, we find that a linear classifier that generalizes to an unseen device much better than our linear baseline can be trained using only adversarial examples (fresh random keys, adversarially perturbed traces) for our deep neural network. This gives a new way of extracting human-usable knowledge from a deep side channel model while also yielding insights on adversarial examples in an application domain where relatively few sources of spurious correlations between data and labels exist.

The experiments described in this paper can be reproduced using code available at https://github.com/agohr/ches2018.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    To be precise, it was known which traces in the training dataset correspond to the challenge device, but no further details about this device were revealed.

  2. 2.

    Shuffling of the training set is important here, since we will use the last 3000 traces of T as our validation set to monitor training progress and control learning rate drops.

  3. 3.

    We tried other ways to combine outputs, such as taking the median of predictions. This did not yield significantly different results.

  4. 4.

    Note that in order to solve the Set 6 challenge, the neural network will need to be invoked for at most two or three traces.

  5. 5.

    Note that the traces in Set 4 all share the same key, and that a ball of radius 150 units in \(\mathbb {R}^{65000}\) covers a volume a factor of \(2^{130000}\) smaller than a ball of radius 600 units.

  6. 6.

    This means that we performed Ridge regression with the same hyperparameters as for the baseline predictor to predict \(K'\) given \(T'\).

  7. 7.

    This is not obvious for side-channel data, as useful features come from small parts of the trace.

  8. 8.

    This idea regarding the source of non-robust features in images is not ours; it is taken from [17] with minor changes.

  9. 9.

    Sampling is easier and sample collection is even better defined e.g. in cryptanalytic problems, but these have a discrete sample space and it is less clear what adversarial perturbation would mean in this case.

  10. 10.

    Our low-pass filter for angular frequency \(2 \pi f\) keeps only the \(\lceil N \cdot f \rceil \) lowest-index coefficients of the DFT of the trace signal and their complex conjugates, where N is the size of the reduced traces.

  11. 11.

    Note however that this does not imply that those global features are useful: for instance, overfitting by memorizing particular data-label pairs is using global features.

  12. 12.

    Note, however, that modern neural network architectures regularly contain processing layers that cannot be naturally described in this framework, for instance layers that normalize data, reshape it, apply fixed transformations, drop some data during training, or add noise.

References

  1. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006). ISBN 978-0387-31073-2

    Google Scholar 

  2. Gohr, A., Jacob, S., Schindler, W.: CHES 2018 Side Channel Contest CTF - Solution of the AES Challenges. IACR eprint archive report 2019/094. https://eprint.iacr.org/2019/094

  3. Damm, T., Freud, S., Klein, D.: Dissecting the CHES 2018 AES Challenge. IACR eprint archive report 2019/783. https://eprint.iacr.org/2019/783

  4. Hu, Y., et al.: Machine Learning and Side-Channel Analysis in a CTF Competition. IACR eprint archive report 2019/860. https://eprint.iacr.org/2019/860

  5. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  6. Soos, M., Nohl, K., Castelluccia, C.: Extending SAT solvers to cryptographic problems. In: 12th International Conference on Theory and Applications of Satisfiability Testing - SAT 2009 (2009)

    Google Scholar 

  7. Pycryptosat homepage. https://pypi.org/project/pycryptosat/. Accessed 08 Oct 2018

  8. Picek, S., Samiotis, I.P., Kim, J., Heuser, A., Bhasin, S., Legay, A.: On the performance of convolutional neural networks for side-channel analysis. In: Chattopadhyay, A., Rebeiro, C., Yarom, Y. (eds.) SPACE 2018. LNCS, vol. 11348, pp. 157–176. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-05072-6_10

    Chapter  Google Scholar 

  9. Cagli, E., Dumas, C., Prouff, E.: Convolutional neural networks with data augmentation against Jitter-based countermeasures. In: Fischer, W., Homma, N. (eds.) CHES 2017. LNCS, vol. 10529, pp. 45–68. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66787-4_3

    Chapter  Google Scholar 

  10. Kim, J., Picek, S., Heuser, A., Bhasin, S., Hanjalic, A.: Make some noise. Unleashing the power of convolutional neural networks for profiled side-channel analysis. IACR Trans. Cryptograph. Hardw. Embedded Syst. 2019(3), 148–179. https://doi.org/10.13154/tches.v2019.i3.148-179

  11. He, K., Xang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016). https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf

  12. Emadjila, R., Prouff, E., Strullu, R., Cagli, E., Dumas, C.: Study of Deep Learning Techniques for Side-Channel Analysis and Introduction to the ASCAD Database. IACR eprint report 2018/053. https://eprint.iacr.org/2018/053

  13. Gohr, A.: Improving attacks on round-reduced Speck32/64 using deep learning. In: Boldyreva, A., Micciancio, D. (eds.) CRYPTO 2019. LNCS, vol. 11693, pp. 150–179. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26951-7_6

    Chapter  Google Scholar 

  14. Kingma, D.P., Ba, J.L.: ADAM: A Method for Stochastic Optimization. ICLR 2015. arXiv:1412.6980 (2015)

  15. Chollet, F.: keras, GitHub (2015). https://github.com/fchollet/keras

  16. Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., Madry, A.: Adversarial Examples Are Not Bugs, They Are Features. NeurIPS 2019. https://arxiv.org/pdf/1905.02175.pdf

  17. Barak, B.: Puzzles of Modern Machine Learning, Windows On Theory Research Blog. https://windowsontheory.org/2019/11/15/puzzles-of-modern-machine-learning/. Accessed 19 Nov 2019

  18. Hettwer, B., Gehrer, S., Güneysu, T.: Deep neural network attribution methods for leakage analysis and symmetric key recovery. In: Paterson, K.G., Stebila, D. (eds.) SAC 2019. LNCS, vol. 11959, pp. 645–666. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-38471-5_26

    Chapter  Google Scholar 

  19. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016). https://www.deeplearningbook.org

  20. Goodfellow, I., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)

  21. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning (2015)

    Google Scholar 

  22. Sharif, M., Bhagavatula, S., Bauer, L., Reiter, M.K.: Accessorize to a crime: real and stealthy attacks on state-of-the-art face recognition. In: ACM Conference on Computer and Communications Security 2016, New York, pp. 1528–1540 (2016)

    Google Scholar 

  23. Belliza, D., Bronchain, O., Cassiers, G., Momin, C., Standaert, F.-X., Udvarhelyi, B.(organizers): CHES CTF 2020 Hall of Fame. Submissions. Accessed 05 July 2021

    Google Scholar 

  24. Szegedy, C.: Intriguing Properties of Neural Networks. arXiv:1312.6199

  25. Zhou, Y., Standaert, F.-X.: Deep learning mitigates but does not annihilate the need of aligned traces and a generalized ResNet model for side-channel attacks. J. Cryptograph. Eng. 10(1), 1–11 (2019)

    Google Scholar 

Download references

Acknowledgements

We thank Friederike Laus for her careful reading of an earlier iteration of this document. In the same vein, we also thank the anonymous reviewers of SAC 2020 for their thoughtful comments on the submitted version. Both helped us a lot to improve the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aron Gohr .

Editor information

Editors and Affiliations

Appendix A Choice of Network Topology: Background and Intuitions

Appendix A Choice of Network Topology: Background and Intuitions

Motivation. Artificial neural networks are machine learning systems which are very loosely inspired by animal brains. Like brains, they consist of a large number of relatively simple computing units (neurons) that are linked together in a directed graph which organizes the information flow within the network. Also like brains, artificial neural networks can be taught to perform complex tasks, sometimes at a very high level. However, for typical neural networks, many core elements, e.g. the functioning of individual neurons or the learning algorithms employed, have no resemblance to biology.

Artificial Neural Networks. An artificial neural network is a function family \(f_w:\mathbb {R}^n \rightarrow \mathbb {R}^m\) defined by a directed graph G that connects n input nodes to m output nodes via a (possibly large) number of hidden nodes, where each node \(g \in G\) has associated to it an activation function \(\varphi _g\). A specific member of the function family is selected by specifying a weight vector \(w \in \mathbb {R}^k\). Essentially, each neuron sums up input values from nodes in the network that it has incoming connections to, applies its activation function, multiplies the result by the weights of its outgoing connections, and sends the results onwardsFootnote 12. Neurons are usually organized in layers.

Training Neural Networks. Learning a task using an artificial neural network means finding a weight vector that results in the network being able to solve the task successfully. This is usually done by optimizing the weight vector using stochastic gradient descent (or variants thereof) on some training data with respect to a suitable loss function that is (piecewise) differentiable with respect to network weights.

A range of problems can prevent training from being successful, for instance:

  • learning rates in stochastic gradient descent may be too low or too high to allow for meaningful convergence,

  • the chosen network structure might define a function family that does not contain a good solution (e.g. the network is too shallow, too small or the range of output activations does not fit the problem),

  • especially for deep networks, gradients may provide insufficient guidance towards a good solution,

  • the network might get stuck in some local optimum of the loss function, far away from any good solutions,

  • the network may choose to memorize the training set instead of learning anything useful.

Various responses to these problems have been put forward in the literature. This paper uses quite a few of them. Concretely:

  • We use a convolutional network architecture to enforce weight sharing among different parts of the network, thereby limiting the number of parameters to be optimized and the ability of the network to memorize training examples.

  • We predict a large number of sensitive variables simultaneously. This should make it hard for the network to memorize desired response vectors.

  • We use deep residual networks [11] in order to counteract problems with the convergence of very deep network architectures. Intuitively, in deep residual networks each subsequent layer of the network only computes a correction term on the output of the previous layer. This means that all network layers contribute to the final output in a relatively direct way, making meaningful gradient-based weight updates somewhat easier.

  • We use batch normalization to stabilize the statistical properties of hidden network layer inputs between weight updates. This is expected to improve convergence significantly on a wide range of tasks [21].

  • We monitor the progress of our network during training by keeping track of its performance on a set of validation examples and use this information to control reductions of a global learning rate parameter.

  • We reduce dependency on the choice of a fixed learning rate by using a variant of gradient descent (namely Adam, [14]) that keeps track of past weight updates to adapt learning speed on a per-parameter basis.

Architecture of Our Main Neural Network: Motivations. The main idea behind our network architecture is that the design of a convolutional neural network should reflect symmetries known to exist in the data; previous neural network based approaches to side channel analysis have taken this to mean that low-width convolutions should pick up local features irrespective of their location and that higher network layers should combine these lower-level features to eventually predict the leakage target.

In an attack that is trying to recover all AES subkey bytes, this approach encounters the problem that subkey bytes involved in similar processing steps (e.g. subkey bytes 16 and 32) will cause similar local features, meaning that the convolutional network layers cannot distinguish between them very well; also, the convolutional layers will have trouble picking up global features of the entire trace and may fail to combine leakage on the same subkey byte coming from disparate parts of the trace. The standard way for convolutional networks to overcome these problems is to intersperse convolutional and pooling layers, followed by densely connected layers finally predicting the leakage target. In such a network, roughly speaking, local convolutions are used to detect local features and pooling layers drastically reduce the amount of data that filters through to the prediction layers; finally, a densely connected prediction head draws on the output of these layers to predict the target.

In our design, on the other hand, we assume only that prediction based on a downsampled version of the power signal should - within reason - be independent of the offset of the downsampling. Hence, we take the full trace signal, decompose it into downsampled slices according to a fixed decimation factor n, and treat each slice approximately as if predicting the target using a fully-connected deep feed-forward network operating on each slice in isolation. Therefore, in our design, each neuron sees information from the entire trace, but the information seen by each neuron differs from that seen by any other. In this architecture, the network is completely free to decide which parts of the trace are important for each feature to be predicted, while at the same time, internal weights are shared between all the subnetworks, thus greatly reducing the problem of overfitting.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gohr, A., Jacob, S., Schindler, W. (2021). Subsampling and Knowledge Distillation on Adversarial Examples: New Techniques for Deep Learning Based Side Channel Evaluations. In: Dunkelman, O., Jacobson, Jr., M.J., O'Flynn, C. (eds) Selected Areas in Cryptography. SAC 2020. Lecture Notes in Computer Science(), vol 12804. Springer, Cham. https://doi.org/10.1007/978-3-030-81652-0_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-81652-0_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-81651-3

  • Online ISBN: 978-3-030-81652-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics