Abstract
An Intrusion Detection System (IDS) is a core element for securing critical systems. An IDS can use signatures of known attacks, or an anomaly detection model for detecting unknown attacks. Attacking an IDS is often the entry point of an attack against a critical system. Consequently, the security of IDSs themselves is imperative. To secure model-based IDSs, we propose a method to authenticate the anomaly detection model. The anomaly detection model is an autoencoder for which we only have access to input-output pairs. Inputs consist of time windows of values from sensors and actuators of an Industrial Control System. Our method is based on a multipath Neural Network (NN) classifier, a newly proposed deep learning technique for which we provide an in-depth description. The idea is to characterize errors of an IDS’s autoencoder by using a multipath NN’s confidence measure c. We use the Wilcoxon-Mann-Whitney (WMW) test to detect a change in the distribution of the summary variable c, indicating that the autoencoder is not working properly. We compare our method to two baselines. They consist in using other summary variables for the WMW test. We assess the performance of these three methods using simulated data. Among others, our analysis shows that: 1) both baselines are oblivious to some autoencoder spoofing attacks while 2) the WMW test on a multipath NN’s confidence measure enables detecting eventually any autoencoder spoofing attack.
Similar content being viewed by others
References
Alain G, Bengio Y (2016) Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644
Andrej K (2016) Convolutional neural networks. Figshare. https://cs231n.github.io/convolutional-networks/. Accessed 30 Aug 2022
Axelsson S ((1998)) Research in intrusion-detection systems: A survey. Tech. rep., Technical report 98–17. Department of Computer Engineering, Chalmers
Berman DS, Buczak AL, Chavis JS, Corbett CL (2019) A survey of deep learning methods for cyber security. Information 10(4):122
Brand A, Allen L, Altman M, Hlava M, Scott J (2015) Beyond authorship: attribution, contribution, collaboration, and credit. Learn Publ 28(2):151–155
Chollet F, et al (2015) Keras
Corona I, Giacinto G, Roli F (2013) Adversarial attacks against intrusion detection systems: Taxonomy, solutions and open issues. Inform Sci 239:201–225
Denning D, Neumann PG (1985) Requirements and model for IDES-a real-time intrusion-detection expert system, vol.8. SRI International Menlo Park
Garcia-Teodoro P, Diaz-Verdejo J, Maciá-Fernández G, Vázquez E (2009) Anomaly-based network intrusion detection: Techniques, systems and challenges. Comput Secur 28(1–2):18–28
Gardiner J, Nagaraja S (2016) On the security of machine learning in malware c&c detection: A survey. ACM Comput Surv (CSUR) 49(3):1–39
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: AISTATS, pp 249::8
Howard A, Sandler M, Chu G, Chen LC, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V, et al (2019) Searching for MobileNetV3. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1314–1324
Hu Y, Yang A, Li H, Sun Y, Sun L (2018) A survey of intrusion detection on industrial control systems. Int J Distrib Sens Netw 14(8):1550147718794,615
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Larsen R (2021) Hyper-Neurons and confidence through path validation. Preprint at: https://gitlab.imt-atlantique.fr/chaire-cyber-cni-public/host4paper/-/raw/c501b6940030b48c5589f5380930b4f26e77d256/papers/Hyper-NeuronsandCTPV.pdf. Accessed 30 Aug 2022
LeCun Y, Cortes C, Burges CJ (2010) MNIST handwritten digit database. AT&T Labs
Luo Y, Xiao Y, Cheng L, Peng G, Yao D (2021) Deep learning-based anomaly detection in cyber-physical systems: Progress and opportunities. ACM Comput Surv (CSUR) 54(5):1–36
Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics pp 50–60
Mitchell R, Chen IR (2014) A survey of intrusion detection techniques for cyber-physical systems. ACM Comput Surv (CSUR) 46(4):1–29
Ptacek TH, Newsham TN (1998) Insertion, evasion, and denial of service: Eluding network intrusion detection. Tech. rep, Secure Networks inc Calgary Alberta
Raphaël MJI, Larsen Marc-Oliver Pahl GC (2021) Authenticating ids autoencoders using multipath neural networks. In: CSNet 5th cyber security in networking conference. IEEE
Wickramasinghe CS, Marino DL, Amarasinghe K, Manic M (2018) Generalization of deep learning for cyber-physical system security: A survey. In: IECON 2018-44th annual conference of the IEEE industrial electronics society, IEEE, pp 745–751
Wilcoxon F (1992) Individual comparisons by ranking methods. In: Breakthroughs in statistics, Springer, pp 196–202
Wressnegger C, Kellner A, Rieck K (2018) Zoe: Content-based anomaly detection for industrial control systems. In: 2018 48th annual IEEE/IFIP international conference on dependable systems and networks (DSN), IEEE, pp 127–138
Acknowledgements
Many thanks to Yana Hasson, discussions with her allowed a better presentation of the multipath NN. We would like to thank the anonymous reviewers for their helpful comments and suggestions. Raphaël Larsen is very grateful to Simon Foley, whose invaluable writing advice during the first part of his doctoral studies helped him with this manuscript.
Funding
Cyber CNI Chair.
Author information
Authors and Affiliations
Contributions
With the taxonomy from [5]:
1. R. Larsen’s contributions are: 1) Conceptualization (raised the article’s problem: autencoder authentication, conceived the idea of using of the multipath Neural Network to treat this problem) 2) Methodology 3) Investigation 4) Formal Analysis 5) Software 6) Visualization 7) Validation 8) Writing (original draft & editing).
2. M.-O. Pahl’s contributions are: 1) Validation 2) Supervision (gave precise guidelines for the first draft presentation that is best for the topic) 3) Writing (review & editing).
3. G. Coatrieux’s contributions are: 1) Conceptualization (conceived the idea of applying some digital forensics principles for authentication in general) 2) Supervision (helped defining the notion of traceability, which does not appear in the present paper but which is a notion that led to the idea of autoencoder authentication) 3) Funding acquisition (helped Raphaël Larsen get hired).
Appendices
Appendix A: The regularization of Diss-Layers
A Diss-Layer Eq. 1 itself is quite simple, yet one needs to add some activity regularization in the loss function in order to achieve proper convergence. In any case, a flaw that needs to be averted is that a Diss-Layer may not learn anything. For example, in an autoencoder with two Diss-Layers, which have the identity as activation function, a possible solution for the equation \(\mathbf {y^{(1)}}+\mathbf {y^{(2)}}=\mathbf {h}\), with \(\mathbf {y^{(k)}}\) the Diss-Layers outputs, is to have, for \(k=1,2\), arbitrary \(\mathbf {w^{(k)}}\), and then \(\textbf{u}^{\textbf{(k)}}=0_{\mathbb {R}^{2}}\), \(b^{(k)}=0\) and \(\textbf{v}^{\textbf{(1)}}+\textbf{v}^{\textbf{(2)}}=1_{\mathbb {R}^{2}}\). A second issue, that appears when several Diss-Layers have the same input \(\mathbf {h}\), is that without any regularization most of the information can go through some Diss-Layers while other Diss-Layers would learn few patterns and would be most of the time incidental to the output of the NN.
To simplify the notation, \(l_{k}(\mathbf {h})\) will denote the output of Diss-Layers (instead of \(f_{k}(\mathbf {h},P_{k}(\mathbf {h}))\)) that are at the same depth in the network (even if they compose Hyper-Neurons, used by the NN, that do not share Diss-Layers) and take the same input \(\mathbf {h}\). Moreover, the sum’s indices are noted by letters that range from 1 to their capital counterparts: I represents the size of the mini-batch (\(I\ge 2\) and \(\mathbf {h^{(i)}}\) is the i-th input from the mini-batch given to the Diss-Layers \(l_{k}\)), J the dimension of the Diss-Layers outputs (and inputs) and K the number of Diss-Layers within the same layer.
The quantity S is related to the mean of the inverse of element-wise relative standard deviations of the Diss-Layers outputs on the mini-batch \((\mathbf {h^{(i)}})_{i}\), we consider S instead because it is a smoother function with respect to h than its counterpart:
must be small to mitigate the first issue of Diss-Layers that may not learn anything because of trivial solutions to the problem enforced by the main objective: S Eq. 5 forces output features of each Diss-Layer to span sufficiently large subspaces of their codomain, in other words to vary sufficiently and M Eq. 7 prevents Diss-Layers’ output features to have very different amplitudes. The size I of the mini-batch must be large enough for A to be useful.
To overcome the second issue, i.e. some Diss-Layers may capture no information from an input, we use a trick that consists in handling the softmax of the Diss-Layers outputs as meaningful probabilities. Let us note, \(\sigma\), the softmax function: \({\sigma :z \rightarrow ({\exp (z_{i})}/{\sum _{j}\exp (z_{j})})_{i}}^{\intercal }\). We first randomly choose two different Diss-Layers \(k_{1}\) and \(k_{2}\) and:
The function \(D_{KL}\) is the Kullback–Leibler divergence, i.e., \({D_{KL}(p\mid \mid q)=\sum _{j}p_{j}\log (\frac{p_{j}}{q_{j}})}\), and is applied on the clipped softmax outputs \(p_{k_{1}}^{(i)}\) and \(p_{k_{2}}^{(i)}\). Taking the same Diss-Layers \(k_{1}\) and \(k_{2}\) for the whole batch is mostly a matter of practicality, but it may be needed to have a stable error gradient with really small batches. To better compensate these conflicting objectives, they are merged into C Eq. 9.
Finally, we want the Diss-Layers to have different pre-potential functions \((P_{k})_{k}\) so that they capture different patterns. This is done thanks to:
It is useless to have different pre-potentials as long as the Diss-Layers have not learned anything yet, so we want the quantity \(\Lambda\) Eq. 10 to influence the training when the former objective is near to be achieved. The final quantity to minimize along with the main loss is then:
where [C] means that the resulting gradient is not backpropagated. This can be done in keras ([6]) with “stop_gradient”: [C] is written \({\text {stop}\_\text {gradient}(C)}\). Therefore, \({1/(1+[C])}\) can be seen as a dynamic learning rate for \(\Lambda\). Note that the size of the mini-batch has to be large enough so that R can play its part. Constants in all denominators in these equations can be changed but in our experience, those have always enable a proper convergence of the NN.
In our experiment concerning Hyper-Neuron autoencoders, \(I=16\), the number of epochs is 100 and the activation function of each Diss-Layer is the sigmoid (input data are min-max normalized data) and the activation function of the resulting Hyper-Neuron is the identity. Moreover, with \(\mathcal {L}\) the mini-batch’s average MSLE and \({\mathcal {L} + \lambda . R}\) the total loss of a Hyper-Neuron autoencoder, we have \(\lambda =0.001\) or 0.01 (depending on inputs’ size). (The multipath NN’s total loss is given by Eq. 4.) In practice, \(\lambda\) is found by having the total loss converging towards a small enough limit, this was enough for the Hyper-Neuron autoencoder to unsupervisedly separate two separable clusters. For every NN in this paper, we use the Adam optimizer ([14]) of keras with the default gradient descent’s parameters.
In general, may we remind the reader that the regularization R is needed only when the Diss-Layers have the same input h, otherwise only the quantity C has to be minimized for each of the Diss-Layers taking different inputs and only A if there is only one Diss-Layer at a certain depth in the NN. In brief, thanks to A Eq. 6, some Diss-Layers capture useful information, thanks to C Eq. 9, each Diss-Layer captures useful information, thanks to R Eq. 11, the Diss-Layers capture different patterns.
Appendix B: Initialization of Dissector Layers
For the autoencoders and the multipath NN, we initialize the K Diss-Layers biases \((b^{(k)})_{k}\) at zero and the weights \((w^{(k)})_{k},(u^{(k)})_{k},(v^{(k)})_{k}\) from Eq. 1 by adapting the normalized initialization of [11] (glorot\(\_\)uniform in keras) for our use of Diss-Layers: \(\forall j,k\)
with J the size of inputs of the Diss-Layers. In our experiments, NNs use the sum of the K Diss-Layers (to reconstruct an input in the case of a Hyper-Neuron autoencoder and for random inputs in the case of the multipath NN), hence the normalization constant. In other contexts, different initializations might be needed.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Larsen, R.M.J.I., Pahl, MO. & Coatrieux, G. Multipath neural networks for anomaly detection in cyber-physical systems. Ann. Telecommun. 78, 149–167 (2023). https://doi.org/10.1007/s12243-022-00922-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12243-022-00922-x