Skip to main content
Log in

Multipath neural networks for anomaly detection in cyber-physical systems

  • Published:
Annals of Telecommunications Aims and scope Submit manuscript

Abstract

An Intrusion Detection System (IDS) is a core element for securing critical systems. An IDS can use signatures of known attacks, or an anomaly detection model for detecting unknown attacks. Attacking an IDS is often the entry point of an attack against a critical system. Consequently, the security of IDSs themselves is imperative. To secure model-based IDSs, we propose a method to authenticate the anomaly detection model. The anomaly detection model is an autoencoder for which we only have access to input-output pairs. Inputs consist of time windows of values from sensors and actuators of an Industrial Control System. Our method is based on a multipath Neural Network (NN) classifier, a newly proposed deep learning technique for which we provide an in-depth description. The idea is to characterize errors of an IDS’s autoencoder by using a multipath NN’s confidence measure c. We use the Wilcoxon-Mann-Whitney (WMW) test to detect a change in the distribution of the summary variable c, indicating that the autoencoder is not working properly. We compare our method to two baselines. They consist in using other summary variables for the WMW test. We assess the performance of these three methods using simulated data. Among others, our analysis shows that: 1) both baselines are oblivious to some autoencoder spoofing attacks while 2) the WMW test on a multipath NN’s confidence measure enables detecting eventually any autoencoder spoofing attack.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Alain G, Bengio Y (2016) Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644

  2. Andrej K (2016) Convolutional neural networks. Figshare. https://cs231n.github.io/convolutional-networks/. Accessed 30 Aug 2022

  3. Axelsson S ((1998)) Research in intrusion-detection systems: A survey. Tech. rep., Technical report 98–17. Department of Computer Engineering, Chalmers

  4. Berman DS, Buczak AL, Chavis JS, Corbett CL (2019) A survey of deep learning methods for cyber security. Information 10(4):122

    Article  Google Scholar 

  5. Brand A, Allen L, Altman M, Hlava M, Scott J (2015) Beyond authorship: attribution, contribution, collaboration, and credit. Learn Publ 28(2):151–155

    Article  Google Scholar 

  6. Chollet F, et al (2015) Keras

  7. Corona I, Giacinto G, Roli F (2013) Adversarial attacks against intrusion detection systems: Taxonomy, solutions and open issues. Inform Sci 239:201–225

    Article  Google Scholar 

  8. Denning D, Neumann PG (1985) Requirements and model for IDES-a real-time intrusion-detection expert system, vol.8. SRI International Menlo Park

  9. Garcia-Teodoro P, Diaz-Verdejo J, Maciá-Fernández G, Vázquez E (2009) Anomaly-based network intrusion detection: Techniques, systems and challenges. Comput Secur 28(1–2):18–28

    Article  Google Scholar 

  10. Gardiner J, Nagaraja S (2016) On the security of machine learning in malware c&c detection: A survey. ACM Comput Surv (CSUR) 49(3):1–39

    Article  Google Scholar 

  11. Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: AISTATS, pp 249::8

  12. Howard A, Sandler M, Chu G, Chen LC, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V, et al (2019) Searching for MobileNetV3. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1314–1324

  13. Hu Y, Yang A, Li H, Sun Y, Sun L (2018) A survey of intrusion detection on industrial control systems. Int J Distrib Sens Netw 14(8):1550147718794,615

  14. Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

  15. Larsen R (2021) Hyper-Neurons and confidence through path validation. Preprint at: https://gitlab.imt-atlantique.fr/chaire-cyber-cni-public/host4paper/-/raw/c501b6940030b48c5589f5380930b4f26e77d256/papers/Hyper-NeuronsandCTPV.pdf. Accessed 30 Aug 2022

  16. LeCun Y, Cortes C, Burges CJ (2010) MNIST handwritten digit database. AT&T Labs

  17. Luo Y, Xiao Y, Cheng L, Peng G, Yao D (2021) Deep learning-based anomaly detection in cyber-physical systems: Progress and opportunities. ACM Comput Surv (CSUR) 54(5):1–36

    Article  Google Scholar 

  18. Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics pp 50–60

  19. Mitchell R, Chen IR (2014) A survey of intrusion detection techniques for cyber-physical systems. ACM Comput Surv (CSUR) 46(4):1–29

    Article  Google Scholar 

  20. Ptacek TH, Newsham TN (1998) Insertion, evasion, and denial of service: Eluding network intrusion detection. Tech. rep, Secure Networks inc Calgary Alberta

  21. Raphaël MJI, Larsen Marc-Oliver Pahl GC (2021) Authenticating ids autoencoders using multipath neural networks. In: CSNet 5th cyber security in networking conference. IEEE

  22. Wickramasinghe CS, Marino DL, Amarasinghe K, Manic M (2018) Generalization of deep learning for cyber-physical system security: A survey. In: IECON 2018-44th annual conference of the IEEE industrial electronics society, IEEE, pp 745–751

  23. Wilcoxon F (1992) Individual comparisons by ranking methods. In: Breakthroughs in statistics, Springer, pp 196–202

  24. Wressnegger C, Kellner A, Rieck K (2018) Zoe: Content-based anomaly detection for industrial control systems. In: 2018 48th annual IEEE/IFIP international conference on dependable systems and networks (DSN), IEEE, pp 127–138

Download references

Acknowledgements

Many thanks to Yana Hasson, discussions with her allowed a better presentation of the multipath NN. We would like to thank the anonymous reviewers for their helpful comments and suggestions. Raphaël Larsen is very grateful to Simon Foley, whose invaluable writing advice during the first part of his doctoral studies helped him with this manuscript.

Funding

Cyber CNI Chair.

Author information

Authors and Affiliations

Authors

Contributions

With the taxonomy from [5]:

1. R. Larsen’s contributions are: 1) Conceptualization (raised the article’s problem: autencoder authentication, conceived the idea of using of the multipath Neural Network to treat this problem) 2) Methodology 3) Investigation 4) Formal Analysis 5) Software 6) Visualization 7) Validation 8) Writing (original draft & editing).

2. M.-O. Pahl’s contributions are: 1) Validation 2) Supervision (gave precise guidelines for the first draft presentation that is best for the topic) 3) Writing (review & editing).

3. G. Coatrieux’s contributions are: 1) Conceptualization (conceived the idea of applying some digital forensics principles for authentication in general) 2) Supervision (helped defining the notion of traceability, which does not appear in the present paper but which is a notion that led to the idea of autoencoder authentication) 3) Funding acquisition (helped Raphaël Larsen get hired).

Appendices

Appendix A: The regularization of Diss-Layers

A Diss-Layer Eq. 1 itself is quite simple, yet one needs to add some activity regularization in the loss function in order to achieve proper convergence. In any case, a flaw that needs to be averted is that a Diss-Layer may not learn anything. For example, in an autoencoder with two Diss-Layers, which have the identity as activation function, a possible solution for the equation \(\mathbf {y^{(1)}}+\mathbf {y^{(2)}}=\mathbf {h}\), with \(\mathbf {y^{(k)}}\) the Diss-Layers outputs, is to have, for \(k=1,2\), arbitrary \(\mathbf {w^{(k)}}\), and then \(\textbf{u}^{\textbf{(k)}}=0_{\mathbb {R}^{2}}\), \(b^{(k)}=0\) and \(\textbf{v}^{\textbf{(1)}}+\textbf{v}^{\textbf{(2)}}=1_{\mathbb {R}^{2}}\). A second issue, that appears when several Diss-Layers have the same input \(\mathbf {h}\), is that without any regularization most of the information can go through some Diss-Layers while other Diss-Layers would learn few patterns and would be most of the time incidental to the output of the NN.

To simplify the notation, \(l_{k}(\mathbf {h})\) will denote the output of Diss-Layers (instead of \(f_{k}(\mathbf {h},P_{k}(\mathbf {h}))\)) that are at the same depth in the network (even if they compose Hyper-Neurons, used by the NN, that do not share Diss-Layers) and take the same input \(\mathbf {h}\). Moreover, the sum’s indices are noted by letters that range from 1 to their capital counterparts: I represents the size of the mini-batch (\(I\ge 2\) and \(\mathbf {h^{(i)}}\) is the i-th input from the mini-batch given to the Diss-Layers \(l_{k}\)), J the dimension of the Diss-Layers outputs (and inputs) and K the number of Diss-Layers within the same layer.

$$\begin{aligned} S = \frac{1}{J. K}\sum _{j,k} \frac{\frac{1}{I. K}\sum _{i,k^{\prime }}l_{k^{\prime }}(\mathbf {h^{(i)}})_{j}}{0.1+\frac{1}{I}\sum _{i}(l_{k}(\mathbf {h^{(i)}})_{j} - \frac{1}{I}\sum _{i^{\prime }}l_{k}(\mathbf {h^{(i^{\prime })}})_{j})^{2}} \end{aligned}$$
(5)

The quantity S is related to the mean of the inverse of element-wise relative standard deviations of the Diss-Layers outputs on the mini-batch \((\mathbf {h^{(i)}})_{i}\), we consider S instead because it is a smoother function with respect to h than its counterpart:

$$\begin{aligned} \frac{1}{J. K}\sum _{j,k} \frac{\frac{1}{I}\sum _{i}l_{k}(\mathbf {h^{(i)}})_{j}}{\sqrt{\frac{1}{I}\sum _{i}(l_{k}(\mathbf {h^{(i)}})_{j}-\frac{1}{I}\sum _{i^{\prime }}l_{k}(\mathbf {h^{(i^{\prime })}})_{j})^{2}}} \end{aligned}$$
$$\begin{aligned} \text {The quantity } A = S + M \text {, with} \end{aligned}$$
(6)
$$\begin{aligned} M = \sum _{k}(\frac{1}{I. J}\sum _{i,j}l_{k}(\mathbf {h^{(i)}})_{j} - \frac{1}{I. J. K}\sum _{i,j,k^{\prime }}l_{k^{\prime }}(\mathbf {h^{(i)}})_{j})^{2} \end{aligned}$$
(7)

must be small to mitigate the first issue of Diss-Layers that may not learn anything because of trivial solutions to the problem enforced by the main objective: S Eq. 5 forces output features of each Diss-Layer to span sufficiently large subspaces of their codomain, in other words to vary sufficiently and M Eq. 7 prevents Diss-Layers’ output features to have very different amplitudes. The size I of the mini-batch must be large enough for A to be useful.

To overcome the second issue, i.e. some Diss-Layers may capture no information from an input, we use a trick that consists in handling the softmax of the Diss-Layers outputs as meaningful probabilities. Let us note, \(\sigma\), the softmax function: \({\sigma :z \rightarrow ({\exp (z_{i})}/{\sum _{j}\exp (z_{j})})_{i}}^{\intercal }\). We first randomly choose two different Diss-Layers \(k_{1}\) and \(k_{2}\) and:

$$\begin{aligned} B = \frac{1}{I}\sum _{i}(D_{KL}(p^{(i)}_{k_{1}}\mid \mid p^{(i)}_{k_{2}})+D_{KL}(p^{(i)}_{k_{2}}\mid \mid p^{(i)}_{k_{1}})) \end{aligned}$$
(8)
$$\begin{aligned} {\text {with }p^{(i)}_{k} = (\min (1,\max (10^{-7},\sigma (l_{k}(\mathbf {h^{(i)}}))_{j})))_{1\le j\le J}.~~~~~~~~~~~~~} \end{aligned}$$

The function \(D_{KL}\) is the Kullback–Leibler divergence, i.e., \({D_{KL}(p\mid \mid q)=\sum _{j}p_{j}\log (\frac{p_{j}}{q_{j}})}\), and is applied on the clipped softmax outputs \(p_{k_{1}}^{(i)}\) and \(p_{k_{2}}^{(i)}\). Taking the same Diss-Layers \(k_{1}\) and \(k_{2}\) for the whole batch is mostly a matter of practicality, but it may be needed to have a stable error gradient with really small batches. To better compensate these conflicting objectives, they are merged into C Eq. 9.

$$\begin{aligned} C = \frac{A}{1+B}+\frac{B}{1+A} \end{aligned}$$
(9)

Finally, we want the Diss-Layers to have different pre-potential functions \((P_{k})_{k}\) so that they capture different patterns. This is done thanks to:

$$\begin{aligned} \Lambda = \frac{1}{I}\sum _{i}\frac{1}{0.01+\frac{1}{K}\sum _{k}(P_{k}(\mathbf {h^{(i)}}) - \frac{1}{K}\sum _{k^{\prime }}P_{k^{\prime }}(\mathbf {h^{(i)}}))^{2}} \end{aligned}$$
(10)

It is useless to have different pre-potentials as long as the Diss-Layers have not learned anything yet, so we want the quantity \(\Lambda\) Eq. 10 to influence the training when the former objective is near to be achieved. The final quantity to minimize along with the main loss is then:

$$\begin{aligned} R=C+\frac{\Lambda }{1+[C]} \end{aligned}$$
(11)

where [C] means that the resulting gradient is not backpropagated. This can be done in keras ([6]) with “stop_gradient”: [C] is written \({\text {stop}\_\text {gradient}(C)}\). Therefore, \({1/(1+[C])}\) can be seen as a dynamic learning rate for \(\Lambda\). Note that the size of the mini-batch has to be large enough so that R can play its part. Constants in all denominators in these equations can be changed but in our experience, those have always enable a proper convergence of the NN.

In our experiment concerning Hyper-Neuron autoencoders, \(I=16\), the number of epochs is 100 and the activation function of each Diss-Layer is the sigmoid (input data are min-max normalized data) and the activation function of the resulting Hyper-Neuron is the identity. Moreover, with \(\mathcal {L}\) the mini-batch’s average MSLE and \({\mathcal {L} + \lambda . R}\) the total loss of a Hyper-Neuron autoencoder, we have \(\lambda =0.001\) or 0.01 (depending on inputs’ size). (The multipath NN’s total loss is given by Eq. 4.) In practice, \(\lambda\) is found by having the total loss converging towards a small enough limit, this was enough for the Hyper-Neuron autoencoder to unsupervisedly separate two separable clusters. For every NN in this paper, we use the Adam optimizer ([14]) of keras with the default gradient descent’s parameters.

In general, may we remind the reader that the regularization R is needed only when the Diss-Layers have the same input h, otherwise only the quantity C has to be minimized for each of the Diss-Layers taking different inputs and only A if there is only one Diss-Layer at a certain depth in the NN. In brief, thanks to A Eq. 6, some Diss-Layers capture useful information, thanks to C Eq. 9, each Diss-Layer captures useful information, thanks to R Eq. 11, the Diss-Layers capture different patterns.

Appendix B: Initialization of Dissector Layers

For the autoencoders and the multipath NN, we initialize the K Diss-Layers biases \((b^{(k)})_{k}\) at zero and the weights \((w^{(k)})_{k},(u^{(k)})_{k},(v^{(k)})_{k}\) from Eq. 1 by adapting the normalized initialization of [11] (glorot\(\_\)uniform in keras) for our use of Diss-Layers: \(\forall j,k\)

$$\begin{aligned} w_{j}^{(k)}, u_{j}^{(k)}, v_{j}^{(k)} \sim \mathcal {U}[-\frac{\sqrt{6}}{\sqrt{J+K}},\frac{\sqrt{6}}{\sqrt{J+K}}] \end{aligned}$$

with J the size of inputs of the Diss-Layers. In our experiments, NNs use the sum of the K Diss-Layers (to reconstruct an input in the case of a Hyper-Neuron autoencoder and for random inputs in the case of the multipath NN), hence the normalization constant. In other contexts, different initializations might be needed.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Larsen, R.M.J.I., Pahl, MO. & Coatrieux, G. Multipath neural networks for anomaly detection in cyber-physical systems. Ann. Telecommun. 78, 149–167 (2023). https://doi.org/10.1007/s12243-022-00922-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12243-022-00922-x

Keywords

Navigation