Iterative Selection of Categorical Variables for Log Data Anomaly Detection

Landauer, Max; Höld, Georg; Wurzenberger, Markus; Skopik, Florian; Rauber, Andreas

doi:10.1007/978-3-030-88418-5_36

Iterative Selection of Categorical Variables for Log Data Anomaly Detection

Max Landauer¹¹,
Georg Höld¹¹,
Markus Wurzenberger¹¹,
Florian Skopik¹¹ &
…
Andreas Rauber¹²

Conference paper
First Online: 30 September 2021

3460 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12972))

Abstract

Log data is a well-known source for anomaly detection in cyber security. Accordingly, a large number of approaches based on self-learning algorithms have been proposed in the past. Most of these approaches focus on numeric features extracted from logs, since these variables are convenient to use with commonly known machine learning techniques. However, system log data frequently involves multiple categorical features that provide further insights into the state of a computer system and thus have the potential to improve detection accuracy. Unfortunately, it is non-trivial to derive useful correlation rules from the vast number of possible values of all available categorical variables. Therefore, we propose the Variable Correlation Detector (VCD) that employs a sequence of selection constraints to efficiently disclose pairs of variables with correlating values. The approach also comprises of an online mode that continuously updates the identified variable correlations to account for system evolution and applies statistical tests on conditional occurrence probabilities for anomaly detection. Our evaluations show that the VCD is well adjustable to fit properties of the data at hand and discloses associated variables with high accuracy. Our experiments with real log data indicate that the VCD is capable of detecting attacks such as scans and brute-force intrusions with higher accuracy than existing detectors.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://tools.kali.org/password-attacks/hydra, accessed: 2021-04-21.
2.
https://cirt.net/Nikto2, accessed: 2021-04-21.

References

Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, vol. 1215, pp. 487–499. Citeseer (1994)
Google Scholar
Bergsma, W.: A bias-correction for cramér’s v and tschuprow’s t. J. Kor. Stat. Soc. 42(3), 323–328 (2013)
Article MathSciNet Google Scholar
Bolboacă, S.D., Jäntschi, L., Sestraş, A.F., Sestraş, R.E., Pamfil, D.C.: Pearson-fisher chi-square statistic revisited. Information 2(3), 528–545 (2011)
Article Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 1–58 (2009)
Article Google Scholar
Chen, T., Tang, L.A., Sun, Y., Chen, Z., Zhang, K.: Entity embedding-based anomaly detection for heterogeneous categorical events. arXiv preprint arXiv:1608.07502 (2016)
Das, K., Schneider, J.: Detecting anomalous records in categorical datasets. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 220–229 (2007)
Google Scholar
Djenouri, Y., Belhadi, A., Fournier-Viger, P.: Extracting useful knowledge from event logs: a frequent itemset mining approach. Knowl.-Based Syst. 139, 132–148 (2018)
Article Google Scholar
Eiras-Franco, C., Martinez-Rego, D., Guijarro-Berdinas, B., Alonso-Betanzos, A., Bahamonde, A.: Large scale anomaly detection in mixed numerical and categorical input spaces. Inf. Sci. 487, 115–127 (2019)
Article MathSciNet Google Scholar
Gupta, G.P., Kulariya, M.: A framework for fast and efficient cyber security network intrusion detection using apache spark. Procedia Comput. Sci. 93, 824–831 (2016)
Article Google Scholar
He, S., Zhu, J., He, P., Lyu, M.R.: Experience report: system log analysis for anomaly detection. In: Proceedings of the 27th International Symposium on Software Reliability Engineering, pp. 207–218. IEEE (2016)
Google Scholar
Ienco, D., Pensa, R.G., Meo, R.: A semisupervised approach to the detection and characterization of outliers in categorical data. IEEE Trans. Neural Netw. Learn. Syst. 28(5), 1017–1029 (2016)
Article Google Scholar
Khalili, A., Sami, A.: Sysdetect: a systematic approach to critical state determination for industrial intrusion detection systems using apriori algorithm. J. Process Control 32, 154–160 (2015)
Article Google Scholar
Landauer, M., Skopik, F., Wurzenberger, M., Hotwagner, W., Rauber, A.: Have it your way: generating customized log datasets with a model-driven simulation testbed. IEEE Trans. Reliab 70(1), 402–415 (2021)
Article Google Scholar
Moustafa, N., Slay, J.: The evaluation of network anomaly detection systems: statistical analysis of the unsw-nb15 data set and the comparison with the kdd99 data set. Inf. Secur. J. Glob. Perspect. 25(1–3), 18–31 (2016)
Article Google Scholar
Narita, K., Kitagawa, H.: Detecting outliers in categorical record databases based on attribute associations. In: Zhang, Y., Yu, G., Bertino, E., Xu, G. (eds.) APWeb 2008. LNCS, vol. 4976, pp. 111–123. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78849-2_13
Chapter Google Scholar
Pande, A., Ahuja, V.: Weac: word embeddings for anomaly classification from event logs. In: Proceedings of the International Conference on Big Data, pp. 1095–1100. IEEE (2017)
Google Scholar
Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes. The Art of Scientific Computing, 3rd edn. Cambridge University Press, Cambridge (2007)
MATH Google Scholar
Ren, J., Wu, Q., Zhang, J., Hu, C.: Efficient outlier detection algorithm for heterogeneous data streams. In: Proceedings of the 6th International Conference on Fuzzy Systems and Knowledge Discovery, vol. 5, pp. 259–264. IEEE (2009)
Google Scholar
Taha, A., Hadi, A.S.: Anomaly detection methods for categorical data: a review. ACM Comput. Surv. 52(2), 1–35 (2019)
Article Google Scholar
Tuor, A., Kaplan, S., Hutchinson, B., Nichols, N., Robinson, S.: Deep learning for unsupervised insider threat detection in structured cybersecurity data streams. arXiv preprint arXiv:1710.00811 (2017)
Wurzenberger, M., et al.: Logdata-anomaly-miner. https://github.com/ait-aecid/logdata-anomaly-miner, Accessed 21 Apr 2021

Download references

Acknowledgements

This work was partly funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU H2020 project GUARD (833456).

Author information

Authors and Affiliations

Austrian Institute of Technology, Giefinggasse 4, Vienna, Austria
Max Landauer, Georg Höld, Markus Wurzenberger & Florian Skopik
Vienna University of Technology, Favoritenstraße 9-11, Vienna, Austria
Andreas Rauber

Authors

Max Landauer
View author publications
You can also search for this author in PubMed Google Scholar
Georg Höld
View author publications
You can also search for this author in PubMed Google Scholar
Markus Wurzenberger
View author publications
You can also search for this author in PubMed Google Scholar
Florian Skopik
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Rauber
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Max Landauer .

Editor information

Editors and Affiliations

Purdue University, West Lafayette, IN, USA
Elisa Bertino
National Research Center for Applied Cybersecurity ATHENE, Fraunhofer Institute for Secure Information Technology SIT, Darmstadt, Germany
Haya Shulman
National Research Center for Applied Cybersecurity ATHENE , Technische Universität Darmstadt, Fraunhofer Institute for Secure Information Technology SIT, Darmstadt, Germany
Michael Waidner

A Appendix

1.1 A.1 Threshold Parameter Selection

The filtering steps for correlations between variables and values presented in Sect. 4 make use of threshold parameters $\theta _{1}$-$\theta _{8}$ to narrow down the search space and select only those correlations that are likely to positively contribute to the detection of anomalies. This section investigates the influence of these threshold parameters on the resulting correlations and thereby supports the manual parameter selection process, in particular, by relating each parameter to specific properties of the data at hand. In the following, we first explain the generation of synthetic data for this evaluation and then describe our experiments.

Data. To measure the influence of thresholds on the correlation selection, it is necessary to control properties of the input data. Therefore, we generate synthetic data for our experiments. We use three variables $V_1$, $V_2$, and $V_3$, of which only $V_1$ and $V_2$ correlate with varying strength, and monitor the correlations found by the VCD for different threshold settings. We use values $\mathcal {V}_i = \left\{ 0, 1, ..., x \right\} , x \in \mathbb {N}$ for each variable and compute their occurrence probabilities as normalized geometric series. Equation 15 shows how the probabilities for values in $V_1$ and $V_3$ are computed, where $p_i = 1$ means that all values are equally likely to occur, and lower values mean that one or more values are dominating the probability distribution. Equation 16 shows how the conditional probabilities of values in $V_2$ given values from $V_1$ are computed. Thereby, $\rho $ specifies the correlation strength, i.e., larger values for $\rho $ indicate that the same values co-occur more frequently with each other, and $\zeta $ is a damping factor that reduces the correlation strength for larger $v_{i, j}$, i.e., higher values for $\zeta $ cause more co-occurrences between different values.

$$\begin{aligned} P\left( v_{i, j} \right) = \frac{p_i^j}{\sum _{j' = 0}^{\left| \mathcal {V}_i \right| } p_i^{j'}} \end{aligned}$$

(15)

$$\begin{aligned} P\left( v_{k, l} \mid v_{i, j} \right) = \frac{\left( 1 - \rho \right) ^{\left| j - l \right| } + \zeta ^{\left| \left| \mathcal {V}_i \right| - j \right| }}{\sum _{l' = 0}^{\left| \mathcal {V}_k \right| } \left( 1 - \rho \right) ^{\left| j - l' \right| } + \zeta ^{\left| \left| \mathcal {V}_i \right| - j \right| }} \end{aligned}$$

(16)

Figure 4 shows the co-occurrences of values from $V_1$ and $V_2$ for a sample configuration of $x = 9$, $p_1 = 0.7$, $\rho = 0.9$, and $\zeta = 0.4$. Due to the relatively strong correlation factor, most values in $V_1$ occur with the same value of $V_2$. The figure also shows that higher values of $V_1$ co-occur with more values of $V_2$ due to the damping factor, e.g., while $v_{1, 1}$ only occurs with four different values of $V_2$, $v_{1, 9}$ occurs with each value of $V_2$ at least once.

To evaluate the accuracy of the correlation selection procedure, we generate a ground truth of expected value correlations that contains all $v_{1, j} \leadsto v_{2, l}$ and $v_{2, l} \leadsto v_{1, j}$ that occur at least once in the data. We count correlations selected by the VCD and present in the ground truth as true positives (TP), correlations not present in the ground truth as false positives (FP), correlations missed by the VCD as false negatives (FN), and all other correlations as true negatives (TN). We use the F-score $F_1 = TP / \left( TP + 0.5 \cdot \left( FP + FN \right) \right) $ to measure the accuracy in the next section.

Results. We first experiment with $\theta _{7}$, which is essential for selecting correlations that represent actual dependencies between the values and do not spuriously emerge from skewed value probability distributions. To analyze the relationship between $\theta _{7}$ and the correlation strength, we increase $\theta _{7}$ in steps of 0.05 and $\rho $ in steps of 0.1 in the range $\left[ 0, 1 \right] $ while leaving $p_1 = 0.7, p_3 = 0.7, \zeta =0.4$ constant, generate 10 data samples with 10000 events respectively as outlined in the previous section, and then compute the average F-score of these simulation runs. The results visualized in Fig. 5a show that weaker correlation strengths require $\theta _{7}$ to be sufficiently low to select all correct correlations and achieve the highest possible F-score of 1. However, setting $\theta _{7}$ to 0 causes a decrease of the F-score independent of the correlation strength. The reason for this is that correlations involving $V_3$ are not checked for dependency and are thus incorrectly selected, which increases the number of FP. We therefore conclude that $\theta _{7}$ should be set to a low, but non-zero value, e.g., 0.05. Note that the selection of $\theta _{7}$ is not affected by $\zeta $, since additional value co-occurrences only have little influence on the sum of variances as long as they are not dominating the distribution.

Threshold $\theta _{5}$ on the other hand relies on the total number of co-occurrences for a given value and is thus influenced by $\zeta $ in addition to $\rho $. Figure 5b shows the F-score for various combinations of $\theta _{5}$ and $\zeta $, while $\rho = 1$ is fixed. As expected, increasing values for $\zeta $ yield lower F-scores for a given $\theta _{5}$, because the number of distinct co-occurring values for any given value increases quickly (cf. Fig. 4). Accordingly, it is necessary to set $\theta _{5} \ge 1$ for $\zeta > 0.5$ to select any correlations. For $\zeta \le 0.5$, $\theta _{5}$ effectively steers the allowed number of distinct co-occurrences, e.g., for $\theta _{5} = 0.5$ at most 5 co-occurring values are allowed since $\left| \mathcal {V}_i \right| = 10, \forall i$.

We argue that the influence of other thresholds is trivial and therefore omit the plots for brevity. Table 4 shows a summary of all thresholds and the data properties with the highest influence on their selection. Note that $\theta _{8}$ is most influenced by $\theta _{5}$ and $\theta _{6}$ rather than a property of the input data, because these thresholds regulate the generation of value correlations that affect the selection criterion involving $\theta _{8}$. The table also provides default values that we identified as useful during our experiments and are used in the evaluations in Sect. 5.

These results indicate that the large number of parameters does not impede practical application of the VCD, since the thresholds are mostly independent from each other and allow to configure the correlation selection constraints specifically to counteract otherwise problematic properties of the data. For example, a high number of correlations involving many distinct values (i.e., $\left| \mathcal {V} \right| $ is large) or weakly correlated variables (i.e., $\rho $ is low) should be addressed by adjusting $\theta _{1}$ and $\theta _{7}$ accordingly to reduce the total number of correlations that are considered for anomaly detection as shown in Sect. 5.1.

Table 4. Dependencies and default values of thresholds.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Landauer, M., Höld, G., Wurzenberger, M., Skopik, F., Rauber, A. (2021). Iterative Selection of Categorical Variables for Log Data Anomaly Detection. In: Bertino, E., Shulman, H., Waidner, M. (eds) Computer Security – ESORICS 2021. ESORICS 2021. Lecture Notes in Computer Science(), vol 12972. Springer, Cham. https://doi.org/10.1007/978-3-030-88418-5_36

Download citation

DOI: https://doi.org/10.1007/978-3-030-88418-5_36
Published: 30 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88417-8
Online ISBN: 978-3-030-88418-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

Buying options

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Threshold Parameter Selection

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation