Abstract
In real-world applications, data can be represented using different units/scales. For example, weight in kilograms or pounds and fuel-efficiency in km/l or l/100 km. One unit can be a linear or non-linear scaling of another. The variation in metrics due to the non-linear scaling makes Anomaly Detection (AD) challenging. Most existing AD algorithms rely on distance- or density-based functions, which makes them sensitive to how data is expressed. This means that they are representation dependent. To avoid such a problem, we introduce a new anomaly detection method, which we call ‘usfAD: Unsupervised Stochastic Forest-based Anomaly Detector’. Our empirical evaluation in synthetic and real-world cybersecurity (spam detection, malicious URL detection and intrusion detection) datasets shows that our approach is more robust to the variation in units/scales used to express data. It produces more consistent and better results than five state-of-the-art AD methods namely: local outlier factor; one-class support vector machine; isolation forest; nearest neighbor in a random subsample of data; and, simple histogram-based probabilistic method.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aggarwal CC (2017) Outlier analysis. Springer, Berlin
Aryal S (2018) Anomaly detection technique robust to units and scales of measurement. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp 589–601
Aryal S, Baniya AA, Santosh K (2019) Improved histogram-based anomaly detector with the extended principal component features. arxiv. https://arxiv.org/abs/1909.12702
Aryal S, Ting KM, Haffari G (2016) Revisiting attribute independence assumption in probabilistic unsupervised anomaly detection. In: Proceedings of the 11th Pacific Asia Workshop on Intelligence and Security Informatics, pp 73–86
Aryal S, Ting KM, Washio T, Haffari G (2017) Data-dependent dissimilarity measure: an effective alternative to geometric distance measures. Knowl Inf Syst 53(2):479–506
Aryal S, Ting KM, Washio T, Haffari G (2020) A comparative study of data-dependent approaches without learning in measuring similarities of data objects. Data Min Knowl Disc 34(1):124–162. https://doi.org/10.1007/s10618-019-00660-0
Aryal S, Ting KM, Wells JR, Washio T (2014) Improving iForest with Relative Mass. In: Proceedings of the 18th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp 510–521
Bakshi BR (1999) Multiscale analysis and modelling using wavelets. J Chemom 13(1):415–434
Bandaragoda T, Ting KM, Albrecht D, Liu F, Wells J (2014) Efficient anomaly detection by isolation using nearest neighbour ensemble. In: Proceedings of the IEEE international conference on data mining workshops, pp 698–705
Baniya AA, Aryal S, Santosh KC (2019) A novel data pre-processing technique: making data mining robust to different units and scales of measurement. In: Proceedings of the 26th international conference on neural information processing (ICONIP) of the Asia-Pacific Neural Network Society, (p. Accepted)
Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of the ninth ACM SIGKDD conference on knowledge discovery and data mining, pp 29–38
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the eighth SIAM international conference on data mining, pp 243–254
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In; Proceedings of ACM SIGMOD conference on management of data, pp 93–104
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15-1-15–58
Cheng T, Li Z (2006) A multiscale approach for spatio-temporal outlier detection. Trans GIS 10(2):253–263
Conover WJ, Iman RL (1981) Rank transformations as a bridge between parametric and nonparametric statistics. Am Statist 35(3):124–129
Fernando TL, Webb GI (2017) SimUSF: An efficient and effective similarity measure that is invariant to violations of the interval scale assumption. Data Min Knowl Disc 31(1):264–286
Gao Z, Guo L, Ma C, Ma X, Sun K, Xiang H, Liu X et al (2019) AMAD: adversarial multiscale anomaly detection on high-dimensional and time-evolving categorical data. In: Proceedings of the 1st international workshop on deep learning practice for high-dimensional sparse data (DLP-KDD ’19), pp 1–8
Goldstein M, Dengel A (2012) Histogram-based outlier score (HBOS): a fast unsupervised anomaly detection algorithm. In: Proceedings of the 35th German Conference on Artificial Intelligence, pp 59–63
Hand DJ, Till RJ (2001) A simple generalisation of the area under the roc curve for multiple class. Mach Learn 45(2):171–186
Hawkins DM (1980) Identification of outliers. Chapman and Hall, London
Jiang H, Wang H, Hu W, Kakde D, Chaudhuri A (2017) Fast incremental SVDD learning algorithm with the Gaussian Kernel. In: Proceedings of the Thirty-Third AAAI conference on artificial intelligence (AAAI), pp 3991–3998
Joiner BL (1981) Lurking variables: some examples. Am Statist 35(4):227–233
Liu F, Ting KM, Zhou Z-H (2008) Isolation forest. In: Proceedings of the Eighth IEEE international conference on data mining, pp 413–422
Liu Q, Klucik R, Chen C, Grant G, Gallaher D, Lv Q, Shang L (2017) Unsupervised detection of contextual anomaly in remotely sensed data. Remote Sens Environ 202(1):75–87
Lord FM (1953) On the statistical treatment of football numbers. Am Psychol 8(12):750–751
Mamun MS, Rathore MA, Lashkari AH, Stakhanova N (2016) Detecting malicious URLs using lexical analysis. In: Proceedings of the international conference on network and system security (NSS 2016), pp 467–482
Pang G, Cao L, Chen L, Liu H (2018) Learning representations of ultrahigh-dimensional data for random distance-based outlier detection. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2041–2050
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Duchesnay E et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
Rekha AG (2015) A fast support vector data description system for anomaly detection using big data. In: Proceedings of the 30th Annual ACM symposium on applied computing (SAC), pp 931–932
Scholkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471
Shi T, Horvath S (2006) Unsupervised learning with random forest predictors. J Comput Graph Stat 15(1):118–138
Siddiqui S, Khan MS, Ferens K (2017) Multiscale Hebbian neural network for cyber threat detection. In: Proceedings of the international joint conference on neural networks (IJCNN), pp 1427–1434
Stevens SS (1946) On the theory of scales of measurement. Science 103(2684):677–680
Sugiyama M, Borgwardt KM (2013) Rapid distance-based outlier detection via sampling. In: Proceedings of the 27th annual conference on neural information processing systems, pp 467–475
Tax D, Duin R (2004) Support vector data description. Mach Learn 54(1):45–66
Ting KM, Washio T, Wells JR, Aryal S (2017) Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors. Mach Learn 106(1):55–91
Townsend JT, Ashby FG (1984) Measurement scales and statistics: the misconception misconceived. Psychol Bull 96(2):394–401
Velleman PF, Wilkinson L (1993) Nominal, ordinal, interval, and ratio typologies are misleading. Am Stat 47(1):65–72
Weinan E (2011) Principles of multiscale modeling (Vol 6). Cambridge University Press, Cambridge
Zhong G, Wang L-N, Ling X, Dong J (2016) An overview on data representation learning: from traditional feature learning to recent deep learning. J Financ Data Sci 2(4):265–278
Acknowledgements
This paper is an extension of a conference paper published in Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 2018 [2]. Authors would like to thank Mr Arbind Agrahari Baniya for his help to run some experiments in this extended version of the paper. This material is based upon work supported by the Air Force Office of Scientific Research under award number FA2386-20-1-4005.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
Authors declare no conflict of interest.
Rights and permissions
About this article
Cite this article
Aryal, S., Santosh, K. & Dazeley, R. usfAD: a robust anomaly detector based on unsupervised stochastic forest. Int. J. Mach. Learn. & Cyber. 12, 1137–1150 (2021). https://doi.org/10.1007/s13042-020-01225-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-020-01225-0