FAAD: an unsupervised fast and accurate anomaly detection method for a multi-dimensional sequence over data stream

Li, Bin; Wang, Yi-jie; Yang, Dong-sheng; Li, Yong-mou; Ma, Xing-kong

doi:10.1631/FITEE.1800038

FAAD: an unsupervised fast and accurate anomaly detection method for a multi-dimensional sequence over data stream

Published: 19 April 2019

Volume 20, pages 388–404, (2019)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Bin Li ORCID: orcid.org/0000-0003-0876-2694¹,
Yi-jie Wang¹,
Dong-sheng Yang²,
Yong-mou Li¹ &
…
Xing-kong Ma¹

279 Accesses
11 Citations
Explore all metrics

Abstract

Recently, sequence anomaly detection has been widely used in many fields. Sequence data in these fields are usually multi-dimensional over the data stream. It is a challenge to design an anomaly detection method for a multi-dimensional sequence over the data stream to satisfy the requirements of accuracy and high speed. It is because: (1) Redundant dimensions in sequence data and large state space lead to a poor ability for sequence modeling; (2) Anomaly detection cannot adapt to the high-speed nature of the data stream, especially when concept drift occurs, and it will reduce the detection rate. On one hand, most existing methods of sequence anomaly detection focus on the single-dimension sequence. On the other hand, some studies concerning multi-dimensional sequence concentrate mainly on the static database rather than the data stream. To improve the performance of anomaly detection for a multi-dimensional sequence over the data stream, we propose a novel unsupervised fast and accurate anomaly detection (FAAD) method which includes three algorithms. First, a method called “information calculation and minimum spanning tree cluster” is adopted to reduce redundant dimensions. Second, to speed up model construction and ensure the detection rate for the sequence over the data stream, we propose a method called “random sampling and subsequence partitioning based on the index probabilistic suffix tree.” Last, the method called “anomaly buffer based on model dynamic adjustment” dramatically reduces the effects of concept drift in the data stream. FAAD is implemented on the streaming platform Storm to detect multi-dimensional log audit data. Compared with the existing anomaly detection methods, FAAD has a good performance in detection rate and speed without being affected by concept drift.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multidimensional Longest Increasing Subsequences and Its Variants Discovery Using DNA Operations

An Unsupervised Anomaly Detection Algorithm for Time Series Big Data

A Markov Process-Based Anomaly Detection of Time Series Streaming Data

References

Bao H, Wang YJ, 2016. A C-SVM based anomaly detection method for multi-dimensional sequence over data stream. Proc IEEE 22^nd Int Conf on Parallel and Distributed Systems, p.948–955. https://doi.org/10.1109/ICPADS.2016.0127
Box GE, Jenkins GM, Reinsel GC, et al., 2015. Time Series Analysis: Forecasting and Control. John Wiley & Sons, Hoboken, USA.
MATH Google Scholar
Budalakoti S, Srivastava AN, Akella R, et al., 2006. Anomaly Detection in Large Sets of High-Dimensional Symbol Sequences. TM-2006-214553, NASA Ames Research Center, USA.
Google Scholar
Budalakoti S, Srivastava AN, Otey ME, 2009. Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airline safety. IEEE Trans Syst Man Cybern C, 39(1):101–113. https://doi.org/10.1109/TSMCC.2008.2007248
Article Google Scholar
Carlin BP, Louis TA, 2000. Bayes and Empirical Bayes Methods for Data Analysis (2^nd Ed.). Chapman & Hall/CRC Press, Boca Raton, FL, USA.
Book MATH Google Scholar
Chandola V, Mithal V, Kumar V, 2008. Comparative evaluation of anomaly detection techniques for sequence data. Proc 8^th IEEE Int Conf on Data Mining, p.743–748. https://doi.org/10.1109/ICDM.2008.151
Chandola V, Banerjee A, Kumar V, 2009. Anomaly detection: a survey. ACM Comput Surv, 41(3), Article 15. https://doi.org/10.1145/1541880.1541882
Google Scholar
Chandola V, Banerjee A, Kumar V, 2012. Anomaly detection for discrete sequences: a survey. IEEE Trans Knowl Data Eng, 24(5):823–839. https://doi.org/10.1109/TKDE.2010.235
Article Google Scholar
Dani MC, Freixo C, Jollois FX, et al., 2015. Unsupervised anomaly detection for aircraft condition monitoring system. Proc IEEE Aerospace Conf, p.1–7. https://doi.org/10.1109/AERO.2015.7119138
Esposito F, di Mauro N, Basile TMA, et al., 2008. Multidimensional relational sequence mining. Fundam Inform, 89(1):23–43.
MATH Google Scholar
Hall MA, 2000. Correlation-based feature selection for discrete and numeric class machine learning. Proc 17^th Int Conf on Machine Learning, p.359–366.
Jin Y, Zuo WL, 2007. Multi-dimensional concept lattice and incremental discovery of multi-dimensional sequential patterns. J Comput Res Dev, 44(11):1816–1824 (in Chinese).
Article Google Scholar
Kaufman L, Rousseeuw PJ, 2009. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, New York, USA.
MATH Google Scholar
Keogh E, Chakrabarti K, Pazzani M, et al., 2001. Dimensionality reduction for fast similarity search in large time series databases. Knowl Inform Syst, 3(3):263–286. https://doi.org/10.1007/PL00011669
Article MATH Google Scholar
Kponyo JJ, Kuang YJ, Zhang EZ, et al., 2013. VANET cluster-on-demand minimum spanning tree (MST) prim clustering algorithm. Proc Int Conf on Computational Problem-Solving, p.101–104. https://doi.org/10.1109/ICCPS.2013.6893585
Lane T, 1998. Machine Learning Techniques for the Domain of Anomaly Detection for Computer Security. Purdue University, Indiana, USA.
Google Scholar
Lee CH, 2015. A multi-phase approach for classifying multidimensional sequence data. Intell Data Anal, 19(3):547–561. https://doi.org/10.3233/IDA-150731
Article Google Scholar
Li C, Tian XG, Xiao X, et al., 2012. Anomaly detection of user behavior based on shell commands and co-occurrence matrix. J Comput Res Dev, 49(9):1982–1990 (in Chinese).
Google Scholar
Li XY, Wang YJ, Li XL, et al., 2014. Parallelizing skyline queries over uncertain data streams with sliding window partitioning and grid index. Knowl Inform Syst, 41(2):277–309. https://doi.org/10.1007/s10115-013-0725-8
Article MathSciNet Google Scholar
Parveen P, Mcdaniel N, Weger Z, et al., 2013. Evolving insider threat detection stream mining perspective. Int J Artif Intell Tools, 22(5):1360013. https://doi.org/10.1142/S0218213013600130
Article Google Scholar
Qian Q, Wu JL, Zhu W, et al., 2012. Improved edit distance method for system call anomaly detection. Proc IEEE 12^th Int Conf on Computer and Information Technology, p.1097–1102. https://doi.org/10.1109/CIT.2012.223
Ron DN, Singer Y, Tishby N, 1994. Learning probabilistic automata with variable memory length. Proc 7^th Annual Conf on Computational Learning Theory, p.35–46. https://doi.org/10.1145/180139.181006
Sarhrouni E, Hammouch A, Aboutajdine D, 2012. Application of symmetric uncertainty and mutual information to dimensionality reduction and classification of hyperspectral images. Int J Eng Technol, 4(5):268–276. https://doi.org/10.1145/180139.181006
Google Scholar
Shu XK, Yao DF, Ryder BG, 2015. A formal framework for program anomaly detection. Proc 18^th Int Symp Research in Attacks, Intrusions, and Defenses, p.270–292. https://doi.org/10.1007/978-3-319-26362-5_13
Tandon G, Chan P, 2003. Learning rules from system call arguments and sequences for anomaly detection. Proc ICDM Workshop on Data Mining for Computer Security, p.20–29.
Wang Y, Ma X, 2015. A general scalable and elastic content-based publish/subscribe service. IEEE Trans Parall Distr Syst, 26(8):2100–2113. https://doi.org/10.1109/TPDS.2014.2346759
Article Google Scholar
Wang YJ, Li S, 2006. Research and performance evaluation of data replication technology in distributed storage systems. Comput Math Appl, 51(11):1625–1632. https://doi.org/10.1016/j.camwa.2006.05.002
Article Google Scholar
Wang YJ, Li XY, Li XL, et al., 2013. A survey of queries over uncertain data. Knowl Inform Syst, 37(3):485–530. https://doi.org/10.1007/s10115-013-0638-6
Article Google Scholar
Wang YJ, Pei X, Ma X, et al., 2018. TA-update: an adaptive update scheme with tree-structured transmission in erasure-coded storage systems. IEEE Trans Parall Distr Syst, 29(8):1893–1906. https://doi.org/10.1109/TPDS.2017.2717981
Article Google Scholar
Xianyu JC, Rasouli S, Timmermans H, 2017. Analysis of variability in multi-day GPS imputed activity-travel diaries using multi-dimensional sequence alignment and panel effects regression models. Transportation, 44(3):533–553. https://doi.org/10.1007/s11116-015-9666-2
Article Google Scholar
Xiong TK, Wang SR, Jiang QS, et al., 2011. A new Markov model for clustering categorical sequences. Proc IEEE 11^th Int Conf on Data Mining, p.854–863. https://doi.org/10.1109/ICDM.2011.13
Yamanishi K, Maruyama Y, 2005. Dynamic syslog mining for network failure monitoring. Proc 11^th ACM SIGKDD Int Conf on Knowledge Discovery in Data Mining, p.499–508. https://doi.org/10.1145/1081870.1081927
Yang J, Wang W, 2003. CLUSEQ: efficient and effective sequence clustering. Proc 19^th Int Conf on Data Engineering, p.101–112. https://doi.org/10.1109/ICDE.2003.1260785
Yu L, Liu H, 2003. Feature selection for high-dimensional data: a fast correlation-based filter solution. Proc 20^th Int Conf on Machine Learning, p.856–863.

Download references

Author information

Authors and Affiliations

Science and Technology on Parallel and Distributed Processing Laboratory, College of Computer, National University of Defense Technology, Changsha, 410073, China
Bin Li, Yi-jie Wang, Yong-mou Li & Xing-kong Ma
Block Chain Research Institute of LianLian Pay, Hangzhou, 310000, China
Dong-sheng Yang

Authors

Bin Li
View author publications
You can also search for this author in PubMed Google Scholar
Yi-jie Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dong-sheng Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yong-mou Li
View author publications
You can also search for this author in PubMed Google Scholar
Xing-kong Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi-jie Wang.

Additional information

Project supported by the National Key R&D Program of China (No. 2016YFB1000101), the National Natural Science Foundation of China (Nos. 61379052 and 61502513), the Natural Science Foundation for Distinguished Young Scholars of Hunan Province, China (No. 14JJ1026), and the Specialized Research Fund for the Doctoral Program of Higher Education, China (No. 20124307110015)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, B., Wang, Yj., Yang, Ds. et al. FAAD: an unsupervised fast and accurate anomaly detection method for a multi-dimensional sequence over data stream. Frontiers Inf Technol Electronic Eng 20, 388–404 (2019). https://doi.org/10.1631/FITEE.1800038

Download citation

Received: 15 January 2018
Accepted: 13 May 2018
Published: 19 April 2019
Issue Date: March 2019
DOI: https://doi.org/10.1631/FITEE.1800038

Key words

CLC number

TP391.4

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FAAD: an unsupervised fast and accurate anomaly detection method for a multi-dimensional sequence over data stream

Abstract

Access this article

Similar content being viewed by others

Multidimensional Longest Increasing Subsequences and Its Variants Discovery Using DNA Operations

An Unsupervised Anomaly Detection Algorithm for Time Series Big Data

A Markov Process-Based Anomaly Detection of Time Series Streaming Data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

CLC number

Navigation

FAAD: an unsupervised fast and accurate anomaly detection method for a multi-dimensional sequence over data stream

Abstract

Access this article

Similar content being viewed by others

Multidimensional Longest Increasing Subsequences and Its Variants Discovery Using DNA Operations

An Unsupervised Anomaly Detection Algorithm for Time Series Big Data

A Markov Process-Based Anomaly Detection of Time Series Streaming Data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Search

Navigation