Abstract
Anomaly detection for symbolic sequence data is a highly important area of research and is relevant in many application domains. While several techniques have been proposed within different domains, understanding of their relative strengths and weaknesses is limited. The key factor for this is that the nature of sequence data varies significantly across domains, and hence while a technique might perform well in its original domain, its performance is not guaranteed in a different domain. In this paper, we aim at establishing this understanding for a wide variety of anomaly detection techniques for symbolic sequences. We present a comparative evaluation of a large number of anomaly detection techniques on a variety of publicly available as well as artificially generated data sets. Many of these are existing techniques while some are slight variants and/or adaptations of traditional anomaly detection techniques to sequence data. The analysis presented in this paper allows relative comparison of the different anomaly detection techniques and highlights their strengths and weaknesses. We extend the reference based analysis (RBA) framework, which was originally proposed to analyze multivariate categorical data, to analyze symbolic sequence data sets. We visualize the symbolic sequences using the characteristics provided by the RBA framework and use the visualization to understand various aspects of the sequence data. We then use the characterization done by RBA to understand the performance of the different techniques. Using the RBA framework, we propose two anomaly detection techniques for symbolic sequences, which show consistently superior performance over the existing techniques across the different data sets.
Similar content being viewed by others
Notes
This framework to characterize a given data set with respect to a base or reference data set was originally proposed for characterizing categorical data Chandola et al. (2009).
References
Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer EL (2000) The pfam protein families database. Nucleic Acids Res 28:263–266
Baum LE, Petrie T, Soules G, Weiss N (1970) A maximization technique occuring in the statistical analysis of probabilistic functions of markov chains. Ann Math Stat 41(1):164–171
Budalakoti S, Srivastava A, Otey M (2007) Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airline safety. In: Proceedings of the IEEE International Conference on Systems, Man, and, Cybernetics, vol 37
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection—a survey. ACM Comput Surv 41(3):1–58
Chandola V, Banerjee A, Kumar V (2012) Anomaly detection for discrete sequences: a survey. IEEE Trans Knowl Data Eng 24:823–839
Chandola V, Boriah S, Kumar V (2009) A framework for exploring categorical data. In: Proceedings of the ninth SIAM International Conference on Data Mining
Chandola V, Boriah S, Kumar V (2010) A reference based analysis framework for analyzing system call traces. In: CSIIRW ’10: Proceedings of the 6th Annual Workshop on Cyber Security and Information Intelligence Research, New York, NY, USA, ACM
Chandola V, Mithal V, Kumar V (2008) A comparative evaluation of anomaly detection techniques for sequence data. In: Proceedings of International Conference on Data Mining
Chandola V, Mithal V, Kumar V (2008) Comparing anomaly detection techniques for sequence data. Technical Report 08–021, University of Minnesota, Computer Science Department, July 2008
Cohen WW (1995) Fast effective rule induction. In: Prieditis A, Russell S (eds) Proceedings of the 12th International Conference on Machine Learning. Morgan Kaufmann, Tahoe City, pp 115–123
Eskin E, Lee W, Stolfo S (2001) Modeling system call for intrusion detection using dynamic window sizes. In: Proceedings of DISCEX
Forney GD Jr (1973) The viterbi algorithm. Proc IEEE 61(3):268–278
Forrest S, Hofmeyr SA, Somayaji A, Longstaff TA (1996) A sense of self for unix processes. In: Proceedinges of the ISRSP96, pp 120–128
Forrest S, Warrender C, Pearlmutter B (1999) Detecting intrusions using system calls: Alternate data models. In: Proceedings of the 1999 IEEE ISRSP, Washington, DC, USA, pp 133–145, 1999. IEEE Computer Society
Gao B, Ma H-Y, Yang Y-H (2002) Hmms (hidden markov models) based on anomaly intrusion detection method. In: Proceedings of International Conference on Machine Learning and Cybernetics, pp 381–385. IEEE
Gonzalez FA, Dasgupta D (2003) Anomaly detection using real-valued negative selection. Genet Program Evolvable Mach 4(4):383–403
Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126
Hofmeyr SA, Forrest S, Somayaji A (1998) Intrusion detection using sequences of system calls. J Comput Secur 6(3):151–180
Lazarevic A, Ertoz L, Kumar V, Ozgur A, Srivastava J (2003) A comparative study of anomaly detection schemes in network intrusion detection. In: Proceedings of SIAM International Conference on Data Mining. SIAM, May 2003
Lee W, Stolfo S (1998) Data mining approaches for intrusion detection. In: Proceedings of the 7th USENIX Security Symposium, San Antonio, TX
Lee W, Stolfo S, Chan P (1997) Learning patterns from unix process execution traces for intrusion detection. In: Proceedings of the AAAI 97 workshop on AI methods in Fraud and risk management
Lippmann RP, et al. (2000) Evaluating intrusion detection systems—the 1998 darpa off-line intrusion detection evaluation. In: DARPA Information Survivability Conference and Exposition (DISCEX) vol 2, pp 12–26. IEEE Computer Society Press
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Cam LM, Neyman J (eds) Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, vol 1. University of California Press, Berkeley, pp 281–297
Michael CC, Ghosh A (2000) Two state-based approaches to program-based anomaly detection. In: Proceedings of the 16th Annual Computer Security Applications Conference, pp 21. IEEE Computer Society
Qiao Y, Xin XW, Bin Y, Ge S (2002) Anomaly intrusion detection method based on HMM. Electron Lett 38(13):663–664
Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of the ACM SIGMOD international conference on Management of data, ACM
Ray A (2004) Symbolic dynamic analysis of complex systems for anomaly detection. Signal Process 84(7):1115–1130
Shalizi CR, Klinkner KL (2004) Blind construction of optimal nonlinear recursive predictors for discrete sequences. In: Chickering M, Halpern JY (eds) Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference (UAI 2004). AUAI Press, Arlington, Virginia, pp 504–511
Srivastava AN (2005) Discovering system health anomalies using data mining techniques. In: Proceedings of 2005 Joint Army Navy NASA Airforce Conference on Propulsion
Sun P, Chawla S, Arunasalam B (2006) Mining for outliers in sequential databases. In In SIAM International Conference on Data Mining
Acknowledgments
This work was supported by NASA under award NNX08AC36A, NSF Grant CNS-0551551 and NSF Grant IIS-0713227. Access to computing facilities was provided by the Digital Technology Consortium.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Eamonn Keogh.
Rights and permissions
About this article
Cite this article
Chandola, V., Mithal, V. & Kumar, V. A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences. Data Min Knowl Disc 28, 702–735 (2014). https://doi.org/10.1007/s10618-013-0315-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-013-0315-0