Skip to main content
Log in

A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Anomaly detection for symbolic sequence data is a highly important area of research and is relevant in many application domains. While several techniques have been proposed within different domains, understanding of their relative strengths and weaknesses is limited. The key factor for this is that the nature of sequence data varies significantly across domains, and hence while a technique might perform well in its original domain, its performance is not guaranteed in a different domain. In this paper, we aim at establishing this understanding for a wide variety of anomaly detection techniques for symbolic sequences. We present a comparative evaluation of a large number of anomaly detection techniques on a variety of publicly available as well as artificially generated data sets. Many of these are existing techniques while some are slight variants and/or adaptations of traditional anomaly detection techniques to sequence data. The analysis presented in this paper allows relative comparison of the different anomaly detection techniques and highlights their strengths and weaknesses. We extend the reference based analysis (RBA) framework, which was originally proposed to analyze multivariate categorical data, to analyze symbolic sequence data sets. We visualize the symbolic sequences using the characteristics provided by the RBA framework and use the visualization to understand various aspects of the sequence data. We then use the characterization done by RBA to understand the performance of the different techniques. Using the RBA framework, we propose two anomaly detection techniques for symbolic sequences, which show consistently superior performance over the existing techniques across the different data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. http://www.cs.umn.edu/~chandola/ICDM2008

  2. http://www.cs.unm.edu/~immsec/systemcalls.htm

  3. This framework to characterize a given data set with respect to a base or reference data set was originally proposed for characterizing categorical data Chandola et al. (2009).

References

  • Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer EL (2000) The pfam protein families database. Nucleic Acids Res 28:263–266

    Article  Google Scholar 

  • Baum LE, Petrie T, Soules G, Weiss N (1970) A maximization technique occuring in the statistical analysis of probabilistic functions of markov chains. Ann Math Stat 41(1):164–171

    Article  MATH  MathSciNet  Google Scholar 

  • Budalakoti S, Srivastava A, Otey M (2007) Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airline safety. In: Proceedings of the IEEE International Conference on Systems, Man, and, Cybernetics, vol 37

  • Chandola V, Banerjee A, Kumar V (2009) Anomaly detection—a survey. ACM Comput Surv 41(3):1–58

    Article  Google Scholar 

  • Chandola V, Banerjee A, Kumar V (2012) Anomaly detection for discrete sequences: a survey. IEEE Trans Knowl Data Eng 24:823–839

    Article  Google Scholar 

  • Chandola V, Boriah S, Kumar V (2009) A framework for exploring categorical data. In: Proceedings of the ninth SIAM International Conference on Data Mining

  • Chandola V, Boriah S, Kumar V (2010) A reference based analysis framework for analyzing system call traces. In: CSIIRW ’10: Proceedings of the 6th Annual Workshop on Cyber Security and Information Intelligence Research, New York, NY, USA, ACM

  • Chandola V, Mithal V, Kumar V (2008) A comparative evaluation of anomaly detection techniques for sequence data. In: Proceedings of International Conference on Data Mining

  • Chandola V, Mithal V, Kumar V (2008) Comparing anomaly detection techniques for sequence data. Technical Report 08–021, University of Minnesota, Computer Science Department, July 2008

  • Cohen WW (1995) Fast effective rule induction. In: Prieditis A, Russell S (eds) Proceedings of the 12th International Conference on Machine Learning. Morgan Kaufmann, Tahoe City, pp 115–123

  • Eskin E, Lee W, Stolfo S (2001) Modeling system call for intrusion detection using dynamic window sizes. In: Proceedings of DISCEX

  • Forney GD Jr (1973) The viterbi algorithm. Proc IEEE 61(3):268–278

    Google Scholar 

  • Forrest S, Hofmeyr SA, Somayaji A, Longstaff TA (1996) A sense of self for unix processes. In: Proceedinges of the ISRSP96, pp 120–128

  • Forrest S, Warrender C, Pearlmutter B (1999) Detecting intrusions using system calls: Alternate data models. In: Proceedings of the 1999 IEEE ISRSP, Washington, DC, USA, pp 133–145, 1999. IEEE Computer Society

  • Gao B, Ma H-Y, Yang Y-H (2002) Hmms (hidden markov models) based on anomaly intrusion detection method. In: Proceedings of International Conference on Machine Learning and Cybernetics, pp 381–385. IEEE

  • Gonzalez FA, Dasgupta D (2003) Anomaly detection using real-valued negative selection. Genet Program Evolvable Mach 4(4):383–403

    Article  Google Scholar 

  • Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126

    Article  MATH  Google Scholar 

  • Hofmeyr SA, Forrest S, Somayaji A (1998) Intrusion detection using sequences of system calls. J Comput Secur 6(3):151–180

    Google Scholar 

  • Lazarevic A, Ertoz L, Kumar V, Ozgur A, Srivastava J (2003) A comparative study of anomaly detection schemes in network intrusion detection. In: Proceedings of SIAM International Conference on Data Mining. SIAM, May 2003

  • Lee W, Stolfo S (1998) Data mining approaches for intrusion detection. In: Proceedings of the 7th USENIX Security Symposium, San Antonio, TX

  • Lee W, Stolfo S, Chan P (1997) Learning patterns from unix process execution traces for intrusion detection. In: Proceedings of the AAAI 97 workshop on AI methods in Fraud and risk management

  • Lippmann RP, et al. (2000) Evaluating intrusion detection systems—the 1998 darpa off-line intrusion detection evaluation. In: DARPA Information Survivability Conference and Exposition (DISCEX) vol 2, pp 12–26. IEEE Computer Society Press

  • MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Cam LM, Neyman J (eds) Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, vol 1. University of California Press, Berkeley, pp 281–297

  • Michael CC, Ghosh A (2000) Two state-based approaches to program-based anomaly detection. In: Proceedings of the 16th Annual Computer Security Applications Conference, pp 21. IEEE Computer Society

  • Qiao Y, Xin XW, Bin Y, Ge S (2002) Anomaly intrusion detection method based on HMM. Electron Lett 38(13):663–664

    Article  Google Scholar 

  • Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of the ACM SIGMOD international conference on Management of data, ACM

  • Ray A (2004) Symbolic dynamic analysis of complex systems for anomaly detection. Signal Process 84(7):1115–1130

    Article  MATH  Google Scholar 

  • Shalizi CR, Klinkner KL (2004) Blind construction of optimal nonlinear recursive predictors for discrete sequences. In: Chickering M, Halpern JY (eds) Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference (UAI 2004). AUAI Press, Arlington, Virginia, pp 504–511

  • Srivastava AN (2005) Discovering system health anomalies using data mining techniques. In: Proceedings of 2005 Joint Army Navy NASA Airforce Conference on Propulsion

  • Sun P, Chawla S, Arunasalam B (2006) Mining for outliers in sequential databases. In In SIAM International Conference on Data Mining

Download references

Acknowledgments

This work was supported by NASA under award NNX08AC36A, NSF Grant CNS-0551551 and NSF Grant IIS-0713227. Access to computing facilities was provided by the Digital Technology Consortium.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Varun Chandola.

Additional information

Responsible editor: Eamonn Keogh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chandola, V., Mithal, V. & Kumar, V. A reference based analysis framework for understanding anomaly detection techniques for symbolic sequences. Data Min Knowl Disc 28, 702–735 (2014). https://doi.org/10.1007/s10618-013-0315-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-013-0315-0

Keywords

Navigation