Skip to main content

Bridging the Semantic Gap: Human Factors in Anomaly-Based Intrusion Detection Systems

  • Chapter
  • First Online:
Network Science and Cybersecurity

Part of the book series: Advances in Information Security ((ADIS,volume 55))

Abstract

Anomaly-based intrusion detection has been pursued as an alternative to standard signature-based methods since the seminal work of Denning in 1987. Despite the length of time for which it has been studied, the high level of activity in this area, and the remarkable success of machine learning techniques in other areas, anomaly-based IDSs remain rarely used in practice, and none appear to have the same widespread popularity as more common misuse detectors such as Bro and Snort. We examine a potential cause of this observation, the “semantic gap” identified by Sommer and Paxson in 2010, in some detail, with reference to several common building blocks for anomaly-based intrusion detection systems. Finally, we revisit tree-based structures for rule construction similar to those first discussed by Vaccaro and Liepins in 1989 in light of modern results in ensemble learning, and suggest how such constructions could be used generate anomaly-based intrusion detection systems that retain acceptable performance while producing output that is more actionable for human analysts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    It is also worth noting that [6] use and provide access to a “KDD-like” set of data—that is, data aggregated on flow-like structures containing labels—that contains real data gathered from honeypots between 2006 and 2009 rather than synthetic attacks that predate 1999; this may provide a more useful and realistic alternative to the KDD’99 set.

  2. 2.

    A possibly instructive exercise for the reader: obscure the labels and ask a knowledgeable colleague to attempt to divine which two packets are ‘normal’ and which is ‘anomalous’, and why.

  3. 3.

    Note another critical feature: if adversarial actors are capable of crafting their traffic to approximate \( f_{0} \), such that the quantity \( \left| {1 - \frac{{f_{1} \left( x \right)}}{{f_{0} \left( x \right)}}} \right| \;\le \; \in \) for some small \( \in \;> \;0 \), and can control the rate of malicious traffic they send and hence \( P\left( I \right) \), then they may craft their traffic such that the defenders have no \( x^{ \star } \) that satisfies the above relationship and so cannot perform cost-effective anomaly detection. We do not discuss this problem in detail, but reserve it for future work.

  4. 4.

    In this case, any outgoing traffic to a relatively high destination port was deemed by an analyst to be unusual, but “certainly not a red flag”; the fact that it was non-TCP and did not originate from the lower end of the range of registered ports suggested a UDP streaming protocol, which often communicate across ephemeral ports; the analyst volunteered the suggestion that if it were in fact UDP it would likely not warrant further analysis. When the same analyst was presented with the outputs given in Fig. 1 through Fig. 3, they were of the opinion that it was not terribly useful, and that it not provide them with any guidance as to why it appeared suspicious; the semantic gap in action.

  5. 5.

    Due to the large size of the test set, it was not loaded into memory all at once, and instead was read sequentially from disk. Total time elapsed was 1023.3 s, of which profiling indicated that roughly 88 % was consumed by disk I/O operations. As our interest was in offhand comparison and not production use, we did not attempt to optimize this further.

References

  1. R. Sommer, V. Paxson, Outside the closed world: on using machine learning for network intrusion detection,” in 2010 IEEE Symposium on Security and Privacy (SP), 2010

    Google Scholar 

  2. P. Laskov, P. DÃŒssel, C. SchÀfer, K. Rieck, in Learning Intrusion Detection: Supervised or Unsupervised?, ed. by F. Roli, S. Vitulano (Springer, Berlin, 2005), pp. 50–57

    Google Scholar 

  3. M. Roesch, Snort – lightweight intrusion detection for networks, in Proceedings of the 13th USENIX Conference on System Administration, 1999, pp. 229–238

    Google Scholar 

  4. V. Paxson, Bro: a system for detecting network intruders in real time. Comput. Netw. 31(23–24), 2435–2463 (1999)

    Article  Google Scholar 

  5. J. Long, D. Schwartz, S. Stoecklin, Distinguishing false from true alerts in Snort by data mining patterns of alerts, in Proceedings of 2006 SPIE Defense and Security Symposium, 2006

    Google Scholar 

  6. M. Sato, H. Yamaki, H. Takakura, Unknown attacks detection using feature extraction from anomaly-based IDS alerts, in 2012 IEEE/IPSJ 12th International Symposium on Applications and the Internet (SAINT), 2012

    Google Scholar 

  7. Y. Song, M.E. Locasto, A. Stavrou, A.D. Keromytis, S.J. Stolfo, On the infeasibility of modeling polymorphic shellcode – Re-thinking…, in MACH LEARN, 2009

    Google Scholar 

  8. H. Debar, M. Dacier, A. Wespi, Towards a taxonomy of intrusion-detection systems. Comput. Netw. 31(8), 805–822 (1999)

    Article  Google Scholar 

  9. O. Depren, M. Topallar, E. Anarim, M.K. Ciliz, An intelligent intrusion detection system (IDS) for anomaly and misuse detection in computer networks. Expert Syst. Appl. 29(4), 713–722 (2005)

    Article  Google Scholar 

  10. J. Zhang, M. Zulkernine, A. Haque, Random-forests-based network intrusion detection systems. IEEE Trans. Syst. Man Cybern. C Appl. Rev. 38(5), 649–659 (2008)

    Article  Google Scholar 

  11. N. Abe, B. Zadrozny, J. Langford, Outlier detection by active learning, in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, 2006

    Google Scholar 

  12. S. Axelsson, The base-rate fallacy and the difficulty of intrusion detection. ACM Trans. Inf. Syst. Secur. 3(3), 186–205 (2000)

    Article  MathSciNet  Google Scholar 

  13. A. Koufakou, E.G. Ortiz, M. Georgiopoulos, G.C. Anagnostopoulos, K.M. Reynolds, A scalable and efficient outlier detection strategy for categorical data, in 19th IEEE International Conference on Tools with Artificial Intelligence, 2007. ICTAI 2007

    Google Scholar 

  14. M.E. Otey, A. Ghoting, S. Parthasarathy, Fast distributed outlier detection in mixed-attribute data sets. Data Min. Knowl. Discov. 12(2–3), 203–228 (2006)

    Article  MathSciNet  Google Scholar 

  15. X. Song, M. Wu, C. Jermaine, S. Ranka, Conditional anomaly detection. IEEE Trans. Knowl. Data Eng. 19(5), 631–645 (2007)

    Article  Google Scholar 

  16. C. Cortes, V. Vapnik, Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  17. K. Wang, S. Stolfo, One-class training for Masquerade detection, in Workshop on Data Mining for Computer Security, 2003

    Google Scholar 

  18. R. Perdisci, G. Gu, W. Lee, Using an ensemble of one-class SVM classifiers to harden payload-based anomaly detection systems, in Sixth International Conference on Data Mining, 2006. ICDM’06. 2006

    Google Scholar 

  19. S. Mukkamala, G. Janoski, A. Sung, Intrusion detection using neural networks and support vector machines, in Proceedings of the 2002 International Joint Conference on Neural Networks, 2002

    Google Scholar 

  20. J. Weston, C. Watkins, Technical Report CSD-TR-98-04, Department of Computer Science, Multi-class Support Vector Machines, Royal Holloway, University of London, 1998

    Google Scholar 

  21. R. Chen, K. Cheng, Y. Chen, C. Hsieh, Using rough set and support vector machine for network intrusion detection system, in First Asian Conference on Intelligent Information and Database Systems, 2009

    Google Scholar 

  22. T. Shon, Y. Kim, C. Lee, J. Moon, A machine learning framework for network anomaly detection using SVM and GA, in Information Assurance Workshop, 2005. IAW’05. Proceedings from the Sixth Annual IEEE SMC, 2005

    Google Scholar 

  23. K. Wang, S. Stolfo, Anomalous payload-based network intrusion detection, in Recent Advances in Intrusion Detection, 2004

    Google Scholar 

  24. B. Sangster, T. O’Connor, T. Cook, R. Fanelli, E. Dean, J. Adams, C. Morrell, G. Conti, Toward instrumenting network warfare competitions to generate labeled datasets, in USENIX Security’s Workshop on Cyber Security Experimentation and Test (CSET), 2009

    Google Scholar 

  25. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  Google Scholar 

  26. P. Biondi, Scapy, a powerful interactive packet manipulation program. , Scapy, 2011, http://www.secdev.org/projects/scapy/

  27. V. Frias-Martinez, J. Sherrick, S.J. Stolfo, A.D. Keromytis, A network access control mechanism based on behavior profiles, in Computer Security Applications Conference, 2009. ACSAC’09. Annual, 2009

    Google Scholar 

  28. L. Breiman, Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  29. A. Criminisi, J. Shotton, E. Konukoglu, Decision Forests for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning, Microsoft Technical Report, 2011

    Google Scholar 

  30. D.S. Kim, S.M. Lee, J.S. Park, Building Lightweight Intrusion Detection System Based on Random Forest, ed. by J. Wang, Z. Yi, J.M. Zurada, B. Lu, H. Yin (Springer, Berlin, 2006), pp. 224–230

    Google Scholar 

  31. F.T. Liu, K.M. Ting, Z.-H. Zhou, Isolation-based anomaly detection. ACM Trans. Knowl. Discov. Data 6(1), 3:1–3:39 (2012)

    Article  Google Scholar 

  32. S.C. Tan, K.M. Ting, T.F. Liu, Fast anomaly detection for streaming data, in Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, vol. 2, 2011

    Google Scholar 

  33. H.S. Vaccaro, G.E. Liepins, Detection of anomalous computer session activity, in Proceedings of 1989 IEEE Symposium on Security and Privacy, 1989

    Google Scholar 

  34. D.E. Denning, An intrusion-detection model. IEEE Trans. Softw. Eng. 13(2), 222–232 (1987)

    Article  Google Scholar 

  35. M. Mahoney, P. Chan, An analysis of the 1999 DARPA/Lincoln laboratory evaluation data for network anomaly detection, in Recent Advances in Intrusion Detection, 2003

    Google Scholar 

  36. J. McHugh, Testing intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory. ACM Trans. Inf. Syst. Secur. 3(4), 262–294 (2000)

    Article  Google Scholar 

  37. T. Lunt, A. Tamaru, F. Gilham, R. Jagannathan, C. Jalali, P. Neumann, H. Javitz, A. Valdes, T. Garvey, A real-time intrusion-detection expert system (IDES), SRI International, Computer Science Laboratory, 1992

    Google Scholar 

  38. M. Molina, I. Paredes-Oliva, W. Routly, P. Barlet-Ros, Operational experiences with anomaly detection in backbone networks. Comput. Secur. 31(3), 273–285 (2012)

    Article  Google Scholar 

  39. K.M. Tan, R.A. Maxion, “Why 6?” Defining the operational limits of stide, an anomaly-based intrusion detector, in Proceedings of the IEEE Symposium on Security and Privacy, 2001

    Google Scholar 

  40. L. Sassaman, M.L. Patterson, S. Bratus, A. Shubina, The Halting problems of network stack insecurity, in USENIX, 2011

    Google Scholar 

  41. Z. Zhou, Ensemble Methods: Foundations and Algorithms (Chapman & Hall, 2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Richard Harang .

Editor information

Editors and Affiliations

Appendix A

Appendix A

Random decision tree classification of KDD’99 data was performed using the Scikit-learn [25] package under Python 2.7.2 on a desktop commodity workstation. Training was performed using the file kddcup.data_10_percent_corrected, and testing was done on the file kddcup.data.corrected. 494,021 training records were used, and 4,898,431 test records. The three fields “Count”, “diff_srv_rate”, and “dst_bytes” were extracted along with the label field in both data sets; all other data was discarded. The random decision forest was trained with the following parameters:

  • Classification threshold: simple majority

  • No bootstrapping used

  • Features per node: 2

  • Node splitting by information gain

  • Minimum leaf samples: 1

  • Minimum samples to split: 2

  • Max tree depth: 9

  • Number of trees: 11

Training the classifier required 4.4 s using a single processor, testing required approximately 122.8 s.Footnote 5 The following confusion matrix was produced (note that we have omitted correct classifications on the diagonal for compactness, and that we have also omitted rows corresponding to predictions that the classifier never produced).

 

A) Normal

Guess_passwd

Nmap

B) Loadmodule

Rootkit

Warezclient

C) Smurf

Pod

D) Neptune

Spy

ftp_write

Phf

E) Portsweep

Teardrop

Buffer_overflow

Land

Imap

F) Warezmaster

Perl

Multihop

Back

Ipsweep

G) Satan

A

0

53

2315

8

10

1020

971

264

5537

2

8

4

9558

752

28

20

12

5

3

7

2197

12480

950

B

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

C

2210

0

1

0

0

0

0

0

9672

0

0

0

1

119

1

0

0

0

0

0

0

0

95

D

402

0

0

0

0

0

40

0

0

0

0

0

42

108

1

1

0

0

0

0

0

1

970

E

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

3

F

3

0

0

0

0

0

0

0

0

0

0

0

3

0

0

0

0

0

0

0

0

0

0

G

6

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

  1. Key: A) Normal B) Loadmodule C) Smurf D) Neptune E) Portsweep F) Warezmaster G) Satan

Total false negatives: \( 36204/4898431 \approx 0.007 \)

Total false positives: \( 2622/4898431 \approx 0.0005 \)

The most common errors were misclassification of the IPsweep attack as normal traffic, and classification of flows corresponding to the Neptune attack as the Smurf attack. Random inspection of the IPsweep misclassifications suggests that each “attack” had several records associated with it; while many individual records were not correctly labeled, all instances that were examined by hand had at least one record in the total attack correctly classified. As the Smurf and Neptune attacks are both denial of service attacks, some confusion between the two is to be expected.

While these results certainly demonstrate that random decision forests are accurate and efficient classifiers, the alternative that the KDD’99 data is simply not a terribly representative data set for IDS research should not be excluded.

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this chapter

Cite this chapter

Harang, R. (2014). Bridging the Semantic Gap: Human Factors in Anomaly-Based Intrusion Detection Systems. In: Pino, R. (eds) Network Science and Cybersecurity. Advances in Information Security, vol 55. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-7597-2_2

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-7597-2_2

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-7596-5

  • Online ISBN: 978-1-4614-7597-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics