Bridging the Semantic Gap: Human Factors in Anomaly-Based Intrusion Detection Systems

Harang, Richard

doi:10.1007/978-1-4614-7597-2_2

Richard Harang²

Part of the book series: Advances in Information Security ((ADIS,volume 55))

3494 Accesses
5 Citations

Abstract

Anomaly-based intrusion detection has been pursued as an alternative to standard signature-based methods since the seminal work of Denning in 1987. Despite the length of time for which it has been studied, the high level of activity in this area, and the remarkable success of machine learning techniques in other areas, anomaly-based IDSs remain rarely used in practice, and none appear to have the same widespread popularity as more common misuse detectors such as Bro and Snort. We examine a potential cause of this observation, the “semantic gap” identified by Sommer and Paxson in 2010, in some detail, with reference to several common building blocks for anomaly-based intrusion detection systems. Finally, we revisit tree-based structures for rule construction similar to those first discussed by Vaccaro and Liepins in 1989 in light of modern results in ensemble learning, and suggest how such constructions could be used generate anomaly-based intrusion detection systems that retain acceptable performance while producing output that is more actionable for human analysts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
It is also worth noting that [6] use and provide access to a “KDD-like” set of data—that is, data aggregated on flow-like structures containing labels—that contains real data gathered from honeypots between 2006 and 2009 rather than synthetic attacks that predate 1999; this may provide a more useful and realistic alternative to the KDD’99 set.
2.
A possibly instructive exercise for the reader: obscure the labels and ask a knowledgeable colleague to attempt to divine which two packets are ‘normal’ and which is ‘anomalous’, and why.
3.
Note another critical feature: if adversarial actors are capable of crafting their traffic to approximate \( f_{0} \), such that the quantity \( \left| {1 - \frac{{f_{1} \left( x \right)}}{{f_{0} \left( x \right)}}} \right| \;\le \; \in \) for some small \( \in \;> \;0 \), and can control the rate of malicious traffic they send and hence \( P\left( I \right) \), then they may craft their traffic such that the defenders have no \( x^{ \star } \) that satisfies the above relationship and so cannot perform cost-effective anomaly detection. We do not discuss this problem in detail, but reserve it for future work.
4.
In this case, any outgoing traffic to a relatively high destination port was deemed by an analyst to be unusual, but “certainly not a red flag”; the fact that it was non-TCP and did not originate from the lower end of the range of registered ports suggested a UDP streaming protocol, which often communicate across ephemeral ports; the analyst volunteered the suggestion that if it were in fact UDP it would likely not warrant further analysis. When the same analyst was presented with the outputs given in Fig. 1 through Fig. 3, they were of the opinion that it was not terribly useful, and that it not provide them with any guidance as to why it appeared suspicious; the semantic gap in action.
5.
Due to the large size of the test set, it was not loaded into memory all at once, and instead was read sequentially from disk. Total time elapsed was 1023.3 s, of which profiling indicated that roughly 88 % was consumed by disk I/O operations. As our interest was in offhand comparison and not production use, we did not attempt to optimize this further.

References

R. Sommer, V. Paxson, Outside the closed world: on using machine learning for network intrusion detection,” in 2010 IEEE Symposium on Security and Privacy (SP), 2010
Google Scholar
P. Laskov, P. DÃŒssel, C. SchÃ€fer, K. Rieck, in Learning Intrusion Detection: Supervised or Unsupervised?, ed. by F. Roli, S. Vitulano (Springer, Berlin, 2005), pp. 50–57
Google Scholar
M. Roesch, Snort – lightweight intrusion detection for networks, in Proceedings of the 13th USENIX Conference on System Administration, 1999, pp. 229–238
Google Scholar
V. Paxson, Bro: a system for detecting network intruders in real time. Comput. Netw. 31(23–24), 2435–2463 (1999)
Article Google Scholar
J. Long, D. Schwartz, S. Stoecklin, Distinguishing false from true alerts in Snort by data mining patterns of alerts, in Proceedings of 2006 SPIE Defense and Security Symposium, 2006
Google Scholar
M. Sato, H. Yamaki, H. Takakura, Unknown attacks detection using feature extraction from anomaly-based IDS alerts, in 2012 IEEE/IPSJ 12th International Symposium on Applications and the Internet (SAINT), 2012
Google Scholar
Y. Song, M.E. Locasto, A. Stavrou, A.D. Keromytis, S.J. Stolfo, On the infeasibility of modeling polymorphic shellcode – Re-thinking…, in MACH LEARN, 2009
Google Scholar
H. Debar, M. Dacier, A. Wespi, Towards a taxonomy of intrusion-detection systems. Comput. Netw. 31(8), 805–822 (1999)
Article Google Scholar
O. Depren, M. Topallar, E. Anarim, M.K. Ciliz, An intelligent intrusion detection system (IDS) for anomaly and misuse detection in computer networks. Expert Syst. Appl. 29(4), 713–722 (2005)
Article Google Scholar
J. Zhang, M. Zulkernine, A. Haque, Random-forests-based network intrusion detection systems. IEEE Trans. Syst. Man Cybern. C Appl. Rev. 38(5), 649–659 (2008)
Article Google Scholar
N. Abe, B. Zadrozny, J. Langford, Outlier detection by active learning, in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, 2006
Google Scholar
S. Axelsson, The base-rate fallacy and the difficulty of intrusion detection. ACM Trans. Inf. Syst. Secur. 3(3), 186–205 (2000)
Article MathSciNet Google Scholar
A. Koufakou, E.G. Ortiz, M. Georgiopoulos, G.C. Anagnostopoulos, K.M. Reynolds, A scalable and efficient outlier detection strategy for categorical data, in 19th IEEE International Conference on Tools with Artificial Intelligence, 2007. ICTAI 2007
Google Scholar
M.E. Otey, A. Ghoting, S. Parthasarathy, Fast distributed outlier detection in mixed-attribute data sets. Data Min. Knowl. Discov. 12(2–3), 203–228 (2006)
Article MathSciNet Google Scholar
X. Song, M. Wu, C. Jermaine, S. Ranka, Conditional anomaly detection. IEEE Trans. Knowl. Data Eng. 19(5), 631–645 (2007)
Article Google Scholar
C. Cortes, V. Vapnik, Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
K. Wang, S. Stolfo, One-class training for Masquerade detection, in Workshop on Data Mining for Computer Security, 2003
Google Scholar
R. Perdisci, G. Gu, W. Lee, Using an ensemble of one-class SVM classifiers to harden payload-based anomaly detection systems, in Sixth International Conference on Data Mining, 2006. ICDM’06. 2006
Google Scholar
S. Mukkamala, G. Janoski, A. Sung, Intrusion detection using neural networks and support vector machines, in Proceedings of the 2002 International Joint Conference on Neural Networks, 2002
Google Scholar
J. Weston, C. Watkins, Technical Report CSD-TR-98-04, Department of Computer Science, Multi-class Support Vector Machines, Royal Holloway, University of London, 1998
Google Scholar
R. Chen, K. Cheng, Y. Chen, C. Hsieh, Using rough set and support vector machine for network intrusion detection system, in First Asian Conference on Intelligent Information and Database Systems, 2009
Google Scholar
T. Shon, Y. Kim, C. Lee, J. Moon, A machine learning framework for network anomaly detection using SVM and GA, in Information Assurance Workshop, 2005. IAW’05. Proceedings from the Sixth Annual IEEE SMC, 2005
Google Scholar
K. Wang, S. Stolfo, Anomalous payload-based network intrusion detection, in Recent Advances in Intrusion Detection, 2004
Google Scholar
B. Sangster, T. O’Connor, T. Cook, R. Fanelli, E. Dean, J. Adams, C. Morrell, G. Conti, Toward instrumenting network warfare competitions to generate labeled datasets, in USENIX Security’s Workshop on Cyber Security Experimentation and Test (CSET), 2009
Google Scholar
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet Google Scholar
P. Biondi, Scapy, a powerful interactive packet manipulation program. , Scapy, 2011, http://www.secdev.org/projects/scapy/
V. Frias-Martinez, J. Sherrick, S.J. Stolfo, A.D. Keromytis, A network access control mechanism based on behavior profiles, in Computer Security Applications Conference, 2009. ACSAC’09. Annual, 2009
Google Scholar
L. Breiman, Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
A. Criminisi, J. Shotton, E. Konukoglu, Decision Forests for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning, Microsoft Technical Report, 2011
Google Scholar
D.S. Kim, S.M. Lee, J.S. Park, Building Lightweight Intrusion Detection System Based on Random Forest, ed. by J. Wang, Z. Yi, J.M. Zurada, B. Lu, H. Yin (Springer, Berlin, 2006), pp. 224–230
Google Scholar
F.T. Liu, K.M. Ting, Z.-H. Zhou, Isolation-based anomaly detection. ACM Trans. Knowl. Discov. Data 6(1), 3:1–3:39 (2012)
Article Google Scholar
S.C. Tan, K.M. Ting, T.F. Liu, Fast anomaly detection for streaming data, in Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, vol. 2, 2011
Google Scholar
H.S. Vaccaro, G.E. Liepins, Detection of anomalous computer session activity, in Proceedings of 1989 IEEE Symposium on Security and Privacy, 1989
Google Scholar
D.E. Denning, An intrusion-detection model. IEEE Trans. Softw. Eng. 13(2), 222–232 (1987)
Article Google Scholar
M. Mahoney, P. Chan, An analysis of the 1999 DARPA/Lincoln laboratory evaluation data for network anomaly detection, in Recent Advances in Intrusion Detection, 2003
Google Scholar
J. McHugh, Testing intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory. ACM Trans. Inf. Syst. Secur. 3(4), 262–294 (2000)
Article Google Scholar
T. Lunt, A. Tamaru, F. Gilham, R. Jagannathan, C. Jalali, P. Neumann, H. Javitz, A. Valdes, T. Garvey, A real-time intrusion-detection expert system (IDES), SRI International, Computer Science Laboratory, 1992
Google Scholar
M. Molina, I. Paredes-Oliva, W. Routly, P. Barlet-Ros, Operational experiences with anomaly detection in backbone networks. Comput. Secur. 31(3), 273–285 (2012)
Article Google Scholar
K.M. Tan, R.A. Maxion, “Why 6?” Defining the operational limits of stide, an anomaly-based intrusion detector, in Proceedings of the IEEE Symposium on Security and Privacy, 2001
Google Scholar
L. Sassaman, M.L. Patterson, S. Bratus, A. Shubina, The Halting problems of network stack insecurity, in USENIX, 2011
Google Scholar
Z. Zhou, Ensemble Methods: Foundations and Algorithms (Chapman & Hall, 2012)
Google Scholar

Download references

Author information

Authors and Affiliations

ICF International, Washington, DC, USA
Richard Harang

Authors

Richard Harang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Richard Harang .

Editor information

Editors and Affiliations

ICF International, Lee Highway, Fairfax, 22031, Virginia, USA
Robinson E. Pino

Appendix A

Random decision tree classification of KDD’99 data was performed using the Scikit-learn [25] package under Python 2.7.2 on a desktop commodity workstation. Training was performed using the file kddcup.data_10_percent_corrected, and testing was done on the file kddcup.data.corrected. 494,021 training records were used, and 4,898,431 test records. The three fields “Count”, “diff_srv_rate”, and “dst_bytes” were extracted along with the label field in both data sets; all other data was discarded. The random decision forest was trained with the following parameters:

Classification threshold: simple majority
No bootstrapping used
Features per node: 2
Node splitting by information gain
Minimum leaf samples: 1
Minimum samples to split: 2
Max tree depth: 9
Number of trees: 11

Training the classifier required 4.4 s using a single processor, testing required approximately 122.8 s.^{Footnote 5} The following confusion matrix was produced (note that we have omitted correct classifications on the diagonal for compactness, and that we have also omitted rows corresponding to predictions that the classifier never produced).

	A) Normal	Guess_passwd	Nmap	B) Loadmodule	Rootkit	Warezclient	C) Smurf	Pod	D) Neptune	Spy	ftp_write	Phf	E) Portsweep	Teardrop	Buffer_overflow	Land	Imap	F) Warezmaster	Perl	Multihop	Back	Ipsweep	G) Satan
A	0	53	2315	8	10	1020	971	264	5537	2	8	4	9558	752	28	20	12	5	3	7	2197	12480	950
B	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
C	2210	0	1	0	0	0	0	0	9672	0	0	0	1	119	1	0	0	0	0	0	0	0	95
D	402	0	0	0	0	0	40	0	0	0	0	0	42	108	1	1	0	0	0	0	0	1	970
E	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	3
F	3	0	0	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0	0
G	6	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

Key: A) Normal B) Loadmodule C) Smurf D) Neptune E) Portsweep F) Warezmaster G) Satan

Total false negatives: \( 36204/4898431 \approx 0.007 \)

Total false positives: \( 2622/4898431 \approx 0.0005 \)

The most common errors were misclassification of the IPsweep attack as normal traffic, and classification of flows corresponding to the Neptune attack as the Smurf attack. Random inspection of the IPsweep misclassifications suggests that each “attack” had several records associated with it; while many individual records were not correctly labeled, all instances that were examined by hand had at least one record in the total attack correctly classified. As the Smurf and Neptune attacks are both denial of service attacks, some confusion between the two is to be expected.

While these results certainly demonstrate that random decision forests are accurate and efficient classifiers, the alternative that the KDD’99 data is simply not a terribly representative data set for IDS research should not be excluded.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Harang, R. (2014). Bridging the Semantic Gap: Human Factors in Anomaly-Based Intrusion Detection Systems. In: Pino, R. (eds) Network Science and Cybersecurity. Advances in Information Security, vol 55. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-7597-2_2

Download citation

DOI: https://doi.org/10.1007/978-1-4614-7597-2_2
Published: 15 June 2013
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-7596-5
Online ISBN: 978-1-4614-7597-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Bridging the Semantic Gap: Human Factors in Anomaly-Based Intrusion Detection Systems

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix A

Appendix A

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation