Skip to main content
Log in

Evaluation in the absence of absolute ground truth: toward reliable evaluation methodology for scan detectors

  • Regular Contribution
  • Published:
International Journal of Information Security Aims and scope Submit manuscript

Abstract

Although network reconnaissance through scanning has been well explored in the literature, new scan detection proposals with various detection features and capabilities continue to appear. To our knowledge, however, there is little discussion of reliable methodologies to evaluate network scanning detectors. In this paper, we show that establishing ground truth labels of scanning activity on non-synthetic network traces is a more difficult problem relative to labeling conventional intrusions. The main problem stems from lack of absolute ground truth (AGT). We identify the specific types of errors this admits. For real-world network traffic, typically many events can be equally interpreted as legitimate or intrusions, and therefore, establishing AGT is infeasible since it depends on unknowable intent. We explore how an estimated ground truth based on discrete classification criteria can be misleading since typical detection accuracy measures are strongly dependent on the chosen criteria. We also present a methodology for evaluating and comparing scan detection algorithms. The methodology classifies remote addresses based on continuous scores designed to provide a more accurate reference for evaluation. The challenge of conducting a reliable evaluation in the absence of AGT applies to other areas in network intrusion detection, and corresponding requirements and guidelines apply.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Allman, M., Paxson, V., Terrell, J.: A brief history of scanning. In: Proceedings the 7th ACM SIGCOMM Conference on Internet Measurement (2007)

  2. Axelsson, S.: The base-rate fallacy and the difficulty of intrusion detection. ACM Trans. Inf. Syst. Secur. (TISSEC). 3(3), 186–205 (2000)

    Article  MathSciNet  Google Scholar 

  3. Bro intrusion detection system. http://bro-ids.org/. Accessed May 2010

  4. Casado, M., Freedman, M.J.: Peering through the shroud: the effect of edge opacity on IP-based client identification. In: 4th USENIX Symposium on Networked Systems Design and Implementation (NDSS’07) (2007)

  5. Coull, S.E., Wright, C.V., Monrose, F., Collins, M.P., Reiter, M.K.: Playing devil’s advocate: inferring sensitive information from anonymized network traces. In: NDSS (2007)

  6. Floyd, S., Paxson, V.: Difficulties in simulating the internet. IEEE/ACM Trans. Netw. 9, 392–403 (2001)

    Article  Google Scholar 

  7. Gates, C.: Co-ordinated port scans: a model, a detector and an evaluation methodology. PhD thesis, Dalhousie University (2006)

  8. Gates, C., McNutt, J.J., Kadane, J.B., Kellner, M.: Scan detection on very large networks using logistic regression modeling. In: Proceedings of the 11th IEEE Symposium on Computers and Communications (ISCC’06) (2006)

  9. Heberlein, L.T., Dias, G.V., Levitt, K.N., Mukherjee, B., Wood, J., Wolber, D.: A network security monitor. In: IEEE Symposium on Security and Privacy, p. 296 (1990)

  10. Jin, R., Ghahramani, Z.: Learning with multiple labels. Adv. Neural Inf. Process. Syst. 15, 897–904 (2002)

    Google Scholar 

  11. Jung, J.: Real-time detection of malicious network activity using stochastic models. PhD thesis, Massachusetts Institute of Technology (2006)

  12. Jung, J., Paxson, V., Berger, A.W., Balakrishnan, H.: Fast portscan detection using sequential hypothesis testing. In: IEEE Symposium on Security and Privacy (2004)

  13. Kang, M.G., Caballero, J., Song, D.: Distributed evasive scan techniques and countermeasures. In: Proceedings of the Conference on Detection of Intrusions and Malware and Vulnerability Assessment (2007)

  14. Kato, N., Nitou, H., Ohta, K., Mansfield, G., Nemoto, Y.: A real-time intrusion detection system (IDS) for large scale networks and its evaluations. IEICE Trans. Commun. E82–B(11), 1817–1825 (1999)

    Google Scholar 

  15. KDD cup data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. Accessed Jan 2010

  16. Kim, H., Kim, S., Kouritzin, M.A., Sun, W.: Detecting network portscans through anomaly detection. In: Proceedings of SPIE: Signal Processing, Sensor Fusion, and Target Recognition XIII, vol. 5429, p. 254 (2004)

  17. Lam, L., Suen, S.: Application of majority voting to pattern recognition: an analysis of its behavior and performance. IEEE Trans. Syst. Man Cybern. A Syst. Hum. 27(5), 553–568 (1997)

    Article  Google Scholar 

  18. Leckie, C., Kotagiri, R.: A probabilistic approach to detecting network scans. In: Proceedings of the Eighth IEEE Network Operations and Management, Symposium (NOMS’02) (2002)

  19. Li, Z., Goyal, A., Chen, Y.: Honeynet-based botnet scan traffic analysis. In: Botnet detection: countering the largest security threat. Advances in information security, vol. 36, pp. 25–44 (2008)

  20. Li, Z., Goyal, A., Chen, Y., Paxson, V.: Automating analysis of large-scale botnet probing events. In: ASIACCS (2009)

  21. Lippmann, R., Haines, J., Fried, D., Korba, J., Das, K.: The 1999 DARPA off-line intrusion detection evaluation. Comput. Netw. 34(4), 579–595 (2000)

    Article  Google Scholar 

  22. Lippmann, R.P., Cunningham, R.K., Fried, D.J., Graf, I., Kendall, K.R., Webster, S.E., Zissman, M.A.: Results of the DARPA 1998 offline intrusion detection evaluation. In: Proceedings of the Symposium on Recent Advances in Intrusion Detection (RAID’99) (1999)

  23. Mahoney, M.V., Chan, P.K.: An analysis of the 1999 DARPA /Lincoln Laboratory evaluation data for network anomaly detection. In: Proceedings of the Sixth International Symposium on Recent Advances in Intrusion Detection (RAID’03) (2003)

  24. Mason, J., Small, S., Monrose, F., MacManus, G.: English shellcode. In: Proceedings of the 16th ACM Conference on Computer and Communications Security (2009)

  25. McHugh, J.: Testing intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln laboratory. ACM Trans. Inf. Syst. Secur. (TISSEC) 3, 262–294 (2000)

    Article  Google Scholar 

  26. Ptacek, T., Newsham, T., Simpson, H.J.: Insertion, evasion, and denial of service: eluding network intrusion detection. Technical report, Secure Networks, Inc., January (1998)

  27. Ringberg, H., Roughan, M., Rexford, J.: The need for simulation in evaluating anomaly detectors. SIGCOMM Comput. Commun. Rev. 38, 55–59 (2008)

    Article  Google Scholar 

  28. Ringberg, H., Soule, A., Rexford, J.: WebClass: adding rigor to manual labeling of traffic anomalies. SIGCOMM Comput. Commun. Rev. 38, 35–38 (2008)

    Article  Google Scholar 

  29. Roelker, D., Norton, M., Hewlett, J.: sfPortscan. http://projects.cs.luc.edu/comp412/dredd/docs/software/readmes/sfsportscan. Accessed Jan 2010

  30. Roesch, M.: Snort: lightweight intrusion detection for networks. In: Proceedings of the 13th Systems Administration Conference (LISA’99) (1999)

  31. Sheng, V., Provost, F., Ipeirotis, P.: Get another label? Improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the Conference on Knowledge Discovery and Data Mining (2008)

  32. Simon, G., Xiong, H., Eilertson, E., Kumar, V.: Scan detection: a data mining approach. In: Proceedings of the International Conference on Data Mining (SIAM’06) (2006)

  33. Designer, Solar., Magazine, Phrack.: Designing and attacking port scan detection tools. 8(53), July 8, 1998, article 13. http://www.phrack.org/issues.html?issue=53&id=13#article

  34. Sommer, R., Paxson, V.: Outside the closed world: on using machine learning for network intrusion detection. In: IEEE Symposium on Security and Privacy, May (2010)

  35. Staniford, S., Hoagland, J.A., McAlerney, J.M.: Practical automated detection of stealthy portscans. J. Comput. Secur. 10(1/2), 105–136 (2002)

    Google Scholar 

  36. Tcpdpriv. http://ita.ee.lbl.gov/html/contrib/tcpdpriv.html. Accessed July 2010

  37. Thabtah, F., Cowling, P., Peng, Y.: Multiple labels associative classification. Knowl. Inf. Syst. 9(1), 109–129 (2006)

    Google Scholar 

  38. Vigna, G.: Network intrusion detection: dead or alive? In: Proceedings of the 26th Annual Computer Security Applications Conference (ACSAC’10) (2010)

  39. Weaver, N., Staniford, S., Paxson, V.: Very fast containment of scanning worms, revisited. In: Christodorescu, M., Jha, S., Maughan, D., Song, D., Wang, C. (eds.) Malware Detection. Advances in information security vol. 27, chap. 6. Springer, pp. 113–145 (2007)

  40. Zhang, Y., Fang, B.: A novel approach to scan detection on the backbone. In: Sixth International Conference on Information Technology: New, Generations (ITNG’09), April (2009)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mansour Alsaleh.

Appendices

Appendices

A: Obtaining network traffic for evaluation

This appendix provides background related to Sect. 2. Ideally, non-synthetic data sets labeled to identify the target classes of intrusions or anomalies might be used to evaluate the detector results. However, for scan detectors, not only are there no publicly available non-synthetic labeled data sets, but there are also no synthetic labeled data sets. In fact, the problem is not simply that currently no such labeled data sets are available, but rather that for scan detection, it is not clear that such labeling is reliable, given the impossibility of determining the intent of a connection attempt (see Sect. 3.1). The following subsections summarize known mechanisms used in the literature to obtain network traffic for evaluation.

1.1 A.1 Simulated data sets

In simulation approaches, network traffic is generated by custom network simulation software to model real network behavior of a particular configuration. Network simulation helps in verifying the correctness of a new or existing network intrusion detector and in predicting its performance, especially in complex network settings. Using simulation, both the network traffic and broader scanning campaigns are simulated. In addition to avoiding the legal and privacy issues regarding using real-world traffic, simulation enables testing of a scan detector using various network configurations while controlling the characteristics of the scans.

On the other hand, simulation has known shortcomings. First, given that network traffic is diverse and variability is expected in both network-level (including background traffic) and application-level protocols [34], the simulated traffic may not realistically represent real-world traffic. It is important to test detectors on traces from operational networks that resemble those where the detectors will be deployed. Simulation must include Internet traffic that is hard to simulate or model due to heterogeneity, scale, and rapid change in the Internet’s dynamics [6]. Second, the generated intrusions or reconnaissance activity (i.e., scan events) might only represent a subset of existing patterns in the wild. The inadequate understanding of many anomalies and reconnaissance activity make it hard to characterize them precisely [28].

Simulating network scanning activity is even more challenging because: (i) Based on the remote host’s intent, an inbound connection attempt can be considered either as a scanning activity or benign; and (ii) it is not necessary that network traffic originated from a scanner contains or is associated with malicious payloads.

1.2 A.2 Emulated data sets

In network emulation, a subset of the studied traffic is the real-world data. For example, starting with a real-world data set, scanning events could be simulated to the target IP address range and then injected in the data set. Emulation approaches seem better than simulation because network traces can be collected from an environment representative of where the detector will be deployed. However, in emulation, the starting data set could include scanning data as well which will affect the evaluation of the detector; filtering out such scanning traffic requires an accurate scan detector in the first place and therefore full knowledge of ground truth. Also, the feedback effects of the interaction between the simulated scanning events and the real-world network traffic are not captured. In addition, similar to simulation, the generated scanning traffic might not reflect the variety of scanning patterns in the wild. Both simulation and emulation approaches are known to be a good starting point to test the detector ability to identify certain types of intrusions but not sufficient in its own to evaluate the detector accuracy and capabilities [27].

1.3 A.3 Real-world traffic data sets

Given the drawbacks of simulation and emulation, and the lack of publicly available non-synthetic accurately labeled data sets, researchers usually evaluate detectors using their own network traffic. Such gathered data sets provide real-world data of both network traffic and intrusions (e.g., scanning events). While anonymization of these data sets (e.g., see [36]) helps substantially in removing private and other sensitive information, information leakage is still possible [5], and thus, for ethical, privacy, and legal reasons, such data sets often stay private.

Such collected data sets are subject to the following issues: (i) The network environment of the data set may exhibit artifacts that until identified and explained, significantly affect detector performance; (ii) the nature and size of the network might not represent a realistic target deployment environment; (iii) with the lack of control over the characteristics of the scan events, it may be hard to determine the conditions under which a detector will perform well; and (iv) obtaining a reliable ground truth of the existing intrusions that the detector is designed to detect is challenging and might be infeasible for some types of intrusions.

B: Example Heuristic for establishing a continuous GTR

In establishing a GTR of scanners for a given network data set, time and computational resource constraints are less than those for a real-time detector. As discussed in Sect. 3.4, remote hosts’ traffic over the entire data set capture period can be examined to establish a GTR of scanners. While the detection features used to obtain a GTR should differ from those used in the evaluated scan detection algorithm, to avoid a circular argument, we argue that monitoring network traffic over a relatively long period of time and/or over a large IP address space breaks such circularity. That is, we argue that time duration is a distinct dimension in detection. Nevertheless, it remains desirable that the GTR includes detection features that are not used in the evaluated online detector. The requirements of using a GTR for evaluating a detector or comparing two or more detectors are discussed in Sect. 2. In this section, we present one detection heuristic as an illustration example of how to build a continuous GTR of remote scanners that assigns each remote host a score according to its observed network traffic, as described in Sect. 4.3.

Several heuristics were proposed in the literature for detecting known scanning behaviors rarely observed in benign traffic. Heuristics are often developed for real-time scan detectors and thus oriented toward fast detection with as few false positives as possible. The following are commonly used detection heuristics: (i) contacting non-existing local hosts or network services [9, 33]; (ii) making unsuccessful connection attempts since, unlike legitimate network traffic, most connection attempts that are part of reconnaissance activities are expected to fail, given that active network services are unknown prior to scanning [8, 12, 14]; (iii) exchanging low volume of data with local hosts [8, 40]; (iv) contacting many local hosts or ports in a small time window [3, 9, 29, 30, 33]; and (v) contacting rarely accessed local hosts or ports [16, 18, 35].

Given that some scanning events might look like rare normal traffic if analyzed in isolation, repeated occurrences of what might individually be considered abnormal and the absence of normal traffic provide more confidence of malicious intent. Among the commonly used scan detection heuristics discussed above, to obtain a GTR of scanners, we employ two heuristics of abnormal traffic that seem hard to evade by scanners: (i) failed connection attempts initiated by the remote; and (ii) lack of data exchange, particularly outbound traffic to the remote. In addition, we employ two heuristics as a sign for normal traffic: (iii) successful connections initiated by the remote and (iv) connection attempts initiated by local hosts to the remote (whether successful or unsuccessful). While we combine these four heuristics in one main heuristic that captures many known scanning patterns, it is expandable by accommodating the addition of further detection heuristics.

For any connection attempt {remote IP address, local IP address, destination port} in a data set, including both inbound and outbound traffic, only the first event involving this pair is taken into account. Let \(n_i\) be the number of local hosts with port \(i\) open; the probability of a scanner to make a successful connection to the port \(i\) is \(1/n_i\) (assuming random scanning). Therefore, a successful connection to port \(i\) is assigned a weight of \(1- (1/n_i)\), whereas a failed connection attempt to port \(i\) is always assigned a weight of \(1\). We combine the connection state heuristics (i.e., (i), (iii), and (iv)) in one ratio GTR1 that is calculated for each remote \(R\) as follows (see Table 4 for notation):

$$\begin{aligned} {\text{ GTR1}}_{R}=\frac{F_R}{F_R + \sum _{j=1}^{S_R} W_{R}^{j} + O_R } \end{aligned}$$
(3)

The closer the \({\text{ GTR}}1_{R}\) score value is to one, the higher the probability that \(R\) is a scanner. Similarly, the closer the score value of \(R\) is to zero, the higher the probability that \(R\) is benign. That is, failed inbound connection attempts increase \({\text{ GTR}}1_{R}\) toward being a scanner, while successful inbound and outbound connection attempts (whether successful or unsuccessful) decrease \({\text{ GTR}}1_{R}\) toward being benign.

Table 4 Notation for Eqs. (3) and (4)

Unlike previous approaches, for the data exchange feature [i.e., (ii)], we use the count of successful inbound connections that contain at least \(k\) outbound packets with only ACK flag set (we call this count \(X_{R}\)). For a given remote, this is an indication of traffic sent by local hosts to the remote. Thus, the higher this count, the more data packets are sent to the remote. The higher the rate of inbound connections with low data exchange to all successful inbound connections initiated by the remote, the more evidence of malicious intent. Since the majority of successful TCP connections have many outbound packets with only the ACK flag set, a small threshold, say \(k < 5\), indicates a low data exchange. Note that for failed inbound connection attempts, there will be no outbound packets with only the ACK flag set. We argue that this feature is particularly valuable to identify fortuitous scanners that have probed some active services since the data exchange with these services is expected to be minimal. The data exchange feature (\({\text{ GTR}}_2\)) is calculated as follows:

$$\begin{aligned} {\text{ GTR}}2_{R} = \frac{X_{R}}{S_R} \end{aligned}$$
(4)

The closer the \({\text{ GTR}}2_{R}\) score is to one, the higher the probability that \(R\) is a scanner since this indicates a low data exchange in most connection attempts initiated by \(R\). Similarly, the closer the \({\text{ GTR}}2_{R}\) to zero, the higher probability that \(R\) is benign. The total score of \(R\) is derived by taking the maximum value of the scores obtained from the detection features used in the evaluation:

$$\begin{aligned} {\text{ GTR}}_R = {\text{ max}}({\text{ GTR}}1_{R}, {\text{ GTR}}2_{R}) \end{aligned}$$
(5)

Note that given that there are several scan detection heuristics that capture different known scanning patterns, the absence of one pattern is not an evidence that \(R\) is not a scanner. Therefore, \({\text{ GTR}}_{R}\) should reflect the scan detection heuristic with the highest value. To evaluate a TOE D, for each \(R\) in an evaluation data set, \({\text{ GTR}}_{R}\) is compared with \(D_{R}\) as explained in Sect. 4.3. While we argue that computing \({\text{ GTR}}1_{R}\) and \({\text{ GTR}}2_{R}\) over a relatively long period of time and/or over a large IP address space is sufficient to identify the majority of scanners, further detection features can be added to Eq. (5) in a similar way, according to the TOE(s) in question.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alsaleh, M., van Oorschot, P.C. Evaluation in the absence of absolute ground truth: toward reliable evaluation methodology for scan detectors. Int. J. Inf. Secur. 12, 97–110 (2013). https://doi.org/10.1007/s10207-012-0178-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10207-012-0178-1

Keywords

Navigation