Abstract
Multiple Instance Learning (MIL) aims at extracting patterns from a collection of samples, where individual samples (called bags) are represented by a group of multiple feature vectors (called instances) instead of a single feature vector. Grouping instances into bags not only helps to formulate some learning problems more naturally, it also significantly reduces label acquisition costs as only the labels for bags are needed, not for the inner instances. However, in application domains where inference transparency is demanded, such as in network security, the sample attribution requirements are often asymmetric with respect to the training/application phase. While in the training phase it is very convenient to supply labels only for bags, in the application phase it is generally not enough to just provide decisions on the bag-level because the inferred verdicts need to be explained on the level of individual instances. Unfortunately, the majority of recent MIL classifiers does not focus on this real-world need. In this paper, we address this problem and propose a new tree-based MIL classifier able to identify instances responsible for positive bag predictions. Results from an empirical evaluation on a large-scale network security dataset also show that the classifier achieves superior performance when compared with prior art methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
For example, a seemingly legitimate request to google.com might be in reality related to malicious activity when it is issued by malware checking Internet connection. Similarly, requesting ad servers in low volumes is considered as a legitimate behavior, but higher numbers might indicate Click-fraud infection.
- 2.
Term extremely in Extremely Randomized Trees [11] corresponds to setting \(T=1\).
- 3.
We used implementation from https://github.com/komartom/BLRT.jl.
- 4.
We used implementation from https://github.com/CTUAvastLab/Mill.jl.
- 5.
MI-SVM is trained with Algorithm 1 for complete feature space (\(\mathbf {s}\) is vector of ones).
- 6.
36 virtual Intel Xeon CPUs @ 2.9 GHz and 60 Gb of memory.
- 7.
- 8.
While precision answers to the question: “With how big percentage of false alarms the network administrators will have to deal with?”, false positive rate gives answer to: “How big percentage of clean users will be bothered?”.
- 9.
This way of identifying malicious communications is not so effective in production, since new threats are not on the deny list yet and need to be first discovered.
- 10.
Datasets are accessible at https://doi.org/10.6084/m9.figshare.6633983.v1.
- 11.
AUC is agnostic to class imbalance and classifier’s decision threshold value.
- 12.
The best model is assigned the lowest rank (i.e. one).
- 13.
The performance of any two classifiers is significantly different if the corresponding average ranks differ by at least the critical difference, which is (for 12 datasets, four methods and \(\alpha =0.05\)) approximately 1.35.
- 14.
Source codes are available at https://github.com/komartom/ISRT.jl.
References
Amores, J.: Multiple instance classification: review, taxonomy and comparative study. Artif. Intell. 201, 81–105 (2013). http://dx.doi.org/10.1016/j.artint.2013.06.003
Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multiple-instance learning. In: Proceedings of the 15th International Conference on Neural Information Processing Systems, pp. 577–584. NIPS 2002. MIT Press, Cambridge, MA, USA (2002). http://dl.acm.org/citation.cfm?id=2968618.2968690
Brabec, J., Komárek, T., Franc, V., Machlica, L.: On model evaluation under non-constant class imbalance. In: Krzhizhanovskaya, V.V., et al. (eds.) ICCS 2020. LNCS, vol. 12140, pp. 74–87. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50423-6_6
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). http://dx.doi.org/10.1023/A:1010933404324
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press, Boca Raton (1984)
Carbonneau, M.A., Cheplygina, V., Granger, E., Gagnon, G.: Multiple instance learning: a survey of problem characteristics and applications. Patt. Recogn. 77, 329–353 (2018). https://www.sciencedirect.com/science/article/pii/S0031320317304065
Cheplygina, V., Tax, D.M.J.: Characterizing multiple instance datasets. In: Feragen, A., Pelillo, M., Loog, M. (eds.) SIMBAD 2015. LNCS, vol. 9370, pp. 15–27. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24261-3_2
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006). http://dl.acm.org/citation.cfm?id=1248547.1248548
Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89(1), 31–71 (1997). http://www.sciencedirect.com/science/article/pii/S0004370296000343
Franc, V., Sofka, M., Bartos, K.: Learning detector of malicious network traffic from weak labels. In: Bifet, A., et al. (eds.) ECML PKDD 2015. LNCS (LNAI), vol. 9286, pp. 85–99. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23461-8_6
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006). https://doi.org/10.1007/s10994-006-6226-1
Ho, T.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20, 832–844 (1998)
Kohout, J., Komárek, T., Čech, P., Bodnár, J., Lokoč, J.: Learning communication patterns for malware discovery in https data. Expert Syst. Appl. 101, 129–142 (2018). http://www.sciencedirect.com/science/article/pii/S0957417418300794
Komárek, T., Somol, P.: Multiple instance learning with bag-level randomized trees. In: Berlingerio, M., Bonchi, F., Gärtner, T., Hurley, N., Ifrim, G. (eds.) ECML PKDD 2018. LNCS (LNAI), vol. 11051, pp. 259–272. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-10925-7_16
Li, K., Chen, R., Gu, L., Liu, C., Yin, J.: A method based on statistical characteristics for detection malware requests in network traffic. In: 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC), pp. 527–532 (2018). https://doi.org/10.1109/DSC.2018.00084
Machlica, L., Bartos, K., Sofka, M.: Learning detectors of malicious web requests for intrusion detection in network traffic (2017)
Pendlebury, F., Pierazzi, F., Jordaney, R., Kinder, J., Cavallaro, L.: TESSERACT: eliminating experimental bias in malware classification across space and time. In: 28th USENIX Security Symposium (USENIX Security 19), pp. 729–746. USENIX Association, Santa Clara, CA, August 2019. https://www.usenix.org/conference/usenixsecurity19/presentation/pendlebury
Pevny, T., Somol, P.: Discriminative models for multi-instance problems with tree structure. In: Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security, pp. 83–91. AISec 2016. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2996758.2996761
Pevný, T., Somol, P.: Using neural network formalism to solve multiple-instance problems. In: Cong, F., Leung, A., Wei, Q. (eds.) ISNN 2017. LNCS, vol. 10261, pp. 135–142. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59072-1_17
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986). http://dx.doi.org/10.1023/A:1022643204877
Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: primal estimated sub-gradient solver for SVM. In: Proceedings of the 24th International Conference on Machine Learning, pp. 807–814. ICML 2007. Association for Computing Machinery, New York, NY, USA (2007). https://doi.org/10.1145/1273496.1273598
Stiborek, J., Pevný, T., Rehák, M.: Multiple instance learning for malware classification. Expert Syst. Appl. 93, 346–357 (2018). http://www.sciencedirect.com/science/article/pii/S0957417417307170
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Komárek, T., Brabec, J., Somol, P. (2021). Explainable Multiple Instance Learning with Instance Selection Randomized Trees. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12976. Springer, Cham. https://doi.org/10.1007/978-3-030-86520-7_44
Download citation
DOI: https://doi.org/10.1007/978-3-030-86520-7_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86519-1
Online ISBN: 978-3-030-86520-7
eBook Packages: Computer ScienceComputer Science (R0)