Effective Anomaly Detection Model Training with only Unlabeled Data by Weakly Supervised Learning Techniques

Yang, Wenzhuo; Lam, Kwok-Yan

doi:10.1007/978-3-030-86890-1_23

Effective Anomaly Detection Model Training with only Unlabeled Data by Weakly Supervised Learning Techniques

Wenzhuo Yang¹² &
Kwok-Yan Lam¹²

Conference paper
First Online: 17 September 2021

1829 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12918))

Abstract

Intrusion detection systems (IDS) play an important role in security monitoring to identify anomalous or suspicious activities. Traditional IDS could be signature-based (or rule-based) or anomaly-based (or analytics-based). With the objectives of detecting zero-day attacks, analytics-based IDS have attracted great interest of the cybersecurity community. Furthermore, machine learning (ML) techniques have been extensively explored for advancing analytics-based IDS. Many ML techniques have been studied to improve the efficiency of intrusion detection and some have shown good performance. However, traditional supervised learning algorithms need strong supervision information, fully correctly labeled (FCL) data, to train an accurate model. Whereas, with the rapid development of network and communication technologies, the volume of network traffic and system logs has increased drastically in recent years, especially with the introduction of Next Generation Broadband Network (NGBN) and 5G networks. This caused huge pressure on analytics-based IDS because, for ML to train predictive models, security-relevant data need to be labeled manually, hence leading to practical barriers to achieving effective IDS. In order to avoid being overly dependent on strong supervision information, weakly supervised learning techniques, which utilize incomplete, inexact, or possibly inaccurate labels, have been studied by cybersecurity researchers in that such weak supervision information are easier and cheaper to obtain than FCL data. This research aims to explore the feasibility of weakly supervised learning techniques in IDS tasks so as to reduce the reliance on a massive amount of strong supervision information, which will only continue to grow tremendously in the big data society. We also investigated the detection stability of the proposed scheme when inaccurate weak supervision information is provided. In this article, we propose an IDS model training scheme that is based on a weakly supervised learning algorithm, which requires only unlabeled data. Experiments have been performed on three publicly available IDS evaluation datasets. The results showed that the proposed scheme performs well and is even better than some supervised learning-based IDS (SL-IDS) models. Experimental results also indicated that the weakly supervised learning based IDS model is robust and can be applied in real world situations. Besides, we examined detection performance of the proposed method when it faces class-imbalanced training data and the experiment results show that it performs better than the compared methods.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Ahmad, Z., Khan, A.S., Shiang, C.W., Abdullah, J., Ahmad, F.: Network intrusion detection system: a systematic study of machine learning and deep learning approaches. Trans. Emerg. Telecommun. Technol. 32(1), e4150 (2021)
Google Scholar
Al-Yaseen, W.L., Othman, Z.A., Nazri, M.Z.A.: Multi-level hybrid support vector machine and extreme learning machine based on modified k-means for intrusion detection system. Expert Syst. Appl. 67, 296–303 (2017)
Article Google Scholar
Alom, M.Z., Taha, T.M.: Network intrusion detection for cyber security using unsupervised deep learning approaches. In: 2017 IEEE National Aerospace and Electronics Conference (NAECON), pp. 63–69. IEEE (2017)
Google Scholar
Ashfaq, R.A.R., Wang, X.Z., Huang, J.Z., Abbas, H., He, Y.L.: Fuzziness based semi-supervised learning approach for intrusion detection system. Inf. Sci. 378, 484–497 (2017)
Article Google Scholar
Bekker, J., Davis, J.: Estimating the class prior in positive and unlabeled data through decision tree induction. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Blanchard, G., Lee, G., Scott, C.: Semi-supervised novelty detection. J. Mach. Learn. Res. 11, 2973–3009 (2010)
MathSciNet MATH Google Scholar
Casas, P., Mazel, J., Owezarski, P.: Knowledge-independent traffic monitoring: unsupervised detection of network attacks. IEEE Network 26(1), 13–21 (2012)
Article Google Scholar
Cisco, F.: Cisco annual internet report (2018–2023). White Paper. https://www.cisco.com/c/en/us/solutions/collateral/executive-perspectives/annual-internet-report/white-paper-c11-741490.html (2020)
De Comité, F., Denis, F., Gilleron, R., Letouzey, F.: Positive and unlabeled examples help learning. In: Watanabe, O., Yokomori, T. (eds.) ALT 1999. LNCS (LNAI), vol. 1720, pp. 219–230. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-46769-6_18
Chapter Google Scholar
Debar, H., Dacier, M., Wespi, A.: Towards a taxonomy of intrusion-detection systems. Comput. Netw. 31(8), 805–822 (1999)
Article Google Scholar
Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 213–220 (2008)
Google Scholar
Feng, L., et al.: Pointwise binary classification with pairwise confidence comparisons. In: International Conference on Machine Learning, pp. 3252–3262. PMLR (2021)
Google Scholar
Gao, N., Gao, L., Gao, Q., Wang, H.: An intrusion detection model based on deep belief networks. In: 2014 Second International Conference on Advanced Cloud and Big Data, pp. 247–252. IEEE (2014)
Google Scholar
Gharib, A., Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: An evaluation framework for intrusion detection dataset. In: 2016 International Conference on Information Science and Security (ICISS), pp. 1–6. IEEE (2016)
Google Scholar
Guo, Z., Lam, K.-Y., Chung, S.-L., Gu, M., Sun, J.-G.: Efficient presentation of multivariate audit data for intrusion detection of web-based internet services. In: Zhou, J., Yung, M., Han, Y. (eds.) ACNS 2003. LNCS, vol. 2846, pp. 63–75. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45203-4_5
Chapter Google Scholar
Hou, M., Chaib-Draa, B., Li, C., Zhao, Q.: Generative adversarial positive-unlabelled learning. arXiv preprint arXiv:1711.08054 (2017)
Jain, S., White, M., Radivojac, P.: Estimating the class prior and posterior from noisy positives and unlabeled data. Adv. Neural. Inf. Process. Syst. 29, 2693–2701 (2016)
Google Scholar
Kuang, F., Xu, W., Zhang, S.: A novel hybrid KPCA and SVM with GA model for intrusion detection. Appl. Soft Comput. 18, 178–184 (2014)
Article Google Scholar
Lam, K.Y., Hui, L., Chung, S.L.: Multivariate data analysis software for enhancing system security. J. Syst. Softw. 31(3), 267–275 (1995)
Article Google Scholar
Leung, K., Leckie, C.: Unsupervised anomaly detection in network intrusion detection using clusters. In: Proceedings of the Twenty-Eighth Australasian Conference on Computer Science, vol. 38, pp. 333–342 (2005)
Google Scholar
Li, X., Bing, L.: Learning to classify texts using positive and unlabeled data. In: International Joint Conference on Artificial Intelligence (2003)
Google Scholar
Li, Y., Guo, L.: An active learning based TCM-KNN algorithm for supervised network intrusion detection. Comput. Secur. 26(7–8), 459–467 (2007)
Article Google Scholar
Liu, L.P., Dietterich, T.G.: A conditional multinomial mixture model for superset label learning. In: NeurIPS. pp. 548–556 (2012)
Google Scholar
Lopez-Martin, M., Carro, B., Sanchez-Esguevillas, A.: Application of deep reinforcement learning to intrusion detection for supervised problems. Expert Syst. Appl. 141, 112963 (2020)
Google Scholar
Lu, N., Niu, G., Menon, A.K., Sugiyama, M.: On the minimal supervision for training any binary classifier from only unlabeled data. arXiv preprint arXiv:1808.10585 (2018)
Lu, N., Zhang, T., Niu, G., Sugiyama, M.: Mitigating overfitting in supervised classification from two unlabeled datasets: a consistent risk correction approach. In: International Conference on Artificial Intelligence and Statistics, pp. 1115–1125. PMLR (2020)
Google Scholar
Luo, J., Orabona, F.: Learning from candidate labeling sets. In: NeurIPS, pp. 1504–1512 (2010)
Google Scholar
Mao, C.H., Lee, H.M., Parikh, D., Chen, T., Huang, S.Y.: Semi-supervised co-training and active learning based approach for multi-view intrusion detection. In: Proceedings of the 2009 ACM Symposium on Applied Computing, pp. 2042–2048 (2009)
Google Scholar
MIT, L.L.: KDD Cup 1999 Data. http://kdd.ics.uci.edu/databases/kdd cup99/kddcup99.htmll. Accessed 20 Jan 2021
Moustafa, N., Slay, J.: UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In: 2015 Military Communications and Information Systems Conference (MilCIS), pp. 1–6. IEEE (2015)
Google Scholar
Muda, Z., Yassin, W., Sulaiman, M., Udzir, N.: Intrusion detection based on k-means clustering and Naïve Bayes classification. In: 2011 7th International Conference on Information Technology in Asia, pp. 1–6. IEEE (2011)
Google Scholar
Mukkamala, S., Janoski, G., Sung, A.: Intrusion detection using neural networks and support vector machines. In: Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN 2002 (Cat. No. 02CH37290), vol. 2, pp. 1702–1707. IEEE (2002)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Peng, K., Leung, V.C., Huang, Q.: Clustering approach based on mini batch Kmeans for intrusion detection system over big data. IEEE Access 6, 11897–11906 (2018)
Article Google Scholar
Perini, L., Vercruyssen, V., Davis, J.: Class prior estimation in active positive and unlabeled learning. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence and the 17th Pacific Rim International Conference on Artificial Intelligence (IJCAI-PRICAI 2020), pp. 2915–2921. IJCAI-PRICAI (2020)
Google Scholar
Ramaswamy, H., Scott, C., Tewari, A.: Mixture proportion estimation via Kernel embeddings of distributions. In: International Conference on Machine Learning, pp. 2052–2060. PMLR (2016)
Google Scholar
Ratner, A., Bach, S., Varma, P., Ré, C.: Weak supervision: the new programming paradigm for machine learning. Hazy Research. Available via https://dawn.cs.stanford.edu//2017/07/16/weak-supervision/. Accessed 5 Sept 2019
Ratner, A.J., De Sa, C.M., Wu, S., Selsam, D., Ré, C.: Data programming: creating large training sets, quickly. Adv. Neural. Inf. Process. Syst. 29, 3567–3575 (2016)
Google Scholar
Roh, Y., Heo, G., Whang, S.E.: A survey on data collection for machine learning: a big data-AI integration perspective. IEEE Trans. Knowl. Data Eng. 33(4), 1328–1347 (2021)
Google Scholar
Ryan, J., Lin, M.J., Miikkulainen, R.: Intrusion detection with neural networks. In: Advances in Neural Information Processing Systems, pp. 943–949 (1998)
Google Scholar
Shao, G., Chen, X., Zeng, X., Wang, L.: Labeling malicious communication samples based on semi-supervised deep neural network. China Commun. 16(11), 183–200 (2019)
Article Google Scholar
Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: ICISSp, pp. 108–116 (2018)
Google Scholar
Shone, N., Ngoc, T.N., Phai, V.D., Shi, Q.: A deep learning approach to network intrusion detection. IEEE Trans. Emerg. Topics Comput. Intell. 2(1), 41–50 (2018)
Article Google Scholar
Sindhu, S.S.S., Geetha, S., Kannan, A.: Decision tree based light weight intrusion detection using a wrapper approach. Expert Syst. Appl. 39(1), 129–141 (2012)
Article Google Scholar
Singla, A., Bertino, E., Verma, D.: Preparing network intrusion detection deep learning models with minimal data using adversarial domain adaptation. In: Proceedings of the 15th ACM Asia Conference on Computer and Communications Security, pp. 127–140 (2020)
Google Scholar
Stolfo, S.J., Fan, W., Lee, W., Prodromidis, A., Chan, P.K.: Cost-based modeling for fraud and intrusion detection: results from the jam project. In: Proceedings DARPA Information Survivability Conference and Exposition. DISCEX 2000, vol. 2, pp. 130–144. IEEE (2000)
Google Scholar
Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the KDD cup 99 data set. In: 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, pp. 1–6. IEEE (2009)
Google Scholar
Vapnik, V.: Principles of risk minimization for learning theory. In: Advances in Neural Information Processing Systems, pp. 831–838 (1992)
Google Scholar
Vinayakumar, R., Soman, K., Poornachandran, P.: A comparative analysis of deep learning approaches for network intrusion detection systems (N-IDSs): deep learning for N-IDSs. Int. J. Digital Crime Forensics (IJDCF) 11(3), 65–89 (2019)
Article Google Scholar
Wagh, S.K., Kolhe, S.R.: Effective intrusion detection system using semi-supervised learning. In: 2014 International Conference on Data Mining and Intelligent Computing (ICDMIC), pp. 1–5. IEEE (2014)
Google Scholar
Wurzenberger, M., Skopik, F., Landauer, M., Greitbauer, P., Fiedler, R., Kastner, W.: Incremental clustering for semi-supervised anomaly detection applied on log data. In: Proceedings of the 12th International Conference on Availability, Reliability and Security, pp. 1–6 (2017)
Google Scholar
Yang, W., Lam, K.-Y.: Automated cyber threat intelligence reports classification for early warning of cyber attacks in next generation SOC. In: Zhou, J., Luo, X., Shen, Q., Xu, Z. (eds.) ICICS 2019. LNCS, vol. 11999, pp. 145–164. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-41579-2_9
Chapter Google Scholar
Yin, C., Zhu, Y., Fei, J., He, X.: A deep learning approach for intrusion detection using recurrent neural networks. IEEE Access 5, 21954–21961 (2017)
Article Google Scholar
Zeiberg, D., Jain, S., Radivojac, P.: Fast nonparametric estimation of class proportions in the positive-unlabeled classification setting. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 6729–6736 (2020)
Google Scholar
Zeng, Z.N., et al.: Learning by associating ambiguously labeled images. In: CVPR, pp. 708–715 (2013)
Google Scholar
Zhou, Z.H.: A brief introduction to weakly supervised learning. Natl. Sci. Rev. 5(1), 44–53 (2018)
Article Google Scholar

Download references

Acknowledgments

This research is supported by the Cyber Security Agency of Singapore (CSA), under its repertoire of initiatives leveraging on research institutes and think-tanks to contribute to the international community “towards a secure and trusted IoT ecosystem”.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Nanyang Technological University, Singapore, Republic of Singapore
Wenzhuo Yang & Kwok-Yan Lam

Authors

Wenzhuo Yang
View author publications
You can also search for this author in PubMed Google Scholar
Kwok-Yan Lam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Wenzhuo Yang or Kwok-Yan Lam .

Editor information

Editors and Affiliations

Singapore Management University, Singapore, Singapore
Debin Gao
Tsinghua University, Beijing, China
Qi Li
Xi'an Jiaotong University, Xi'an, China
Xiaohong Guan
Chongqing University, Chongqing, China
Xiaofeng Liao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, W., Lam, KY. (2021). Effective Anomaly Detection Model Training with only Unlabeled Data by Weakly Supervised Learning Techniques. In: Gao, D., Li, Q., Guan, X., Liao, X. (eds) Information and Communications Security. ICICS 2021. Lecture Notes in Computer Science(), vol 12918. Springer, Cham. https://doi.org/10.1007/978-3-030-86890-1_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-86890-1_23
Published: 17 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86889-5
Online ISBN: 978-3-030-86890-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics