Skip to main content

Effective Anomaly Detection Model Training with only Unlabeled Data by Weakly Supervised Learning Techniques

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12918))

Abstract

Intrusion detection systems (IDS) play an important role in security monitoring to identify anomalous or suspicious activities. Traditional IDS could be signature-based (or rule-based) or anomaly-based (or analytics-based). With the objectives of detecting zero-day attacks, analytics-based IDS have attracted great interest of the cybersecurity community. Furthermore, machine learning (ML) techniques have been extensively explored for advancing analytics-based IDS. Many ML techniques have been studied to improve the efficiency of intrusion detection and some have shown good performance. However, traditional supervised learning algorithms need strong supervision information, fully correctly labeled (FCL) data, to train an accurate model. Whereas, with the rapid development of network and communication technologies, the volume of network traffic and system logs has increased drastically in recent years, especially with the introduction of Next Generation Broadband Network (NGBN) and 5G networks. This caused huge pressure on analytics-based IDS because, for ML to train predictive models, security-relevant data need to be labeled manually, hence leading to practical barriers to achieving effective IDS. In order to avoid being overly dependent on strong supervision information, weakly supervised learning techniques, which utilize incomplete, inexact, or possibly inaccurate labels, have been studied by cybersecurity researchers in that such weak supervision information are easier and cheaper to obtain than FCL data. This research aims to explore the feasibility of weakly supervised learning techniques in IDS tasks so as to reduce the reliance on a massive amount of strong supervision information, which will only continue to grow tremendously in the big data society. We also investigated the detection stability of the proposed scheme when inaccurate weak supervision information is provided. In this article, we propose an IDS model training scheme that is based on a weakly supervised learning algorithm, which requires only unlabeled data. Experiments have been performed on three publicly available IDS evaluation datasets. The results showed that the proposed scheme performs well and is even better than some supervised learning-based IDS (SL-IDS) models. Experimental results also indicated that the weakly supervised learning based IDS model is robust and can be applied in real world situations. Besides, we examined detection performance of the proposed method when it faces class-imbalanced training data and the experiment results show that it performs better than the compared methods.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://www.unsw.adfa.edu.au/unsw-canberra-cyber/.

  2. 2.

    http://www.ixiacom.com/products/perfectstorm.

References

  1. Ahmad, Z., Khan, A.S., Shiang, C.W., Abdullah, J., Ahmad, F.: Network intrusion detection system: a systematic study of machine learning and deep learning approaches. Trans. Emerg. Telecommun. Technol. 32(1), e4150 (2021)

    Google Scholar 

  2. Al-Yaseen, W.L., Othman, Z.A., Nazri, M.Z.A.: Multi-level hybrid support vector machine and extreme learning machine based on modified k-means for intrusion detection system. Expert Syst. Appl. 67, 296–303 (2017)

    Article  Google Scholar 

  3. Alom, M.Z., Taha, T.M.: Network intrusion detection for cyber security using unsupervised deep learning approaches. In: 2017 IEEE National Aerospace and Electronics Conference (NAECON), pp. 63–69. IEEE (2017)

    Google Scholar 

  4. Ashfaq, R.A.R., Wang, X.Z., Huang, J.Z., Abbas, H., He, Y.L.: Fuzziness based semi-supervised learning approach for intrusion detection system. Inf. Sci. 378, 484–497 (2017)

    Article  Google Scholar 

  5. Bekker, J., Davis, J.: Estimating the class prior in positive and unlabeled data through decision tree induction. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  6. Blanchard, G., Lee, G., Scott, C.: Semi-supervised novelty detection. J. Mach. Learn. Res. 11, 2973–3009 (2010)

    MathSciNet  MATH  Google Scholar 

  7. Casas, P., Mazel, J., Owezarski, P.: Knowledge-independent traffic monitoring: unsupervised detection of network attacks. IEEE Network 26(1), 13–21 (2012)

    Article  Google Scholar 

  8. Cisco, F.: Cisco annual internet report (2018–2023). White Paper. https://www.cisco.com/c/en/us/solutions/collateral/executive-perspectives/annual-internet-report/white-paper-c11-741490.html (2020)

  9. De Comité, F., Denis, F., Gilleron, R., Letouzey, F.: Positive and unlabeled examples help learning. In: Watanabe, O., Yokomori, T. (eds.) ALT 1999. LNCS (LNAI), vol. 1720, pp. 219–230. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-46769-6_18

    Chapter  Google Scholar 

  10. Debar, H., Dacier, M., Wespi, A.: Towards a taxonomy of intrusion-detection systems. Comput. Netw. 31(8), 805–822 (1999)

    Article  Google Scholar 

  11. Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 213–220 (2008)

    Google Scholar 

  12. Feng, L., et al.: Pointwise binary classification with pairwise confidence comparisons. In: International Conference on Machine Learning, pp. 3252–3262. PMLR (2021)

    Google Scholar 

  13. Gao, N., Gao, L., Gao, Q., Wang, H.: An intrusion detection model based on deep belief networks. In: 2014 Second International Conference on Advanced Cloud and Big Data, pp. 247–252. IEEE (2014)

    Google Scholar 

  14. Gharib, A., Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: An evaluation framework for intrusion detection dataset. In: 2016 International Conference on Information Science and Security (ICISS), pp. 1–6. IEEE (2016)

    Google Scholar 

  15. Guo, Z., Lam, K.-Y., Chung, S.-L., Gu, M., Sun, J.-G.: Efficient presentation of multivariate audit data for intrusion detection of web-based internet services. In: Zhou, J., Yung, M., Han, Y. (eds.) ACNS 2003. LNCS, vol. 2846, pp. 63–75. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45203-4_5

    Chapter  Google Scholar 

  16. Hou, M., Chaib-Draa, B., Li, C., Zhao, Q.: Generative adversarial positive-unlabelled learning. arXiv preprint arXiv:1711.08054 (2017)

  17. Jain, S., White, M., Radivojac, P.: Estimating the class prior and posterior from noisy positives and unlabeled data. Adv. Neural. Inf. Process. Syst. 29, 2693–2701 (2016)

    Google Scholar 

  18. Kuang, F., Xu, W., Zhang, S.: A novel hybrid KPCA and SVM with GA model for intrusion detection. Appl. Soft Comput. 18, 178–184 (2014)

    Article  Google Scholar 

  19. Lam, K.Y., Hui, L., Chung, S.L.: Multivariate data analysis software for enhancing system security. J. Syst. Softw. 31(3), 267–275 (1995)

    Article  Google Scholar 

  20. Leung, K., Leckie, C.: Unsupervised anomaly detection in network intrusion detection using clusters. In: Proceedings of the Twenty-Eighth Australasian Conference on Computer Science, vol. 38, pp. 333–342 (2005)

    Google Scholar 

  21. Li, X., Bing, L.: Learning to classify texts using positive and unlabeled data. In: International Joint Conference on Artificial Intelligence (2003)

    Google Scholar 

  22. Li, Y., Guo, L.: An active learning based TCM-KNN algorithm for supervised network intrusion detection. Comput. Secur. 26(7–8), 459–467 (2007)

    Article  Google Scholar 

  23. Liu, L.P., Dietterich, T.G.: A conditional multinomial mixture model for superset label learning. In: NeurIPS. pp. 548–556 (2012)

    Google Scholar 

  24. Lopez-Martin, M., Carro, B., Sanchez-Esguevillas, A.: Application of deep reinforcement learning to intrusion detection for supervised problems. Expert Syst. Appl. 141, 112963 (2020)

    Google Scholar 

  25. Lu, N., Niu, G., Menon, A.K., Sugiyama, M.: On the minimal supervision for training any binary classifier from only unlabeled data. arXiv preprint arXiv:1808.10585 (2018)

  26. Lu, N., Zhang, T., Niu, G., Sugiyama, M.: Mitigating overfitting in supervised classification from two unlabeled datasets: a consistent risk correction approach. In: International Conference on Artificial Intelligence and Statistics, pp. 1115–1125. PMLR (2020)

    Google Scholar 

  27. Luo, J., Orabona, F.: Learning from candidate labeling sets. In: NeurIPS, pp. 1504–1512 (2010)

    Google Scholar 

  28. Mao, C.H., Lee, H.M., Parikh, D., Chen, T., Huang, S.Y.: Semi-supervised co-training and active learning based approach for multi-view intrusion detection. In: Proceedings of the 2009 ACM Symposium on Applied Computing, pp. 2042–2048 (2009)

    Google Scholar 

  29. MIT, L.L.: KDD Cup 1999 Data. http://kdd.ics.uci.edu/databases/kdd cup99/kddcup99.htmll. Accessed 20 Jan 2021

  30. Moustafa, N., Slay, J.: UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In: 2015 Military Communications and Information Systems Conference (MilCIS), pp. 1–6. IEEE (2015)

    Google Scholar 

  31. Muda, Z., Yassin, W., Sulaiman, M., Udzir, N.: Intrusion detection based on k-means clustering and Naïve Bayes classification. In: 2011 7th International Conference on Information Technology in Asia, pp. 1–6. IEEE (2011)

    Google Scholar 

  32. Mukkamala, S., Janoski, G., Sung, A.: Intrusion detection using neural networks and support vector machines. In: Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN 2002 (Cat. No. 02CH37290), vol. 2, pp. 1702–1707. IEEE (2002)

    Google Scholar 

  33. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  34. Peng, K., Leung, V.C., Huang, Q.: Clustering approach based on mini batch Kmeans for intrusion detection system over big data. IEEE Access 6, 11897–11906 (2018)

    Article  Google Scholar 

  35. Perini, L., Vercruyssen, V., Davis, J.: Class prior estimation in active positive and unlabeled learning. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence and the 17th Pacific Rim International Conference on Artificial Intelligence (IJCAI-PRICAI 2020), pp. 2915–2921. IJCAI-PRICAI (2020)

    Google Scholar 

  36. Ramaswamy, H., Scott, C., Tewari, A.: Mixture proportion estimation via Kernel embeddings of distributions. In: International Conference on Machine Learning, pp. 2052–2060. PMLR (2016)

    Google Scholar 

  37. Ratner, A., Bach, S., Varma, P., Ré, C.: Weak supervision: the new programming paradigm for machine learning. Hazy Research. Available via https://dawn.cs.stanford.edu//2017/07/16/weak-supervision/. Accessed 5 Sept 2019

  38. Ratner, A.J., De Sa, C.M., Wu, S., Selsam, D., Ré, C.: Data programming: creating large training sets, quickly. Adv. Neural. Inf. Process. Syst. 29, 3567–3575 (2016)

    Google Scholar 

  39. Roh, Y., Heo, G., Whang, S.E.: A survey on data collection for machine learning: a big data-AI integration perspective. IEEE Trans. Knowl. Data Eng. 33(4), 1328–1347 (2021)

    Google Scholar 

  40. Ryan, J., Lin, M.J., Miikkulainen, R.: Intrusion detection with neural networks. In: Advances in Neural Information Processing Systems, pp. 943–949 (1998)

    Google Scholar 

  41. Shao, G., Chen, X., Zeng, X., Wang, L.: Labeling malicious communication samples based on semi-supervised deep neural network. China Commun. 16(11), 183–200 (2019)

    Article  Google Scholar 

  42. Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: ICISSp, pp. 108–116 (2018)

    Google Scholar 

  43. Shone, N., Ngoc, T.N., Phai, V.D., Shi, Q.: A deep learning approach to network intrusion detection. IEEE Trans. Emerg. Topics Comput. Intell. 2(1), 41–50 (2018)

    Article  Google Scholar 

  44. Sindhu, S.S.S., Geetha, S., Kannan, A.: Decision tree based light weight intrusion detection using a wrapper approach. Expert Syst. Appl. 39(1), 129–141 (2012)

    Article  Google Scholar 

  45. Singla, A., Bertino, E., Verma, D.: Preparing network intrusion detection deep learning models with minimal data using adversarial domain adaptation. In: Proceedings of the 15th ACM Asia Conference on Computer and Communications Security, pp. 127–140 (2020)

    Google Scholar 

  46. Stolfo, S.J., Fan, W., Lee, W., Prodromidis, A., Chan, P.K.: Cost-based modeling for fraud and intrusion detection: results from the jam project. In: Proceedings DARPA Information Survivability Conference and Exposition. DISCEX 2000, vol. 2, pp. 130–144. IEEE (2000)

    Google Scholar 

  47. Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the KDD cup 99 data set. In: 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, pp. 1–6. IEEE (2009)

    Google Scholar 

  48. Vapnik, V.: Principles of risk minimization for learning theory. In: Advances in Neural Information Processing Systems, pp. 831–838 (1992)

    Google Scholar 

  49. Vinayakumar, R., Soman, K., Poornachandran, P.: A comparative analysis of deep learning approaches for network intrusion detection systems (N-IDSs): deep learning for N-IDSs. Int. J. Digital Crime Forensics (IJDCF) 11(3), 65–89 (2019)

    Article  Google Scholar 

  50. Wagh, S.K., Kolhe, S.R.: Effective intrusion detection system using semi-supervised learning. In: 2014 International Conference on Data Mining and Intelligent Computing (ICDMIC), pp. 1–5. IEEE (2014)

    Google Scholar 

  51. Wurzenberger, M., Skopik, F., Landauer, M., Greitbauer, P., Fiedler, R., Kastner, W.: Incremental clustering for semi-supervised anomaly detection applied on log data. In: Proceedings of the 12th International Conference on Availability, Reliability and Security, pp. 1–6 (2017)

    Google Scholar 

  52. Yang, W., Lam, K.-Y.: Automated cyber threat intelligence reports classification for early warning of cyber attacks in next generation SOC. In: Zhou, J., Luo, X., Shen, Q., Xu, Z. (eds.) ICICS 2019. LNCS, vol. 11999, pp. 145–164. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-41579-2_9

    Chapter  Google Scholar 

  53. Yin, C., Zhu, Y., Fei, J., He, X.: A deep learning approach for intrusion detection using recurrent neural networks. IEEE Access 5, 21954–21961 (2017)

    Article  Google Scholar 

  54. Zeiberg, D., Jain, S., Radivojac, P.: Fast nonparametric estimation of class proportions in the positive-unlabeled classification setting. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 6729–6736 (2020)

    Google Scholar 

  55. Zeng, Z.N., et al.: Learning by associating ambiguously labeled images. In: CVPR, pp. 708–715 (2013)

    Google Scholar 

  56. Zhou, Z.H.: A brief introduction to weakly supervised learning. Natl. Sci. Rev. 5(1), 44–53 (2018)

    Article  Google Scholar 

Download references

Acknowledgments

This research is supported by the Cyber Security Agency of Singapore (CSA), under its repertoire of initiatives leveraging on research institutes and think-tanks to contribute to the international community “towards a secure and trusted IoT ecosystem”.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Wenzhuo Yang or Kwok-Yan Lam .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, W., Lam, KY. (2021). Effective Anomaly Detection Model Training with only Unlabeled Data by Weakly Supervised Learning Techniques. In: Gao, D., Li, Q., Guan, X., Liao, X. (eds) Information and Communications Security. ICICS 2021. Lecture Notes in Computer Science(), vol 12918. Springer, Cham. https://doi.org/10.1007/978-3-030-86890-1_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86890-1_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86889-5

  • Online ISBN: 978-3-030-86890-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics