Elsevier

Applied Soft Computing

Volume 47, October 2016, Pages 179-190
Applied Soft Computing

Evolving meta-ensemble of classifiers for handling incomplete and unbalanced datasets in the cyber security domain

https://doi.org/10.1016/j.asoc.2016.05.044Get rights and content

Abstract

Cyber security classification algorithms usually operate with datasets presenting many missing features and strongly unbalanced classes. In order to cope with these issues, we designed a distributed genetic programming (GP) framework, named CAGE-MetaCombiner, which adopts a meta-ensemble model to operate efficiently with missing data. Each ensemble evolves a function for combining the classifiers, which does not need of any extra phase of training on the original data. Therefore, in the case of changes in the data, the function can be recomputed in an incremental way, with a moderate computational effort; this aspect together with the advantages of running on parallel/distributed architectures makes the algorithm suitable to operate with the real time constraints typical of a cyber security problem. In addition, an important cyber security problem that concerns the classification of the users or the employers of an e-payment system is illustrated, in order to show the relevance of the case in which entire sources of data or groups of features are missing. Finally, the capacity of approach in handling groups of missing features and unbalanced datasets is validated on many artificial datasets and on two real datasets and it is compared with some similar approaches.

Introduction

In the last few years, as a consequence of our interconnected society, the interest in cyber security problems has really been increasing and cyber crime seriously threatens national governments and the economy of many industries [1]. Indeed, computer and network technologies have intrinsic security vulnerabilities, i.e., protocol, operating system weaknesses, etc. Therefore, potential threats and the related vulnerabilities need to be identified and addressed to minimize the risks. In addition, computer network activities, human actions, etc. generate large amounts of data and this aspect must be seriously taken into account.

Data mining techniques could be used to fight efficiently, to alleviate the effect or to prevent the action of cybercriminals, especially in the presence of large datasets. In particular, classification can be used efficiently for many cyber security applications, i.e., classification of the user behavior, risk and attack analysis, intrusion detection systems, etc. However, in this particular domain, datasets often have different number of features and each attribute could have different importance and cost. Furthermore, the entire system must also work if some features are missing and/or the classes are unbalanced. Therefore, a single classification algorithm performing well for all the datasets would be really unlikely, especially in the presence of changes and with constraints of real time and scalability.

In the ensemble learning paradigm [2], [3], multiple classification models are trained by a predictive algorithm, and then their predictions are combined to classify new tuples. This paradigm presents a number of advantages with regard to using a single model, i.e., it reduces the variance of the error, the bias, and the dependence on a single dataset and works well in the case of unbalanced classes; furthermore, the ensemble can be build in an incremental way and can be easily implemented on a distributed environment. If we consider a stream of data, the ensemble needs to be re-trained to take into account changes in the data. This process could be computationally expensive, especially if it is necessary to retrain the models or to regenerate new models on the new data.

Therefore, in order to classify large datasets in the field of cyber security, usually having the above-cited issues of unbalanced classes and missing features, a new framework, named CAGE-MetaCombiner, is proposed. The framework extends a well-known implementation of distributed GP (CellulAr GEnetic programming (CAGE) environment) and adopts a meta-ensemble model in order to cope with missing data, while the GP system, which evolves the combiner function of the ensemble, permits to handle unbalanced classes thanks to a weighted fitness function. In practice, an ensemble is built for each group of likely missing features, as explained in the following, and the different ensembles perform a weighted vote in order to decide the correct class. Each ensemble evolves a function for combining the classifiers, which can be trained only on a portion of the training set and does not need any extra phase of training on the original data. In fact, in the case of changes in the data, the function can be recomputed in an incremental way, with a moderate computational effort. In addition, all the phases of the algorithm are distributed and can exploit the advantages of running on parallel/distributed architectures to cope with real time constraints.

The rest of the paper is structured as follows: in Section 2 presents some related works; in Section 3, a real scenario in the field of cyber security is illustrated; Section 4 is devoted to some background information concerning the problem of missing data and incomplete datasets and the ensemble of classifiers; in Section 5, the framework and its software architecture is illustrated; Section 6 shows a number of experiments conducted to verify the effectiveness of the approach and to compare it with other similar approaches; finally, Section 7 concludes the work.

Section snippets

Related works

Evolutionary algorithms have been used mainly to evolve and select the base classifiers composing the ensemble [5], [6] or adopting some time-expensive algorithms to combine the ensemble [7]; however, a limited number of papers concerns the evolution of the combining function of the ensemble by using GP.

In the following, we analyze two groups of approaches. The first group comprises GP-based ensembles used to evolve the combination function. Most of the analyzed approaches employ a high number

A real scenario: classification of user profiles in e-payment systems

The inspiration of the approach taken in this paper comes from a project on cyber security for e-payment systems, in which one of the main tasks consists in dividing the users of an e-payments systems into homogenous groups on the basis of their weakness or vulnerabilities from the cyber security point of view. In this way, the provider of an e-payment system can conduct a different information and prevention campaign for each class of users, with obvious advantages in terms of time and cost

Background

In this section, we give some background information useful to understand our approach, i.e., the main methods to cope with missing data and incomplete datasets and a general schema for combining an ensemble of classifiers and the concept of “non-trainable functions” that can be used in order to combine an ensemble of classifiers without the need of a further phase of training.

A distributed tool for evolving combining functions

In this section, we illustrate the software architecture and detail the pseudo-code of the meta-ensemble approach; then, we show how the distributed GP framework used to evolve the combining function of the ensemble works, including the nodes, the terminals and the fitness function employed.

Experimental results

In this section, the experiments conducted to analyze the capacity of our approach on coping with unbalanced datasets and on handling missing features are described together with the main parameters and the main characteristics of the datasets used. In addition to a number of well-known benchmark datasets, two real and hard datasets were used to validate the approach: Unix dataset and KDD 99. The first was used to test the performance of the algorithm for the case of missing features, while the

Conclusions and future work

A meta-ensemble-based GP framework for classifying datasets in the cyber security domain and a real scenario concerning the segmentation of the users of an e-payment system, which illustrates the real applicability of the approach, are presented. The GP system is used to evolve the combiner function of the ensemble and permits to handle unbalanced classes thanks to a weighted fitness function, while the ensembles are specialized to handle the different groups of likely missing features.

Acknowledgment

This work has been partially supported by MIUR-PON under project PON03PE_00032_2 within the framework of the Technological District on Cyber Security.

References (27)

  • R. Polikar et al.

    Learn++.MF: a random subspace approach for the missing feature problem

    Pattern Recognit.

    (2010)
  • Y. Freund et al.

    A decision-theoretic generalization of on-line learning and an application to boosting

    J. Comput. Syst. Sci.

    (1997)
  • CERT Australia

    Cyber Crime and Security Survey Report, Tech. Rep.

    (2012)
  • L. Breiman

    Bagging predictors

    Mach. Learn.

    (1996)
  • Y. Freund et al.

    Experiments with a new boosting algorithm

  • G. Folino et al.

    A scalable cellular implementation of parallel genetic programming

    IEEE Trans. Evol. Comput.

    (2003)
  • D.F. de Oliveira et al.

    Use of multi-objective genetic algorithms to investigate the diversity/accuracy dilemma in heterogeneous ensembles

  • G. Folino et al.

    Training distributed GP ensemble with a selective algorithm based on clustering and pruning for pattern classification

    IEEE Trans. Evol. Comput.

    (2008)
  • C.D. Stefano et al.

    Using Bayesian networks for selecting classifiers in GP ensembles

    Inf. Sci.

    (2014)
  • J. Sylvester et al.

    Evolutionary ensembles: combining learning agents using genetic algorithms

  • N. Chawla et al.

    Exploiting diversity in ensembles: improving the performance on unbalanced datasets

  • J. Sylvester et al.

    Evolutionary ensemble creation and thinning

  • N. Acosta-Mendoza et al.

    Learning to assemble classifiers via genetic programming

    IJPRAI

    (2014)
  • Cited by (23)

    • A class center based approach for missing value imputation

      2018, Knowledge-Based Systems
      Citation Excerpt :

      For example, for continuous variables, a missing attribute value is filled in by taking the average value of that attribute for all of the observed data. The mean method has been compared in Folino and Pisani [9], Tian et al. [28], Silva-Ramirez et al. [25], and Xia et al. [32]. On the contrary, the mode method focuses on using the most appeared value of the target attribute for missing value imputation.

    • Multivariate Imputation by N Neighbour Mean and Chained Equation for Time Series Missing Data

      2023, Proceedings of 2023 IEEE 2nd International Conference on Industrial Electronics: Developments and Applications, ICIDeA 2023
    View all citing articles on Scopus
    View full text