Evolving meta-ensemble of classifiers for handling incomplete and unbalanced datasets in the cyber security domain

doi:10.1016/j.asoc.2016.05.044

Applied Soft Computing

Volume 47, October 2016, Pages 179-190

https://doi.org/10.1016/j.asoc.2016.05.044 Get rights and content

Abstract

Cyber security classification algorithms usually operate with datasets presenting many missing features and strongly unbalanced classes. In order to cope with these issues, we designed a distributed genetic programming (GP) framework, named CAGE-MetaCombiner, which adopts a meta-ensemble model to operate efficiently with missing data. Each ensemble evolves a function for combining the classifiers, which does not need of any extra phase of training on the original data. Therefore, in the case of changes in the data, the function can be recomputed in an incremental way, with a moderate computational effort; this aspect together with the advantages of running on parallel/distributed architectures makes the algorithm suitable to operate with the real time constraints typical of a cyber security problem. In addition, an important cyber security problem that concerns the classification of the users or the employers of an e-payment system is illustrated, in order to show the relevance of the case in which entire sources of data or groups of features are missing. Finally, the capacity of approach in handling groups of missing features and unbalanced datasets is validated on many artificial datasets and on two real datasets and it is compared with some similar approaches.

Graphical abstract

Introduction

In the last few years, as a consequence of our interconnected society, the interest in cyber security problems has really been increasing and cyber crime seriously threatens national governments and the economy of many industries [1]. Indeed, computer and network technologies have intrinsic security vulnerabilities, i.e., protocol, operating system weaknesses, etc. Therefore, potential threats and the related vulnerabilities need to be identified and addressed to minimize the risks. In addition, computer network activities, human actions, etc. generate large amounts of data and this aspect must be seriously taken into account.

Data mining techniques could be used to fight efficiently, to alleviate the effect or to prevent the action of cybercriminals, especially in the presence of large datasets. In particular, classification can be used efficiently for many cyber security applications, i.e., classification of the user behavior, risk and attack analysis, intrusion detection systems, etc. However, in this particular domain, datasets often have different number of features and each attribute could have different importance and cost. Furthermore, the entire system must also work if some features are missing and/or the classes are unbalanced. Therefore, a single classification algorithm performing well for all the datasets would be really unlikely, especially in the presence of changes and with constraints of real time and scalability.

In the ensemble learning paradigm [2], [3], multiple classification models are trained by a predictive algorithm, and then their predictions are combined to classify new tuples. This paradigm presents a number of advantages with regard to using a single model, i.e., it reduces the variance of the error, the bias, and the dependence on a single dataset and works well in the case of unbalanced classes; furthermore, the ensemble can be build in an incremental way and can be easily implemented on a distributed environment. If we consider a stream of data, the ensemble needs to be re-trained to take into account changes in the data. This process could be computationally expensive, especially if it is necessary to retrain the models or to regenerate new models on the new data.

Therefore, in order to classify large datasets in the field of cyber security, usually having the above-cited issues of unbalanced classes and missing features, a new framework, named CAGE-MetaCombiner, is proposed. The framework extends a well-known implementation of distributed GP (CellulAr GEnetic programming (CAGE) environment) and adopts a meta-ensemble model in order to cope with missing data, while the GP system, which evolves the combiner function of the ensemble, permits to handle unbalanced classes thanks to a weighted fitness function. In practice, an ensemble is built for each group of likely missing features, as explained in the following, and the different ensembles perform a weighted vote in order to decide the correct class. Each ensemble evolves a function for combining the classifiers, which can be trained only on a portion of the training set and does not need any extra phase of training on the original data. In fact, in the case of changes in the data, the function can be recomputed in an incremental way, with a moderate computational effort. In addition, all the phases of the algorithm are distributed and can exploit the advantages of running on parallel/distributed architectures to cope with real time constraints.

The rest of the paper is structured as follows: in Section 2 presents some related works; in Section 3, a real scenario in the field of cyber security is illustrated; Section 4 is devoted to some background information concerning the problem of missing data and incomplete datasets and the ensemble of classifiers; in Section 5, the framework and its software architecture is illustrated; Section 6 shows a number of experiments conducted to verify the effectiveness of the approach and to compare it with other similar approaches; finally, Section 7 concludes the work.

Section snippets

Related works

Evolutionary algorithms have been used mainly to evolve and select the base classifiers composing the ensemble [5], [6] or adopting some time-expensive algorithms to combine the ensemble [7]; however, a limited number of papers concerns the evolution of the combining function of the ensemble by using GP.

In the following, we analyze two groups of approaches. The first group comprises GP-based ensembles used to evolve the combination function. Most of the analyzed approaches employ a high number

A real scenario: classification of user profiles in e-payment systems

The inspiration of the approach taken in this paper comes from a project on cyber security for e-payment systems, in which one of the main tasks consists in dividing the users of an e-payments systems into homogenous groups on the basis of their weakness or vulnerabilities from the cyber security point of view. In this way, the provider of an e-payment system can conduct a different information and prevention campaign for each class of users, with obvious advantages in terms of time and cost

Background

In this section, we give some background information useful to understand our approach, i.e., the main methods to cope with missing data and incomplete datasets and a general schema for combining an ensemble of classifiers and the concept of “non-trainable functions” that can be used in order to combine an ensemble of classifiers without the need of a further phase of training.

A distributed tool for evolving combining functions

In this section, we illustrate the software architecture and detail the pseudo-code of the meta-ensemble approach; then, we show how the distributed GP framework used to evolve the combining function of the ensemble works, including the nodes, the terminals and the fitness function employed.

Experimental results

In this section, the experiments conducted to analyze the capacity of our approach on coping with unbalanced datasets and on handling missing features are described together with the main parameters and the main characteristics of the datasets used. In addition to a number of well-known benchmark datasets, two real and hard datasets were used to validate the approach: Unix dataset and KDD 99. The first was used to test the performance of the algorithm for the case of missing features, while the

Conclusions and future work

A meta-ensemble-based GP framework for classifying datasets in the cyber security domain and a real scenario concerning the segmentation of the users of an e-payment system, which illustrates the real applicability of the approach, are presented. The GP system is used to evolve the combiner function of the ensemble and permits to handle unbalanced classes thanks to a weighted fitness function, while the ensembles are specialized to handle the different groups of likely missing features.

Acknowledgment

This work has been partially supported by MIUR-PON under project PON03PE_00032_2 within the framework of the Technological District on Cyber Security.

References (27)

R. Polikar et al.
Learn++.MF: a random subspace approach for the missing feature problem
Pattern Recognit.
(2010)
Y. Freund et al.
A decision-theoretic generalization of on-line learning and an application to boosting
J. Comput. Syst. Sci.
(1997)
CERT Australia
Cyber Crime and Security Survey Report, Tech. Rep.
(2012)
L. Breiman
Bagging predictors
Mach. Learn.
(1996)
Y. Freund et al.
Experiments with a new boosting algorithm
G. Folino et al.
A scalable cellular implementation of parallel genetic programming
IEEE Trans. Evol. Comput.
(2003)
D.F. de Oliveira et al.
Use of multi-objective genetic algorithms to investigate the diversity/accuracy dilemma in heterogeneous ensembles
G. Folino et al.
Training distributed GP ensemble with a selective algorithm based on clustering and pruning for pattern classification
IEEE Trans. Evol. Comput.
(2008)
C.D. Stefano et al.
Using Bayesian networks for selecting classifiers in GP ensembles
Inf. Sci.
(2014)
J. Sylvester et al.
Evolutionary ensembles: combining learning agents using genetic algorithms

N. Chawla et al.

Exploiting diversity in ensembles: improving the performance on unbalanced datasets

J. Sylvester et al.

Evolutionary ensemble creation and thinning

N. Acosta-Mendoza et al.

Learning to assemble classifiers via genetic programming

IJPRAI

(2014)

Cited by (23)

Missing value imputation affects the performance of machine learning: A review and analysis of the literature (2010–2021)
2021, Informatics in Medicine Unlocked
Recently, numerous studies have been conducted on Missing Value Imputation (MVI), intending the primary solution scheme for the datasets containing one or more missing attribute’s values. The incorporation of MVI reinforces the Machine Learning (ML) models’ performance and necessitates a systematic review of MVI methodologies employed for different tasks and datasets. It will aid beginners as guidance towards composing an effective ML-based decision-making system in various fields of applications. This article aims to conduct a rigorous review and analysis of the state-of-the-art MVI methods in the literature published in the last decade. Altogether, 191 articles, published from 2010 to August 2021, are selected for review using the well-known Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) technique. We summarize those articles with relevant definitions, theories, and analyses to provide essential information for building a precise decision-making framework. In addition, the evaluation metrics employed for MVI methods and ML-based classification models are also discussed and explored. Remarkably, the trends for the MVI method and its evaluation are also scrutinized from the last twelve years’ data. To come up with the conclusion, several ML-based pipelines, where the MVI schemes are incorporated for performance enhancement, are investigated and reviewed for many different datasets. In the end, informative observations and recommendations are addressed for future research directions and trends in related fields of interest.
A class center based approach for missing value imputation
2018, Knowledge-Based Systems
Citation Excerpt :
For example, for continuous variables, a missing attribute value is filled in by taking the average value of that attribute for all of the observed data. The mean method has been compared in Folino and Pisani [9], Tian et al. [28], Silva-Ramirez et al. [25], and Xia et al. [32]. On the contrary, the mode method focuses on using the most appeared value of the target attribute for missing value imputation.
Missing value imputation (MVI) is the major solution method for dealing with incomplete dataset problems in which the missing attribute values are replaced from a chosen set of observed data using some statistical methods, such as mean/mode, machine learning, or support vector machine methods. Although machine learning MVI approaches may produce reasonably good imputation results, they usually require larger imputation times than statistical approaches. In this paper, a Class Center based Missing Value Imputation (CCMVI) approach is introduced for producing effective imputation results more efficiently. It is based on measuring the class center of each class and then the distances between it and the other observed data are used to define a threshold for the later imputation. The experimental results based on numerical, categorical, and mixed data types of datasets show that the proposed CCMVI approach outperforms the other MVI approaches for both numerical and mixed datasets. In addition, it requires much less imputation time than the machine learning MVI methods.
Time-aware neural ordinary differential equations for incomplete time series modeling
2023, Journal of Supercomputing
An ensemble-based framework for user behaviour anomaly detection and classification for cybersecurity
2023, Journal of Supercomputing
Analysis of IoT Security Challenges and Its Solutions Using Artificial Intelligence
2023, Brain Sciences
Multivariate Imputation by N Neighbour Mean and Chained Equation for Time Series Missing Data
2023, Proceedings of 2023 IEEE 2nd International Conference on Industrial Electronics: Developments and Applications, ICIDeA 2023

View all citing articles on Scopus

View full text

Evolving meta-ensemble of classifiers for handling incomplete and unbalanced datasets in the cyber security domain

Abstract

Graphical abstract

Introduction

Section snippets

Related works

A real scenario: classification of user profiles in e-payment systems

Background

A distributed tool for evolving combining functions

Experimental results

Conclusions and future work

Acknowledgment

Pattern Recognit.

J. Comput. Syst. Sci.

Cyber Crime and Security Survey Report, Tech. Rep.

Bagging predictors

Mach. Learn.

Experiments with a new boosting algorithm

A scalable cellular implementation of parallel genetic programming

IEEE Trans. Evol. Comput.

Use of multi-objective genetic algorithms to investigate the diversity/accuracy dilemma in heterogeneous ensembles

Training distributed GP ensemble with a selective algorithm based on clustering and pruning for pattern classification

IEEE Trans. Evol. Comput.

Using Bayesian networks for selecting classifiers in GP ensembles

Inf. Sci.

Evolutionary ensembles: combining learning agents using genetic algorithms

Exploiting diversity in ensembles: improving the performance on unbalanced datasets

Evolutionary ensemble creation and thinning

Learning to assemble classifiers via genetic programming

IJPRAI