A cellular-based evolutionary approach for the extraction of emerging patterns in massive data streams

https://doi.org/10.1016/j.eswa.2021.115419Get rights and content

Highlights

  • A cellular-based evolutionary fuzzy system for the extraction of emerging patterns in high-speed, massive data streams is proposed.

  • Smart triggering of the learning method which updates the model only when required.

  • A reinitialisation and filtering strategy for reducing the extraction of redundant patterns is also defined.

  • The quality of knowledge extracted outperforms state-of-the-art methods.

  • The proposed method is able to process batches of data up to 750,000 instances.

Abstract

Today, the number of existing devices generates immense amounts of data on a continuous basis that must be processed by new distributed data stream mining approaches. In this paper we present a new approach for extracting descriptive emerging patterns in massive data streams from different sources through Apache Kafka and Apache Spark Streaming whose objective is to monitor the state of the system with respect to a variable of interest. For this purpose, the proposed algorithm is a cellular-based multi-objective evolutionary fuzzy system that uses an informed strategy for efficient data processing and a re-initialisation and filtering mechanism to eliminate redundant and low-reliable patterns. The experimental study carried out demonstrates an interpretability improvement of 25% in the extraction of high-interest knowledge by the proposed algorithm, which would make it easier for experts to analyse the problem. Finally, the proposed algorithm is up to five times faster than another proposal on the processing of the same amount of data. In this experimental study, up to 750,000 instances have been processed in approximately four seconds.

Introduction

The amount of information generated is growing exponentially nowadays. For example, the experiments at the Large Hadron Collider at CERN could generate about 90 petabytes of data per year and process one petabyte of data a day (CERN, 2021). According to a Cisco report it is expected that the amount of information generated worldwide by 2021 will be approximately 850 zettabytes (Cisco, 2021). These data are mainly produced by the high amount of information transmitted between devices and the explosion of the Internet of Things devices (Sezer et al., 2017, Nord et al., 2019). From this huge amount of data, the information generated but not stored is two orders of magnitude higher than the amount of information finally stored. This means that the majority of the generated data are considered interesting only at their creation. However, they are usually neither stored nor analysed after that. In the worst case, these data are stored, but they are never analysed, creating the so-called data tombs. Nevertheless, these data can contain interesting insights about its application domain. This information could be relevant for companies in order to improve their services and productivity, together with many other applications that requires a quick response, whereas computational resources are efficiently employed.

It is undeniable that we live surrounded by data. These huge amounts of data are commonly known as big data. Big data can be characterised by the 5 V’s model (volume, velocity, variety, veracity and value) (Mayer-Schonberger & Cukier, 2013), which describes the massive volume of data, their fast generation, their diverse nature and their usefulness for the experts. In many cases the short life of generated data force us to process them as soon as they arrive into the system for providing a fast, reliable response. These data that continuously arrive into the system at an undetermined speed are known in the literature as a data stream (Gama, 2010). In this scenario, the learning model must be continuously updated and adapted to the incoming data. However, volume, velocity and variety of data could be so huge nowadays that classical approaches are not suitable to handle them. Thus, a distributed data stream processing approach becomes compulsory. The interest in the analysis of this kind of data is evidenced by the number of distributed, large-scale processing frameworks that have been developed up to date for this purpose. In particular, Apache Spark (Zaharia, Chowdhury, Franklin, Shenker, & Stoica, 2010), Kafka (Garg, 2013), Storm (Foundation, 2021), or Flink (Carbone et al., 2015), amongst others, are gaining special attention for both its distributed real-time performance, and fault-tolerant processing within iterative methods such as machine learning algorithms.

In the last few years, the literature on data streams have been focused on concept drift analysis (Brzeziński, 2015, Ramírez-Gallego et al., 2017, Khamassi et al., 2018). These works, amongst others, have greatly improved the development of the area. However, many other aspects within data stream mining are not completely developed, so real-world applications are still challenging. For example, it is very difficult to find completely labelled data within high-speed, massive data streams. In these cases, it would be more realistic the development of supervised-learning-based, descriptive models with good interpretability for system monitoring. In this way, supervised learning techniques can be employed in order to monitor the behaviour of data with respect to a naturally-generated property of interest, which can be employed as class label.

Emerging Pattern Mining (EPM) (Dong and Li, 1999, García-Vico et al., 2018) is a data mining task within the Supervised Descriptive Rule Discovery (SDRD) framework (Kralj-Novak, Lavrac, & Webb, 2009). The aim of this task is the description of the discriminative relationships in data with respect to a property of interest. In particular, it tries to describe the characteristic differences between the values of a property of interest or the description of emerging behaviour in data. In this way, experts can obtain an easy-to-understand pattern model which describes the underlying phenomena in data. Hence, EPM can be useful within data stream mining as the purpose is to monitor the behaviour of the stream using a simple, readable, reliable model. EPM has been successfully applied in many different fields such as disease management (Piao et al., 2009, Park et al., 2010, Tzanis et al., 2011, Poezevara et al., 2017), toxicology (Sherhod et al., 2012, Sherhod et al., 2013), renewable energies (García-Vico, Montes, Aguilera, Carmona, & del Jesus, 2016), management (Li, Law, Vu, Rong, & Zhao, 2015) and social networks (Peng et al., 2018), amongst others. In addition, approaches based on Evolutionary Fuzzy Systems (EFSs) have been recently proposed in García-Vico, Carmona, González, and del Jesus (2018) which surpasses the descriptive capacities of the classical methods. Nevertheless, the development of EPM algorithms within massive data stream mining is still challenging. This is mainly due to the computational complexity of the mining methods (Wang, Zhao, Dong, & Li, 2004) and the difficulties for the development of fast, distributed strategies. This makes unfeasible its application to massive data stream environments, as they require an almost real-time response. In addition, one of the main drawbacks of EPM methods in data stream mining is that they require a finite dataset in order to compute the required quality measures for the extraction of the patterns. A first approach to solve this issue is a multi-objective EFS following a block-based learning approach for data stream mining, proposed in García-Vico, Carmona, González, and del Jesus (2020). Although the quality of the knowledge extracted is good, its learning method is continuously executed to be adapted with respect to the stream. Moreover, it does not provide any distributed mechanism to efficiently scale up the mining process. Therefore, its application within high-speed, massive data stream environments could be a problem, as many unnecessary executions of the learning method are carried out without any data distribution mechanism.

In this paper, a Cellular-based Evolutionary approach for the Extraction of Emerging Patterns in Massive Data Streams (CE3P-MDS) is proposed. The main contributions of this paper are as follows:

  • 1.

    Learning method inspired on a cellular-based, multi-objective evolutionary algorithm (Nebro, Durillo, Luna, Dorronsoro, & Alba, 2009) which improves the diversity-exploitation trade-off, together with a reinitialisation method based on the odds ratio measure which removes those redundant patterns with the highest complexity.

  • 2.

    Smart triggering of the learning method, which updates and adapt the current pattern model with respect to the state of the data stream only when it is necessary, according to the user requirements.

  • 3.

    Scalable approach for the processing of massive, high-speed data streams from several sources thanks to the employment of Apache Kafka and Apache Spark.

  • 4.

    Comprehensive experimental evaluation of the proposed method.

This paper is organised as follows: firstly, the main concepts related to big data analysis, data stream mining and EPM are presented in Section 2. Next, the main components of CE3P-MDS and its working scheme are shown in Section 3. After that, the experimental study, the results extracted and its discussion are depicted in Section 4. Finally, the conclusions of this work are presented in Section 5.

Section snippets

Related work

In this section, the main concepts related to this paper are presented below: big data analysis (Section 2.1), data stream mining (Section 2.2) and and EPM (Section 2.3).

CE3P-MDS: cellular-based evolutionary approach for the extraction of emerging patterns in massive data streams

Fuzzy Rule-Based Systems (FRBSs) (Mamdani & Assilian, 1975) are knowledge systems composed by a set of IF-THEN rules where both antecedent and consequent can contain fuzzy sets. There are two main components within FRBSs: the knowledge base (KB), which contains the fuzzy rules, and the data base (DB) that contains the fuzzy sets definitions. Throughout the literature, EFSs have been widely used for learning the KB, the DB, or both from scratch; or for tuning its elements as a posteriori

Experimental study

In this section an experimental study is carried out for the determination of the quality of CE3P-MDS. In Section 4.1 the experimental framework is shown. Next, the study has three main objectives that are analysed: firstly, the quality of the knowledge extracted is shown and compared against other evolutionary approach for the extraction of EPs in data streams, the FEPDS algorithm (García-Vico et al., 2020) in Section 4.2. This comparison is made because, to the best of our knowledge, the

Conclusion

In this paper, an EFS for the extraction of EPs in massive data streams have been presented. To the best of our knowledge, the CE3P-MDS algorithm is the first EPM method focused on the processing of massive, heterogeneous, high-speed data streams. The main aim of the proposed method is to describe or monitor the current state of the data stream with respect to a variable of interest. This is carried out by an informed strategy based on a change monitoring system for finding a good trade-off

CRediT authorship contribution statement

Ángel M. García-Vico: Writing - original draft, Conceptualization, Methodology, Software. Cristóbal Carmona: Methodology, Writing - review & editing, Supervision. Pedro González: Writing - review & editing, Supervision. María J. del Jesus: Writing - review & editing, Supervision, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the Spanish Ministry of Economy and Competitiveness under the project PID2019-107793GB-I00 and by the Regional Government of Andalusia, program “Personal Investigador Doctor”, reference DOC_00235.

References (94)

  • G. Li et al.

    Identifying emerging hotel preferences using emerging pattern mining technique

    Tourism management

    (2015)
  • J. Li et al.

    Discovering statistically non-redundant subgroups

    Knowledge-Based Systems

    (2014)
  • H. Li et al.

    Probabilistic frequent itemset mining over uncertain data streams

    Expert Systems with Applications

    (2018)
  • E.H. Mamdani et al.

    An experiment in linguistic synthesis with a fuzzy logic controller

    International journal of man-machine studies

    (1975)
  • J.H. Nord et al.

    The internet of things: Review and theoretical framework

    Expert Systems with Applications

    (2019)
  • S. Ramírez-Gallego et al.

    Big data: Tutorial and guidelines on information and process fusion for analytics algorithms with mapreduce

    Information Fusion

    (2018)
  • S. Ramírez-Gallego et al.

    A survey on data preprocessing for data stream mining: Current status and future directions

    Neurocomputing

    (2017)
  • E. Ruiz et al.

    Adaptive fuzzy partitions for evolving association rules in big data stream

    International Journal of Approximate Reasoning

    (2018)
  • I. Škrjanc et al.

    Evolving fuzzy and neuro-fuzzy approaches in clustering, regression, identification, and classification: A survey

    Information Sciences

    (2019)
  • G. Tzanis et al.

    Polya-iep: A data mining method for the effective prediction of polyadenylation sites

    Expert Systems with Applications

    (2011)
  • E. Alba et al.

    The exploration/exploitation tradeoff in dynamic cellular genetic algorithms

    IEEE transactions on evolutionary computation

    (2005)
  • A. Bifet et al.

    MOA: massive online analysis

    Journal of Machine Learning Research

    (2010)
  • Brzeziński, D. (2015). Block-based and online ensembles for concept-drifting data streams (Ph.D. thesis). Poznan...
  • P. Carbone et al.

    Apache flink: Stream and batch processing in a single engine

    Bulletin of the IEEE Computer Society Technical Committee on Data Engineering

    (2015)
  • C.J. Carmona et al.

    NMEEF-SD: Non-dominated multi-objective evolutionary algorithm for extracting fuzzy rules in subgroup discovery

    IEEE Transactions on Fuzzy Systems

    (2010)
  • CERN (2021). Storage at cern. URL: https://home.cern/science/computing/storage. Accessed:...
  • J. Cheng et al.

    Maintaining frequent closed itemsets over a sliding window

    Journal of Intelligent Information Systems

    (2008)
  • Cisco (2021). Cisco annual internet report (2018-2023) white paper. URL:...
  • J. Dean et al.

    Mapreduce: Simplified data processing on large clusters

  • K. Deb et al.

    An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, Part I: Solving problems with box constraints

    IEEE Transactions on Evolutionary Computation

    (2014)
  • K. Deb et al.

    A fast and elitist multiobjective genetic algorithm: NSGA-II

    IEEE Transactions Evolutionary Computation

    (2002)
  • Dheeru, D., & Karra Taniskidou, E. (2017). Uci machine learning repository. URL:...
  • G. Dong et al.

    Efficient mining of emerging patterns: Discovering trends and differences

  • U.M. Fayyad et al.

    From data mining to knowledge discovery: an overview

  • A. Fernández et al.

    Evolutionary fuzzy systems for explainable artificial intelligence: Why, when, what for, and where to?

    IEEE Computational Intelligence Magazine

    (2019)
  • A. Fernández et al.

    Big Data with Cloud Computing: An Insight on the Computing Environment, MapReduce and Programming Frameworks

    WIREs Data Mining and Knowledge Discovery

    (2014)
  • Foundation, A. S. (2021). Apache storm. URL: https://storm.apache.org/. Accessed:...
  • J. Gama

    Knowledge discovery from data streams

    (2010)
  • J. Gama et al.

    A survey on concept drift adaptation

    ACM Computing Surveys

    (2014)
  • D. Gamberger et al.

    Expert-guided subgroup discovery: Methodology and application

    Journal Artificial Intelligence Research

    (2002)
  • L.E. García-Hernández et al.

    Multi-objective configuration of a secured distributed cloud data storage

  • A.M. García-Vico et al.

    MOEA-EFEP: Multi-objective evolutionary algorithm for extracting fuzzy emerging patterns

    IEEE Transactions on Fuzzy Systems

    (2018)
  • A. García-Vico et al.

    Fepds: A proposal for the extraction of fuzzy emerging patterns in data streams

    IEEE Transactions on Fuzzy Systems

    (2020)
  • A.M. García-Vico et al.

    An overview of emerging pattern mining in supervised descriptive rule discovery: Taxonomy, empirical study, trends and prospects

    WIREs: Data Mining and Knowledge Discovery

    (2018)
  • A.M. García-Vico et al.

    Analysing Concentrating Photovoltaics Technology through the use of Emerging Pattern Mining

  • N. Garg

    Apache Kafka

    (2013)
  • R. Hernández Gómez et al.

    Improved metaheuristic based on the r2 indicator for many-objective optimization

  • Cited by (0)

    View full text