A cellular-based evolutionary approach for the extraction of emerging patterns in massive data streams
Introduction
The amount of information generated is growing exponentially nowadays. For example, the experiments at the Large Hadron Collider at CERN could generate about 90 petabytes of data per year and process one petabyte of data a day (CERN, 2021). According to a Cisco report it is expected that the amount of information generated worldwide by 2021 will be approximately 850 zettabytes (Cisco, 2021). These data are mainly produced by the high amount of information transmitted between devices and the explosion of the Internet of Things devices (Sezer et al., 2017, Nord et al., 2019). From this huge amount of data, the information generated but not stored is two orders of magnitude higher than the amount of information finally stored. This means that the majority of the generated data are considered interesting only at their creation. However, they are usually neither stored nor analysed after that. In the worst case, these data are stored, but they are never analysed, creating the so-called data tombs. Nevertheless, these data can contain interesting insights about its application domain. This information could be relevant for companies in order to improve their services and productivity, together with many other applications that requires a quick response, whereas computational resources are efficiently employed.
It is undeniable that we live surrounded by data. These huge amounts of data are commonly known as big data. Big data can be characterised by the 5 V’s model (volume, velocity, variety, veracity and value) (Mayer-Schonberger & Cukier, 2013), which describes the massive volume of data, their fast generation, their diverse nature and their usefulness for the experts. In many cases the short life of generated data force us to process them as soon as they arrive into the system for providing a fast, reliable response. These data that continuously arrive into the system at an undetermined speed are known in the literature as a data stream (Gama, 2010). In this scenario, the learning model must be continuously updated and adapted to the incoming data. However, volume, velocity and variety of data could be so huge nowadays that classical approaches are not suitable to handle them. Thus, a distributed data stream processing approach becomes compulsory. The interest in the analysis of this kind of data is evidenced by the number of distributed, large-scale processing frameworks that have been developed up to date for this purpose. In particular, Apache Spark (Zaharia, Chowdhury, Franklin, Shenker, & Stoica, 2010), Kafka (Garg, 2013), Storm (Foundation, 2021), or Flink (Carbone et al., 2015), amongst others, are gaining special attention for both its distributed real-time performance, and fault-tolerant processing within iterative methods such as machine learning algorithms.
In the last few years, the literature on data streams have been focused on concept drift analysis (Brzeziński, 2015, Ramírez-Gallego et al., 2017, Khamassi et al., 2018). These works, amongst others, have greatly improved the development of the area. However, many other aspects within data stream mining are not completely developed, so real-world applications are still challenging. For example, it is very difficult to find completely labelled data within high-speed, massive data streams. In these cases, it would be more realistic the development of supervised-learning-based, descriptive models with good interpretability for system monitoring. In this way, supervised learning techniques can be employed in order to monitor the behaviour of data with respect to a naturally-generated property of interest, which can be employed as class label.
Emerging Pattern Mining (EPM) (Dong and Li, 1999, García-Vico et al., 2018) is a data mining task within the Supervised Descriptive Rule Discovery (SDRD) framework (Kralj-Novak, Lavrac, & Webb, 2009). The aim of this task is the description of the discriminative relationships in data with respect to a property of interest. In particular, it tries to describe the characteristic differences between the values of a property of interest or the description of emerging behaviour in data. In this way, experts can obtain an easy-to-understand pattern model which describes the underlying phenomena in data. Hence, EPM can be useful within data stream mining as the purpose is to monitor the behaviour of the stream using a simple, readable, reliable model. EPM has been successfully applied in many different fields such as disease management (Piao et al., 2009, Park et al., 2010, Tzanis et al., 2011, Poezevara et al., 2017), toxicology (Sherhod et al., 2012, Sherhod et al., 2013), renewable energies (García-Vico, Montes, Aguilera, Carmona, & del Jesus, 2016), management (Li, Law, Vu, Rong, & Zhao, 2015) and social networks (Peng et al., 2018), amongst others. In addition, approaches based on Evolutionary Fuzzy Systems (EFSs) have been recently proposed in García-Vico, Carmona, González, and del Jesus (2018) which surpasses the descriptive capacities of the classical methods. Nevertheless, the development of EPM algorithms within massive data stream mining is still challenging. This is mainly due to the computational complexity of the mining methods (Wang, Zhao, Dong, & Li, 2004) and the difficulties for the development of fast, distributed strategies. This makes unfeasible its application to massive data stream environments, as they require an almost real-time response. In addition, one of the main drawbacks of EPM methods in data stream mining is that they require a finite dataset in order to compute the required quality measures for the extraction of the patterns. A first approach to solve this issue is a multi-objective EFS following a block-based learning approach for data stream mining, proposed in García-Vico, Carmona, González, and del Jesus (2020). Although the quality of the knowledge extracted is good, its learning method is continuously executed to be adapted with respect to the stream. Moreover, it does not provide any distributed mechanism to efficiently scale up the mining process. Therefore, its application within high-speed, massive data stream environments could be a problem, as many unnecessary executions of the learning method are carried out without any data distribution mechanism.
In this paper, a Cellular-based Evolutionary approach for the Extraction of Emerging Patterns in Massive Data Streams (CE3P-MDS) is proposed. The main contributions of this paper are as follows:
- 1.
Learning method inspired on a cellular-based, multi-objective evolutionary algorithm (Nebro, Durillo, Luna, Dorronsoro, & Alba, 2009) which improves the diversity-exploitation trade-off, together with a reinitialisation method based on the odds ratio measure which removes those redundant patterns with the highest complexity.
- 2.
Smart triggering of the learning method, which updates and adapt the current pattern model with respect to the state of the data stream only when it is necessary, according to the user requirements.
- 3.
Scalable approach for the processing of massive, high-speed data streams from several sources thanks to the employment of Apache Kafka and Apache Spark.
- 4.
Comprehensive experimental evaluation of the proposed method.
This paper is organised as follows: firstly, the main concepts related to big data analysis, data stream mining and EPM are presented in Section 2. Next, the main components of CE3P-MDS and its working scheme are shown in Section 3. After that, the experimental study, the results extracted and its discussion are depicted in Section 4. Finally, the conclusions of this work are presented in Section 5.
Section snippets
Related work
In this section, the main concepts related to this paper are presented below: big data analysis (Section 2.1), data stream mining (Section 2.2) and and EPM (Section 2.3).
CE3P-MDS: cellular-based evolutionary approach for the extraction of emerging patterns in massive data streams
Fuzzy Rule-Based Systems (FRBSs) (Mamdani & Assilian, 1975) are knowledge systems composed by a set of IF-THEN rules where both antecedent and consequent can contain fuzzy sets. There are two main components within FRBSs: the knowledge base (KB), which contains the fuzzy rules, and the data base (DB) that contains the fuzzy sets definitions. Throughout the literature, EFSs have been widely used for learning the KB, the DB, or both from scratch; or for tuning its elements as a posteriori
Experimental study
In this section an experimental study is carried out for the determination of the quality of CE3P-MDS. In Section 4.1 the experimental framework is shown. Next, the study has three main objectives that are analysed: firstly, the quality of the knowledge extracted is shown and compared against other evolutionary approach for the extraction of EPs in data streams, the FEPDS algorithm (García-Vico et al., 2020) in Section 4.2. This comparison is made because, to the best of our knowledge, the
Conclusion
In this paper, an EFS for the extraction of EPs in massive data streams have been presented. To the best of our knowledge, the CE3P-MDS algorithm is the first EPM method focused on the processing of massive, heterogeneous, high-speed data streams. The main aim of the proposed method is to describe or monitor the current state of the data stream with respect to a variable of interest. This is carried out by an informed strategy based on a change monitoring system for finding a good trade-off
CRediT authorship contribution statement
Ángel M. García-Vico: Writing - original draft, Conceptualization, Methodology, Software. Cristóbal Carmona: Methodology, Writing - review & editing, Supervision. Pedro González: Writing - review & editing, Supervision. María J. del Jesus: Writing - review & editing, Supervision, Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the Spanish Ministry of Economy and Competitiveness under the project PID2019-107793GB-I00 and by the Regional Government of Andalusia, program “Personal Investigador Doctor”, reference DOC_00235.
References (94)
- et al.
A unifying analysis for the supervised descriptive rule discovery via the weighted relative accuracy
Knowledge-Based Systems
(2018) - et al.
On learning guarantees to unsupervised concept drift detection on data streams
Expert Systems with Applications
(2019) - et al.
Detecting concept drift in data streams using model explanation
Expert Systems with Applications
(2018) - et al.
Max-fism: Mining (recently) maximal frequent itemsets over data streams using the sliding window model
Computers & Mathematics with Applications
(2012) - et al.
Revisiting evolutionary fuzzy systems: Taxonomy, applications, new trends and challenges
Knowledge-Based Systems
(2015) - et al.
Evaluation of quality measures for contrast patterns by using unseen objects
Expert Systems with Applications
(2017) - et al.
E2pamea: A fast evolutionary algorithm for extracting fuzzy emerging patterns in big data environments
Neurocomputing
(2020) - et al.
Multi-objective evolutionary algorithms for energy-aware scheduling on distributed computing systems
Applied Soft Computing
(2014) - et al.
Fpo tree and dp3 algorithm for distributed parallel frequent itemsets mining
Expert Systems with Applications
(2020) - et al.
Ensemble learning for data stream analysis: A survey
Information Fusion
(2017)