Elsevier

Applied Soft Computing

Volume 8, Issue 4, September 2008, Pages 1283-1294
Applied Soft Computing

Info-fuzzy algorithms for mining dynamic data streams

https://doi.org/10.1016/j.asoc.2007.11.003Get rights and content

Abstract

Most data-mining algorithms assume static behavior of the incoming data. In the real world, the situation is different and most continuously collected data streams are generated by dynamic processes, which may change over time, in some cases even drastically. The change in the underlying concept, also known as concept drift, causes the data-mining model generated from past examples to become less accurate and relevant for classifying the current data. Most online learning algorithms deal with concept drift by generating a new model every time a concept drift is detected. On one hand, this solution ensures accurate and relevant models at all times, thus implying an increase in the classification accuracy. On the other hand, this approach suffers from a major drawback, which is the high computational cost of generating new models. The problem is getting worse when a concept drift is detected more frequently and, hence, a compromise in terms of computational effort and accuracy is needed. This work describes a series of incremental algorithms that are shown empirically to produce more accurate classification models than the batch algorithms in the presence of a concept drift while being computationally cheaper than existing incremental methods. The proposed incremental algorithms are based on an advanced decision-tree learning methodology called “Info-Fuzzy Network” (IFN), which is capable to induce compact and accurate classification models. The algorithms are evaluated on real-world streams of traffic and intrusion-detection data.

Introduction

Data mining is known as the core stage of Knowledge Discovery in Databases (KDD), which is defined by Fayyad et al. [12] as: “the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”. In recent years, there is an ongoing demand for systems, which are capable to mine massive and continuous streams of real-world data. The use of such systems can be in the fields of temperature monitoring, precision agriculture, urban traffic control, stock market analysis, network security, etc. The complex nature of real-world data has increased the difficulties and the challenges of data mining in terms of data processing, data storage, and model storage requirements [20]. One of the main difficulties in mining dynamic continuous data streams is to cope with the changing data concept. The fundamental processes generating most real-world data streams may change over years, months and even seconds, at times drastically. In case of the classification task, this change, also known as concept drift [15], causes the data-mining model generated from past data to become less accurate in the classification of new records. Therefore, the most important characteristic of such a system is to deal with noise, uncertainty, and asynchrony of the real-world data [8].

Batch classification algorithms like CART [2], ID3 [28], C4.5 [29], and IFN [25] are not suitable for mining continuous data streams. The main problem of these algorithms is their tendency to store and process the entire set of training data. The continuous arrival of training data increases their storage and processing effort, which eventually results in insufficient memory or prohibitively long computation times. In addition, when a certain data-mining algorithm considers all past training examples, the induced patterns may not be valid and relevant to the new data because of changes in the dynamic process, which generates the data. In practical terms, this means an increasing error rate in classifying new records with the existing model.

Algorithms and methods, which extract patterns from continuous and potentially dynamic data streams, are known as incremental (online) learning. According to [14], a learning task is defined as incremental if the training examples used to solve it become available over time, usually one at a time. The basic approach of pure incremental algorithms is to induce patterns in an incremental manner based on every new incoming instance. This means that instead of building a new model, an incremental learning algorithm updates the current model. This approach saves a significant amount of computer resources such as processing time and memory. In the area of incremental learning with decision-tree classification algorithms, there are several methods such as VFDT [9], CVFDT [17], and OLIN [22], which in general are able to process continuous data streams.

In this paper, we present a series of novel incremental algorithms that produce more accurate classification models than the batch algorithms in the presence of a concept drift and are computationally cheaper than existing incremental methods (OLIN and CVFDT). In our work, we use classification models, which are “oblivious” decision trees generated by the IFN (Info-Fuzzy Network) algorithm introduced by Maimon and Last in [25]. The proposed incremental algorithms are evaluated on real-world streams of traffic and intrusion-detection data. The algorithms are also compared to a leading incremental approach to mining dynamic data streams called CVFDT (Concept adapting Very Fast Decision Tree) of [17] and the results show that our incremental methods outperform the CVFDT performance in terms of run time while maintaining nearly the same level of predictive accuracy.

The rest of this paper is organized as follows. Section 2 presents the related work in the areas of incremental learning and real-time data mining. Section 3 presents a brief overview of IFN and OLIN algorithms, which are the basis for this work. Section 4 presents the proposed incremental algorithms, and Section 5 is the evaluation part. In Section 6, we conclude our paper with a discussion of experimental results and propositions for future work.

Section snippets

Incremental learning

Pratt and Tschapek [27] claim that the change in outcome distribution (concept drift) may occur in two ways. First, an existing predictive rule may keep its accuracy level, but the rule may be invoked more or less often due to a change in the frequency of occurrence of its feature values. Second, the accuracy of a certain rule may decrease because its underlying features have become irrelevant. In this case, those rules should be discarded and replaced with new rules that depend on newly

The batch learning algorithm (IN)

Many batch and online learning methods use the information theory to induce classification models. One of the batch information-theoretic methods, developed by Last and Maimon [24], [25], is the Info-Fuzzy Network algorithm (also known as Information Network-IN). IN is an oblivious decision-tree classification model designed to minimize the total number of predicting attributes. The underlying principle of the IN-based methodology is to construct a multi-layered network in order to maximize the

Incremental information network algorithms

The OLIN online classification algorithm [22] deals with the potential concept drift in a non-stationary data stream by simply generating a new model for every new sliding window. On one hand, this regenerative approach ensures accurate and relevant models over time and therefore an increase in the classification accuracy. On the other hand, OLIN's major shortcoming is the high computational cost of generating new models. In this section, we present four incremental learning algorithms, which

Evaluation

In this section, the proposed incremental algorithms are evaluated vs. the original Regenerative OLIN algorithm [22] on several real-world streams of dynamic data. In addition, the IN-based methods are compared to the CVFDT incremental decision-tree learner [17] available as part of the VFML toolkit [19]. We have examined two aspects of all incremental algorithms: first, we have evaluated their predictive accuracy on incoming examples and secondly we have compared their processing time per the

Conclusions and future work

This paper has performed a comprehensive evaluation of a series of novel real-time data-mining algorithms, aimed at optimizing the classification performance under arrival of dynamic data. Unlike existing techniques for mining continuous data streams, the real-time algorithms adapt themselves automatically to the rate of data change (“concept drift”). The Learning Module of the proposed real-time data-mining methods is based on an advanced decision-tree induction algorithm called Info-Fuzzy

Acknowledgements

We would like to thank the Traffic Control Center of Jerusalem for granting us the permission to use their traffic database. This work was partially supported under a research contract from the Israel Ministry of Defense and by the National Institute for Systems Test and Productivity at University of South Florida under the USA Space and Naval Warfare Systems Command Grant No. N00039-01-1-2248.

References (33)

  • P. Domingos et al.

    Mining high-speed data streams

  • P. Domingos et al.

    A general framework for mining massive data streams

    J. Comput. Graphical Stat.

    (2003)
  • U. Fayyad et al.

    From data mining to knowledge discovery: an overview

  • J. Gama et al.

    Accurate Decision Trees for mining high-speed Data Streams

  • C. Giraud-Carrier

    A note on the utility of incremental learning

    AI Commun.

    (2000)
  • D.P. Helmbold et al.

    Tracking drifting concepts by minimizing disagreements

    Machine Learn.

    (1994)
  • Cited by (0)

    This paper is partially based on the following non-archival publication: L. Cohen, G. Avrahami, M. Last, A. Kandel, and O. Kipersztok, “Incremental Classification of Nonstationary Data Streams”, Proceedings of the Second International Workshop on Knowledge Discovery in Data Streams, pp. 117–124, October 7, 2005, Porto, Portugal.

    1

    Tel.: +972 8 6461397; fax: +972 8 6477527.

    View full text