Elsevier

Pattern Recognition

Volume 40, Issue 2, February 2007, Pages 492-503
Pattern Recognition

Mining evolving data streams for frequent patterns

https://doi.org/10.1016/j.patcog.2006.03.006Get rights and content

Abstract

A data stream is a potentially uninterrupted flow of data. Mining this flow makes it necessary to cope with uncertainty, as only a part of the stream can be stored. In this paper, we evaluate a statistical technique which biases the estimation of the support of patterns, so as to maximize either the precision or the recall, as chosen by the user, and limit the degradation of the other criterion. Theoretical results show that the technique is not far from the optimum, from the statistical standpoint. Experiments performed tend to demonstrate its potential, as it remains robust even under significant distribution drifts.

Introduction

A growing body of works arising from Databases and Data Mining deals with data arriving in the form of continuous potentially infinite streams, i.e. an ordered sequence of item occurrences that arrives in a timely manner. Data streams have seen the emergence of crucial problems that were previously not as pregnant for databases, such as the accurate retrieval of informations in a data flow that prevents its exact storage, and whose information may evolve through time. Emerging and real applications generate data streams: trend analysis, fraud detection, intrusion detection, click stream, among many others. Trend analysis is an important problem that commercial applications have to deal with, which is to detect in the data stream significant trends, emerging buzz, and unusually high or low activity [1]. In fraud detection, data miners try to detect suspicious changes in user behavior [2]. Finally, intrusion detection is a critical approach to help protect systems, with the growing importance of network systems security and the sensitivity of the informations stored and manipulated online [3].

A crucial issue in Data Mining that has recently attracted significant attention [3], [4], [5], [6], [7], [8] is to build the set of the most frequent patterns encountered in the data stream. Though it is straightforward to formulate, addressing this issue faces two non-trivial problems. The first is the statistical approximation of the true supports by observed supports. The second concerns the drifts that the data stream may face through time.

The rest of this paper is organized as follows. Section 2 states precisely the problem. Our theoretical approach is presented and discussed in Section 3. Section 4 is experimental: it presents and discusses some results that were obtained on readily generable data streams. In Section 5 we make some comparisons with related approaches. Finally, Section 6 concludes the paper with future avenues for research. In order not to laden the paper, an Appendix at the end of the paper contains the proof of a theorem.

Section snippets

Problem statement

We define items as the unit information, itemsets to be sets of items [9], and sequential patterns to be sequences of items [10]. We use the word pattern for a shorthand to both settings, without loss of generality. A pattern is θ-frequent if it occurs in at least a fraction θ of the data stream (called its support), where θ is a user-specified parameter.

Basically, our problem is motivated by the fact that the data we store catches a glimpse of a data stream, and the information we mine should

Our approach

Our approach relies on the following model of the data stream. It is supposed to be obtained from the repetitive sampling of a potentially huge domain X which contains all possible data sequences, see Fig. 1 (a). Obviously, X is unknown, but we have access to its elements through an unknown distribution D, see Fig. 1(b). We make absolutely no assumption on D, except for the moment that it does not change through time (later, this assumption shall be relaxed). Now, the user specifies a real 0<θ<1

Experiments

Two kinds of experiments were performed. First, we evaluate how our statistical supports are helpful to mine frequent patterns. Second, we analyze the behavior of our approach according to distribution drifts.

Related works

A significant body of previous works has addressed the accurate storing of the data stream history. This storage problem consists in finding compact data structures to reduce the size of the data kept out of the stream, while guaranteeing with high probability that the items observed as frequent from the stream are still observed frequent inside the data structure [11], [13], [5]. The first approach was proposed by [7] where they define the first single-pass algorithm. Li et al. [4] use a

Conclusion

There are five main contributions in this paper. First, we discuss the replacement of the conventional minimal support requirement for finding frequent patterns by a statistical support, in cases where storing the entire data is impossible (such as for data streams), so as to keep some convenient properties over the data kept. Then, we provide a method to compute this statistical support, while keeping those relevant properties. The method exploits concentration inequalities for random

References (32)

  • G. Manku et al.

    Approximate frequency counts over data streams

  • W.-G. Teng et al.

    A regression-based temporal patterns mining schema for data streams

  • S. Gollapudi et al.

    Framework and algorithms for trend analysis in massive temporal datasets

  • W. Fan et al.

    Active mining of data streams

  • L. Golab et al.

    Issues in data stream management

    ACM SIGMOD Records

    (2003)
  • H.-F. Li, S.Y. Lee, M.-K. Shan, An efficient algorithm for mining frequent itemsets over the entire history of data...
  • C. Jin et al.

    Dynamically maintaining frequent items over a data stream

  • E. Demaine et al.

    Frequency estimation of internet packet streams with limited space

  • R.-M. Karp et al.

    A simple algorithm for finding elements in streams and bags

    ACM Trans. Database Systems

    (2003)
  • R. Agrawal et al.

    Mining association rules between sets of items in large databases

  • R. Agrawal et al.

    Mining sequential patterns

  • M. Charikar et al.

    Finding frequent items in data streams

  • D. Cheung et al.

    Maintenance of discovered association rules in large databases: an incremental updating technique

  • G. Cormode et al.

    What's hot and what's not: tracking most frequent items dynamically

  • A. Veloso et al.

    Mining frequent itemsets in evolving databases

  • V. Vapnik

    Statistical Learning Theory

    (1998)
  • Cited by (8)

    • Patient clustering using dynamic partitioning on correlated and uncertain biomedical data

      2020, Computer Methods and Programs in Biomedicine
      Citation Excerpt :

      Patients in assisted living systems are continuously monitored using a multitude of sensors [2]. These biomedical sensors generate a stream of physiological data [3] that contain a wealth of information about patients [4,5]. Vital sign data such as heart rate (HR), blood pressure (BP), respiratory rate (RR) and blood oxygen saturation (SPO2) are highly correlated and even a small change in these data over time can cause emergency clinical events [6].

    • Experimental study on fighters behaviors mining

      2011, Expert Systems with Applications
      Citation Excerpt :

      Zhong combined the online spherical k-means (OSKM) algorithm with the scalable clustering strategy to achieve fast and adaptive clustering of text streams (Zhong, 2005). Laur et al. evaluated a statistical technique which biased the estimation of the Support of patterns to maximize either the precision or the recall (Laur, Nock, Symphor, & Poncelet, 2007). Lin et al. designed some algorithms to reduce the number of distinct query results by grouping a set of different queries into a cluster and to minimize the total query processing costs (Lin et al., 2006).

    • Stock Data Clustering and Multiscale Trend Detection

      2012, Methodology and Computing in Applied Probability
    • Reconfigurable manufacturing services for web-based collaboration systems

      2011, 21st International Conference on Production Research: Innovation in Product and Production, ICPR 2011 - Conference Proceedings
    View all citing articles on Scopus
    View full text