Mining evolving data streams for frequent patterns

doi:10.1016/j.patcog.2006.03.006

Pattern Recognition

Volume 40, Issue 2, February 2007, Pages 492-503

https://doi.org/10.1016/j.patcog.2006.03.006 Get rights and content

Abstract

A data stream is a potentially uninterrupted flow of data. Mining this flow makes it necessary to cope with uncertainty, as only a part of the stream can be stored. In this paper, we evaluate a statistical technique which biases the estimation of the support of patterns, so as to maximize either the precision or the recall, as chosen by the user, and limit the degradation of the other criterion. Theoretical results show that the technique is not far from the optimum, from the statistical standpoint. Experiments performed tend to demonstrate its potential, as it remains robust even under significant distribution drifts.

Introduction

A growing body of works arising from Databases and Data Mining deals with data arriving in the form of continuous potentially infinite streams, i.e. an ordered sequence of item occurrences that arrives in a timely manner. Data streams have seen the emergence of crucial problems that were previously not as pregnant for databases, such as the accurate retrieval of informations in a data flow that prevents its exact storage, and whose information may evolve through time. Emerging and real applications generate data streams: trend analysis, fraud detection, intrusion detection, click stream, among many others. Trend analysis is an important problem that commercial applications have to deal with, which is to detect in the data stream significant trends, emerging buzz, and unusually high or low activity [1]. In fraud detection, data miners try to detect suspicious changes in user behavior [2]. Finally, intrusion detection is a critical approach to help protect systems, with the growing importance of network systems security and the sensitivity of the informations stored and manipulated online [3].

A crucial issue in Data Mining that has recently attracted significant attention [3], [4], [5], [6], [7], [8] is to build the set of the most frequent patterns encountered in the data stream. Though it is straightforward to formulate, addressing this issue faces two non-trivial problems. The first is the statistical approximation of the true supports by observed supports. The second concerns the drifts that the data stream may face through time.

The rest of this paper is organized as follows. Section 2 states precisely the problem. Our theoretical approach is presented and discussed in Section 3. Section 4 is experimental: it presents and discusses some results that were obtained on readily generable data streams. In Section 5 we make some comparisons with related approaches. Finally, Section 6 concludes the paper with future avenues for research. In order not to laden the paper, an Appendix at the end of the paper contains the proof of a theorem.

Section snippets

Problem statement

We define items as the unit information, itemsets to be sets of items [9], and sequential patterns to be sequences of items [10]. We use the word pattern for a shorthand to both settings, without loss of generality. A pattern is $θ$ -frequent if it occurs in at least a fraction $θ$ of the data stream (called its support), where $θ$ is a user-specified parameter.

Basically, our problem is motivated by the fact that the data we store catches a glimpse of a data stream, and the information we mine should

Our approach

Our approach relies on the following model of the data stream. It is supposed to be obtained from the repetitive sampling of a potentially huge domain $X$ which contains all possible data sequences, see Fig. 1 (a). Obviously, $X$ is unknown, but we have access to its elements through an unknown distribution $D$ , see Fig. 1(b). We make absolutely no assumption on $D$ , except for the moment that it does not change through time (later, this assumption shall be relaxed). Now, the user specifies a real $0 < θ < 1$

Experiments

Two kinds of experiments were performed. First, we evaluate how our statistical supports are helpful to mine frequent patterns. Second, we analyze the behavior of our approach according to distribution drifts.

Related works

A significant body of previous works has addressed the accurate storing of the data stream history. This storage problem consists in finding compact data structures to reduce the size of the data kept out of the stream, while guaranteeing with high probability that the items observed as frequent from the stream are still observed frequent inside the data structure [11], [13], [5]. The first approach was proposed by [7] where they define the first single-pass algorithm. Li et al. [4] use a

Conclusion

There are five main contributions in this paper. First, we discuss the replacement of the conventional minimal support requirement for finding frequent patterns by a statistical support, in cases where storing the entire data is impossible (such as for data streams), so as to keep some convenient properties over the data kept. Then, we provide a method to compute this statistical support, while keeping those relevant properties. The method exploits concentration inequalities for random

References (32)

G. Manku et al.
Approximate frequency counts over data streams
W.-G. Teng et al.
A regression-based temporal patterns mining schema for data streams
S. Gollapudi et al.
Framework and algorithms for trend analysis in massive temporal datasets
W. Fan et al.
Active mining of data streams
L. Golab et al.
Issues in data stream management
ACM SIGMOD Records
(2003)
H.-F. Li, S.Y. Lee, M.-K. Shan, An efficient algorithm for mining frequent itemsets over the entire history of data...
C. Jin et al.
Dynamically maintaining frequent items over a data stream
E. Demaine et al.
Frequency estimation of internet packet streams with limited space
R.-M. Karp et al.
A simple algorithm for finding elements in streams and bags
ACM Trans. Database Systems
(2003)
R. Agrawal et al.
Mining association rules between sets of items in large databases

R. Agrawal et al.

Mining sequential patterns

M. Charikar et al.

Finding frequent items in data streams

D. Cheung et al.

Maintenance of discovered association rules in large databases: an incremental updating technique

G. Cormode et al.

What's hot and what's not: tracking most frequent items dynamically

A. Veloso et al.

Mining frequent itemsets in evolving databases

V. Vapnik

Statistical Learning Theory

(1998)

Cited by (8)

Patient clustering using dynamic partitioning on correlated and uncertain biomedical data
2020, Computer Methods and Programs in Biomedicine
Citation Excerpt :
Patients in assisted living systems are continuously monitored using a multitude of sensors [2]. These biomedical sensors generate a stream of physiological data [3] that contain a wealth of information about patients [4,5]. Vital sign data such as heart rate (HR), blood pressure (BP), respiratory rate (RR) and blood oxygen saturation (SPO2) are highly correlated and even a small change in these data over time can cause emergency clinical events [6].
Background and objectivesHealth professionals look for specific patterns by correlating multiple physiological data in the process of deciding treatments to remedy clinical abnormalities. Biomedical data exhibit some common patterns in the event of identical clinical illnesses. The primary interest of this work is automatic discovery of such patterns in vital sign data (e.g. heart rate, blood pressure) using unsupervised learning and utilising them to identify patients with similar clinical conditions.
MethodsA patient clustering method is developed that efficiently isolates patients into multiple groups by discovering dynamic patterns in multi-dimensional vital sign data. A dynamic partitioning algorithm and a patient clustering approach is proposed by introducing a measure namely aggregated instance-wise uncertainty (AIU) computed from multi-dimensional physiological time-series data.
ResultsThe developed model is evaluated qualitatively using principal component analysis and silhouette value; and quantitatively in terms of its ability of clustering patients associated with different clinical situations. Experiments are conducted using real-world biomedical data of patients having various clinical conditions. Thee observed accuracy was 82.85% and 91.17% on two experimental datasets comprised of 35 and 34 patients data respectively.The comparisons show that the proposed approached outperformed than other methods in state-of-the-art approach.
ConclusionsThe experimental outcomes demonstrate the effectiveness of the proposed approach in discovering distinct patterns with predictive significance.
Experimental study on fighters behaviors mining
2011, Expert Systems with Applications
Citation Excerpt :
Zhong combined the online spherical k-means (OSKM) algorithm with the scalable clustering strategy to achieve fast and adaptive clustering of text streams (Zhong, 2005). Laur et al. evaluated a statistical technique which biased the estimation of the Support of patterns to maximize either the precision or the recall (Laur, Nock, Symphor, & Poncelet, 2007). Lin et al. designed some algorithms to reduce the number of distinct query results by grouping a set of different queries into a cluster and to minimize the total query processing costs (Lin et al., 2006).
Effective prediction for fighters behaviors is crucial for air-combats as well as for many other game fields. In this paper, we present three patterns to predict the behaviors of fighters that are the ActionStreams pattern, the Owner_Actions pattern and the Time_Owner_Actions pattern, where: (1) ActionStreams pattern is a coarse granular for describing the fighter’s behaviors with action identifier whereas without distinguishing the time and the executor/owner; (2) Owner_Actions pattern is a finer granular for describing the fighter’s behaviors with the action identifier and the executor whereas without distinguishing the time; and (3) Time_Owner_Actions pattern encapsulates the action identifier, the time, and also the executor. Based on such fighters’ behaviors patterns, we explore the data structures used to store and the satisfied properties used to mine; and further, by designing and implementing the relevant mining/processing algorithms and systems, we have discovered some experience patterns of the fighters’ behaviors and have conducted certain valid predictions for the fighters’ behaviors. We also present the experimental results conducted on the simulation platform of the air to air combats. The results show that our method is effective.
Multivariable stream data classification using motifs and their temporal relations
2009, Information Sciences
Multivariable stream data is becoming increasingly common as diverse types of sensor devices and networks are deployed. Building accurate classification models for such data has attracted a lot of attention from the research community. Most of the previous works, however, relied on features extracted from individual streams, and did not take into account the dependency relations among the features within and across the streams. In this work, we propose new classification models that exploit temporal relations among features. We showed that consideration of such dependencies does significantly improve the classification accuracy. Another benefit of employing temporal relations is the improved interpretability of the resulting classification models, as the set of temporal relations can be easily translated to a rule using a sequence of inter-dependent events characterizing the class. We evaluated the proposed scheme using different classification models including the Naive Bayesian, TFIDF, and vector distance models. We showed that the proposed model can be a useful addition to the set of existing stream classification algorithms.
A lower bound on the sample size needed to perform a significant frequent pattern mining task
2009, Pattern Recognition Letters
During the past few years, the problem of assessing the statistical significance of frequent patterns extracted from a given set S of data has received much attention. Considering that S always consists of a sample drawn from an unknown underlying distribution, two types of risks can arise during a frequent pattern mining process: accepting a false frequent pattern or rejecting a true one. In this context, many approaches presented in the literature assume that the dataset size is an application-dependent parameter. In this case, there is a trade-off between both errors leading to solutions that only control one risk to the detriment of the other one. On the other hand, many sampling-based methods have attempted to determine the optimal size of S ensuring a good approximation of the original (potentially infinite) database from which S is drawn. However, these approaches often resort to Chernoff bounds that do not allow the independent control of the two risks. In this paper, we overcome the mentioned drawbacks by providing a lower bound on the sample size required to control both risks and achieve a significant frequent pattern mining task.
Stock Data Clustering and Multiscale Trend Detection
2012, Methodology and Computing in Applied Probability
Reconfigurable manufacturing services for web-based collaboration systems
2011, 21st International Conference on Production Research: Innovation in Product and Production, ICPR 2011 - Conference Proceedings

View all citing articles on Scopus

View full text

Mining evolving data streams for frequent patterns

Abstract

Introduction

Section snippets

Problem statement

Our approach

Experiments

Related works

Conclusion

Framework and algorithms for trend analysis in massive temporal datasets

Active mining of data streams

Issues in data stream management

ACM SIGMOD Records

Dynamically maintaining frequent items over a data stream

Frequency estimation of internet packet streams with limited space

A simple algorithm for finding elements in streams and bags

ACM Trans. Database Systems