A segment-based framework for modeling and mining data streams

Aggarwal, Charu C.

doi:10.1007/s10115-010-0366-0

A segment-based framework for modeling and mining data streams

Regular Paper
Published: 23 November 2010

Volume 30, pages 1–29, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Charu C. Aggarwal¹

439 Accesses
16 Citations
6 Altmetric
Explore all metrics

Abstract

Data Streams have become ubiquitous in recent years because of advances in hardware technology which have enabled automated recording of large amounts of data. The primary constraint in the effective mining of streams is the large volume of data which must be processed in real time. In many cases, it is desirable to store a summary of the data stream segments in order to perform data mining tasks. Since density estimation provides a comprehensive overview of the probabilistic data distribution of a stream segment, it is a natural choice for this purpose. A direct use of density distributions can however turn out to be an inefficient storage and processing mechanism in practice. In this paper, we introduce the concept of cluster histograms, which provides an efficient way to estimate and summarize the most important data distribution profiles over different stream segments. These profiles can be constructed in a supervised or unsupervised way depending upon the nature of the underlying application. The profiles can also be used for change detection, anomaly detection, segmental nearest neighbor search, or supervised stream segment classification. Furthermore, these techniques can also be used for modeling other kinds of data such as text and categorical data. The flexibility of the tasks which can be performed from the cluster histogram framework follows from its generality in storing the historical density profile of the data stream. As a result, this method provides a holistic framework for density-based mining of data streams. We discuss and test the application of the cluster histogram framework to a variety of interesting data mining applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal CC, Han J, Wang J, Yu P (2003) A framework for clustering evolving data streams LDB conference
Aggarwal CC (2003) A framework for diagnosing changes in evolving data streams. ACM SIGMOD conference
Aggarwal CC (2007) Data streams: models and algorithms. Springer, Berlin
MATH Google Scholar
Domingos P, Hulten G (2000) Mining high-speed data streams. ACM KDD conference
Gaber MM, Zaslavsky A, Krishnaswamy S (2007) A survey of classification methods in data streams. Data streams: models and algorithms. Springer, Berlin
Google Scholar
Guha S, Koudas N, Shim K (2001) Data streams and histograms. ACM symposium on theory of computing
Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. ACM KDD conference
Indyk P, Koudas N, Muthukrishnan S (2000) Identifying representative trends in massive time series data sets using sketches. VLDB conference, pp 362–372
Kifer D, Ben-David S, Gehrke J (2004) Detecting change in data streams. VLDB conference
Knorr E, Ng R (1998) Algorithms for mining distance based outliers in large data sets. VLDB conference
O’Callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) Streaming-data algorithms for high-quality clustering. ICDE conference
Silverman BW (1986) Density estimation for statistics and data analysis. Monographs on statistics and applied probability. Chapman and Hall, London
Google Scholar
Xu X, Ester M, Kriegel H-P, Sander J (1998) A distribution-based clustering algorithm for mining in large spatial databases. ICDE conference
Wang H, Fan W, Yu P, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. ACM KDD conference, pp 226–235
Zhang T, Ramakrishnan R, Livny M (1999) Fast density estimation using CF-Kernel for very large databases. ACM KDD conference
Zhou A, Cao F, Qian Q, Jin C (2008) Tracking clusters in evolving streams over sliding windows. Knowl Inf Syst
Zhu Y, Shasha D (2002) Statstream: statistical monitoring of thousands of streams in real time. VLDB conference

Download references

Author information

Authors and Affiliations

IBM T. J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY, 10532, USA
Charu C. Aggarwal

Authors

Charu C. Aggarwal
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Charu C. Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aggarwal, C.C. A segment-based framework for modeling and mining data streams. Knowl Inf Syst 30, 1–29 (2012). https://doi.org/10.1007/s10115-010-0366-0

Download citation

Received: 03 February 2010
Revised: 06 July 2010
Accepted: 04 November 2010
Published: 23 November 2010
Issue Date: January 2012
DOI: https://doi.org/10.1007/s10115-010-0366-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A segment-based framework for modeling and mining data streams

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms

Data stream clustering: a review

Statistical hierarchical clustering algorithm for outlier detection in evolving data streams

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

A segment-based framework for modeling and mining data streams

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms

Data stream clustering: a review

Statistical hierarchical clustering algorithm for outlier detection in evolving data streams

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now