On Clustering Massive Data Streams: A Summarization Paradigm

Aggarwal, Charu C.; Han, Jiawei; Wang, Jianyong; Yu, Philip S.

doi:10.1007/978-0-387-47534-9_2

Charu C. Aggarwal³,
Jiawei Han⁴,
Jianyong Wang⁴ &
…
Philip S. Yu³

Part of the book series: Advances in Database Systems ((ADBS,volume 31))

2925 Accesses
17 Citations

Abstract

In recent years, data streams have become ubiquitous because of the large number of applications which generate huge volumes of data in an automated way. Many existing data mining methods cannot be applied directly on data streams because of the fact that the data needs to be mined in one pass. Furthermore, data streams show a considerable amount of temporal locality because of which a direct application of the existing methods may lead to misleading results. In this paper, we develop an efficient and effective approach for mining fast evolving data streams, which integrates the micro-clustering technique with the high-level data mining process, and discovers data evolution regularities as well. Our analysis and experiments demonstrate two important data mining problems, namely stream clustering and stream classification, can be performed effectively using this approach, with high quality mining results. We discuss the use of micro-clustering as a general summarization technology to solve data mining problems on streams. Our discussion illustrates the importance of our approach for a variety of mining problems in the data stream domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aggarwal C., Procopiuc C., Wolf J., Yu P., Park J.-S. (1999). Fast algorithms for projected clustering. ACM SIGMOD Conference.
Google Scholar
Aggarwal C., Yu P. (2000). Finding Generalized Projected Clusters in High Dimensional Spaces, ACM SIGMOD Conference.
Google Scholar
Aggarwal C., Yu P., (2004). A Condensation Approach to Privacy Preserving Data Mining. EDBT Conference.
Google Scholar
Agrawal R., Gehrke J., Gunopulos D., Raghavan P (1998). Automatic Sub-space Clustering of High Dimensional Data for Data Mining Applications. ACM SIGMOD Conference.
Google Scholar
Aggarwal C (2003). A Framework for Diagnosing Changes in Evolving Data Streams. ACM SIGMOD Conference.
Google Scholar
Aggarwal C., Han J., Wang J., Yu P (2003). A Framework for Clustering Evolving Data Streams. VLDB Conference.
Google Scholar
Aggarwal C, Han J., Wang J., Yu P. (2004). On-Demand Classification of Evolving Data Streams. ACM KDD Conference.
Google Scholar
Aggarwal C., Han J., Wang J., Yu P. (2004). A Framework for Projected Clustering of High Dimensional Data Streams. VLDB Conference.
Google Scholar
Aggarwal C. (2006) on Futuristic Query Processing in Data Streams. EDBT Conference.
Google Scholar
Ankerst M., Breunig M., Kriegel H.-P., Sander J. (1999). OPTICS: Ordering Points To Identify the Clustering Structure. ACM SIGMOD Conference.
Google Scholar
Babcock B., Babu S., Datar M., Motwani R., Widom J. (2002). Models and Issues in Data Stream Systems, ACM PODS Conference.
Google Scholar
Bradley P., Fayyad U., Reina C. (1998) Scaling Clustering Algorithms to Large Databases. SIGKDD Conference.
Google Scholar
Cortes C, Fisher K., Pregibon D., Rogers A., Smith F. (2000). Hancock: A Language for Extracting Signatures from Data Streams. ACM SIGKDD Conference.
Google Scholar
Domingos P., Hulten G. (2000). Mining High-Speed Data Streams. ACM SIGKDD Conference.
Google Scholar
Duda R., Hart P (1973). Pattern Classification and Scene Analysis, Wiley, New York.
MATH Google Scholar
Farnstrom F,, Lewis J., Elkan C. (2000). Scalability for Clustering Algorithms Revisited. SIGKDD Explorations, 2(1):pp. 51–57.
Article Google Scholar
Guha S., Mishra N., Motwani R., O’Callaghan L. (2000). Clustering Data Streams. IEEE FOCS Conference.
Google Scholar
Guha S., Rastogi R., Shim K. (1998). CURE: An Efficient Clustering Algorithm for Large Databases. ACM SIGMOD Conference.
Google Scholar
Hulten G., Spencer L., Domingos P. (2001). Mining Time Changing Data Streams. ACM KDD Conference.
Google Scholar
Jain A., Dubes R. (1998). Algorithms for Clustering Data, Prentice Hall, New Jersey.
Google Scholar
Kaufman L., Rousseuw P. (1990). Finding Groups in Data-An Introduction to Cluster Analysis. Wiley Series in Probability and Math Sciences.
Google Scholar
Ng R., Han J (1994). Efficient and Effective Clustering Methods for Spatial Data Mining. Very Large Data Bases Conference.
Google Scholar
O’Callaghan L., Mishra N., Meyerson A., Guha S., Motwani R (2002). Streaming-Data Algorithms For High-Quality Clustering. ICDE Conference.
Google Scholar
Zhang T., Ramakrishnan R., and Livny M (1996). BIRCH: An Efficient Data Clustering Method for Very Large Databases. ACM SIGMOD Conference.
Google Scholar

Download references

Author information

Authors and Affiliations

IBM, T. J. Watson Research Center, Hawthorne, NY, 10532
Charu C. Aggarwal & Philip S. Yu
University of Illinois at Urbana-Champaign, Urbana, IL
Jiawei Han & Jianyong Wang

Authors

Charu C. Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar
Jiawei Han
View author publications
You can also search for this author in PubMed Google Scholar
Jianyong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Philip S. Yu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IBM, Thomas J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY, 10532
Charu C. Aggarwal

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Aggarwal, C.C., Han, J., Wang, J., Yu, P.S. (2007). On Clustering Massive Data Streams: A Summarization Paradigm. In: Aggarwal, C.C. (eds) Data Streams. Advances in Database Systems, vol 31. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-47534-9_2

Download citation

DOI: https://doi.org/10.1007/978-0-387-47534-9_2
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-28759-1
Online ISBN: 978-0-387-47534-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics