Skip to main content
Log in

On High Dimensional Projected Clustering of Data Streams

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

The data stream problem has been studied extensively in recent years, because of the great ease in collection of stream data. The nature of stream data makes it essential to use algorithms which require only one pass over the data. Recently, single-scan, stream analysis methods have been proposed in this context. However, a lot of stream data is high-dimensional in nature. High-dimensional data is inherently more complex in clustering, classification, and similarity search. Recent research discusses methods for projected clustering over high-dimensional data sets. This method is however difficult to generalize to data streams because of the complexity of the method and the large volume of the data streams.

In this paper, we propose a new, high-dimensional, projected data stream clustering method, called HPStream. The method incorporates a fading cluster structure, and the projection based clustering methodology. It is incrementally updatable and is highly scalable on both the number of dimensions and the size of the data streams, and it achieves better clustering quality in comparison with the previous stream clustering methods. Our performance study with both real and synthetic data sets demonstrates the efficiency and effectiveness of our proposed framework and implementation methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Aggarwal, C.C. 2004. A human-computer interactive method for projected clustering. IEEE Transactions on Knowledge and Data Engineering, 16(4):448–460.

    Google Scholar 

  • Aggarwal, C.C., Procopiuc, C., Wolf, J., Yu, P.S., and Park, J.-S. 1999. Fast algorithms for projected clustering. In ACM SIGMOD Conference.

  • Aggarwal, C.C., Han, J., Wang, J., and Yu, P. 2003. A framework for clustering evolving data streams. In VLDB Conference.

  • Aggarwal, C.C. 2002. An intuitive framework for understanding changes in evolving data streams. ICDE Conference.

  • Aggarwal, C.C. 2003. A framework for diagnosing changes in evolving data streams. ACM SIGMOD Conference, pp. 575–586.

  • Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. 1998. Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD Conference.

  • Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom, J. 2002. Models and issues in data stream systems. ACM PODS Conference.

  • Domingos, P. and Hulten, G. 2000. Mining high-speed data streams. In ACM SIGKDD Conference.

  • Farnstrom, F., Lewis, J., and Elkan, C. 2000. Scalability for Clustering Algorithms Revisited. SIGKDD Explorations, 2(1):51–57.

    Google Scholar 

  • Feigenbaum, J., et al. 2000. Testing and spot-checking of data streams. In ACM SODA Conference.

  • Guha, S., Mishra, N., Motwani, R., and O’Callaghan, L. 2000. Clustering data streams. In IEEE FOCS Conference.

  • Guha, S., Rastogi, R., and Shim, K. 1998. CURE: An efficient clustering algorithm for large databases. ACM SIGMOD Conference.

  • Jain, A. and Dubes, R. 1998. Algorithms for clustering data. New Jersey: Prentice Hall.

    Google Scholar 

  • Ng, R. and Han, J. 1994. Efficient and effective clustering methods for spatial data mining. In Very Large Data Bases Conference.

  • O’Callaghan, L., Mishra, N., Meyerson, A., Guha, S., and Motwani, R. 2002. Streaming-data algorithms for high-quality clustering. In ICDE Conference.

  • Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: An efficient data clustering method for very large databases. ACM SIGMOD Conference.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Charu C. Aggarwal.

Additional information

Charu C. Aggarwal received his B.Tech. degree in Computer Science from the Indian Institute of Technology (1993) and his Ph.D. degree in Operations Research from the Massachusetts Institute of Technology (1996). He has been a Research Staff Member at the IBM T. J. Watson Research Center since June 1996. He has applied for or been granted over 50 US patents, and has published over 75 papers in numerous international conferences and journals. He has twice been designated Master Inventor at IBM Research in 2000 and 2003 for the commercial value of his patents. His contributions to the Epispire project on real time attack detection were awarded the IBM Corporate Award for Environmental Excellence in 2003. He has been a program chair of the DMKD 2003, chair for all workshops organized in conjunction with ACM KDD 2003, and is also an associate editor of the IEEE Transactions on Knowledge and Data Engineering Journal. His current research interests include algorithms, data mining, privacy, and information retrieval.

Jiawei Han is a Professor in the Department of Computer Science at the University of Illinois at Urbana–Champaign. He has been working on research into data mining, data warehousing, stream and RFID data mining, spatiotemporal and multimedia data mining, biological data mining, social network analysis, text and Web mining, and software bug mining, with over 300 conference and journal publications. He has chaired or served in many program committees of international conferences and workshops, including ACM SIGKDD Conferences (2001 best paper award chair, 1996 PC co-chair), SIAM-Data Mining Conferences (2001 and 2002 PC co-chair), ACM SIGMOD Conferences (2000 exhibit program chair), International Conferences on Data Engineering (2004 and 2002 PC vice-chair), and International Conferences on Data Mining (2005 PC co-chair). He also served or is serving on the editorial boards for Data Mining and Knowledge Discovery, IEEE Transactions on Knowledge and Data Engineering, Journal of Computer Science and Technology, and Journal of Intelligent Information Systems. He is currently serving on the Board of Directors for the Executive Committee of ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD). Jiawei has received three IBM Faculty Awards, the Outstanding Contribution Award at the 2002 International Conference on Data Mining, ACM Service Award (1999) and ACM SIGKDD Innovation Award (2004). He is an ACM Fellow (since 2003). He is the first author of the textbook “Data Mining: Concepts and Techniques” (Morgan Kaufmann, 2001).

Jianyong Wang received the Ph.D. degree in computer science in 1999 from the Institute of Computing Technology, the Chinese Academy of Sciences. Since then, he ever worked as an assistant professor in the Department of Computer Science and Technology, Peking (Beijing) University in the areas of distributed systems and Web search engines (May 1999–May 2001), and visited the School of Computing Science at Simon Fraser University (June 2001–December 2001), the Department of Computer Science at the University of Illinois at Urbana-Champaign (December 2001–July 2003), and the Digital Technology Center and Department of Computer Science and Engineering at the University of Minnesota (July 2003–November 2004), mainly working in the area of data mining. He is currently an associate professor in the Department of Computer Science and Technology, Tsinghua University, Beijing, China.

Philip S. Yuis the manager of the Software Tools and Techniques group at the IBM Thomas J. Watson Research Center. The current focuses of the project include the development of advanced algorithms and optimization techniques for data mining, anomaly detection and personalization, and the enabling of Web technologies to facilitate E-commerce and pervasive computing. Dr. Yu,s research interests include data mining, Internet applications and technologies, database systems, multimedia systems, parallel and distributed processing, disk arrays, computer architecture, performance modeling and workload analysis. Dr. Yu has published more than 340 papers in refereed journals and conferences. He holds or has applied for more than 200 US patents. Dr. Yu is an IBM Master Inventor.

Dr. Yu is a Fellow of the ACM and a Fellow of the IEEE. He will become the Editor-in-Chief of IEEE Transactions on Knowledge and Data Engineering on Jan. 2001. He is an associate editor of ACM Transactions of the Internet Technology and also Knowledge and Information Systems Journal. He is a member of the IEEE Data Engineering steering committee. He also serves on the steering committee of IEEE Intl. Conference on Data Mining. He received an IEEE Region 1 Award for “promoting and perpetuating numerous new electrical engineering concepts”. Philip S. Yu received the B.S. Degree in E.E. from National Taiwan University, Taipei, Taiwan, the M.S. and Ph.D. degrees in E.E. from Stanford University, and the M.B.A. degree from New York University.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aggarwal, C.C., Han, J., Wang, J. et al. On High Dimensional Projected Clustering of Data Streams. Data Min Knowl Disc 10, 251–273 (2005). https://doi.org/10.1007/s10618-005-0645-7

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-005-0645-7

Keywords

Navigation