Skip to main content
Log in

Efficient Computation of k-Medians over Data Streams Under Memory Constraints

  • Database and Knowledge-Based Systems
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

In this paper, we study the problem of efficiently computing k-medians over high-dimensional and high speed data streams. The focus of this paper is on the issue of minimizing CPU time to handle high speed data streams on top of the requirements of high accuracy and small memory. Our work is motivated by the following observation: the existing algorithms have similar approximation behaviors in practice, even though they make noticeably different worst case theoretical guarantees. The underlying reason is that in order to achieve high approximation level with the smallest possible memory, they need rather complex techniques to maintain a sketch, along time dimension, by using some existing off-line clustering algorithms. Those clustering algorithms cannot guarantee the optimal clustering result over data segments in a data stream but accumulate errors over segments, which makes most algorithms behave the same in terms of approximation level, in practice. We propose a new grid-based approach which divides the entire data set into cells (not along time dimension). We can achieve high approximation level based on a novel concept called (1−∊)-dominant. We further extend the method to the data stream context, by leveraging a density-based heuristic and frequent item mining techniques over data streams. We only need to apply an existing clustering once to computing k-medians, on demand, which reduces CPU time significantly. We conducted extensive experimental studies, and show that our approaches outperform other well-known approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Guha S, Mishra N, Motwani R, O’Callaghan L. Clustering data streams. In FOCS′00: Proc. the 41st Annual Symposium on Foundations of Computer Science, Redondo Beach, CA, USA, 2000, p.359.

  2. Moses Charikar, Liadan O’Callaghan, Rina Panigrahy. Better streaming algorithms for clustering problems. In STOC′03: Proc. the 35th Annual ACM Symposium on Theory of Computing, San Diego, CA, USA, 2003, pp.30–39.

  3. Sariel Har-Peled, Soham Mazumdar. On coresets for κ-means and κ-median clustering. In STOC′04: Proc. the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, 2004, pp.291–300.

  4. Piotr Indyk. Algorithms for dynamic geometric problems over data streams. In STOC′04: Proc. the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, 2004, pp.373–380.

  5. Charu C Aggarwal, Jiawei Han, Jianyong Wang, Philip S Yu. A framework for clustering evolving data streams. In VLDB′03: Proc. 29th International Conference on Very Large Data Bases, Berlin, Germany, 2003, pp.81–92.

  6. Tian Zhang, Raghu Ramakrishnan, Miron Livny. BIRCH: An efficient data clustering method for very large databases. In SIGMOD′96: Proc. the 1996 ACM SIGMOD Int. Conf. Management of Data, Montreal, Quebec, Canada, 1996, pp.103–114.

  7. Moses Charikar, Sudipto Guha. Improved combinatorial algorithms for the facility location and κ-median problems. In FOCS′99: Proc. the 40th Annual Symposium on Foundations of Computer Science, New York, NY, USA, 1999, p.378.

  8. Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD′96: Proc. the Second International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, USA, 1996, pp.226–231.

  9. Mihael Ankerst, Markus M Breunig, Hans-Peter Kriegel, Jörg Sander. OPTICS: Ordering points to identify the clustering structure. In SIGMOD′99: Proc. the 1999 ACM SIGMOD Int. Conf. Management of Data, Philadelphia, Pennsylvania, USA, 1999, pp.49–60.

  10. Markus M Breunig, Hans-Peter Kriegel, Peer Kröger, Jörg Sander. Data bubbles: Quality preserving performance boosting for hierarchical clustering. SIGMOD Rec., 2001, 30(2): 79–90.

    Article  Google Scholar 

  11. Samer Nassar, Jörg Sander, Corrine Cheng. Incremental and effective data summarization for dynamic hierarchical clustering. In SIGMOD′04: Proc. the 2004 ACM SIGMOD Int. Conf. Management of Data, 2004, Paris, France, pp.467–478.

  12. Carlos Ordonez, Edward Omiecinski, Norberto Ezquerra. A fast algorithm to cluster high dimensional basket data. In ICDM′01: Proc. the 2001 IEEE Int. Conf. Data Mining, San Jose, California, USA, 2001, pp.633–636.

  13. Gurmeet Singh Manku, Rajeev Motwani. Approximate frequency counts over data streams. In VLDB′02: Proc. 28th Int. Conf. Very Large Data Bases, Hong Kong, China, 2002, pp.346–357.

  14. Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying Zhou. False positive or false negative: Mining frequent itemsets from high speed transactional data streams. In VLDB′04: Proc. the 30th Int. Conf. Very Large Data Bases, Toronto, Canada, 2004, pp.204–215.

  15. Liadan O’Callaghan, Adam Meyerson, Rajeev Motwani et al. Streaming-data algorithms for high-quality clustering. In ICDE′02: Proc. the 18th Int. Conf. Data Engineering, San Jose, California, USA, 2002, p.685.

  16. MacQueen J. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symposium on Math, Stat. and Prob, University of California Press, 1967, pp.281–297.

  17. Meyerson A. Online facility location. In FOCS′01: Proc. the 42nd IEEE Symposium on Foundations of Computer Science, Las Vegas, Nevada, USA, 2001, p.426.

  18. Haixun Wang, Wei Fan, Philip S Yu, Jiawei Han. Mining concept-drifting data streams using ensemble classifiers. In KDD′03: Proc. the Ninth ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Washington DC, 2003, pp.226–235.

  19. Pankaj K Agarwal, Sariel Har-Peled, Kasturi R Varadarajan. Geometric Approximation via Coresetshttp://valis.cs.uiuc.edu/~sariel/papers/04/survey/

  20. Nam Hun Park, Won Suk Lee. Statistical grid-based clustering over data streams. SIGMOD Record, 2004, 33(1): 32–37.

    Article  Google Scholar 

  21. Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan. Maintaining variance and κ-medians over data stream windows. In PODS′03: Proc. the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 2003, San Diego, CA, USA, pp.234–243.

  22. Piotr Indyk. A sublinear time approximation scheme for clustering in metric spaces. In FOCS′99: Proc. the 40th Annual Symposium on Foundations of Computer Science, New York, NY, USA, 1999, p.154.

  23. Mettu R R, Plaxton C G. The online median problem. In FOCS′00: Proc. the 41st Annual Symposium on Foundations of Computer Science, Redondo Beach, CA, USA, 2000, p.339.

  24. Alexander Hinneburg, Daniel A Keim. Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In VLDB′99: Proc. 25th Int. Conf. Very Large Data Bases, Edinburgh, Scotland, UK, 1999, pp.506–517.

  25. Jain A K, Murty M N, Flynn P J. Data clustering: A review. ACM Comput. Surv., 1999, 31(3): 264–323.

    Article  Google Scholar 

  26. Pavel Berkhin. Survey of clustering data mining techniques. Technical Report, Accrue Software, San Jose, CA, 2002. http://citeseer.nj.nec.com/berkhin02survey.html

  27. Mohanmed Medhat Gaber, Arkady Zaslavsky, Shonali Krishnaswamy. Mining data streams: A review. SIGMOD Record, 2005, 34(2): 18–26.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhi-Hong Chong.

Additional information

This work is supported by the National High Technology Development 863 Program of China under Grant No. 2002AA413310 and ARC Discovery Grant, Australia under Grant Nos. DP0346004 and DP0345710.

Zhi-Hong Chong received his B.S. degree in computer science from Nanjing Meteorological Institute and M.S. degrees in economics from Institute of Fiscal Studies, Ministry of Finance, P.R. China, in 1991, 1999, respectively. He is currently a Ph.D. candidate in Fudan University, China. His research interests cover data mining, data streams. He has published several research papers in these areas in major international conferences and reputable journals.

Jeffrey Xu Yu received his B.E., M.E. and Ph.D. degrees in computer science, from the University of Tsukuba, Japan, in 1985, 1987 and 1990, respectively. Jeffrey Xu Yu was a research fellow (Apr. 1990–Mar. 1991) and a faculty member (Apr. 1991–July 1992) in the Institute of Information Sciences and Electronics, University of Tsukuba. From July 1992 to June 2000, he was a lecturer in the Department of Computer Science, The Australian National University. Currently, he is an associate professor in the Department of Systems Engineering and Engineering Management, the Chinese University of Hong Kong. Jeffrey Xu Yu is a member of ACM, and a member of IEEE Computer Society. His interests cover XML database, data mining, data warehouse and on-line analytical processing, web-technology, query processing and query optimization, bioinformatics, wireless information systems, design and implementation of database management systems.

Zhen-Jie Zhang received his B.S. degree from Department of Computer Science and Engineering, Fudan University in 2004. Currently he is a Ph.D. candidate in the School of Computing, National University of Singapore. His current research interests include skyline query, clustering and ranking database.

Xue-Min Lin is an associate professor (reader) in the School of Computer Science and Engineering, the University of New South Wales. Currently, he is the head of database research group. Dr. Lin received his Ph.D. degree in computer science from the University of Queensland (Australia) in 1992 and his B.Sc. degree in applied math from Fudan University (China) in 1984. During 1984–1988, he studied for Ph.D. in applied math at Fudan University. Dr Lin’s principal research areas cover databases and graph visualisation.

Wei Wang is currently a lecturer in the School of Computer Science and Engineering, The University of New South Wales, Australia. His current research interests include query processing and optimization for XML, integration of database and information retrieval technologies, data warehousing and OLAP, approximate query processing and data mining. He has published over twenty research papers in these areas in major international conferences. He received his Ph.D. degree in computer science from The Hong Kong University of Science and Technology, Hong Kong, China, in 2004, and B.Eng. degree in computer science and engineering from Shanghai Jiao Tong University, Shanghai, China, in 1999.

Ao-Ying Zhou received his B.Sc. and M.Sc. degrees from Computer Department of Sichuan University in 1985 and 1988, Ph.D. degree in computer software from Fudan University in 1993. From 1996 to 1999 he was appointed the vice director of Computer Science Department of Fudan University and the director from 1999 to 2002. Prof. Zhou is engaged in research on data mining and business intelligence, web data management and peer-to-peer computing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chong, ZH., Yu, J.X., Zhang, ZJ. et al. Efficient Computation of k-Medians over Data Streams Under Memory Constraints. J Comput Sci Technol 21, 284–296 (2006). https://doi.org/10.1007/s11390-006-0284-5

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-006-0284-5

Keywords

Navigation