Skip to main content
Log in

Clustering in very large databases based on distance and density

  • Correspondence
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Clustering in very large databases or data warehouses, with many applications in areas such as spatial computation, web information collection, pattern recognition and economic analysis, is a huge task that challenges data mining researches. Current clustering methods always have the problems: 1) scanning the whole database leads to high I/O cost and expensive maintenance (e.g.,R *-tree); 2) pre-specifying the uncertain parameterk, with which clustering can only be refined by trial and test many times; 3) lacking high efficiency in treating arbitrary shape under very large data set environment. In this paper, we first present a new hybrid-clustering algorithm to solve these problems. This new algorithm, which combines both distance and density strategies, can handle any arbitrary shape clusters effectively. It makes full use of statistics information in mining to reduce the time complexity greatly while keeping good clustering quality. Furthermore, this algorithm can easily eliminate noises and identify outliers. An experimental evaluation is performed on a spatial database with this method and other popular clustering algorithms (CURE and DBSCAN). The results show that our algorithm outperforms them in terms of efficiency and cost, and even gets much more speedup as the data size scales up much larger.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Sheikholeslami Get al. WaveCluster: A multi-resolution clustering approach for very large spatial databases. InProc. 24th Int. Conf. Very Large Data Bases, Gupta A, Shmueli O, Widom J (eds.), New York City, Morgan Kaufmann, 1998, pp.428–438.

    Google Scholar 

  2. Zhang T, Ramakrishnan R, Livny M. BIRCH: An efficient data clustering method for very large databases. InProc. 1996 ACM SIGMOD International Conference on Management of Data, Jagadish H V, Mumick I S (eds.), Quebec: ACM Press, 1996, pp.103–114.

    Chapter  Google Scholar 

  3. Guha Set al. CURE: An efficient clustering algorithm for large databases. InProc. 1998 ACM SIGMOD Int. Conf. Management of Data, Haas L M, Tiwary A (eds.), Seattle: ACM Press, 1998, pp.73–84.

    Chapter  Google Scholar 

  4. Kaufman Let al. Finding Groups in Data: An Introduction to Ciuster Analysis. John Wiley & Sons, 1990.

  5. Ng R T, Han J. Efficient and effective clustering methods for spatial data mining. InProc. the 20th Int. Conf. Very Large Data Bases (VLDB'94), Bocca J B, Larke M, Zaniolo C (eds.), Santiago de Chile, Chile: Morgan Kaufmann, 1994, pp.144–155.

    Google Scholar 

  6. Jain Anil K. Algorithms for Clustering Data. Prentice Hall, 1988.

  7. Ester Met al. A density-based algorithm for discovering clusters in large spatial databases with noises. InProc. the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), Simoudis E, Han J, Fayyad U M (eds.), AAAI Press, 1996, pp.226–231.

  8. Ankerst Met al. OPTICS: Ordering points to identify the clustering structure. InProc. 1999 ACM SIGMOD International Conference on Management of Dat., Delis A, Faloutsos C, Ghandeharizadeh S (eds.), Philadelphia: ACM Press, 1999, pp.49–60.

    Chapter  Google Scholar 

  9. Agrawal R, Gehrke J, Gunopulos Det al. Automatic subspace clustering of high dimensional data for data mining applications. InProc. 1998 ACM SIGMOD Int. Conf. Management of Data, Haas L M, Tiwary A (eds.), Seattle: ACM Press, 1998, pp.94–105.

    Chapter  Google Scholar 

  10. Wang W, Yang J, Muntz R. STING: A statistical information grid approach to spatial data mining. InProc. 23rd International Conference on Very Large Data Bases, Jarke M, Carey M J, Dittrich M A, Lochovsky F H, Loucopoulos P, Jeusfeld M A (eds.), Athens, Greece: Morgan Kaufimann, 1997, pp.186–195.

    Google Scholar 

  11. Gibson D, Kleinberg J M, Raghavan P. Clustering categorical data: An approach based on dynamical systems. InProc. 24th International Conference on Very Large Data Bases, Gupta A, Shmueli O, Widom J (eds.), New York City: Morgan Kaufmann, 1998, pp.311–322.

    Google Scholar 

  12. Boley D, Gini M, Gross Ret al. Partitioning-based clustering for web document categorization.Decision Support System Journal, 1999, 27(3): 329–341.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qian Weining.

Additional information

This work is supported by the National Grand Fundamental Research ‘973’ Program of China under Grant No.C1998030414; the National Research Foundation for the Doctoral Program of Higher Education of China under Grant No.99038. The first author is partially supported by Microsoft Research Fellowship.

QIAN WeiNing is a Ph.D. candidate in Computer Science Department, Fudan University. His major is database and knowledge-base. His research interests include clustering, data mining and Web mining.

GONG XueQing is a Ph.D. candidate in Computer Science Department, Fudan University. His major is database and knowledge-base. His research interests include Web data management, data mining and data management over P2P systems.

ZHOU AoYing received his M.S. degree in computer science from Sichuan University in 1988, and his Ph.D. degree in computer software from Fudan University in 1993. He is currently a professor in the Department of Computer Science and Engineering, Fudan University. His main research interests include Web/XML data management, data mining and streaming data analysis, and Peer-to-Peer computing systems and their application.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qian, W., Gong, X. & Zhou, A. Clustering in very large databases based on distance and density. J. Comput. Sci. & Technol. 18, 67–76 (2003). https://doi.org/10.1007/BF02946652

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02946652

Keywords

Navigation