Abstract
Clustering in very large databases or data warehouses, with many applications in areas such as spatial computation, web information collection, pattern recognition and economic analysis, is a huge task that challenges data mining researches. Current clustering methods always have the problems: 1) scanning the whole database leads to high I/O cost and expensive maintenance (e.g.,R *-tree); 2) pre-specifying the uncertain parameterk, with which clustering can only be refined by trial and test many times; 3) lacking high efficiency in treating arbitrary shape under very large data set environment. In this paper, we first present a new hybrid-clustering algorithm to solve these problems. This new algorithm, which combines both distance and density strategies, can handle any arbitrary shape clusters effectively. It makes full use of statistics information in mining to reduce the time complexity greatly while keeping good clustering quality. Furthermore, this algorithm can easily eliminate noises and identify outliers. An experimental evaluation is performed on a spatial database with this method and other popular clustering algorithms (CURE and DBSCAN). The results show that our algorithm outperforms them in terms of efficiency and cost, and even gets much more speedup as the data size scales up much larger.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Sheikholeslami Get al. WaveCluster: A multi-resolution clustering approach for very large spatial databases. InProc. 24th Int. Conf. Very Large Data Bases, Gupta A, Shmueli O, Widom J (eds.), New York City, Morgan Kaufmann, 1998, pp.428–438.
Zhang T, Ramakrishnan R, Livny M. BIRCH: An efficient data clustering method for very large databases. InProc. 1996 ACM SIGMOD International Conference on Management of Data, Jagadish H V, Mumick I S (eds.), Quebec: ACM Press, 1996, pp.103–114.
Guha Set al. CURE: An efficient clustering algorithm for large databases. InProc. 1998 ACM SIGMOD Int. Conf. Management of Data, Haas L M, Tiwary A (eds.), Seattle: ACM Press, 1998, pp.73–84.
Kaufman Let al. Finding Groups in Data: An Introduction to Ciuster Analysis. John Wiley & Sons, 1990.
Ng R T, Han J. Efficient and effective clustering methods for spatial data mining. InProc. the 20th Int. Conf. Very Large Data Bases (VLDB'94), Bocca J B, Larke M, Zaniolo C (eds.), Santiago de Chile, Chile: Morgan Kaufmann, 1994, pp.144–155.
Jain Anil K. Algorithms for Clustering Data. Prentice Hall, 1988.
Ester Met al. A density-based algorithm for discovering clusters in large spatial databases with noises. InProc. the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), Simoudis E, Han J, Fayyad U M (eds.), AAAI Press, 1996, pp.226–231.
Ankerst Met al. OPTICS: Ordering points to identify the clustering structure. InProc. 1999 ACM SIGMOD International Conference on Management of Dat., Delis A, Faloutsos C, Ghandeharizadeh S (eds.), Philadelphia: ACM Press, 1999, pp.49–60.
Agrawal R, Gehrke J, Gunopulos Det al. Automatic subspace clustering of high dimensional data for data mining applications. InProc. 1998 ACM SIGMOD Int. Conf. Management of Data, Haas L M, Tiwary A (eds.), Seattle: ACM Press, 1998, pp.94–105.
Wang W, Yang J, Muntz R. STING: A statistical information grid approach to spatial data mining. InProc. 23rd International Conference on Very Large Data Bases, Jarke M, Carey M J, Dittrich M A, Lochovsky F H, Loucopoulos P, Jeusfeld M A (eds.), Athens, Greece: Morgan Kaufimann, 1997, pp.186–195.
Gibson D, Kleinberg J M, Raghavan P. Clustering categorical data: An approach based on dynamical systems. InProc. 24th International Conference on Very Large Data Bases, Gupta A, Shmueli O, Widom J (eds.), New York City: Morgan Kaufmann, 1998, pp.311–322.
Boley D, Gini M, Gross Ret al. Partitioning-based clustering for web document categorization.Decision Support System Journal, 1999, 27(3): 329–341.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is supported by the National Grand Fundamental Research ‘973’ Program of China under Grant No.C1998030414; the National Research Foundation for the Doctoral Program of Higher Education of China under Grant No.99038. The first author is partially supported by Microsoft Research Fellowship.
QIAN WeiNing is a Ph.D. candidate in Computer Science Department, Fudan University. His major is database and knowledge-base. His research interests include clustering, data mining and Web mining.
GONG XueQing is a Ph.D. candidate in Computer Science Department, Fudan University. His major is database and knowledge-base. His research interests include Web data management, data mining and data management over P2P systems.
ZHOU AoYing received his M.S. degree in computer science from Sichuan University in 1988, and his Ph.D. degree in computer software from Fudan University in 1993. He is currently a professor in the Department of Computer Science and Engineering, Fudan University. His main research interests include Web/XML data management, data mining and streaming data analysis, and Peer-to-Peer computing systems and their application.
Rights and permissions
About this article
Cite this article
Qian, W., Gong, X. & Zhou, A. Clustering in very large databases based on distance and density. J. Comput. Sci. & Technol. 18, 67–76 (2003). https://doi.org/10.1007/BF02946652
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF02946652