Clustering in very large databases based on distance and density

Qian, Weining; Gong, XueQing; Zhou, AoYing

doi:10.1007/BF02946652

Clustering in very large databases based on distance and density

Correspondence
Published: January 2003

Volume 18, pages 67–76, (2003)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Qian Weining¹,
Gong XueQing¹ &
Zhou AoYing¹

130 Accesses
3 Altmetric
Explore all metrics

Abstract

Clustering in very large databases or data warehouses, with many applications in areas such as spatial computation, web information collection, pattern recognition and economic analysis, is a huge task that challenges data mining researches. Current clustering methods always have the problems: 1) scanning the whole database leads to high I/O cost and expensive maintenance (e.g.,R ^*-tree); 2) pre-specifying the uncertain parameterk, with which clustering can only be refined by trial and test many times; 3) lacking high efficiency in treating arbitrary shape under very large data set environment. In this paper, we first present a new hybrid-clustering algorithm to solve these problems. This new algorithm, which combines both distance and density strategies, can handle any arbitrary shape clusters effectively. It makes full use of statistics information in mining to reduce the time complexity greatly while keeping good clustering quality. Furthermore, this algorithm can easily eliminate noises and identify outliers. An experimental evaluation is performed on a spatial database with this method and other popular clustering algorithms (CURE and DBSCAN). The results show that our algorithm outperforms them in terms of efficiency and cost, and even gets much more speedup as the data size scales up much larger.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Constraint-Based Clustering Algorithm for Multi-density Data and Arbitrary Shapes

A Clustering Algorithm for Multi-density Datasets

AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities

Article 08 May 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Sheikholeslami Get al. WaveCluster: A multi-resolution clustering approach for very large spatial databases. InProc. 24th Int. Conf. Very Large Data Bases, Gupta A, Shmueli O, Widom J (eds.), New York City, Morgan Kaufmann, 1998, pp.428–438.
Google Scholar
Zhang T, Ramakrishnan R, Livny M. BIRCH: An efficient data clustering method for very large databases. InProc. 1996 ACM SIGMOD International Conference on Management of Data, Jagadish H V, Mumick I S (eds.), Quebec: ACM Press, 1996, pp.103–114.
Chapter Google Scholar
Guha Set al. CURE: An efficient clustering algorithm for large databases. InProc. 1998 ACM SIGMOD Int. Conf. Management of Data, Haas L M, Tiwary A (eds.), Seattle: ACM Press, 1998, pp.73–84.
Chapter Google Scholar
Kaufman Let al. Finding Groups in Data: An Introduction to Ciuster Analysis. John Wiley & Sons, 1990.
Ng R T, Han J. Efficient and effective clustering methods for spatial data mining. InProc. the 20th Int. Conf. Very Large Data Bases (VLDB'94), Bocca J B, Larke M, Zaniolo C (eds.), Santiago de Chile, Chile: Morgan Kaufmann, 1994, pp.144–155.
Google Scholar
Jain Anil K. Algorithms for Clustering Data. Prentice Hall, 1988.
Ester Met al. A density-based algorithm for discovering clusters in large spatial databases with noises. InProc. the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), Simoudis E, Han J, Fayyad U M (eds.), AAAI Press, 1996, pp.226–231.
Ankerst Met al. OPTICS: Ordering points to identify the clustering structure. InProc. 1999 ACM SIGMOD International Conference on Management of Dat., Delis A, Faloutsos C, Ghandeharizadeh S (eds.), Philadelphia: ACM Press, 1999, pp.49–60.
Chapter Google Scholar
Agrawal R, Gehrke J, Gunopulos Det al. Automatic subspace clustering of high dimensional data for data mining applications. InProc. 1998 ACM SIGMOD Int. Conf. Management of Data, Haas L M, Tiwary A (eds.), Seattle: ACM Press, 1998, pp.94–105.
Chapter Google Scholar
Wang W, Yang J, Muntz R. STING: A statistical information grid approach to spatial data mining. InProc. 23rd International Conference on Very Large Data Bases, Jarke M, Carey M J, Dittrich M A, Lochovsky F H, Loucopoulos P, Jeusfeld M A (eds.), Athens, Greece: Morgan Kaufimann, 1997, pp.186–195.
Google Scholar
Gibson D, Kleinberg J M, Raghavan P. Clustering categorical data: An approach based on dynamical systems. InProc. 24th International Conference on Very Large Data Bases, Gupta A, Shmueli O, Widom J (eds.), New York City: Morgan Kaufmann, 1998, pp.311–322.
Google Scholar
Boley D, Gini M, Gross Ret al. Partitioning-based clustering for web document categorization.Decision Support System Journal, 1999, 27(3): 329–341.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, The Laboratory for Intelligent Information Processing, Fudan University, 200433, Shanghai, P.R. China
Qian Weining, Gong XueQing & Zhou AoYing

Authors

Qian Weining
View author publications
You can also search for this author inPubMed Google Scholar
Gong XueQing
View author publications
You can also search for this author inPubMed Google Scholar
Zhou AoYing
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Qian Weining.

Additional information

This work is supported by the National Grand Fundamental Research ‘973’ Program of China under Grant No.C1998030414; the National Research Foundation for the Doctoral Program of Higher Education of China under Grant No.99038. The first author is partially supported by Microsoft Research Fellowship.

QIAN WeiNing is a Ph.D. candidate in Computer Science Department, Fudan University. His major is database and knowledge-base. His research interests include clustering, data mining and Web mining.

GONG XueQing is a Ph.D. candidate in Computer Science Department, Fudan University. His major is database and knowledge-base. His research interests include Web data management, data mining and data management over P2P systems.

ZHOU AoYing received his M.S. degree in computer science from Sichuan University in 1988, and his Ph.D. degree in computer software from Fudan University in 1993. He is currently a professor in the Department of Computer Science and Engineering, Fudan University. His main research interests include Web/XML data management, data mining and streaming data analysis, and Peer-to-Peer computing systems and their application.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qian, W., Gong, X. & Zhou, A. Clustering in very large databases based on distance and density. J. Comput. Sci. & Technol. 18, 67–76 (2003). https://doi.org/10.1007/BF02946652

Download citation

Received: 04 January 2001
Revised: 30 October 2002
Issue Date: January 2003
DOI: https://doi.org/10.1007/BF02946652

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering in very large databases based on distance and density

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Constraint-Based Clustering Algorithm for Multi-density Data and Arbitrary Shapes

A Clustering Algorithm for Multi-density Datasets

AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now