Abstract
This paper presents a single scan algorithm for clustering large datasets based on a two phase process which combines two well known clustering methods. The Cobweb algorithm is modified to produce a balanced tree with subclusters at the leaves, and then K-means is applied to the resulting subclusters. The resulting method, Scalable Cobweb, is then compared to a single pass K-means algorithm and standard K-means. The evaluation looks at error as measured by the sum of squared error and vulnerability to the order in which data points are processed.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Blake, C., Keogh, E., Merz, C.J.: UCI Repository of Machine Learning Data Bases, Irvine, CA. Department of Information and Computer Science. University of California (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
Bradley, P.S., Fayyad, U.M., Reina, C.A.: Scaling clustering algorithms to large databases. Microsoft Research, Technical Report, MSR-TR-98-37 (June 1998)
Bradley, P.S., Fayyad, U.M., Reina, C.A.: Scaling EM(Expectation-Maximization) clustering to large databases. Microsoft Research, Technical Report, MSR-TR-98-35, Nov. 1998, Revised (October 1999)
Farnstrom, F., Lewis, J., Elkan, C.: Scalability for clustering algorithms revisited. SIGKDD Explorations Newsletter 2(1), 1–7 (2000)
Fisher, D.: Knowledge acquisition via incremental conceptual clustering. Machine Learning 2(2), 139–172 (1987)
Gennari, J.H., Langley, P., Fisher, D.: Models of incremental concept formation. Artificial Intelligence 40, 11–61 (1990)
Gluck, M.A., Corter, J.E.: Information, uncertainty, and the utility of categories. In: Proceedings of the 7th Annual Conference of the Cognitive Science Society, Irvine, CA (1985)
Han, J., Kamber, M., Tung, A.K.H.: Spatial clustering methods in data mining: A survey. In: Miller, H., Han, J. (eds.) Geographic Data Mining and Knowledge Discovery, Taylor and Francis (2001)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review
Kolatch, E.: Clustering algorithms for spatial databases: A survey. Dept. of Computer Science. Univ. of Maryland, College Park (2001); ACM Computing Surveys 31(3), 264–323 (September 1999)
Ng, R., Han, J.: Efficient and effective clustering method for spatial data mining. In: Proceedings of the 20th Very Large Databases Conference(VLDB 1994), Santiago, Chile, pp. 144–155 (1994)
Ordonez, C.: Clustering binary data streams with k-means. In: 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, San Diego, California (June 2003)
Witten, I.H., Frank, E.: Data mining: Practical machine learning tools and techniques with java implementations. Morgan Kaufmann Publishers, San Francisco (2000)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. In: SIGMOD 1996, pp. 103–114 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, M., Holmes, G., Pfahringer, B. (2004). Clustering Large Datasets Using Cobweb and K-Means in Tandem. In: Webb, G.I., Yu, X. (eds) AI 2004: Advances in Artificial Intelligence. AI 2004. Lecture Notes in Computer Science(), vol 3339. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30549-1_33
Download citation
DOI: https://doi.org/10.1007/978-3-540-30549-1_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24059-4
Online ISBN: 978-3-540-30549-1
eBook Packages: Computer ScienceComputer Science (R0)