An Efficient K-means Clustering Algorithm on MapReduce

Li, Qiuhong; Wang, Peng; Wang, Wei; Hu, Hao; Li, Zhongsheng; Li, Junxian

doi:10.1007/978-3-319-05810-8_24

Qiuhong Li^22,23,
Peng Wang^22,23,
Wei Wang^22,23,
Hao Hu^22,23,
Zhongsheng Li²⁴ &
…
Junxian Li^22,23

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8421))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

2007 Accesses
19 Citations

Abstract

As an important approach to analyze the massive data set, an efficient k-means implementation on MapReduce is crucial in many applications. In this paper we propose a series of strategies to improve the efficiency of k-means for massive high-dimensional data points on MapReduce. First, we use locality sensitive hashing (LSH) to map data points into buckets, based on which, the original data points is converted into the weighted representative points as well as the outlier points. Then an effective center initialization algorithm is proposed, which can achieve higher quality of the initial centers. Finally, a pruning strategy is proposed to speed up the iteration process by pruning the unnecessary distance computation between centers and data points. An extensive empirical study shows that the proposed techniques can improve both efficiency and accuracy of k-means on MapReduce greatly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

http://hadoop.apache.org/
http://mahout.apache.org/
AlSabti, K., Ranka, S.: An efficient space-partitioning based algorithm for the K-means clustering. In: Zhong, N., Zhou, L. (eds.) PAKDD 1999. LNCS (LNAI), vol. 1574, pp. 355–360. Springer, Heidelberg (1999)
Chapter Google Scholar
Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: FOCS, pp. 459–468 (2006)
Google Scholar
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: SODA, pp. 1027–1035 (2007)
Google Scholar
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. CoRR, abs/1203.6402 (2012)
Google Scholar
Bu, Y., Howe, B., Balazinska, M., Ernst, M.: Haloop: Efficient iterative data processing on large clusters. PVLDB 3(1), 285–296 (2010)
Google Scholar
Cordeiro, R.L.F., Traina Jr., C., Traina, A.J.M., López, J., Kang, U., Faloutsos, C.: Clustering very large multi-dimensional datasets with mapreduce. In: KDD, pp. 690–698 (2011)
Google Scholar
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Symposium on Computational Geometry, pp. 253–262 (2004)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI (2004)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI (2004)
Google Scholar
Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Zaki, M.J., Ho, C.-T. (eds.) KDD 1999. LNCS (LNAI), vol. 1759, pp. 245–260. Springer, Heidelberg (2000)
Chapter Google Scholar
Ene, A., Im, S., Moseley, B.: Fast clustering using mapreduce. In: KDD, pp. 681–689 (2011)
Google Scholar
Huang, J.Z., Ng, M.K., Rong, H., Li, Z.: Automated variable weighting in k-means type clustering. IEEE Trans. Pattern Anal. Mach. Intell. 27(5), 657–668 (2005)
Article Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: STOC, pp. 604–613 (1998)
Google Scholar
Kriegel, H.-P., Kröger, P., Zimek, A.: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. TKDD 3(1) (2009)
Google Scholar
Mondal, A., Lifu, Y., Kitsuregawa, M.: P2PR-tree: An R-tree-based spatial index for peer-to-peer environments. In: Lindner, W., Fischer, F., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 516–525. Springer, Heidelberg (2004)
Chapter Google Scholar
Ordonez, C., Omiecinski, E.: Efficient disk-based k-means clustering for relational databases. IEEE Trans. Knowl. Data Eng. 16(8), 909–921 (2004)
Article Google Scholar
Pelleg, D., Moore, A.W.: X-means: Extending k-means with efficient estimation of the number of clusters. In: ICML, pp. 727–734 (2000)
Google Scholar
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A.F.M., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)
Article Google Scholar
Yang, Y.-H., Lin, Y.-C., Chen, H.H.: Clustering for music search results. In: ICME, pp. 874–877 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Fudan University, Shanghai, China
Qiuhong Li, Peng Wang, Wei Wang, Hao Hu & Junxian Li
Shanghai Key Laboratory of Data Science, Fudan University, China
Qiuhong Li, Peng Wang, Wei Wang, Hao Hu & Junxian Li
Jiangnan Institute of Computing Technology, China
Zhongsheng Li

Authors

Qiuhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Peng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hao Hu
View author publications
You can also search for this author in PubMed Google Scholar
Zhongsheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Junxian Li
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, 639798, Singapore, Singapore
Sourav S. Bhowmick
Department of Computer Science, Utah State University, Old Main Hill, 4205, 84322-4205, Logan, UT, USA
Curtis E. Dyreson
Department of Computer Science, Aalborg University, Selma Lagerløfs Vej 300, 9220, Aalborg Øst, Denmark
Christian S. Jensen
Department of Computer Science, National University of Singapore, 13 Computing Drive, 117417, Singapore, Singapore
Mong Li Lee
Department of Computer Science, Udayana University, Jl. Kampus Unud Jimbaran Bali, 80364, Badung, Bali, Indonesia
Agus Muliantara
Information Systems Engineering, Christian-Albrechts-Universität zu Kiel, Olshausenstrasse 40, 24098, Kiel, Germany
Bernhard Thalheim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Q., Wang, P., Wang, W., Hu, H., Li, Z., Li, J. (2014). An Efficient K-means Clustering Algorithm on MapReduce. In: Bhowmick, S.S., Dyreson, C.E., Jensen, C.S., Lee, M.L., Muliantara, A., Thalheim, B. (eds) Database Systems for Advanced Applications. DASFAA 2014. Lecture Notes in Computer Science, vol 8421. Springer, Cham. https://doi.org/10.1007/978-3-319-05810-8_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-05810-8_24
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05809-2
Online ISBN: 978-3-319-05810-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics