Fast Scalable k-means++ Algorithm with MapReduce

Xu, Yujie; Qu, Wenyu; Li, Zhiyang; Ji, Changqing; Li, Yuanyuan; Wu, Yinan

doi:10.1007/978-3-319-11194-0_2

Yujie Xu²⁵,
Wenyu Qu²⁵,
Zhiyang Li²⁵,
Changqing Ji^25,26,
Yuanyuan Li^25,27 &
…
Yinan Wu²⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8631))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

2789 Accesses
5 Citations

Abstract

K-means++ is undoubtedly one of the most important initializing algorithms for k-means owing to its provable approximation guarantee to the optimal solution. However, due to its sequential nature, k-means++ requires a large number of iterations to complete the initialization and it becomes inefficient as the size of data increase. Even though scalable k-means++ can drastically reduce the iterations and can be easily applied to the MapReduce systems, but due to its sequential nature, it still requires two MapReduce jobs in each round. Moreover, it takes a large number of I/O cost and it is time-consuming. In this paper, we propose Oversampling and Refining (OnR) method which can improve efficiency of scalable k-means++ by using only one MapReduce job to obtain Ω(k) centers in each round. Except for the oversampling factor ℓ of scalable k-means++, OnR uses another oversampling factor o to further increase the number of chosen centers. Oversampling is executed on the Mapper phase, and in Reducer phase, one Reducer is responsible for removing the oversampled centers generated from o and outputs a set of centers which is the same as the output of scalable k-means++. To reduce the expensive network cost caused by too large o, OnR estimates the global cost by the local clustering cost and uses it to remove some wrong points in Mapper phase. Extensive experiments on real data are conducted and the performance results indicate that OnR outperforms scalable k-means++ in the aspect of I/O cost and running time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Accelerating K-Means by Grouping Points Automatically

Accelerating Exact K-Means++ Seeding Using Lower Bound Based Framework

Distributed k-Means with Outliers in General Metrics

References

Chandra, E., Anuradha, V.P.: A survery on clustering algorithms for data in spatial database management systems. Computer Applications 24(9), 19–26 (2011)
Google Scholar
Xu, Z., Ke, Y., Wang, Y., Cheng, H., Cheng, J.: A model-based approach to attributed graph clustering. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 505–516 (2012)
Google Scholar
Moise, D.: D, Shestakov, G. Gudmundsson, L. Amsaleg.: Indexing and searching 100m images with map-reduce. In: Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval, pp. 17–24 (2013)
Google Scholar
Jin, Y., Li, K.: An optimal multimedia object allocation solution in multi-powermode storage systems. Concurrency and Computation: Practice and Experience 22(13), 1852–1873 (2010)
Article Google Scholar
Celebi, M.E., Kingravi, H.A., Vela, P.A.: A Comparative Study of Efficient Initialization Methods for the K-means Clustering Algorithm. Expert Syst. Appl. 40(1), 200–210 (2013)
Article Google Scholar
Onoda, T., Sakai, M., Yamada, S.: Careful Seeding Method based on Independent Components Analysis for k-means Clustering. Emerging Technologies in Web Intelligence 4(1), 51–59
Google Scholar
Arthur, D., Vassilvitskii, S.: K-means++: The Advantages of Careful Seeding. In: Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 (2007)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation, pp. 137–150 (2004)
Google Scholar
Papadimitriou, S., Sun, J.: DisCo: Distributed Co-clustering with Map-Reduce: A Case Study Towards Petabyte-Scale End-to-End Mining. In: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, pp. 512–521 (2008)
Google Scholar
Zhao, W., Ma, H., He, Q.: Parallel K-means clustering based on mapReduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) Cloud Computing. LNCS, vol. 5931, pp. 674–679. Springer, Heidelberg (2009)
Chapter Google Scholar
Ene, A., Im, S., Moseley, B.: Fast Clustering Using MapReduce. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 684–689 (2011)
Google Scholar
Cordeiro, F., Leonardo, R., Caetano Jr., T., Traina, M., Juci, A., López, J., Kang, U., Faloutsos, C.: Clustering Very Large Multi-dimensional Datasets with MapReduce. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 690–698 (2011)
Google Scholar
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: HaLoop: Efficient Iterative Data Processing on Large Clusters. VLDB Endow 3(1-2), 285–296 (2010)
Article Google Scholar
Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: A Runtime for Iterative MapReduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 810–818 (2010)
Google Scholar
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable K-Means++. VLDB Endow 5(7), 622–633 (2012)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Science and Techology, Dalian Maritime University, Dalian, China, 116026
Yujie Xu, Wenyu Qu, Zhiyang Li, Changqing Ji & Yuanyuan Li
School of Physical Science and Technology, Dalian University, Dalian, China, 116622
Changqing Ji
School of Software, Dalian Jiaotong University, Dalian, China, 116028
Yuanyuan Li
Department of Equipment, Unit 91550 of PLA, Dalian, China, 116023
Yinan Wu

Authors

Yujie Xu
View author publications
You can also search for this author in PubMed Google Scholar
Wenyu Qu
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyang Li
View author publications
You can also search for this author in PubMed Google Scholar
Changqing Ji
View author publications
You can also search for this author in PubMed Google Scholar
Yuanyuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Yinan Wu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Illinois Institute of Technology, 60616-3793, Chicago, IL, USA
Xian-he Sun
School of Computer Science and Technology, Dalian Maritime University, 1 Linghai Road, 116026, Dalian, China
Wenyu Qu
SEECS, University of Ottawa, 8, King Edward Ave, K1N 6N5, Ottawa, ON, Canada
Ivan Stojmenovic
Deakin University, 221 Burwood Highway, 3125, Burwood, VIC, Australia
Wanlei Zhou
Dalian Maritime University, NO.1 Linhai Road Dailian, 116026, China
Zhiyang Li
BeiHang University, XueYuan Road No.37, HaiDian District, Beijing, China
Hua Guo
University of Bradford, BD7 1DP, Bradford, West Yorkshire, United Kingdom
Geyong Min
Dalian Maritime University, NO.1 Linhai Road Dailian, China, 116026
Tingting Yang
Computer Network Information Center, Chinese Academy of Sciences, 100190, Beijing, China
Yulei Wu
Shandong University, 27 Shanda Nanlu, 250100, Jinan City, Shandong Province, China
Lei Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, Y., Qu, W., Li, Z., Ji, C., Li, Y., Wu, Y. (2014). Fast Scalable k-means++ Algorithm with MapReduce. In: Sun, Xh., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2014. Lecture Notes in Computer Science, vol 8631. Springer, Cham. https://doi.org/10.1007/978-3-319-11194-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-11194-0_2
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11193-3
Online ISBN: 978-3-319-11194-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics