Abstract
K-means++ is undoubtedly one of the most important initializing algorithms for k-means owing to its provable approximation guarantee to the optimal solution. However, due to its sequential nature, k-means++ requires a large number of iterations to complete the initialization and it becomes inefficient as the size of data increase. Even though scalable k-means++ can drastically reduce the iterations and can be easily applied to the MapReduce systems, but due to its sequential nature, it still requires two MapReduce jobs in each round. Moreover, it takes a large number of I/O cost and it is time-consuming. In this paper, we propose Oversampling and Refining (OnR) method which can improve efficiency of scalable k-means++ by using only one MapReduce job to obtain Ω(k) centers in each round. Except for the oversampling factor ℓ of scalable k-means++, OnR uses another oversampling factor o to further increase the number of chosen centers. Oversampling is executed on the Mapper phase, and in Reducer phase, one Reducer is responsible for removing the oversampled centers generated from o and outputs a set of centers which is the same as the output of scalable k-means++. To reduce the expensive network cost caused by too large o, OnR estimates the global cost by the local clustering cost and uses it to remove some wrong points in Mapper phase. Extensive experiments on real data are conducted and the performance results indicate that OnR outperforms scalable k-means++ in the aspect of I/O cost and running time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chandra, E., Anuradha, V.P.: A survery on clustering algorithms for data in spatial database management systems. Computer Applications 24(9), 19–26 (2011)
Xu, Z., Ke, Y., Wang, Y., Cheng, H., Cheng, J.: A model-based approach to attributed graph clustering. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 505–516 (2012)
Moise, D.: D, Shestakov, G. Gudmundsson, L. Amsaleg.: Indexing and searching 100m images with map-reduce. In: Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval, pp. 17–24 (2013)
Jin, Y., Li, K.: An optimal multimedia object allocation solution in multi-powermode storage systems. Concurrency and Computation: Practice and Experience 22(13), 1852–1873 (2010)
Celebi, M.E., Kingravi, H.A., Vela, P.A.: A Comparative Study of Efficient Initialization Methods for the K-means Clustering Algorithm. Expert Syst. Appl. 40(1), 200–210 (2013)
Onoda, T., Sakai, M., Yamada, S.: Careful Seeding Method based on Independent Components Analysis for k-means Clustering. Emerging Technologies in Web Intelligence 4(1), 51–59
Arthur, D., Vassilvitskii, S.: K-means++: The Advantages of Careful Seeding. In: Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 (2007)
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation, pp. 137–150 (2004)
Papadimitriou, S., Sun, J.: DisCo: Distributed Co-clustering with Map-Reduce: A Case Study Towards Petabyte-Scale End-to-End Mining. In: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, pp. 512–521 (2008)
Zhao, W., Ma, H., He, Q.: Parallel K-means clustering based on mapReduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) Cloud Computing. LNCS, vol. 5931, pp. 674–679. Springer, Heidelberg (2009)
Ene, A., Im, S., Moseley, B.: Fast Clustering Using MapReduce. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 684–689 (2011)
Cordeiro, F., Leonardo, R., Caetano Jr., T., Traina, M., Juci, A., López, J., Kang, U., Faloutsos, C.: Clustering Very Large Multi-dimensional Datasets with MapReduce. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 690–698 (2011)
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: HaLoop: Efficient Iterative Data Processing on Large Clusters. VLDB Endow 3(1-2), 285–296 (2010)
Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: A Runtime for Iterative MapReduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 810–818 (2010)
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable K-Means++. VLDB Endow 5(7), 622–633 (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Xu, Y., Qu, W., Li, Z., Ji, C., Li, Y., Wu, Y. (2014). Fast Scalable k-means++ Algorithm with MapReduce. In: Sun, Xh., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2014. Lecture Notes in Computer Science, vol 8631. Springer, Cham. https://doi.org/10.1007/978-3-319-11194-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-11194-0_2
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11193-3
Online ISBN: 978-3-319-11194-0
eBook Packages: Computer ScienceComputer Science (R0)