Skip to main content

Fast Scalable k-means++ Algorithm with MapReduce

  • Conference paper
Algorithms and Architectures for Parallel Processing (ICA3PP 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8631))

Abstract

K-means++ is undoubtedly one of the most important initializing algorithms for k-means owing to its provable approximation guarantee to the optimal solution. However, due to its sequential nature, k-means++ requires a large number of iterations to complete the initialization and it becomes inefficient as the size of data increase. Even though scalable k-means++ can drastically reduce the iterations and can be easily applied to the MapReduce systems, but due to its sequential nature, it still requires two MapReduce jobs in each round. Moreover, it takes a large number of I/O cost and it is time-consuming. In this paper, we propose Oversampling and Refining (OnR) method which can improve efficiency of scalable k-means++ by using only one MapReduce job to obtain Ω(k) centers in each round. Except for the oversampling factor ℓ of scalable k-means++, OnR uses another oversampling factor o to further increase the number of chosen centers. Oversampling is executed on the Mapper phase, and in Reducer phase, one Reducer is responsible for removing the oversampled centers generated from o and outputs a set of centers which is the same as the output of scalable k-means++. To reduce the expensive network cost caused by too large o, OnR estimates the global cost by the local clustering cost and uses it to remove some wrong points in Mapper phase. Extensive experiments on real data are conducted and the performance results indicate that OnR outperforms scalable k-means++ in the aspect of I/O cost and running time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Chandra, E., Anuradha, V.P.: A survery on clustering algorithms for data in spatial database management systems. Computer Applications 24(9), 19–26 (2011)

    Google Scholar 

  2. Xu, Z., Ke, Y., Wang, Y., Cheng, H., Cheng, J.: A model-based approach to attributed graph clustering. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 505–516 (2012)

    Google Scholar 

  3. Moise, D.: D, Shestakov, G. Gudmundsson, L. Amsaleg.: Indexing and searching 100m images with map-reduce. In: Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval, pp. 17–24 (2013)

    Google Scholar 

  4. Jin, Y., Li, K.: An optimal multimedia object allocation solution in multi-powermode storage systems. Concurrency and Computation: Practice and Experience 22(13), 1852–1873 (2010)

    Article  Google Scholar 

  5. Celebi, M.E., Kingravi, H.A., Vela, P.A.: A Comparative Study of Efficient Initialization Methods for the K-means Clustering Algorithm. Expert Syst. Appl. 40(1), 200–210 (2013)

    Article  Google Scholar 

  6. Onoda, T., Sakai, M., Yamada, S.: Careful Seeding Method based on Independent Components Analysis for k-means Clustering. Emerging Technologies in Web Intelligence 4(1), 51–59

    Google Scholar 

  7. Arthur, D., Vassilvitskii, S.: K-means++: The Advantages of Careful Seeding. In: Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 (2007)

    Google Scholar 

  8. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation, pp. 137–150 (2004)

    Google Scholar 

  9. Papadimitriou, S., Sun, J.: DisCo: Distributed Co-clustering with Map-Reduce: A Case Study Towards Petabyte-Scale End-to-End Mining. In: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, pp. 512–521 (2008)

    Google Scholar 

  10. Zhao, W., Ma, H., He, Q.: Parallel K-means clustering based on mapReduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) Cloud Computing. LNCS, vol. 5931, pp. 674–679. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  11. Ene, A., Im, S., Moseley, B.: Fast Clustering Using MapReduce. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 684–689 (2011)

    Google Scholar 

  12. Cordeiro, F., Leonardo, R., Caetano Jr., T., Traina, M., Juci, A., López, J., Kang, U., Faloutsos, C.: Clustering Very Large Multi-dimensional Datasets with MapReduce. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 690–698 (2011)

    Google Scholar 

  13. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: HaLoop: Efficient Iterative Data Processing on Large Clusters. VLDB Endow 3(1-2), 285–296 (2010)

    Article  Google Scholar 

  14. Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: A Runtime for Iterative MapReduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 810–818 (2010)

    Google Scholar 

  15. Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable K-Means++. VLDB Endow 5(7), 622–633 (2012)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Xu, Y., Qu, W., Li, Z., Ji, C., Li, Y., Wu, Y. (2014). Fast Scalable k-means++ Algorithm with MapReduce. In: Sun, Xh., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2014. Lecture Notes in Computer Science, vol 8631. Springer, Cham. https://doi.org/10.1007/978-3-319-11194-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11194-0_2

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11193-3

  • Online ISBN: 978-3-319-11194-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics