Skip to main content

Fast K-Means Clustering for Very Large Datasets Based on MapReduce Combined with a New Cutting Method

  • Conference paper
Book cover Knowledge and Systems Engineering

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 326))

Abstract

Clustering very large datasets is a challenging problem for data mining and processing. MapReduce is considered as a powerful programming framework which significantly reduces executing time by dividing a job into several tasks and executes them in a distributed environment. K-Means which is one of the most used clustering methods and K-Means based on MapReduce is considered as an advanced solution for very large dataset clustering. However, the executing time is still an obstacle due to the increasing number of iterations when there is an increase of dataset size and number of clusters. This paper presents a new approach for reducing the number of iterations of K-Means algorithm which can be applied to very large dataset clustering. This new method can reduce up to 30 percent of iterations while maintaining up to 98 percent accuracy when tested with several very large datasets with real data type attributes. Based on the significant results from the experiments, this paper proposes a new fast K-Means clustering method for very large datasets based on MapReduce combined with a new cutting method (abbreviated to FMR.K-Means).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Philip Chen, C.L., Zhang, C.-Y.: Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences (in press, 2014)

    Google Scholar 

  2. Barioni, M.C.N., Razente, H., Marcelino, A.M.R., Traina, A.J.M., Traina, C.: Open issues for partitioning clustering methods: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4, 161–177 (2014)

    Google Scholar 

  3. Hadian, A., Shahrivari, S.: High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs. The Journal of Supercomputing, 1–19 (2014)

    Google Scholar 

  4. Bharill, N., Tiwari, A.: Handling Big Data with Fuzzy Based Classification Approach. In: Jamshidi, M., Kreinovich, V., Kacprzyk, J. (eds.) Advance Trends in Soft Computing. STUDFUZZ, vol. 312, pp. 219–227. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  5. Chen, M., Mao, S., Zhang, Y., Leung, V.M.: Chapter 1. Introduction. In: Big Data, pp. 1–10. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  6. Jain, A.K.: Data Clustering: 50 Years Beyond K-means. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 3–4. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  7. Stoffel, K., Belkoniene, A.: Parallel k/h-Means Clustering for Large Data Sets. In: Amestoy, P.R., Berger, P., Daydé, M., Duff, I.S., Frayssé, V., Giraud, L., Ruiz, D. (eds.) Euro-Par 1999. LNCS, vol. 1685, pp. 1451–1454. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  8. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)

    Article  Google Scholar 

  9. Zhao, W., Ma, H., He, Q.: Parallel K-Means Clustering Based on MapReduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) Cloud Computing. LNCS, vol. 5931, pp. 674–679. Springer, Heidelberg (2009)

    Google Scholar 

  10. Lin, C., Yang, Y., Rutayisire, T.: A Parallel Cop-Kmeans Clustering Algorithm Based on MapReduce Framework. In: Wang, Y., Li, T. (eds.) Knowledge Engineering and Management. AISC, vol. 123, pp. 93–102. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  11. Lv, Z., Hu, Y., Zhong, H., Wu, J., Li, B., Zhao, H.: Parallel K-means clustering of remote sensing images based on mapReduce. In: Wang, F.L., Gong, Z., Luo, X., Lei, J. (eds.) Web Information Systems and Mining. LNCS, vol. 6318, pp. 162–170. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  12. Manning, C.D., Raghavan, P., Schütze, H.: K-Means. In: An Introduction to Information Retrieval. Cambridge University Press (2009)

    Google Scholar 

  13. Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. SIGMOD Rec. 27, 73–84 (1998)

    Article  Google Scholar 

  14. Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. Presented at the Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, Chicago, IL, USA (2004)

    Google Scholar 

  15. Jain, A.K., Dubes, R.C.: Chapter 3. Clustering Methods and Algorithms. In: Algorithms for Data Clustering, vol. Computer Science. Prentice Hall (1988)

    Google Scholar 

  16. Anchalia, P.P., Koundinya, A.K., Srinath, N.K.: MapReduce Design of K-Means Clustering Algorithm. In: 2013 International Conference on Information Science and Applications (ICISA), pp. 1–5 (2013)

    Google Scholar 

  17. Dom, B.E.: An Information-Theoretic External Cluster-Validity Measure. In: The Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI 2002), Alberta, Canada, pp. 137–145 (2012)

    Google Scholar 

  18. Wagner, S., Wagner, D.: Comparing Clusterings - An Overview. Institute of Theoretical Informatics (2007)

    Google Scholar 

  19. Xu, Y., Qu, W., Li, Z., Min, G., Li, K., Liu, Z.: Efficient k-means++ Approximation with MapReduce. IEEE Transactions on Parallel and Distributed Systems PP, 1–10 (2014)

    Google Scholar 

  20. UCI. YouTube Multiview Video Games Dataset, http://archive.ics.uci.edu/ml/datasets/YouTube+Multiview+Video+Games+Dataset

  21. UCI. Daily and Sports Activities, http://archive.ics.uci.edu/ml/datasets/Daily+and+Sports+Activities

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Duong Van Hieu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Van Hieu, D., Meesad, P. (2015). Fast K-Means Clustering for Very Large Datasets Based on MapReduce Combined with a New Cutting Method. In: Nguyen, VH., Le, AC., Huynh, VN. (eds) Knowledge and Systems Engineering. Advances in Intelligent Systems and Computing, vol 326. Springer, Cham. https://doi.org/10.1007/978-3-319-11680-8_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11680-8_23

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11679-2

  • Online ISBN: 978-3-319-11680-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics