Fast K-Means Clustering for Very Large Datasets Based on MapReduce Combined with a New Cutting Method

Van Hieu, Duong; Meesad, Phayung

doi:10.1007/978-3-319-11680-8_23

Duong Van Hieu⁵ &
Phayung Meesad⁵

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 326))

1904 Accesses
8 Citations

Abstract

Clustering very large datasets is a challenging problem for data mining and processing. MapReduce is considered as a powerful programming framework which significantly reduces executing time by dividing a job into several tasks and executes them in a distributed environment. K-Means which is one of the most used clustering methods and K-Means based on MapReduce is considered as an advanced solution for very large dataset clustering. However, the executing time is still an obstacle due to the increasing number of iterations when there is an increase of dataset size and number of clusters. This paper presents a new approach for reducing the number of iterations of K-Means algorithm which can be applied to very large dataset clustering. This new method can reduce up to 30 percent of iterations while maintaining up to 98 percent accuracy when tested with several very large datasets with real data type attributes. Based on the significant results from the experiments, this paper proposes a new fast K-Means clustering method for very large datasets based on MapReduce combined with a new cutting method (abbreviated to FMR.K-Means).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Philip Chen, C.L., Zhang, C.-Y.: Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences (in press, 2014)
Google Scholar
Barioni, M.C.N., Razente, H., Marcelino, A.M.R., Traina, A.J.M., Traina, C.: Open issues for partitioning clustering methods: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4, 161–177 (2014)
Google Scholar
Hadian, A., Shahrivari, S.: High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs. The Journal of Supercomputing, 1–19 (2014)
Google Scholar
Bharill, N., Tiwari, A.: Handling Big Data with Fuzzy Based Classification Approach. In: Jamshidi, M., Kreinovich, V., Kacprzyk, J. (eds.) Advance Trends in Soft Computing. STUDFUZZ, vol. 312, pp. 219–227. Springer, Heidelberg (2014)
Chapter Google Scholar
Chen, M., Mao, S., Zhang, Y., Leung, V.M.: Chapter 1. Introduction. In: Big Data, pp. 1–10. Springer, Heidelberg (2014)
Chapter Google Scholar
Jain, A.K.: Data Clustering: 50 Years Beyond K-means. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 3–4. Springer, Heidelberg (2008)
Chapter Google Scholar
Stoffel, K., Belkoniene, A.: Parallel k/h-Means Clustering for Large Data Sets. In: Amestoy, P.R., Berger, P., Daydé, M., Duff, I.S., Frayssé, V., Giraud, L., Ruiz, D. (eds.) Euro-Par 1999. LNCS, vol. 1685, pp. 1451–1454. Springer, Heidelberg (1999)
Chapter Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Article Google Scholar
Zhao, W., Ma, H., He, Q.: Parallel K-Means Clustering Based on MapReduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) Cloud Computing. LNCS, vol. 5931, pp. 674–679. Springer, Heidelberg (2009)
Google Scholar
Lin, C., Yang, Y., Rutayisire, T.: A Parallel Cop-Kmeans Clustering Algorithm Based on MapReduce Framework. In: Wang, Y., Li, T. (eds.) Knowledge Engineering and Management. AISC, vol. 123, pp. 93–102. Springer, Heidelberg (2011)
Chapter Google Scholar
Lv, Z., Hu, Y., Zhong, H., Wu, J., Li, B., Zhao, H.: Parallel K-means clustering of remote sensing images based on mapReduce. In: Wang, F.L., Gong, Z., Luo, X., Lei, J. (eds.) Web Information Systems and Mining. LNCS, vol. 6318, pp. 162–170. Springer, Heidelberg (2010)
Chapter Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: K-Means. In: An Introduction to Information Retrieval. Cambridge University Press (2009)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. SIGMOD Rec. 27, 73–84 (1998)
Article Google Scholar
Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. Presented at the Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, Chicago, IL, USA (2004)
Google Scholar
Jain, A.K., Dubes, R.C.: Chapter 3. Clustering Methods and Algorithms. In: Algorithms for Data Clustering, vol. Computer Science. Prentice Hall (1988)
Google Scholar
Anchalia, P.P., Koundinya, A.K., Srinath, N.K.: MapReduce Design of K-Means Clustering Algorithm. In: 2013 International Conference on Information Science and Applications (ICISA), pp. 1–5 (2013)
Google Scholar
Dom, B.E.: An Information-Theoretic External Cluster-Validity Measure. In: The Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI 2002), Alberta, Canada, pp. 137–145 (2012)
Google Scholar
Wagner, S., Wagner, D.: Comparing Clusterings - An Overview. Institute of Theoretical Informatics (2007)
Google Scholar
Xu, Y., Qu, W., Li, Z., Min, G., Li, K., Liu, Z.: Efficient k-means++ Approximation with MapReduce. IEEE Transactions on Parallel and Distributed Systems PP, 1–10 (2014)
Google Scholar
UCI. YouTube Multiview Video Games Dataset, http://archive.ics.uci.edu/ml/datasets/YouTube+Multiview+Video+Games+Dataset
UCI. Daily and Sports Activities, http://archive.ics.uci.edu/ml/datasets/Daily+and+Sports+Activities

Download references

Author information

Authors and Affiliations

Faculty of Information Technology, King Mongkut’s University of Technology North Bangkok, Bangkok, 10800, Thailand
Duong Van Hieu & Phayung Meesad

Authors

Duong Van Hieu
View author publications
You can also search for this author in PubMed Google Scholar
Phayung Meesad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Duong Van Hieu .

Editor information

Editors and Affiliations

Faculty of Information Technology, VNU University of Engineering and Technology, Hanoi, Vietnam
Viet-Ha Nguyen
Faculty of Information Technology, VNU University of Engineering and Technology, Hanoi, Vietnam
Anh-Cuong Le
School of Knowledge Science, Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Van-Nam Huynh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Van Hieu, D., Meesad, P. (2015). Fast K-Means Clustering for Very Large Datasets Based on MapReduce Combined with a New Cutting Method. In: Nguyen, VH., Le, AC., Huynh, VN. (eds) Knowledge and Systems Engineering. Advances in Intelligent Systems and Computing, vol 326. Springer, Cham. https://doi.org/10.1007/978-3-319-11680-8_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-11680-8_23
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11679-2
Online ISBN: 978-3-319-11680-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics