Abstract
In this paper, a new parallel version of Two-Phase K-means, called Parallel Two-Phase K-means (Par2PK-means), is introduced to overcome limits of available parallel versions. Par2PK-means is developed and executed on the MapReduce framework. It is divided into two phases. In the first phase, Mappers independently work on data segments to create an intermediate data. In the second phase, the intermediate data collected from Mappers are clustered by the Reducer to create the final clustering result. Testing on large data sets, the newly proposed algorithm attained a good speedup ratio, closing to the linearly speed-up ratio, when comparing to the sequential version Two-Phase K-means.
The work is supported by DOST, Hochiminh City under the contract number 283/2012/HD-SKHCN.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Zhang, Y., Xiong, Z., Mao, J., Ou, L.: The Study of Parallel K-Means Algorithm. In: Proceedings of the Sixth World Congress on Intelligent Control and Automation (WCICA 2006), vol. 2, pp. 5868–5871 (2006)
Tian, J., Zhu, L., Zhang, S., Liu, L.: Improvement and Parallelism of k-Means Clustering Algorithm. Tsinghua Science & Technology 10(3), 277–281 (2005)
Kraj, P., Sharma, A., Garge, N., Podolsky, R., McIndoe, R.A.: ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use. BMC Bioinformatics 9, 200 (2008)
Pakhira, M.K.: Clustering Large Databases in Distributed Environment. In: IEEE International Advance Computing Conference (IACC 2009), pp. 351–358 (2009)
Kantabutra, S., Couch, A.L.: Parallel K-means clustering algorithm on NOWs. NECTEC Technical Journal 1(6), 243–247 (2000)
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations, Berkeley, California, vol. (1), pp. 281–297. University of California Press, Los Angeles (1967)
Pham, D.T., Dimov, S.S., Nguyen, C.D.: An Incremental K-means Algorithm. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science 218, 783–795 (2004)
Pham, D.T., Dimov, S.S., Nguyen, C.D.: A two-phase k-means algorithm for large datasets. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science 218(10), 1269–1273 (2004)
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI 2004: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, pp. 137–150 (2004)
Chu, C.-T., Kim, S.K., Lin, Y.-A., Yu, Y., Bradski, G.R., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. In: NIPS, pp. 281–288 (2006)
Zhao, W., Ma, H., He, Q.: Parallel K-Means Clustering Based on MapReduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) Cloud Computing. LNCS, vol. 5931, pp. 674–679. Springer, Heidelberg (2009)
Zhou, P., Lei, J., Ye, W.: Large-Scale Data Sets Clustering Based on MapReduce and Hadoop. Journal of Computational Information Systems 7(16), 5956–5963 (2011)
Frank, A., Asuncion, A.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine (2010), http://archive.ics.uci.edu/ml
VMware virtualization technology, http://www.vmware.com (accessed in May 2013)
Kernel based virtual machine, http://www.linux-kvm.org (accessed in May 2013)
Linux Foundation Collaborative Projects, http://www.xen.org/products/xenhyp.html (Last accessed in May 2013)
Openstack: Open source software for building private and public cloud, http://www.openstack.org/ (Last accessed in May 2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nguyen, C.D., Nguyen, D.T., Pham, VH. (2013). Parallel Two-Phase K-Means. In: Murgante, B., et al. Computational Science and Its Applications – ICCSA 2013. ICCSA 2013. Lecture Notes in Computer Science, vol 7975. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39640-3_16
Download citation
DOI: https://doi.org/10.1007/978-3-642-39640-3_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39639-7
Online ISBN: 978-3-642-39640-3
eBook Packages: Computer ScienceComputer Science (R0)