Abstract
A novel parallel implementation of the Evolving Clustering Method (ECM) is proposed in this paper. The original serial version of the ECM is the clustering method which computes online and with a single-pass. The parallel version (Parallel ECM or PECM) is implemented in the Apache Spark framework, which makes it work in real time. The parallelization of the algorithm aims to handle a dataset with large volume. Many of the extant clustering algorithms do not involve a parallel one-pass method. The proposed method addresses this shortcoming. Its effectiveness is demonstrated on a credit card fraud dataset (with size 297 MB), and a Higgs dataset was taken from Physics pertaining to particle detectors in the accelerator (with size 1.4 GB). The experimental setup included a cluster of 10 machines having 32 GB RAM each with Hadoop Distributed File System (HDFS) and Spark computational environment. A remarkable achievement of this research is a dramatic reduction in computational time compared to the serial version of the ECM. In future, the PECM shall be hybridized with other machine learning algorithms for solving large-scale regression and classification problems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium, vol. 1, pp. 281–297 (1967)
Murtagh, F.: A survey of recent advances in hierarchical clustering algorithms. Comput. J. 26, 354–359 (1983)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data, An Introduction to Cluster Analysis. Wiley, New York (1990)
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, vol. 96, pp. 226–231 (1996)
Wang, W., Yang, J., Muntz, R.: STING: a statistical information grid approach to spatial data mining. VLDB 97, 186–195 (1997)
Banfield, J.D., Raftery, A.E.: Model-based gaussian and non-gaussian clustering. Biometrics 49, 803–821 (1993)
Song, Q., Kasabov, N.: ECM — a novel on-line, evolving clustering method and its applications. In: Foundations of Cognitive Science, pp. 631–682 (2001)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 95 (2010)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing (2012)
Big Data Processing with Apache Spark – Part 1: Introduction. https://www.infoq.com/articles/apache-spark-introduction
Cluster Mode Overview - Spark 2.1.0 Documentation. https://spark.apache.org/docs/2.1.0/cluster-overview.html
Sun, Z., Fox, G., Gu, W., Li, Z.: A parallel clustering method combined information bottleneck theory and centroid-based clustering. J. Supercomput. 69, 452–467 (2014)
Capó, M., Pérez, A., Lozano, J.A.: An efficient approximation to the K-means clustering for massive data. Knowl.-Based Syst. 117, 56–69 (2017)
Liao, Q., Yang, F., Zhao, J.: An improved parallel K-means clustering algorithm with MapReduce. In: 2013 15th IEEE International Conference on Communication Technology, pp. 764–768. IEEE (2013)
Esteves, R.M., Hacker, T., Rong, C.: Competitive K-means: a new accurate and distributed K-means algorithm for large datasets. In: Proceedings of the International Conference on Cloud Computing Technology and Science, CloudCom, pp. 17–24. IEEE, Bristol (2013)
Lin, K., Li, X., Zhang, Z., Chen, J.: A K-means clustering with optimized initial center based on Hadoop platform. In: 2014 9th International Conference on Computer Science and Education, pp. 263–266. IEEE (2014)
Ene, A., Im, S., Moseley, B.: Fast Clustering using MapReduce Categories and Subject Descriptors. In: Kdd, pp. 681–689 (2011)
Zhu, Y.T., Wang, F.Z., Shan, X.H., Lv, X.Y.: K-medoids clustering based on MapReduce and optimal search of medoids. In: Proceedings of the 9th International Conference on Computer Science and Education, ICCCSE 2014, pp. 573–577 (2014)
Jiang, Y., Zhang, J.: Parallel K-medoids clustering algorithm based on Hadoop. In: 2014 IEEE 5th International Conference on Software Engineering and Service Science, pp. 649–652. IEEE, Beijing (2014)
Ludwig, S.A.: MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability. Int. J. Mach. Learn. Cybern. 6, 923–934 (2015)
Yu, Q., Ding, Z.: An improved Fuzzy C-means algorithm based on MapReduce. In: 2015 8th International Conference on Biomedical Engineering and Informatics (BMEI), pp. 634–638. IEEE (2015)
Han, D., Agrawal, A., Liao, W.-K., Choudhary, A.: A novel scalable DBSCAN algorithm with spark. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1393–1402. IEEE (2016)
Chen, C.C., Chen, T.Y., Huang, J.W., Chen, M.S.: Reducing communication and merging overheads for distributed clustering algorithms on the cloud. In: Proceedings of the 2015 International Conference on Cloud Computing and Big Data, CCBD 2015, pp. 41–48 (2016)
Gouineau, F., Landry, T., Triplet, T.: PatchWork, a scalable density-grid clustering algorithm. In: Proceedings of the 31st Annual ACM Symposium on Applied Computing - SAC 2016, pp. 824–831. ACM Press, Pisa (2016)
Tsapanos, N., Tefas, A., Nikolaidis, N., Pitas, I.: Distributed, MapReduce-based nearest neighbor and E-ball kernel k-means. In: 2015 IEEE Symposium Series on Computational Intelligence, pp. 509–515. IEEE, Cape Town (2015)
Ketu, S., Agarwal, S.: Performance enhancement of distributed K-means clustering for big Data analytics through in-memory computation. In: 2015 Eighth International Conference on Contemporary Computing (IC3), pp. 318–324. IEEE, Noida (2015)
Tsapanos, N., Tefas, A., Nikolaidis, N., Pitas, I.: Efficient MapReduce kernel k-means for Big Data clustering. In: Proceedings of the 9th Hellenic Conference on Artificial Intelligence - SETN 2016, pp. 1–5. ACM Press, Thessaloniki (2016)
Dunn, J.C.: A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J. Cybern. 3, 32–57 (1973)
Inselberg, A.: Parallel coordinates. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia of Database Systems, pp. 2018–2024. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-39940-9_262
Pearson, K.: On lines and planes of closest fit to systems of points in space. Philos. Mag. 2, 559–572 (1901)
Roosta, S.H.: Parallel Processing and Parallel Algorithms: Theory and Computation. Springer, New York (2000). https://doi.org/10.1007/978-1-4612-1220-1
ccFraud dataset. https://packages.revolutionanalytics.com/datasets/
Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5, 1–9 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Kamaruddin, S., Ravi, V., Mayank, P. (2017). Parallel Evolving Clustering Method for Big Data Analytics Using Apache Spark: Applications to Banking and Physics. In: Reddy, P., Sureka, A., Chakravarthy, S., Bhalla, S. (eds) Big Data Analytics. BDA 2017. Lecture Notes in Computer Science(), vol 10721. Springer, Cham. https://doi.org/10.1007/978-3-319-72413-3_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-72413-3_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-72412-6
Online ISBN: 978-3-319-72413-3
eBook Packages: Computer ScienceComputer Science (R0)