Parallel Evolving Clustering Method for Big Data Analytics Using Apache Spark: Applications to Banking and Physics

Kamaruddin, Sk; Ravi, Vadlamani; Mayank, Pritman

doi:10.1007/978-3-319-72413-3_19

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10721))

Included in the following conference series:

International Conference on Big Data Analytics

2281 Accesses
2 Citations

Abstract

A novel parallel implementation of the Evolving Clustering Method (ECM) is proposed in this paper. The original serial version of the ECM is the clustering method which computes online and with a single-pass. The parallel version (Parallel ECM or PECM) is implemented in the Apache Spark framework, which makes it work in real time. The parallelization of the algorithm aims to handle a dataset with large volume. Many of the extant clustering algorithms do not involve a parallel one-pass method. The proposed method addresses this shortcoming. Its effectiveness is demonstrated on a credit card fraud dataset (with size 297 MB), and a Higgs dataset was taken from Physics pertaining to particle detectors in the accelerator (with size 1.4 GB). The experimental setup included a cluster of 10 machines having 32 GB RAM each with Hadoop Distributed File System (HDFS) and Spark computational environment. A remarkable achievement of this research is a dramatic reduction in computational time compared to the serial version of the ECM. In future, the PECM shall be hybridized with other machine learning algorithms for solving large-scale regression and classification problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium, vol. 1, pp. 281–297 (1967)
Google Scholar
Murtagh, F.: A survey of recent advances in hierarchical clustering algorithms. Comput. J. 26, 354–359 (1983)
Article MATH Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data, An Introduction to Cluster Analysis. Wiley, New York (1990)
MATH Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, vol. 96, pp. 226–231 (1996)
Google Scholar
Wang, W., Yang, J., Muntz, R.: STING: a statistical information grid approach to spatial data mining. VLDB 97, 186–195 (1997)
Google Scholar
Banfield, J.D., Raftery, A.E.: Model-based gaussian and non-gaussian clustering. Biometrics 49, 803–821 (1993)
Article MathSciNet MATH Google Scholar
Song, Q., Kasabov, N.: ECM — a novel on-line, evolving clustering method and its applications. In: Foundations of Cognitive Science, pp. 631–682 (2001)
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 95 (2010)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing (2012)
Google Scholar
Big Data Processing with Apache Spark – Part 1: Introduction. https://www.infoq.com/articles/apache-spark-introduction
Cluster Mode Overview - Spark 2.1.0 Documentation. https://spark.apache.org/docs/2.1.0/cluster-overview.html
Sun, Z., Fox, G., Gu, W., Li, Z.: A parallel clustering method combined information bottleneck theory and centroid-based clustering. J. Supercomput. 69, 452–467 (2014)
Article Google Scholar
Capó, M., Pérez, A., Lozano, J.A.: An efficient approximation to the K-means clustering for massive data. Knowl.-Based Syst. 117, 56–69 (2017)
Article Google Scholar
Liao, Q., Yang, F., Zhao, J.: An improved parallel K-means clustering algorithm with MapReduce. In: 2013 15th IEEE International Conference on Communication Technology, pp. 764–768. IEEE (2013)
Google Scholar
Esteves, R.M., Hacker, T., Rong, C.: Competitive K-means: a new accurate and distributed K-means algorithm for large datasets. In: Proceedings of the International Conference on Cloud Computing Technology and Science, CloudCom, pp. 17–24. IEEE, Bristol (2013)
Google Scholar
Lin, K., Li, X., Zhang, Z., Chen, J.: A K-means clustering with optimized initial center based on Hadoop platform. In: 2014 9th International Conference on Computer Science and Education, pp. 263–266. IEEE (2014)
Google Scholar
Ene, A., Im, S., Moseley, B.: Fast Clustering using MapReduce Categories and Subject Descriptors. In: Kdd, pp. 681–689 (2011)
Google Scholar
Zhu, Y.T., Wang, F.Z., Shan, X.H., Lv, X.Y.: K-medoids clustering based on MapReduce and optimal search of medoids. In: Proceedings of the 9th International Conference on Computer Science and Education, ICCCSE 2014, pp. 573–577 (2014)
Google Scholar
Jiang, Y., Zhang, J.: Parallel K-medoids clustering algorithm based on Hadoop. In: 2014 IEEE 5th International Conference on Software Engineering and Service Science, pp. 649–652. IEEE, Beijing (2014)
Google Scholar
Ludwig, S.A.: MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability. Int. J. Mach. Learn. Cybern. 6, 923–934 (2015)
Article Google Scholar
Yu, Q., Ding, Z.: An improved Fuzzy C-means algorithm based on MapReduce. In: 2015 8th International Conference on Biomedical Engineering and Informatics (BMEI), pp. 634–638. IEEE (2015)
Google Scholar
Han, D., Agrawal, A., Liao, W.-K., Choudhary, A.: A novel scalable DBSCAN algorithm with spark. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1393–1402. IEEE (2016)
Google Scholar
Chen, C.C., Chen, T.Y., Huang, J.W., Chen, M.S.: Reducing communication and merging overheads for distributed clustering algorithms on the cloud. In: Proceedings of the 2015 International Conference on Cloud Computing and Big Data, CCBD 2015, pp. 41–48 (2016)
Google Scholar
Gouineau, F., Landry, T., Triplet, T.: PatchWork, a scalable density-grid clustering algorithm. In: Proceedings of the 31st Annual ACM Symposium on Applied Computing - SAC 2016, pp. 824–831. ACM Press, Pisa (2016)
Google Scholar
Tsapanos, N., Tefas, A., Nikolaidis, N., Pitas, I.: Distributed, MapReduce-based nearest neighbor and E-ball kernel k-means. In: 2015 IEEE Symposium Series on Computational Intelligence, pp. 509–515. IEEE, Cape Town (2015)
Google Scholar
Ketu, S., Agarwal, S.: Performance enhancement of distributed K-means clustering for big Data analytics through in-memory computation. In: 2015 Eighth International Conference on Contemporary Computing (IC3), pp. 318–324. IEEE, Noida (2015)
Google Scholar
Tsapanos, N., Tefas, A., Nikolaidis, N., Pitas, I.: Efficient MapReduce kernel k-means for Big Data clustering. In: Proceedings of the 9th Hellenic Conference on Artificial Intelligence - SETN 2016, pp. 1–5. ACM Press, Thessaloniki (2016)
Google Scholar
Dunn, J.C.: A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J. Cybern. 3, 32–57 (1973)
Article MathSciNet MATH Google Scholar
Inselberg, A.: Parallel coordinates. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia of Database Systems, pp. 2018–2024. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-39940-9_262
Google Scholar
Pearson, K.: On lines and planes of closest fit to systems of points in space. Philos. Mag. 2, 559–572 (1901)
Article MATH Google Scholar
Roosta, S.H.: Parallel Processing and Parallel Algorithms: Theory and Computation. Springer, New York (2000). https://doi.org/10.1007/978-1-4612-1220-1
Book MATH Google Scholar
ccFraud dataset. https://packages.revolutionanalytics.com/datasets/
Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5, 1–9 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Centre of Excellence in Analytics, Institute for Development and Research in Banking Technology, Castle Hills Road No. 1, Masab Tank, Hyderabad, 500057, India
Sk Kamaruddin, Vadlamani Ravi & Pritman Mayank
SCIS, University of Hyderabad, Hyderabad, 500046, India
Sk Kamaruddin

Authors

Sk Kamaruddin
View author publications
You can also search for this author in PubMed Google Scholar
Vadlamani Ravi
View author publications
You can also search for this author in PubMed Google Scholar
Pritman Mayank
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vadlamani Ravi .

Editor information

Editors and Affiliations

International Institute of Information Technology, Hyderabad, India
P. Krishna Reddy
Rajiv Gandhi Education City, Sonepat, India
Ashish Sureka
University of Texas at Arlington, Arlington, Texas, USA
Sharma Chakravarthy
University of Aizu, Aizu-Wakamatsu, Japan
Subhash Bhalla

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kamaruddin, S., Ravi, V., Mayank, P. (2017). Parallel Evolving Clustering Method for Big Data Analytics Using Apache Spark: Applications to Banking and Physics. In: Reddy, P., Sureka, A., Chakravarthy, S., Bhalla, S. (eds) Big Data Analytics. BDA 2017. Lecture Notes in Computer Science(), vol 10721. Springer, Cham. https://doi.org/10.1007/978-3-319-72413-3_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-72413-3_19
Published: 25 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-72412-6
Online ISBN: 978-3-319-72413-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics