Skip to main content

Parallel Evolving Clustering Method for Big Data Analytics Using Apache Spark: Applications to Banking and Physics

  • Conference paper
  • First Online:
Big Data Analytics (BDA 2017)

Abstract

A novel parallel implementation of the Evolving Clustering Method (ECM) is proposed in this paper. The original serial version of the ECM is the clustering method which computes online and with a single-pass. The parallel version (Parallel ECM or PECM) is implemented in the Apache Spark framework, which makes it work in real time. The parallelization of the algorithm aims to handle a dataset with large volume. Many of the extant clustering algorithms do not involve a parallel one-pass method. The proposed method addresses this shortcoming. Its effectiveness is demonstrated on a credit card fraud dataset (with size 297 MB), and a Higgs dataset was taken from Physics pertaining to particle detectors in the accelerator (with size 1.4 GB). The experimental setup included a cluster of 10 machines having 32 GB RAM each with Hadoop Distributed File System (HDFS) and Spark computational environment. A remarkable achievement of this research is a dramatic reduction in computational time compared to the serial version of the ECM. In future, the PECM shall be hybridized with other machine learning algorithms for solving large-scale regression and classification problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium, vol. 1, pp. 281–297 (1967)

    Google Scholar 

  2. Murtagh, F.: A survey of recent advances in hierarchical clustering algorithms. Comput. J. 26, 354–359 (1983)

    Article  MATH  Google Scholar 

  3. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data, An Introduction to Cluster Analysis. Wiley, New York (1990)

    MATH  Google Scholar 

  4. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, vol. 96, pp. 226–231 (1996)

    Google Scholar 

  5. Wang, W., Yang, J., Muntz, R.: STING: a statistical information grid approach to spatial data mining. VLDB 97, 186–195 (1997)

    Google Scholar 

  6. Banfield, J.D., Raftery, A.E.: Model-based gaussian and non-gaussian clustering. Biometrics 49, 803–821 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  7. Song, Q., Kasabov, N.: ECM — a novel on-line, evolving clustering method and its applications. In: Foundations of Cognitive Science, pp. 631–682 (2001)

    Google Scholar 

  8. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 95 (2010)

    Google Scholar 

  9. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing (2012)

    Google Scholar 

  10. Big Data Processing with Apache Spark – Part 1: Introduction. https://www.infoq.com/articles/apache-spark-introduction

  11. Cluster Mode Overview - Spark 2.1.0 Documentation. https://spark.apache.org/docs/2.1.0/cluster-overview.html

  12. Sun, Z., Fox, G., Gu, W., Li, Z.: A parallel clustering method combined information bottleneck theory and centroid-based clustering. J. Supercomput. 69, 452–467 (2014)

    Article  Google Scholar 

  13. Capó, M., Pérez, A., Lozano, J.A.: An efficient approximation to the K-means clustering for massive data. Knowl.-Based Syst. 117, 56–69 (2017)

    Article  Google Scholar 

  14. Liao, Q., Yang, F., Zhao, J.: An improved parallel K-means clustering algorithm with MapReduce. In: 2013 15th IEEE International Conference on Communication Technology, pp. 764–768. IEEE (2013)

    Google Scholar 

  15. Esteves, R.M., Hacker, T., Rong, C.: Competitive K-means: a new accurate and distributed K-means algorithm for large datasets. In: Proceedings of the International Conference on Cloud Computing Technology and Science, CloudCom, pp. 17–24. IEEE, Bristol (2013)

    Google Scholar 

  16. Lin, K., Li, X., Zhang, Z., Chen, J.: A K-means clustering with optimized initial center based on Hadoop platform. In: 2014 9th International Conference on Computer Science and Education, pp. 263–266. IEEE (2014)

    Google Scholar 

  17. Ene, A., Im, S., Moseley, B.: Fast Clustering using MapReduce Categories and Subject Descriptors. In: Kdd, pp. 681–689 (2011)

    Google Scholar 

  18. Zhu, Y.T., Wang, F.Z., Shan, X.H., Lv, X.Y.: K-medoids clustering based on MapReduce and optimal search of medoids. In: Proceedings of the 9th International Conference on Computer Science and Education, ICCCSE 2014, pp. 573–577 (2014)

    Google Scholar 

  19. Jiang, Y., Zhang, J.: Parallel K-medoids clustering algorithm based on Hadoop. In: 2014 IEEE 5th International Conference on Software Engineering and Service Science, pp. 649–652. IEEE, Beijing (2014)

    Google Scholar 

  20. Ludwig, S.A.: MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability. Int. J. Mach. Learn. Cybern. 6, 923–934 (2015)

    Article  Google Scholar 

  21. Yu, Q., Ding, Z.: An improved Fuzzy C-means algorithm based on MapReduce. In: 2015 8th International Conference on Biomedical Engineering and Informatics (BMEI), pp. 634–638. IEEE (2015)

    Google Scholar 

  22. Han, D., Agrawal, A., Liao, W.-K., Choudhary, A.: A novel scalable DBSCAN algorithm with spark. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1393–1402. IEEE (2016)

    Google Scholar 

  23. Chen, C.C., Chen, T.Y., Huang, J.W., Chen, M.S.: Reducing communication and merging overheads for distributed clustering algorithms on the cloud. In: Proceedings of the 2015 International Conference on Cloud Computing and Big Data, CCBD 2015, pp. 41–48 (2016)

    Google Scholar 

  24. Gouineau, F., Landry, T., Triplet, T.: PatchWork, a scalable density-grid clustering algorithm. In: Proceedings of the 31st Annual ACM Symposium on Applied Computing - SAC 2016, pp. 824–831. ACM Press, Pisa (2016)

    Google Scholar 

  25. Tsapanos, N., Tefas, A., Nikolaidis, N., Pitas, I.: Distributed, MapReduce-based nearest neighbor and E-ball kernel k-means. In: 2015 IEEE Symposium Series on Computational Intelligence, pp. 509–515. IEEE, Cape Town (2015)

    Google Scholar 

  26. Ketu, S., Agarwal, S.: Performance enhancement of distributed K-means clustering for big Data analytics through in-memory computation. In: 2015 Eighth International Conference on Contemporary Computing (IC3), pp. 318–324. IEEE, Noida (2015)

    Google Scholar 

  27. Tsapanos, N., Tefas, A., Nikolaidis, N., Pitas, I.: Efficient MapReduce kernel k-means for Big Data clustering. In: Proceedings of the 9th Hellenic Conference on Artificial Intelligence - SETN 2016, pp. 1–5. ACM Press, Thessaloniki (2016)

    Google Scholar 

  28. Dunn, J.C.: A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J. Cybern. 3, 32–57 (1973)

    Article  MathSciNet  MATH  Google Scholar 

  29. Inselberg, A.: Parallel coordinates. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia of Database Systems, pp. 2018–2024. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-39940-9_262

    Google Scholar 

  30. Pearson, K.: On lines and planes of closest fit to systems of points in space. Philos. Mag. 2, 559–572 (1901)

    Article  MATH  Google Scholar 

  31. Roosta, S.H.: Parallel Processing and Parallel Algorithms: Theory and Computation. Springer, New York (2000). https://doi.org/10.1007/978-1-4612-1220-1

    Book  MATH  Google Scholar 

  32. ccFraud dataset. https://packages.revolutionanalytics.com/datasets/

  33. Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5, 1–9 (2014)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vadlamani Ravi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kamaruddin, S., Ravi, V., Mayank, P. (2017). Parallel Evolving Clustering Method for Big Data Analytics Using Apache Spark: Applications to Banking and Physics. In: Reddy, P., Sureka, A., Chakravarthy, S., Bhalla, S. (eds) Big Data Analytics. BDA 2017. Lecture Notes in Computer Science(), vol 10721. Springer, Cham. https://doi.org/10.1007/978-3-319-72413-3_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-72413-3_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-72412-6

  • Online ISBN: 978-3-319-72413-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics