Skip to main content

Clustering Techniques for Big Data Mining

  • Conference paper
  • First Online:
Book cover Business Intelligence (CBI 2021)

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 416))

Included in the following conference series:

Abstract

This paper introduces the Clustering method as an unsupervised machine learning where the input and the output data are unlabeled. Many algorithms are designed to solve clustering problems and many approaches were developed to enhance deficiency or to seek efficiency and effectiveness. These approaches are partitioning-based, hierarchical-based, density-based, grid-based, and model-based. With the evolution of data amounts in every second, we become faced to deal with what is called big data that compelled researchers to develop the algorithms based on these approaches in order to adjust them to manage warehouses in a fast way. Our main purpose is the comparative of representative algorithms of each approach that respect most of the big data criterions which are called the 4Vs. The comparison aims to figure out which algorithms could mine efficiently information by clustering big data. The studied algorithms are FCM, CURE, OPTICS, BANG, and EM respectively from each approach aforementioned. Assessing these algorithms based on the 4Vs big data criterions which are Volume, Variety, Velocity and Value shows some deficiency in some of them. All trained algorithms clusters well large datasets but exclusively FCM and OPTICS algorithms suffer from the curse of dimensionality. FCM and EM algorithms are very sensitive to outliers which affect badly the results. FCM, CURE, and EM algorithms require the number of clusters as input which plays a deficiency if the optimal one wasn’t chosen. FCM and EM algorithms give spherical shapes of clusters unlike CURE, OPTICS, and BANG algorithms which give arbitrary ones that play an advantage for cluster quality. FCM algorithm is the fastest in performing big data, unlike EM algorithm that takes the longest time in training. For diversity in types of data CURE algorithm trains both numerical and categorical data types. Consequently, the analysis leads us to conclude that both CURE and BANG are efficient in clustering big data but we noticed that CURE lacks a bit of accuracy in data assignment. Therefore we infer to qualify the BANG algorithm to be the appropriate one to cluster a large dataset with high dimensionality and noise within it. BANG algorithm is based on a grid structure but comprises implicitly partitioning, hierarchical and density approaches the reason behind its efficiency in giving good accurate results. But even so, the ultimate accuracy in clustering isn’t reached yet but almost close. The conclusion we observe from the BANG algorithm should be applied to more algorithms by mixing approaches in order to attain the ultimate accuracy and effectiveness that lead consequently to accurate future decisions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Tufféry, S.: Data mining et statistique décisionnelle l’intelligence dans les bases de données. TECHNIP ed., Paris (2005)

    Google Scholar 

  2. Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. SIAM, Philadelphia (2007)

    Book  Google Scholar 

  3. Abbas, A.: Comparisons between data clustering algorithms. Int. Arab J. Inf. Technol. 5(3), 320–325 (2008)

    MathSciNet  Google Scholar 

  4. Fahad, A., et al.: A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2(3), 267–279 (2014)

    Article  Google Scholar 

  5. Sajana, T., Rani, C.M.S., Narayana, K.V.: A survey on clustering techniques for big data mining. Indian J. Sci. Technol. 9(3), 1–12 (2016)

    Article  Google Scholar 

  6. Nayyar, A., Puri, V.: Comprehensive analysis & performance comparison of clustering algorithms for big data. Rev. Comput. Eng. Res. 4(2), 54–80 (2017)

    Article  Google Scholar 

  7. Macqueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings 5th Berkley Symposium on Mathematical Statistics Probability, pp. 281–297 (1967)

    Google Scholar 

  8. Kaufman, L., Rousseau, P.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics, pp. 68–125 (1990)

    Google Scholar 

  9. Kaufman, L., Rousseau,P.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics, pp. 126–163 (1990)

    Google Scholar 

  10. Ng, R., Han, J.: Efficient and effective clustering methods for spatial data mining. In: Proceedings International Conference on Very Large Data Bases (VLDB), pp. 144–155 (1994)

    Google Scholar 

  11. Bezdek, J., Ehrlich, R., Full, W.: FCM: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2–3), 191–203 (1984)

    Article  Google Scholar 

  12. Zhang, T., Ramakrishma, R., Livny, M.: BIRCH: an efficient data clustering method for very large data bases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, vol. 25, no. 2, pp. 103–114 (1996)

    Google Scholar 

  13. Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. In: ACM SIGMOD International Conference on Management of Data, vol. 27, no. 2, pp. 73–84 (1998)

    Google Scholar 

  14. Guha, S., Rastogi, R., Shim, K.: Rock: a robust clustering algorithm for categorical attributes. In: 15th International Conference on Data Engineering, pp. 512–521 (1999)

    Google Scholar 

  15. Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings ACM SIGMOD Conference on Knowledge Discovery and Data Mining (KDD), pp. 226–231 (1996)

    Google Scholar 

  16. Ankerst, M., Breuning, M., Kriegel, H., Sander, J.: OPTICS: ordering points to identify the clustering structure. In: Proceedings ACM SIGKDD International Conference on Management of Data, vol. 28, no. 2, pp. 49–60 (1999)

    Google Scholar 

  17. Xu, X., Ester, M., Kriegel, H., Sander, J.: A distribution-based clustering algorithm for mining in large sptial databases. In: Proceedings 14th IEEE International Conference on Data Engineering (ICDE), pp. 324–331 (1998)

    Google Scholar 

  18. Hinneburg, A., Keim, D.: An efficient approach to clustering in large multimedia databases with noise. In: Proceedings ACM SIGKDD Conference on Knowledge Discovery Ad data Mining (KDD), pp. 58–65 (1998)

    Google Scholar 

  19. Hinneburg, A., Keim, D.: Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings 25th International Conference on Very Large Data Bases (VLDB), pp. 506–517 (1999)

    Google Scholar 

  20. Schikuta, E., Erhart, M.: The BANG-clustering system: grid-based data analysis. In: Liu, X., Cohen, P., Berthold, M. (eds.) IDA 1997. LNCS, vol. 1280, pp. 513–524. Springer, Heidelberg (1997). https://doi.org/10.1007/BFb0052867

    Chapter  Google Scholar 

  21. Cheng, C., Fu, A., Zhang, Y.: Entropy based sub space clustering for mining numerical data. In: Proceedings of the fifth ACM SIGMOD International Conference on Knowledge Discovery and Data Mining, pp. 84–93 (1999)

    Google Scholar 

  22. Aggarwal, C., Wolf, J., Yu, P., Procopiuc, C., Jong, S.: Fast algorithms for projected clustering. ACM SIGMOD Rec. 28(2), 61–72 (1999)

    Article  Google Scholar 

  23. Aggrawal, C., Yu, P.: Finding generalized projected clusters in high dimensional spaces. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, vol. 29, no. 2, pp. 70–81 (2000)

    Google Scholar 

  24. Kohonen, T.: The self-organizing map. Neurocomputing 21(1–3), 1–6 (1998)

    Article  Google Scholar 

  25. Dempster, A., Laird, N., Rdin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39(1), 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  26. Laney, D.: 3-D data management: controlling data volume, velocity and variety. META Group Res. Note 6, 1 (2001)

    Google Scholar 

  27. Demchenko, Y., Membrey, P., Grosso, P., de Laat, C.: Addressing big data issues in scientific data infrastructure. In: First International Symposium on Big Data and Data Analytics in Collaboration (BDDAC 2013), May 2013

    Google Scholar 

  28. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  Google Scholar 

  29. Pal, N., Bezdek, J.: On cluster validity for the fuzzy C-means model. IEEE Trans. Fuzzy Syst. 3(3), 370–379 (1995)

    Article  Google Scholar 

  30. Wu, K.-L.: Analysis of parameter selections for fuzzy c-means. Pattern Recogn. 45(1), 407–415 (2012)

    Article  Google Scholar 

  31. Datasets (2020). https://www.kaggle.com/jahina/100000-scales-records. Accessed 25 Oct 2020

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fakir, Y., El Iklil, J. (2021). Clustering Techniques for Big Data Mining. In: Fakir, M., Baslam, M., El Ayachi, R. (eds) Business Intelligence. CBI 2021. Lecture Notes in Business Information Processing, vol 416. Springer, Cham. https://doi.org/10.1007/978-3-030-76508-8_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-76508-8_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-76507-1

  • Online ISBN: 978-3-030-76508-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics