Clustering Techniques for Big Data Mining

Fakir, Youssef; El Iklil, Jihane

doi:10.1007/978-3-030-76508-8_14

Youssef Fakir⁹ &
Jihane El Iklil⁹

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 416))

Included in the following conference series:

International Conference on Business Intelligence

1073 Accesses
4 Citations

Abstract

This paper introduces the Clustering method as an unsupervised machine learning where the input and the output data are unlabeled. Many algorithms are designed to solve clustering problems and many approaches were developed to enhance deficiency or to seek efficiency and effectiveness. These approaches are partitioning-based, hierarchical-based, density-based, grid-based, and model-based. With the evolution of data amounts in every second, we become faced to deal with what is called big data that compelled researchers to develop the algorithms based on these approaches in order to adjust them to manage warehouses in a fast way. Our main purpose is the comparative of representative algorithms of each approach that respect most of the big data criterions which are called the 4Vs. The comparison aims to figure out which algorithms could mine efficiently information by clustering big data. The studied algorithms are FCM, CURE, OPTICS, BANG, and EM respectively from each approach aforementioned. Assessing these algorithms based on the 4Vs big data criterions which are Volume, Variety, Velocity and Value shows some deficiency in some of them. All trained algorithms clusters well large datasets but exclusively FCM and OPTICS algorithms suffer from the curse of dimensionality. FCM and EM algorithms are very sensitive to outliers which affect badly the results. FCM, CURE, and EM algorithms require the number of clusters as input which plays a deficiency if the optimal one wasn’t chosen. FCM and EM algorithms give spherical shapes of clusters unlike CURE, OPTICS, and BANG algorithms which give arbitrary ones that play an advantage for cluster quality. FCM algorithm is the fastest in performing big data, unlike EM algorithm that takes the longest time in training. For diversity in types of data CURE algorithm trains both numerical and categorical data types. Consequently, the analysis leads us to conclude that both CURE and BANG are efficient in clustering big data but we noticed that CURE lacks a bit of accuracy in data assignment. Therefore we infer to qualify the BANG algorithm to be the appropriate one to cluster a large dataset with high dimensionality and noise within it. BANG algorithm is based on a grid structure but comprises implicitly partitioning, hierarchical and density approaches the reason behind its efficiency in giving good accurate results. But even so, the ultimate accuracy in clustering isn’t reached yet but almost close. The conclusion we observe from the BANG algorithm should be applied to more algorithms by mixing approaches in order to attain the ultimate accuracy and effectiveness that lead consequently to accurate future decisions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Tufféry, S.: Data mining et statistique décisionnelle l’intelligence dans les bases de données. TECHNIP ed., Paris (2005)
Google Scholar
Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. SIAM, Philadelphia (2007)
Book Google Scholar
Abbas, A.: Comparisons between data clustering algorithms. Int. Arab J. Inf. Technol. 5(3), 320–325 (2008)
MathSciNet Google Scholar
Fahad, A., et al.: A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2(3), 267–279 (2014)
Article Google Scholar
Sajana, T., Rani, C.M.S., Narayana, K.V.: A survey on clustering techniques for big data mining. Indian J. Sci. Technol. 9(3), 1–12 (2016)
Article Google Scholar
Nayyar, A., Puri, V.: Comprehensive analysis & performance comparison of clustering algorithms for big data. Rev. Comput. Eng. Res. 4(2), 54–80 (2017)
Article Google Scholar
Macqueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings 5th Berkley Symposium on Mathematical Statistics Probability, pp. 281–297 (1967)
Google Scholar
Kaufman, L., Rousseau, P.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics, pp. 68–125 (1990)
Google Scholar
Kaufman, L., Rousseau,P.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics, pp. 126–163 (1990)
Google Scholar
Ng, R., Han, J.: Efficient and effective clustering methods for spatial data mining. In: Proceedings International Conference on Very Large Data Bases (VLDB), pp. 144–155 (1994)
Google Scholar
Bezdek, J., Ehrlich, R., Full, W.: FCM: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2–3), 191–203 (1984)
Article Google Scholar
Zhang, T., Ramakrishma, R., Livny, M.: BIRCH: an efficient data clustering method for very large data bases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, vol. 25, no. 2, pp. 103–114 (1996)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. In: ACM SIGMOD International Conference on Management of Data, vol. 27, no. 2, pp. 73–84 (1998)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: Rock: a robust clustering algorithm for categorical attributes. In: 15th International Conference on Data Engineering, pp. 512–521 (1999)
Google Scholar
Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings ACM SIGMOD Conference on Knowledge Discovery and Data Mining (KDD), pp. 226–231 (1996)
Google Scholar
Ankerst, M., Breuning, M., Kriegel, H., Sander, J.: OPTICS: ordering points to identify the clustering structure. In: Proceedings ACM SIGKDD International Conference on Management of Data, vol. 28, no. 2, pp. 49–60 (1999)
Google Scholar
Xu, X., Ester, M., Kriegel, H., Sander, J.: A distribution-based clustering algorithm for mining in large sptial databases. In: Proceedings 14th IEEE International Conference on Data Engineering (ICDE), pp. 324–331 (1998)
Google Scholar
Hinneburg, A., Keim, D.: An efficient approach to clustering in large multimedia databases with noise. In: Proceedings ACM SIGKDD Conference on Knowledge Discovery Ad data Mining (KDD), pp. 58–65 (1998)
Google Scholar
Hinneburg, A., Keim, D.: Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings 25th International Conference on Very Large Data Bases (VLDB), pp. 506–517 (1999)
Google Scholar
Schikuta, E., Erhart, M.: The BANG-clustering system: grid-based data analysis. In: Liu, X., Cohen, P., Berthold, M. (eds.) IDA 1997. LNCS, vol. 1280, pp. 513–524. Springer, Heidelberg (1997). https://doi.org/10.1007/BFb0052867
Chapter Google Scholar
Cheng, C., Fu, A., Zhang, Y.: Entropy based sub space clustering for mining numerical data. In: Proceedings of the fifth ACM SIGMOD International Conference on Knowledge Discovery and Data Mining, pp. 84–93 (1999)
Google Scholar
Aggarwal, C., Wolf, J., Yu, P., Procopiuc, C., Jong, S.: Fast algorithms for projected clustering. ACM SIGMOD Rec. 28(2), 61–72 (1999)
Article Google Scholar
Aggrawal, C., Yu, P.: Finding generalized projected clusters in high dimensional spaces. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, vol. 29, no. 2, pp. 70–81 (2000)
Google Scholar
Kohonen, T.: The self-organizing map. Neurocomputing 21(1–3), 1–6 (1998)
Article Google Scholar
Dempster, A., Laird, N., Rdin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39(1), 1–38 (1977)
MathSciNet MATH Google Scholar
Laney, D.: 3-D data management: controlling data volume, velocity and variety. META Group Res. Note 6, 1 (2001)
Google Scholar
Demchenko, Y., Membrey, P., Grosso, P., de Laat, C.: Addressing big data issues in scientific data infrastructure. In: First International Symposium on Big Data and Data Analytics in Collaboration (BDDAC 2013), May 2013
Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article Google Scholar
Pal, N., Bezdek, J.: On cluster validity for the fuzzy C-means model. IEEE Trans. Fuzzy Syst. 3(3), 370–379 (1995)
Article Google Scholar
Wu, K.-L.: Analysis of parameter selections for fuzzy c-means. Pattern Recogn. 45(1), 407–415 (2012)
Article Google Scholar
Datasets (2020). https://www.kaggle.com/jahina/100000-scales-records. Accessed 25 Oct 2020

Download references

Author information

Authors and Affiliations

Laboratory of Information Processing and Decision Support, Sultan Moulay Slimane University, Beni Mellal, Morocco
Youssef Fakir & Jihane El Iklil

Authors

Youssef Fakir
View author publications
You can also search for this author in PubMed Google Scholar
Jihane El Iklil
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Université Sultan Moulay Slimane, Beni-Mellal, Morocco
Mohamed Fakir
Université Sultan Moulay Slimane, Beni-Mellal, Morocco
Mohamed Baslam
Université Sultan Moulay Slimane, Beni-Mellal, Morocco
Rachid El Ayachi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fakir, Y., El Iklil, J. (2021). Clustering Techniques for Big Data Mining. In: Fakir, M., Baslam, M., El Ayachi, R. (eds) Business Intelligence. CBI 2021. Lecture Notes in Business Information Processing, vol 416. Springer, Cham. https://doi.org/10.1007/978-3-030-76508-8_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-76508-8_14
Published: 16 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-76507-1
Online ISBN: 978-3-030-76508-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics