Abstract
This paper introduces the Clustering method as an unsupervised machine learning where the input and the output data are unlabeled. Many algorithms are designed to solve clustering problems and many approaches were developed to enhance deficiency or to seek efficiency and effectiveness. These approaches are partitioning-based, hierarchical-based, density-based, grid-based, and model-based. With the evolution of data amounts in every second, we become faced to deal with what is called big data that compelled researchers to develop the algorithms based on these approaches in order to adjust them to manage warehouses in a fast way. Our main purpose is the comparative of representative algorithms of each approach that respect most of the big data criterions which are called the 4Vs. The comparison aims to figure out which algorithms could mine efficiently information by clustering big data. The studied algorithms are FCM, CURE, OPTICS, BANG, and EM respectively from each approach aforementioned. Assessing these algorithms based on the 4Vs big data criterions which are Volume, Variety, Velocity and Value shows some deficiency in some of them. All trained algorithms clusters well large datasets but exclusively FCM and OPTICS algorithms suffer from the curse of dimensionality. FCM and EM algorithms are very sensitive to outliers which affect badly the results. FCM, CURE, and EM algorithms require the number of clusters as input which plays a deficiency if the optimal one wasn’t chosen. FCM and EM algorithms give spherical shapes of clusters unlike CURE, OPTICS, and BANG algorithms which give arbitrary ones that play an advantage for cluster quality. FCM algorithm is the fastest in performing big data, unlike EM algorithm that takes the longest time in training. For diversity in types of data CURE algorithm trains both numerical and categorical data types. Consequently, the analysis leads us to conclude that both CURE and BANG are efficient in clustering big data but we noticed that CURE lacks a bit of accuracy in data assignment. Therefore we infer to qualify the BANG algorithm to be the appropriate one to cluster a large dataset with high dimensionality and noise within it. BANG algorithm is based on a grid structure but comprises implicitly partitioning, hierarchical and density approaches the reason behind its efficiency in giving good accurate results. But even so, the ultimate accuracy in clustering isn’t reached yet but almost close. The conclusion we observe from the BANG algorithm should be applied to more algorithms by mixing approaches in order to attain the ultimate accuracy and effectiveness that lead consequently to accurate future decisions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Tufféry, S.: Data mining et statistique décisionnelle l’intelligence dans les bases de données. TECHNIP ed., Paris (2005)
Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. SIAM, Philadelphia (2007)
Abbas, A.: Comparisons between data clustering algorithms. Int. Arab J. Inf. Technol. 5(3), 320–325 (2008)
Fahad, A., et al.: A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2(3), 267–279 (2014)
Sajana, T., Rani, C.M.S., Narayana, K.V.: A survey on clustering techniques for big data mining. Indian J. Sci. Technol. 9(3), 1–12 (2016)
Nayyar, A., Puri, V.: Comprehensive analysis & performance comparison of clustering algorithms for big data. Rev. Comput. Eng. Res. 4(2), 54–80 (2017)
Macqueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings 5th Berkley Symposium on Mathematical Statistics Probability, pp. 281–297 (1967)
Kaufman, L., Rousseau, P.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics, pp. 68–125 (1990)
Kaufman, L., Rousseau,P.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics, pp. 126–163 (1990)
Ng, R., Han, J.: Efficient and effective clustering methods for spatial data mining. In: Proceedings International Conference on Very Large Data Bases (VLDB), pp. 144–155 (1994)
Bezdek, J., Ehrlich, R., Full, W.: FCM: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2–3), 191–203 (1984)
Zhang, T., Ramakrishma, R., Livny, M.: BIRCH: an efficient data clustering method for very large data bases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, vol. 25, no. 2, pp. 103–114 (1996)
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. In: ACM SIGMOD International Conference on Management of Data, vol. 27, no. 2, pp. 73–84 (1998)
Guha, S., Rastogi, R., Shim, K.: Rock: a robust clustering algorithm for categorical attributes. In: 15th International Conference on Data Engineering, pp. 512–521 (1999)
Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings ACM SIGMOD Conference on Knowledge Discovery and Data Mining (KDD), pp. 226–231 (1996)
Ankerst, M., Breuning, M., Kriegel, H., Sander, J.: OPTICS: ordering points to identify the clustering structure. In: Proceedings ACM SIGKDD International Conference on Management of Data, vol. 28, no. 2, pp. 49–60 (1999)
Xu, X., Ester, M., Kriegel, H., Sander, J.: A distribution-based clustering algorithm for mining in large sptial databases. In: Proceedings 14th IEEE International Conference on Data Engineering (ICDE), pp. 324–331 (1998)
Hinneburg, A., Keim, D.: An efficient approach to clustering in large multimedia databases with noise. In: Proceedings ACM SIGKDD Conference on Knowledge Discovery Ad data Mining (KDD), pp. 58–65 (1998)
Hinneburg, A., Keim, D.: Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings 25th International Conference on Very Large Data Bases (VLDB), pp. 506–517 (1999)
Schikuta, E., Erhart, M.: The BANG-clustering system: grid-based data analysis. In: Liu, X., Cohen, P., Berthold, M. (eds.) IDA 1997. LNCS, vol. 1280, pp. 513–524. Springer, Heidelberg (1997). https://doi.org/10.1007/BFb0052867
Cheng, C., Fu, A., Zhang, Y.: Entropy based sub space clustering for mining numerical data. In: Proceedings of the fifth ACM SIGMOD International Conference on Knowledge Discovery and Data Mining, pp. 84–93 (1999)
Aggarwal, C., Wolf, J., Yu, P., Procopiuc, C., Jong, S.: Fast algorithms for projected clustering. ACM SIGMOD Rec. 28(2), 61–72 (1999)
Aggrawal, C., Yu, P.: Finding generalized projected clusters in high dimensional spaces. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, vol. 29, no. 2, pp. 70–81 (2000)
Kohonen, T.: The self-organizing map. Neurocomputing 21(1–3), 1–6 (1998)
Dempster, A., Laird, N., Rdin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39(1), 1–38 (1977)
Laney, D.: 3-D data management: controlling data volume, velocity and variety. META Group Res. Note 6, 1 (2001)
Demchenko, Y., Membrey, P., Grosso, P., de Laat, C.: Addressing big data issues in scientific data infrastructure. In: First International Symposium on Big Data and Data Analytics in Collaboration (BDDAC 2013), May 2013
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Pal, N., Bezdek, J.: On cluster validity for the fuzzy C-means model. IEEE Trans. Fuzzy Syst. 3(3), 370–379 (1995)
Wu, K.-L.: Analysis of parameter selections for fuzzy c-means. Pattern Recogn. 45(1), 407–415 (2012)
Datasets (2020). https://www.kaggle.com/jahina/100000-scales-records. Accessed 25 Oct 2020
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Fakir, Y., El Iklil, J. (2021). Clustering Techniques for Big Data Mining. In: Fakir, M., Baslam, M., El Ayachi, R. (eds) Business Intelligence. CBI 2021. Lecture Notes in Business Information Processing, vol 416. Springer, Cham. https://doi.org/10.1007/978-3-030-76508-8_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-76508-8_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-76507-1
Online ISBN: 978-3-030-76508-8
eBook Packages: Computer ScienceComputer Science (R0)