Skip to main content
Log in

An extended visual methods to perform data cluster assessment in distributed data systems

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The cluster tendency is one of the major problems in data clustering. Deriving the number of clusters for an unlabeled dataset is known as the cluster tendency problem. In this paper, the preclustering problem for important clustering methods, such as k-means, hierarchical clustering, etc., is considered. Existing preclustering methods, i.e., the visual assessment tendency (VAT), effectively solve the cluster tendency (i.e., k in the k-means). Enhanced methods, such as the improved VAT (iVAT) and other related visual methods, have greatly succeeded in determining the precluster tendency for complex and large datasets. Clustering using the improved visual assessment tendency (ClusiVAT) is a recent visual method and is widely used for large datasets. However, it focuses primarily on the amount of data rather than the dimensionality. Big data in real-time applications possess large sizes and higher dimensions. The ClusiVAT uses the sampling technique to handle the amount of original data; however, it is not focused on high-dimensional big data. Thus, the proposed method develops scalable visual methods using linear subspace learning (LSL) techniques to overcome the curse of dimensionality. Empirical analysis is performed to demonstrate the efficiency of the proposed LSL-based visual methods using benchmarked datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. SulemanBasha M, Mouleeswaran SK, Prasad KR (2021) Sampling-based visual assessment computing techniques for an efficient social data clustering. J Supercomput 77:8013–8037. https://doi.org/10.1007/s11227-021-03618-6

    Article  Google Scholar 

  2. Rui X, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678

    Article  Google Scholar 

  3. Rathore P, Kumar D, Bezdek JC, Rajasegarar S, Palaniswami M (2019) A rapid hybrid clustering algorithm for large volumes of high dimensional data. IEEE Trans Knowl Data Eng 31(4):641–654. https://doi.org/10.1109/TKDE.2018.2842191

    Article  Google Scholar 

  4. Bezdek JC, Hathaway RJ (2002) VAT: a tool for visual assessment of (cluster) tendency. In: Proceedings of 2002 International Joint Conference on Neural Networks, Honolulu, HI, pp 2225–2230

  5. Havens TC, Bezdek JC (2012) An efficient formulation of the improved visual assessment of cluster tendency (iVAT) algorithm. IEEE TKDE 24(5):813–822

    Google Scholar 

  6. Kumar D, Palaniswami M, Rajasegarar S, Leckie C, Bezdek JC, Havens TC (2013) clusiVAT: a mixed visual/numerical clustering algorithm for big data. IEEE Int Conf Big Data 2013:112–117. https://doi.org/10.1109/BigData.2013.6691561

    Article  Google Scholar 

  7. Rathore P, Bezdek JC, Palaniswami M (2021) Fast cluster tendency assessment for big, high-dimensional data. In: Lesot MJ, Marsala C (eds) Fuzzy approaches for soft computing and approximate reasoning: theories and applications. Studies in fuzziness and soft computing, vol 394. Springer, Cham. https://doi.org/10.1007/978-3-030-54341-9_12

    Chapter  Google Scholar 

  8. Ibrahim OA, Keller JM, Bezdek JC (2021) Evaluating evolving structure in streaming data with modified Dunn’s indices. IEEE Trans Emerg Top Comput Intell 5(2):262–273

    Article  Google Scholar 

  9. Kumar D, Bezdek JC, Palaniswami M, Rajasegarar S, Leckie C, Havens TC (2016) A hybrid approach to clustering in big data. IEEE Trans Cybern 46(10):2372–2385

    Article  Google Scholar 

  10. Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the International Conference on Machine Learning, pp 186–193

  11. Urruty T, Djeraba C, Simovici DA (2007) Clustering by random projections. In: Proceedings of the International Conference on Data Mining, pp 107–119

  12. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  13. Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer, New York

    MATH  Google Scholar 

  14. Xiaofei H, Niyogi P (2002) Locality preserving projections (LPP). IEEE Trans Reliab 16:186–197

    Google Scholar 

  15. Lacaze S, Missoum S (2014) A generalized “max–min” sample for surrogate update. Struct Multidisc Optim 49:683–687. https://doi.org/10.1007/s00158-013-1011-9

    Article  Google Scholar 

  16. Nguyen DT (2012) Clustering with multi-viewpoint based similarity measure. IEEE Trans Knowl Data Eng 24(6):988–1001

    Article  Google Scholar 

  17. Hathaway RJ, Bezdek JC, Huband JM (2006) Scalable visual assessment of cluster tendency for large data sets. Pattern Recogn 39:1315–1324

    Article  MATH  Google Scholar 

  18. Bezdek JL (2008) SpecVAT: enhanced visual cluster analysis. In: IEEE International Conference on Data Mining

  19. McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 169–178

  20. Pasupathi S, Shanmuganathan V, Madasamy K et al (2021) Trend analysis using agglomerative hierarchical clustering approach for time series big data. J Supercomput 77:6505–6524. https://doi.org/10.1007/s11227-020-03580-9

    Article  Google Scholar 

  21. Ngueilbaye A, Wang H, Mahamat DA et al (2021) SDLER: stacked dedupe learning for entity resolution in big data era. J Supercomput 77:10959–10983. https://doi.org/10.1007/s11227-021-03710-x

    Article  Google Scholar 

  22. Basha MS, Mouleeswaran SK, Prasad KR (2021) Sampling-based visual assessment computing techniques for an efficient social data clustering. J Supercomput 77:8013–8037. https://doi.org/10.1007/s11227-021-03618-6

    Article  Google Scholar 

  23. Jalalian Z, Sharifi M (2021) A hierarchical multi-objective task scheduling approach for fast big data processing. J Supercomput. https://doi.org/10.1007/s11227-021-03960-9

    Article  Google Scholar 

  24. Kumar D, Jha VK (2021) An efficient query optimization technique in big data using σσ-ANFIS load balancer and CaM-BW optimizer. J Supercomput 77:13018–13045. https://doi.org/10.1007/s11227-021-03793-6

    Article  Google Scholar 

  25. Shirkhorshidi AS, Aghabozorgi S, Wah TY, Herawan T (2014) Big data clustering: a review. In: Proceedings of the International Conference on Computer Science and Application, pp 707–720

  26. Eswara Reddy B, Rajendra Prasad K (2016) Improving the performance of visualized clustering method. Int J Syst Assur Eng Manag 7:102–111. https://doi.org/10.1007/s13198-015-0342-x

    Article  Google Scholar 

  27. SulemanBasha M, Mouleeswaran SK, Rajendra Prasad K (2019) Cluster tendency methods for visualizing the data partitions. Int J Innov Technol Explore Eng 8:2978–2982

    Google Scholar 

  28. Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web. ACM, pp 1177–1178

  29. Bradley PS, Fayyad UM, Reina C, et al (1998) Scaling clustering algorithms to large databases. In: Proceedings of the 4th International Conference on Knowledge Discovery Data Mining, pp 9–15

  30. Kriegel H-P, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering pattern-based clustering and correlation clustering. ACM Trans Knowl Discov Data 3(1):1–58

    Article  Google Scholar 

  31. Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newslett 6(1):90–105

    Article  Google Scholar 

  32. Asuncion A, Newman D (2007) UCI machine learning repository

  33. Assent I (2012) Clustering high dimensional data. Wiley Interdiscip Rev Data Min Knowl Discov 2(4):340–350

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. Rajendra Prasad.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Subba Reddy, K., Rajendra Prasad, K., Kamatam, G.R. et al. An extended visual methods to perform data cluster assessment in distributed data systems. J Supercomput 78, 8810–8829 (2022). https://doi.org/10.1007/s11227-021-04243-z

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-04243-z

Keywords

Navigation