Abstract
Assessment of clustering tendency is an important first step in crisp or fuzzy cluster analysis. One tool for assessing cluster tendency is the Visual Assessment of Tendency (VAT) algorithm. The VAT and improved VAT (iVAT) algorithms have been successful in determining potential cluster structure in the form of visual images for various datasets, but they can be computationally expensive for datasets with a very large number of samples and/or dimensions. Scalable versions of VAT/iVAT, such as sVAT/siVAT, have been proposed for iVAT approximation, but they also take a lot of time when the data is large both in the number of records and dimensions. In this chapter, we introduce two new algorithms to obtain approximate iVAT images that can be used to visually estimate the potential number of clusters in big data. We compare the two proposed methods with the original version of siVAT on five large, high-dimensional datasets, and demonstrate that both new methods provide visual evidence about potential cluster structure in these datasets in significantly less time than siVAT with no apparent loss of accuracy or visual acuity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
n = a few hundred samples is a good choice for most datasets. In [18], \(k'\) and n are randomly chosen between 2k and 4k, and 10k and 30k respectively, where \(k',n \in \mathbb {Z}\), and k is the number of labeled subsets in the ground truth data. The \(k'\) is an overestimate of k i.e., \(k'>k\).
- 2.
References
Achlioptas, D.: Database-friendly random projections. In: Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 274–281 (2001)
Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: Streamkm++: a clustering algorithm for data streams. J. Exp. Algorithmics (JEA) 17, 2–4 (2012)
Asuncion, A., Newman, D.: UCI machine learning repository (2007)
Bezdek, J.C.: Primer on Cluster Analysis: Four Basic Methods that (Usually) Work, vol. 1. First Edition Design Publishing (2017)
Bezdek, J.C., Hathaway, R.J.: VAT: a tool for visual assessment of (cluster) tendency. In: Proceedings of International Joint Conference on Neural Networks (IJCNN), pp. 2225–2230 (2002)
Bezdek, J.C., Ye, X., Popescu, M., Keller, J., Zare, A.: Random projection below the JL limit. In: Proceedings of International Joint Conference on Neural Network (IJCNN), pp. 2414–2423 (2016)
Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 245–250. ACM (2001)
Chen, K., Liu, L.: Detecting the change of clustering structure in categorical data streams. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 504–508. SIAM (2006)
Dunn, J.C.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J. Cybern. 3(3), 32–57 (1973)
Hathaway, R.J., Bezdek, J.C., Huband, J.M.: Scalable visual assessment of cluster tendency for large data sets. Pattern Recognit. 39(7), 1315–1324 (2006)
Havens, T.C., Bezdek, J.C.: An efficient formulation of the improved visual assessment of cluster tendency (iVAT) algorithm. IEEE Trans. Knowl. Data Eng. 24(5), 813–822 (2012)
Havens, T.C., Bezdek, J.C., Palaniswami, M.: Scalable single linkage hierarchical clustering for big data. In: IEEE Eighth International Conference on Intelligent Sensors, Sensor Networks and Information Processing, pp. 396–401. IEEE (2013)
Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26(189–206), 1 (1984)
Kumar, D., Bezdek, J.C., Palaniswami, M., Rajasegarar, S., Leckie, C., Havens, T.C.: A hybrid approach to clustering in big data. IEEE Trans. Cybern. 46(10), 2372–2385 (2016)
Kumar, D., Palaniswami, M., Rajasegarar, S., Leckie, C., Bezdek, J.C., Havens, T.C.: clusiVAT: a mixed visual/numerical clustering algorithm for big data. In: IEEE International Conference on Big Data, pp. 112–117. IEEE (2013)
Lawson, R.G., Jurs, P.C.: New index for clustering tendency and its application to chemical problems. J. Chem. Inf. Comput. Sci. 30(1), 36–41 (1990)
Rathore, P., Bezdek, J.C., Erfani, S.M., Rajasegarar, S., Palaniswami, M.: Ensemble fuzzy clustering using cumulative aggregation on random projections. IEEE Trans. Fuzzy Syst. 26(3), 1510–1524 (2018)
Rathore, P., Kumar, D., Bezdek, J.C., Rajasegarar, S., Palaniswami, M.S.: A rapid hybrid clustering algorithm for large volumes of high dimensional data. IEEE Trans. Knowl. Data Eng. (2018)
Thorndike, R.L.: Who belongs in the family? Psychometrika 18(4), 267–276 (1953)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Rathore, P., Bezdek, J.C., Palaniswami, M. (2021). Fast Cluster Tendency Assessment for Big, High-Dimensional Data. In: Lesot, MJ., Marsala, C. (eds) Fuzzy Approaches for Soft Computing and Approximate Reasoning: Theories and Applications. Studies in Fuzziness and Soft Computing, vol 394. Springer, Cham. https://doi.org/10.1007/978-3-030-54341-9_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-54341-9_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-54340-2
Online ISBN: 978-3-030-54341-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)