Fast Cluster Tendency Assessment for Big, High-Dimensional Data

Rathore, Punit; Bezdek, James C.; Palaniswami, Marimuthu

doi:10.1007/978-3-030-54341-9_12

Punit Rathore⁴,
James C. Bezdek⁵ &
Marimuthu Palaniswami⁴

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 394))

291 Accesses
3 Citations

Abstract

Assessment of clustering tendency is an important first step in crisp or fuzzy cluster analysis. One tool for assessing cluster tendency is the Visual Assessment of Tendency (VAT) algorithm. The VAT and improved VAT (iVAT) algorithms have been successful in determining potential cluster structure in the form of visual images for various datasets, but they can be computationally expensive for datasets with a very large number of samples and/or dimensions. Scalable versions of VAT/iVAT, such as sVAT/siVAT, have been proposed for iVAT approximation, but they also take a lot of time when the data is large both in the number of records and dimensions. In this chapter, we introduce two new algorithms to obtain approximate iVAT images that can be used to visually estimate the potential number of clusters in big data. We compare the two proposed methods with the original version of siVAT on five large, high-dimensional datasets, and demonstrate that both new methods provide visual evidence about potential cluster structure in these datasets in significantly less time than siVAT with no apparent loss of accuracy or visual acuity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
n = a few hundred samples is a good choice for most datasets. In [18], \(k'\) and n are randomly chosen between 2k and 4k, and 10k and 30k respectively, where \(k',n \in \mathbb {Z}\), and k is the number of labeled subsets in the ground truth data. The \(k'\) is an overestimate of k i.e., \(k'>k\).
2.
These datasets can be found at the UCI machine learning data repository [2, 3]. The features were normalized to the interval [0, 1] by subtracting the minimum and then dividing by the subsequent maximum so that they all had the same scale.

References

Achlioptas, D.: Database-friendly random projections. In: Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 274–281 (2001)
Google Scholar
Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: Streamkm++: a clustering algorithm for data streams. J. Exp. Algorithmics (JEA) 17, 2–4 (2012)
MathSciNet MATH Google Scholar
Asuncion, A., Newman, D.: UCI machine learning repository (2007)
Google Scholar
Bezdek, J.C.: Primer on Cluster Analysis: Four Basic Methods that (Usually) Work, vol. 1. First Edition Design Publishing (2017)
Google Scholar
Bezdek, J.C., Hathaway, R.J.: VAT: a tool for visual assessment of (cluster) tendency. In: Proceedings of International Joint Conference on Neural Networks (IJCNN), pp. 2225–2230 (2002)
Google Scholar
Bezdek, J.C., Ye, X., Popescu, M., Keller, J., Zare, A.: Random projection below the JL limit. In: Proceedings of International Joint Conference on Neural Network (IJCNN), pp. 2414–2423 (2016)
Google Scholar
Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 245–250. ACM (2001)
Google Scholar
Chen, K., Liu, L.: Detecting the change of clustering structure in categorical data streams. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 504–508. SIAM (2006)
Google Scholar
Dunn, J.C.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J. Cybern. 3(3), 32–57 (1973)
Article MathSciNet Google Scholar
Hathaway, R.J., Bezdek, J.C., Huband, J.M.: Scalable visual assessment of cluster tendency for large data sets. Pattern Recognit. 39(7), 1315–1324 (2006)
Article Google Scholar
Havens, T.C., Bezdek, J.C.: An efficient formulation of the improved visual assessment of cluster tendency (iVAT) algorithm. IEEE Trans. Knowl. Data Eng. 24(5), 813–822 (2012)
Article Google Scholar
Havens, T.C., Bezdek, J.C., Palaniswami, M.: Scalable single linkage hierarchical clustering for big data. In: IEEE Eighth International Conference on Intelligent Sensors, Sensor Networks and Information Processing, pp. 396–401. IEEE (2013)
Google Scholar
Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26(189–206), 1 (1984)
MathSciNet MATH Google Scholar
Kumar, D., Bezdek, J.C., Palaniswami, M., Rajasegarar, S., Leckie, C., Havens, T.C.: A hybrid approach to clustering in big data. IEEE Trans. Cybern. 46(10), 2372–2385 (2016)
Article Google Scholar
Kumar, D., Palaniswami, M., Rajasegarar, S., Leckie, C., Bezdek, J.C., Havens, T.C.: clusiVAT: a mixed visual/numerical clustering algorithm for big data. In: IEEE International Conference on Big Data, pp. 112–117. IEEE (2013)
Google Scholar
Lawson, R.G., Jurs, P.C.: New index for clustering tendency and its application to chemical problems. J. Chem. Inf. Comput. Sci. 30(1), 36–41 (1990)
Article Google Scholar
Rathore, P., Bezdek, J.C., Erfani, S.M., Rajasegarar, S., Palaniswami, M.: Ensemble fuzzy clustering using cumulative aggregation on random projections. IEEE Trans. Fuzzy Syst. 26(3), 1510–1524 (2018)
Article Google Scholar
Rathore, P., Kumar, D., Bezdek, J.C., Rajasegarar, S., Palaniswami, M.S.: A rapid hybrid clustering algorithm for large volumes of high dimensional data. IEEE Trans. Knowl. Data Eng. (2018)
Google Scholar
Thorndike, R.L.: Who belongs in the family? Psychometrika 18(4), 267–276 (1953)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Electronic Engineering, The University of Melbourne, Melbourne, Australia
Punit Rathore & Marimuthu Palaniswami
School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
James C. Bezdek

Authors

Punit Rathore
View author publications
You can also search for this author in PubMed Google Scholar
James C. Bezdek
View author publications
You can also search for this author in PubMed Google Scholar
Marimuthu Palaniswami
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Punit Rathore .

Editor information

Editors and Affiliations

Sorbonne Université, CNRS, LIP6, Paris, France
Marie-Jeanne Lesot
Sorbonne Université, CNRS, LIP6, Paris, France
Christophe Marsala

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Rathore, P., Bezdek, J.C., Palaniswami, M. (2021). Fast Cluster Tendency Assessment for Big, High-Dimensional Data. In: Lesot, MJ., Marsala, C. (eds) Fuzzy Approaches for Soft Computing and Approximate Reasoning: Theories and Applications. Studies in Fuzziness and Soft Computing, vol 394. Springer, Cham. https://doi.org/10.1007/978-3-030-54341-9_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-54341-9_12
Published: 27 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-54340-2
Online ISBN: 978-3-030-54341-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics