Skip to main content

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 394))

Abstract

Assessment of clustering tendency is an important first step in crisp or fuzzy cluster analysis. One tool for assessing cluster tendency is the Visual Assessment of Tendency (VAT) algorithm. The VAT and improved VAT (iVAT) algorithms have been successful in determining potential cluster structure in the form of visual images for various datasets, but they can be computationally expensive for datasets with a very large number of samples and/or dimensions. Scalable versions of VAT/iVAT, such as sVAT/siVAT, have been proposed for iVAT approximation, but they also take a lot of time when the data is large both in the number of records and dimensions. In this chapter, we introduce two new algorithms to obtain approximate iVAT images that can be used to visually estimate the potential number of clusters in big data. We compare the two proposed methods with the original version of siVAT on five large, high-dimensional datasets, and demonstrate that both new methods provide visual evidence about potential cluster structure in these datasets in significantly less time than siVAT with no apparent loss of accuracy or visual acuity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    n = a few hundred samples is a good choice for most datasets. In [18], \(k'\) and n are randomly chosen between 2k and 4k, and 10k and 30k respectively, where \(k',n \in \mathbb {Z}\), and k is the number of labeled subsets in the ground truth data. The \(k'\) is an overestimate of k i.e., \(k'>k\).

  2. 2.

    These datasets can be found at the UCI machine learning data repository [2, 3]. The features were normalized to the interval [0, 1] by subtracting the minimum and then dividing by the subsequent maximum so that they all had the same scale.

References

  1. Achlioptas, D.: Database-friendly random projections. In: Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 274–281 (2001)

    Google Scholar 

  2. Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: Streamkm++: a clustering algorithm for data streams. J. Exp. Algorithmics (JEA) 17, 2–4 (2012)

    MathSciNet  MATH  Google Scholar 

  3. Asuncion, A., Newman, D.: UCI machine learning repository (2007)

    Google Scholar 

  4. Bezdek, J.C.: Primer on Cluster Analysis: Four Basic Methods that (Usually) Work, vol. 1. First Edition Design Publishing (2017)

    Google Scholar 

  5. Bezdek, J.C., Hathaway, R.J.: VAT: a tool for visual assessment of (cluster) tendency. In: Proceedings of International Joint Conference on Neural Networks (IJCNN), pp. 2225–2230 (2002)

    Google Scholar 

  6. Bezdek, J.C., Ye, X., Popescu, M., Keller, J., Zare, A.: Random projection below the JL limit. In: Proceedings of International Joint Conference on Neural Network (IJCNN), pp. 2414–2423 (2016)

    Google Scholar 

  7. Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 245–250. ACM (2001)

    Google Scholar 

  8. Chen, K., Liu, L.: Detecting the change of clustering structure in categorical data streams. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 504–508. SIAM (2006)

    Google Scholar 

  9. Dunn, J.C.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J. Cybern. 3(3), 32–57 (1973)

    Article  MathSciNet  Google Scholar 

  10. Hathaway, R.J., Bezdek, J.C., Huband, J.M.: Scalable visual assessment of cluster tendency for large data sets. Pattern Recognit. 39(7), 1315–1324 (2006)

    Article  Google Scholar 

  11. Havens, T.C., Bezdek, J.C.: An efficient formulation of the improved visual assessment of cluster tendency (iVAT) algorithm. IEEE Trans. Knowl. Data Eng. 24(5), 813–822 (2012)

    Article  Google Scholar 

  12. Havens, T.C., Bezdek, J.C., Palaniswami, M.: Scalable single linkage hierarchical clustering for big data. In: IEEE Eighth International Conference on Intelligent Sensors, Sensor Networks and Information Processing, pp. 396–401. IEEE (2013)

    Google Scholar 

  13. Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26(189–206), 1 (1984)

    MathSciNet  MATH  Google Scholar 

  14. Kumar, D., Bezdek, J.C., Palaniswami, M., Rajasegarar, S., Leckie, C., Havens, T.C.: A hybrid approach to clustering in big data. IEEE Trans. Cybern. 46(10), 2372–2385 (2016)

    Article  Google Scholar 

  15. Kumar, D., Palaniswami, M., Rajasegarar, S., Leckie, C., Bezdek, J.C., Havens, T.C.: clusiVAT: a mixed visual/numerical clustering algorithm for big data. In: IEEE International Conference on Big Data, pp. 112–117. IEEE (2013)

    Google Scholar 

  16. Lawson, R.G., Jurs, P.C.: New index for clustering tendency and its application to chemical problems. J. Chem. Inf. Comput. Sci. 30(1), 36–41 (1990)

    Article  Google Scholar 

  17. Rathore, P., Bezdek, J.C., Erfani, S.M., Rajasegarar, S., Palaniswami, M.: Ensemble fuzzy clustering using cumulative aggregation on random projections. IEEE Trans. Fuzzy Syst. 26(3), 1510–1524 (2018)

    Article  Google Scholar 

  18. Rathore, P., Kumar, D., Bezdek, J.C., Rajasegarar, S., Palaniswami, M.S.: A rapid hybrid clustering algorithm for large volumes of high dimensional data. IEEE Trans. Knowl. Data Eng. (2018)

    Google Scholar 

  19. Thorndike, R.L.: Who belongs in the family? Psychometrika 18(4), 267–276 (1953)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Punit Rathore .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Rathore, P., Bezdek, J.C., Palaniswami, M. (2021). Fast Cluster Tendency Assessment for Big, High-Dimensional Data. In: Lesot, MJ., Marsala, C. (eds) Fuzzy Approaches for Soft Computing and Approximate Reasoning: Theories and Applications. Studies in Fuzziness and Soft Computing, vol 394. Springer, Cham. https://doi.org/10.1007/978-3-030-54341-9_12

Download citation

Publish with us

Policies and ethics