Skip to main content
Log in

Validity indices for clusters of uncertain data objects

  • S.I.: Data Mining and Decision Analytics
  • Published:
Annals of Operations Research Aims and scope Submit manuscript

Abstract

Clustering validity indices are the main tools for evaluating the quality of formed clusters and determining the correct number of clusters. They can be applied on the results of clustering algorithms to validate the performance of those algorithms. In this paper, two clustering validity indices named uncertain Silhouette and Order Statistic, are developed for uncertain data. To the best of our knowledge, there is not any clustering validity index in the literature that is designed for uncertain objects and can be used for validating the performance of uncertain clustering algorithms. Our proposed validity indices use probabilistic distance measures to capture the distance between uncertain objects. They outperform existing validity indices for certain data in validating clusters of uncertain data objects and are robust to outliers. The Order Statistic index in particular, a general form of uncertain Dunn validity index (also developed here), is well capable of handling instances where there is a single cluster that is relatively scattered (not compact) compared to other clusters, or there are two clusters that are close (not well-separated) compared to other clusters. The aforementioned instances can potentially result in the failure of existing clustering validity indices in detecting the correct number of clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

References

  • Aggarwal, C. C., & Philip, S. Y. (2009). A survey of uncertain data algorithms and applications. IEEE Transactions on Knowledge and Data Engineering, 21(5), 609–623.

    Article  Google Scholar 

  • Basseville, M. (1989). Distance measures for signal processing and pattern recognition. Signal Processing, 18(4), 349–369.

    Article  Google Scholar 

  • Bhattacharyya, A. (1946). On a measure of divergence between two multinomial populations. Sankhyā: The Indian Journal of Statistics, 7(4), 401–406.

    Google Scholar 

  • Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics-Theory and Methods, 3(1), 1–27.

    Article  Google Scholar 

  • Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. City, 1(2), 1.

    Google Scholar 

  • Chau, M., Cheng, R., Kao, B., & Ng, J. (2006). Uncertain data mining: An example in clustering location data. In Pacific-Asia conference on knowledge discovery and data mining (pp. 199–204). Springer.

  • Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4), 493–507.

    Article  Google Scholar 

  • Chiang, M.-C., Tsai, C.-W., & Yang, C.-S. (2011). A time-efficient pattern reduction algorithm for k-means clustering. Information Sciences, 181(4), 716–731.

    Article  Google Scholar 

  • Cover, T. M., & Thomas, J. A. (2012). Elements of information theory. Hoboken: Wiley.

    Google Scholar 

  • Csiszar, I., & Körner, J. (2011). Information theory: Coding theorems for discrete memoryless systems. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2, 224–227.

    Article  Google Scholar 

  • Devijver, P. A., & Kittler, J. (1982). Pattern recognition: A statistical approach. Upper Saddle River: Prentice Hall.

    Google Scholar 

  • Duan, L., Xu, L., Liu, Y., & Lee, J. (2009). Cluster-based outlier detection. Annals of Operations Research, 168(1), 151–168.

    Article  Google Scholar 

  • Dunn, J. C. (1973). A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3(3), 32–57.

    Article  Google Scholar 

  • Fraley, C., & Raftery, A. E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal, 41(8), 578–588.

    Article  Google Scholar 

  • Gullo, F., Ponti, G., & Tagarelli, A. (2008a). Clustering uncertain data via k-medoids. In Proceedings of the 2nd international conference on scalable uncertainty management, ser. SUM’08 (pp. 229–242). Berlin: Springer.

  • Gullo, F., Ponti, G., Tagarelli, A., & Greco, S. (2008b). A hierarchical algorithm for clustering uncertain data via an information-theoretic approach. In Data mining, 2008. ICDM’08. Eighth IEEE international conference on (pp. 821–826). IEEE.

  • Gullo, F., Ponti, G., & Tagarelli, A. (2010). Minimizing the variance of cluster mixture models for clustering uncertain objects. In Data mining (ICDM), 2010 IEEE 10th international conference on (pp. 839–844). IEEE.

  • Gullo, F., Ponti, G., & Tagarelli, A. (2013). Minimizing the variance of cluster mixture models for clustering uncertain objects. Statistical Analysis and Data Mining: The ASA Data Science Journal, 6(2), 116–135.

    Article  Google Scholar 

  • Gullo, F., Ponti, G., Tagarelli, A., & Greco, S. (2017). An information-theoretic approach to hierarchical clustering of uncertain data. Information Sciences, 402, 199–215.

    Article  Google Scholar 

  • Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2), 107–145.

    Article  Google Scholar 

  • Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1), 100–108.

    Google Scholar 

  • Jiang, B., Pei, J., Tao, Y., & Lin, X. (2013). Clustering uncertain data based on probability distribution similarity. IEEE Transactions on Knowledge and Data Engineering, 25(4), 751–763.

    Article  Google Scholar 

  • Kao, B., Lee, S. D., Lee, F. K., Cheung, D. W., & Ho, W.-S. (2010). Clustering uncertain data using voronoi diagrams and r-tree index. IEEE Transactions on Knowledge and Data Engineering, 22(9), 1219–1233.

    Article  Google Scholar 

  • Kriegel, H.-P., & Pfeifle, M. (2005). Density-based clustering of uncertain data. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (pp. 672–677). ACM.

  • Lee, S. D., Kao, B., & Cheng, R. (2007). Reducing UK-means to K-means. In Data mining workshops, 2007. ICDM workshops 2007. Seventh IEEE international conference on (pp. 483–488). IEEE.

  • Marinakis, Y., Marinaki, M., Doumpos, M., Matsatsinis, N., & Zopounidis, C. (2011). A hybrid ACO-GRASP algorithm for clustering analysis. Annals of Operations Research, 188(1), 343–358.

    Article  Google Scholar 

  • Nydick, S. (2012). The wishart and inverse wishart distributions. http://www.tc.umn.edu/~nydic001/docs/unpubs/WishartDistribution.pdf. Accessed 21 Mar 2017.

  • Pakhira, M. K., Bandyopadhyay, S., & Maulik, U. (2004). Validity index for crisp and fuzzy clusters. Pattern Recognition, 37(3), 487–501.

    Article  Google Scholar 

  • Pakhira, M. K., Bandyopadhyay, S., & Maulik, U. (2005). A study of some fuzzy cluster validity indices, genetic clustering and application to pixel classification. Fuzzy Sets and Systems, 155(2), 191–214.

    Article  Google Scholar 

  • Pal, N. R., & Biswas, J. (1997). Cluster validation using graph theoretic concepts. Pattern Recognition, 30(6), 847–857.

    Article  Google Scholar 

  • Peel, M. C., Finlayson, B. L., & McMahon, T. A. (2007). Updated world map of the Köppen–Geiger climate classification. Hydrology and Earth System Sciences Discussions, 4(2), 439–473.

    Google Scholar 

  • Qin, B., Xia, Y., & Li, F. (2009). DTU: A decision tree for uncertain data. In Pacific-Asia conference on knowledge discovery and data mining (pp. 4–15). Berlin: Sringer.

  • Qin, Z., Wan, T., & Zhao, H. (2017). Hybrid clustering of data and vague concepts based on labels semantics. Annals of Operations Research, 256(2), 393–416.

    Article  Google Scholar 

  • Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.

    Article  Google Scholar 

  • Tavakkol, B. (2018). Data Mining methodologies with uncertain data (Doctoral dissertation, Rutgers University-School of Graduate Studies-New Brunswick).

  • Tavakkol, B., Jeong, M. K., & Albin, S. L. (2017). Object-to-group probabilistic distance measure for uncertain data classification. Neurocomputing, 230, 143–151.

    Article  Google Scholar 

  • Xie, X. L., & Beni, G. (1991). A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(8), 841–847.

    Article  Google Scholar 

  • Yang, B., & Zhang, Y. (2010). Kernel based K-medoids for clustering data with uncertainty. In International conference on advanced data mining and applications (pp. 246–253). Berlin: Springer.

  • Zhou, S., & Chellappa, R. (2004). Probabilistic distance measures in reproducing kernel Hilbert space. SCR Technical Report, University of Maryland.

Download references

Acknowledgements

The authors would like to thank the editor and anonymous reviewers for their valuable comments and suggestions which helped to improve the quality of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Myong K. Jeong.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tavakkol, B., Jeong, M.K. & Albin, S.L. Validity indices for clusters of uncertain data objects. Ann Oper Res 303, 321–357 (2021). https://doi.org/10.1007/s10479-018-3043-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10479-018-3043-4

Keywords

Navigation