Abstract
Clustering validity indices are the main tools for evaluating the quality of formed clusters and determining the correct number of clusters. They can be applied on the results of clustering algorithms to validate the performance of those algorithms. In this paper, two clustering validity indices named uncertain Silhouette and Order Statistic, are developed for uncertain data. To the best of our knowledge, there is not any clustering validity index in the literature that is designed for uncertain objects and can be used for validating the performance of uncertain clustering algorithms. Our proposed validity indices use probabilistic distance measures to capture the distance between uncertain objects. They outperform existing validity indices for certain data in validating clusters of uncertain data objects and are robust to outliers. The Order Statistic index in particular, a general form of uncertain Dunn validity index (also developed here), is well capable of handling instances where there is a single cluster that is relatively scattered (not compact) compared to other clusters, or there are two clusters that are close (not well-separated) compared to other clusters. The aforementioned instances can potentially result in the failure of existing clustering validity indices in detecting the correct number of clusters.
Similar content being viewed by others
References
Aggarwal, C. C., & Philip, S. Y. (2009). A survey of uncertain data algorithms and applications. IEEE Transactions on Knowledge and Data Engineering, 21(5), 609–623.
Basseville, M. (1989). Distance measures for signal processing and pattern recognition. Signal Processing, 18(4), 349–369.
Bhattacharyya, A. (1946). On a measure of divergence between two multinomial populations. Sankhyā: The Indian Journal of Statistics, 7(4), 401–406.
Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics-Theory and Methods, 3(1), 1–27.
Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. City, 1(2), 1.
Chau, M., Cheng, R., Kao, B., & Ng, J. (2006). Uncertain data mining: An example in clustering location data. In Pacific-Asia conference on knowledge discovery and data mining (pp. 199–204). Springer.
Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4), 493–507.
Chiang, M.-C., Tsai, C.-W., & Yang, C.-S. (2011). A time-efficient pattern reduction algorithm for k-means clustering. Information Sciences, 181(4), 716–731.
Cover, T. M., & Thomas, J. A. (2012). Elements of information theory. Hoboken: Wiley.
Csiszar, I., & Körner, J. (2011). Information theory: Coding theorems for discrete memoryless systems. Cambridge: Cambridge University Press.
Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2, 224–227.
Devijver, P. A., & Kittler, J. (1982). Pattern recognition: A statistical approach. Upper Saddle River: Prentice Hall.
Duan, L., Xu, L., Liu, Y., & Lee, J. (2009). Cluster-based outlier detection. Annals of Operations Research, 168(1), 151–168.
Dunn, J. C. (1973). A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3(3), 32–57.
Fraley, C., & Raftery, A. E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal, 41(8), 578–588.
Gullo, F., Ponti, G., & Tagarelli, A. (2008a). Clustering uncertain data via k-medoids. In Proceedings of the 2nd international conference on scalable uncertainty management, ser. SUM’08 (pp. 229–242). Berlin: Springer.
Gullo, F., Ponti, G., Tagarelli, A., & Greco, S. (2008b). A hierarchical algorithm for clustering uncertain data via an information-theoretic approach. In Data mining, 2008. ICDM’08. Eighth IEEE international conference on (pp. 821–826). IEEE.
Gullo, F., Ponti, G., & Tagarelli, A. (2010). Minimizing the variance of cluster mixture models for clustering uncertain objects. In Data mining (ICDM), 2010 IEEE 10th international conference on (pp. 839–844). IEEE.
Gullo, F., Ponti, G., & Tagarelli, A. (2013). Minimizing the variance of cluster mixture models for clustering uncertain objects. Statistical Analysis and Data Mining: The ASA Data Science Journal, 6(2), 116–135.
Gullo, F., Ponti, G., Tagarelli, A., & Greco, S. (2017). An information-theoretic approach to hierarchical clustering of uncertain data. Information Sciences, 402, 199–215.
Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2), 107–145.
Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1), 100–108.
Jiang, B., Pei, J., Tao, Y., & Lin, X. (2013). Clustering uncertain data based on probability distribution similarity. IEEE Transactions on Knowledge and Data Engineering, 25(4), 751–763.
Kao, B., Lee, S. D., Lee, F. K., Cheung, D. W., & Ho, W.-S. (2010). Clustering uncertain data using voronoi diagrams and r-tree index. IEEE Transactions on Knowledge and Data Engineering, 22(9), 1219–1233.
Kriegel, H.-P., & Pfeifle, M. (2005). Density-based clustering of uncertain data. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (pp. 672–677). ACM.
Lee, S. D., Kao, B., & Cheng, R. (2007). Reducing UK-means to K-means. In Data mining workshops, 2007. ICDM workshops 2007. Seventh IEEE international conference on (pp. 483–488). IEEE.
Marinakis, Y., Marinaki, M., Doumpos, M., Matsatsinis, N., & Zopounidis, C. (2011). A hybrid ACO-GRASP algorithm for clustering analysis. Annals of Operations Research, 188(1), 343–358.
Nydick, S. (2012). The wishart and inverse wishart distributions. http://www.tc.umn.edu/~nydic001/docs/unpubs/WishartDistribution.pdf. Accessed 21 Mar 2017.
Pakhira, M. K., Bandyopadhyay, S., & Maulik, U. (2004). Validity index for crisp and fuzzy clusters. Pattern Recognition, 37(3), 487–501.
Pakhira, M. K., Bandyopadhyay, S., & Maulik, U. (2005). A study of some fuzzy cluster validity indices, genetic clustering and application to pixel classification. Fuzzy Sets and Systems, 155(2), 191–214.
Pal, N. R., & Biswas, J. (1997). Cluster validation using graph theoretic concepts. Pattern Recognition, 30(6), 847–857.
Peel, M. C., Finlayson, B. L., & McMahon, T. A. (2007). Updated world map of the Köppen–Geiger climate classification. Hydrology and Earth System Sciences Discussions, 4(2), 439–473.
Qin, B., Xia, Y., & Li, F. (2009). DTU: A decision tree for uncertain data. In Pacific-Asia conference on knowledge discovery and data mining (pp. 4–15). Berlin: Sringer.
Qin, Z., Wan, T., & Zhao, H. (2017). Hybrid clustering of data and vague concepts based on labels semantics. Annals of Operations Research, 256(2), 393–416.
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.
Tavakkol, B. (2018). Data Mining methodologies with uncertain data (Doctoral dissertation, Rutgers University-School of Graduate Studies-New Brunswick).
Tavakkol, B., Jeong, M. K., & Albin, S. L. (2017). Object-to-group probabilistic distance measure for uncertain data classification. Neurocomputing, 230, 143–151.
Xie, X. L., & Beni, G. (1991). A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(8), 841–847.
Yang, B., & Zhang, Y. (2010). Kernel based K-medoids for clustering data with uncertainty. In International conference on advanced data mining and applications (pp. 246–253). Berlin: Springer.
Zhou, S., & Chellappa, R. (2004). Probabilistic distance measures in reproducing kernel Hilbert space. SCR Technical Report, University of Maryland.
Acknowledgements
The authors would like to thank the editor and anonymous reviewers for their valuable comments and suggestions which helped to improve the quality of this paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tavakkol, B., Jeong, M.K. & Albin, S.L. Validity indices for clusters of uncertain data objects. Ann Oper Res 303, 321–357 (2021). https://doi.org/10.1007/s10479-018-3043-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479-018-3043-4