Validity indices for clusters of uncertain data objects

Tavakkol, Behnam; Jeong, Myong K.; Albin, Susan L.

doi:10.1007/s10479-018-3043-4

Validity indices for clusters of uncertain data objects

S.I.: Data Mining and Decision Analytics
Published: 10 September 2018

Volume 303, pages 321–357, (2021)
Cite this article

Annals of Operations Research Aims and scope Submit manuscript

Behnam Tavakkol¹,
Myong K. Jeong² &
Susan L. Albin²

420 Accesses
3 Citations
Explore all metrics

Abstract

Clustering validity indices are the main tools for evaluating the quality of formed clusters and determining the correct number of clusters. They can be applied on the results of clustering algorithms to validate the performance of those algorithms. In this paper, two clustering validity indices named uncertain Silhouette and Order Statistic, are developed for uncertain data. To the best of our knowledge, there is not any clustering validity index in the literature that is designed for uncertain objects and can be used for validating the performance of uncertain clustering algorithms. Our proposed validity indices use probabilistic distance measures to capture the distance between uncertain objects. They outperform existing validity indices for certain data in validating clusters of uncertain data objects and are robust to outliers. The Order Statistic index in particular, a general form of uncertain Dunn validity index (also developed here), is well capable of handling instances where there is a single cluster that is relatively scattered (not compact) compared to other clusters, or there are two clusters that are close (not well-separated) compared to other clusters. The aforementioned instances can potentially result in the failure of existing clustering validity indices in detecting the correct number of clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A cluster validity evaluation method for dynamically determining the near-optimal number of clusters

Article 24 October 2019

An unsupervised and robust validity index for clustering analysis

Article 19 October 2018

A New Fuzzy Clustering Validity Index with Strong Robustness

References

Aggarwal, C. C., & Philip, S. Y. (2009). A survey of uncertain data algorithms and applications. IEEE Transactions on Knowledge and Data Engineering, 21(5), 609–623.
Article Google Scholar
Basseville, M. (1989). Distance measures for signal processing and pattern recognition. Signal Processing, 18(4), 349–369.
Article Google Scholar
Bhattacharyya, A. (1946). On a measure of divergence between two multinomial populations. Sankhyā: The Indian Journal of Statistics, 7(4), 401–406.
Google Scholar
Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics-Theory and Methods, 3(1), 1–27.
Article Google Scholar
Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. City, 1(2), 1.
Google Scholar
Chau, M., Cheng, R., Kao, B., & Ng, J. (2006). Uncertain data mining: An example in clustering location data. In Pacific-Asia conference on knowledge discovery and data mining (pp. 199–204). Springer.
Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4), 493–507.
Article Google Scholar
Chiang, M.-C., Tsai, C.-W., & Yang, C.-S. (2011). A time-efficient pattern reduction algorithm for k-means clustering. Information Sciences, 181(4), 716–731.
Article Google Scholar
Cover, T. M., & Thomas, J. A. (2012). Elements of information theory. Hoboken: Wiley.
Google Scholar
Csiszar, I., & Körner, J. (2011). Information theory: Coding theorems for discrete memoryless systems. Cambridge: Cambridge University Press.
Book Google Scholar
Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2, 224–227.
Article Google Scholar
Devijver, P. A., & Kittler, J. (1982). Pattern recognition: A statistical approach. Upper Saddle River: Prentice Hall.
Google Scholar
Duan, L., Xu, L., Liu, Y., & Lee, J. (2009). Cluster-based outlier detection. Annals of Operations Research, 168(1), 151–168.
Article Google Scholar
Dunn, J. C. (1973). A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3(3), 32–57.
Article Google Scholar
Fraley, C., & Raftery, A. E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal, 41(8), 578–588.
Article Google Scholar
Gullo, F., Ponti, G., & Tagarelli, A. (2008a). Clustering uncertain data via k-medoids. In Proceedings of the 2nd international conference on scalable uncertainty management, ser. SUM’08 (pp. 229–242). Berlin: Springer.
Gullo, F., Ponti, G., Tagarelli, A., & Greco, S. (2008b). A hierarchical algorithm for clustering uncertain data via an information-theoretic approach. In Data mining, 2008. ICDM’08. Eighth IEEE international conference on (pp. 821–826). IEEE.
Gullo, F., Ponti, G., & Tagarelli, A. (2010). Minimizing the variance of cluster mixture models for clustering uncertain objects. In Data mining (ICDM), 2010 IEEE 10th international conference on (pp. 839–844). IEEE.
Gullo, F., Ponti, G., & Tagarelli, A. (2013). Minimizing the variance of cluster mixture models for clustering uncertain objects. Statistical Analysis and Data Mining: The ASA Data Science Journal, 6(2), 116–135.
Article Google Scholar
Gullo, F., Ponti, G., Tagarelli, A., & Greco, S. (2017). An information-theoretic approach to hierarchical clustering of uncertain data. Information Sciences, 402, 199–215.
Article Google Scholar
Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2), 107–145.
Article Google Scholar
Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1), 100–108.
Google Scholar
Jiang, B., Pei, J., Tao, Y., & Lin, X. (2013). Clustering uncertain data based on probability distribution similarity. IEEE Transactions on Knowledge and Data Engineering, 25(4), 751–763.
Article Google Scholar
Kao, B., Lee, S. D., Lee, F. K., Cheung, D. W., & Ho, W.-S. (2010). Clustering uncertain data using voronoi diagrams and r-tree index. IEEE Transactions on Knowledge and Data Engineering, 22(9), 1219–1233.
Article Google Scholar
Kriegel, H.-P., & Pfeifle, M. (2005). Density-based clustering of uncertain data. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (pp. 672–677). ACM.
Lee, S. D., Kao, B., & Cheng, R. (2007). Reducing UK-means to K-means. In Data mining workshops, 2007. ICDM workshops 2007. Seventh IEEE international conference on (pp. 483–488). IEEE.
Marinakis, Y., Marinaki, M., Doumpos, M., Matsatsinis, N., & Zopounidis, C. (2011). A hybrid ACO-GRASP algorithm for clustering analysis. Annals of Operations Research, 188(1), 343–358.
Article Google Scholar
Nydick, S. (2012). The wishart and inverse wishart distributions. http://www.tc.umn.edu/~nydic001/docs/unpubs/WishartDistribution.pdf. Accessed 21 Mar 2017.
Pakhira, M. K., Bandyopadhyay, S., & Maulik, U. (2004). Validity index for crisp and fuzzy clusters. Pattern Recognition, 37(3), 487–501.
Article Google Scholar
Pakhira, M. K., Bandyopadhyay, S., & Maulik, U. (2005). A study of some fuzzy cluster validity indices, genetic clustering and application to pixel classification. Fuzzy Sets and Systems, 155(2), 191–214.
Article Google Scholar
Pal, N. R., & Biswas, J. (1997). Cluster validation using graph theoretic concepts. Pattern Recognition, 30(6), 847–857.
Article Google Scholar
Peel, M. C., Finlayson, B. L., & McMahon, T. A. (2007). Updated world map of the Köppen–Geiger climate classification. Hydrology and Earth System Sciences Discussions, 4(2), 439–473.
Google Scholar
Qin, B., Xia, Y., & Li, F. (2009). DTU: A decision tree for uncertain data. In Pacific-Asia conference on knowledge discovery and data mining (pp. 4–15). Berlin: Sringer.
Qin, Z., Wan, T., & Zhao, H. (2017). Hybrid clustering of data and vague concepts based on labels semantics. Annals of Operations Research, 256(2), 393–416.
Article Google Scholar
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.
Article Google Scholar
Tavakkol, B. (2018). Data Mining methodologies with uncertain data (Doctoral dissertation, Rutgers University-School of Graduate Studies-New Brunswick).
Tavakkol, B., Jeong, M. K., & Albin, S. L. (2017). Object-to-group probabilistic distance measure for uncertain data classification. Neurocomputing, 230, 143–151.
Article Google Scholar
Xie, X. L., & Beni, G. (1991). A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(8), 841–847.
Article Google Scholar
Yang, B., & Zhang, Y. (2010). Kernel based K-medoids for clustering data with uncertainty. In International conference on advanced data mining and applications (pp. 246–253). Berlin: Springer.
Zhou, S., & Chellappa, R. (2004). Probabilistic distance measures in reproducing kernel Hilbert space. SCR Technical Report, University of Maryland.

Download references

Acknowledgements

The authors would like to thank the editor and anonymous reviewers for their valuable comments and suggestions which helped to improve the quality of this paper.

Author information

Authors and Affiliations

Stockton University, Galloway, USA
Behnam Tavakkol
Rutgers University, Piscataway, USA
Myong K. Jeong & Susan L. Albin

Authors

Behnam Tavakkol
View author publications
You can also search for this author in PubMed Google Scholar
Myong K. Jeong
View author publications
You can also search for this author in PubMed Google Scholar
Susan L. Albin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Myong K. Jeong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tavakkol, B., Jeong, M.K. & Albin, S.L. Validity indices for clusters of uncertain data objects. Ann Oper Res 303, 321–357 (2021). https://doi.org/10.1007/s10479-018-3043-4

Download citation

Published: 10 September 2018
Issue Date: August 2021
DOI: https://doi.org/10.1007/s10479-018-3043-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Validity indices for clusters of uncertain data objects

Abstract

Access this article

Similar content being viewed by others

A cluster validity evaluation method for dynamically determining the near-optimal number of clusters

An unsupervised and robust validity index for clustering analysis

A New Fuzzy Clustering Validity Index with Strong Robustness

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Validity indices for clusters of uncertain data objects

Abstract

Access this article

Similar content being viewed by others

A cluster validity evaluation method for dynamically determining the near-optimal number of clusters

An unsupervised and robust validity index for clustering analysis

A New Fuzzy Clustering Validity Index with Strong Robustness

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation