Cluster Validating Techniques in the Presence of Duplicates

Jain, Ravi; Koronios, Andy

doi:10.1007/978-3-540-79474-5_9

Ravi Jain¹ &
Andy Koronios¹

Part of the book series: Studies in Computational Intelligence ((SCI,volume 137))

560 Accesses
3 Citations

Abstract

To detect database records containing approximate and exact duplicates because of data entry error or differences in the detailed schemas of records from multiple databases or for some other reasons is an important line of research. Yet no comprehensive comparative study has been performed to evaluate the effectiveness of Silhouette width, Calinski & Harbasz index (pseudo F-statistics) and Baker & Hubert index (γ index) algorithms for exact and approximate duplicates. In this chapter, a comparative study and effectiveness of these three cluster validation techniques which involve measuring the stability of a partition in a data set in the presence of noise, in particular, approximate and exact duplicates are presented. Silhouette width, Calinski & Harbasz index and Baker & Hubert index are calculated before and after inserting the exact and approximate duplicates (deliberately) in the data set. Comprehensive experiments on glass, wine, iris and ruspini database confirms that the Baker & Hubert index is not stable in the presence of approximate duplicates. Moreover, Silhouette width, Calinski and Harbasz index and Baker & Hubert indice do not exceed the original data indice in the presence of approximate duplicates.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Rousseeuw, P.J.: Finding groups in data: an introduction to cluster analysis. Wiley, New York (1990)
Google Scholar
Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Comp App. Math. 20, 53–65 (1987)
Article MATH Google Scholar
R Development Core Team R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, http://www.r-project.org/
Hirano, S., et al.: Comparison of clustering methods for clinical databases. Journal of Information Sciences, 155–165 (2004)
Google Scholar
Halkidi, M., et al.: On Clustering validation techniques. Journal of Intelligent Information Systems 17(2/3), 107–145 (2001)
Article MATH Google Scholar
Jain, A., et al.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999)
Article Google Scholar
Halkidi, M., et al.: Cluster validity methods: Part 1. Sigmod Record 31(2), 40–45 (2002)
Article Google Scholar
Halkidi, M., et al.: Cluster validity methods: Part 2. Sigmod Record 31(3), 19–27 (2002)
Article Google Scholar
Halkidi, M., et al.: On clustering validation techniques. Journal of Intelligent Information Systems 17(2/3), 107–145 (2001)
Article MATH Google Scholar
MacQueen, J.B.: Some Methods for classification and analysis of multivariate observations. In: Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)
Google Scholar
Bolshakova, N., Azuaje, F.: Cluster validation techniques for genome expression data 83(4), 825–833 (2003)
Google Scholar
Tibshirani, et al.: Estimating the number of clusters in a data set via the gap statistic. Journal R. Stat. Soc. Ser. B 63, 411–423 (2001)
Article MATH MathSciNet Google Scholar
Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2), 159–179 (1985)
Article Google Scholar
Baker, F.B., Hubert, L.J.: Measuring the power of hierarchical cluster analysis. Journal of the American Statistical Association, 31–38 (1975)
Google Scholar
Stein, B., et al.: On cluster validity and the information need of users. In: 3rd IASTED Int. Conference on Artificial Intelligence and Applications (AIA 2003), pp. 216–221 (2003)
Google Scholar
Ahmed, K., et al.: Duplicate record detection: A survey. IEEE Transactions on Data and Knowledge and Engineering 19(1), 1–16 (2007)
Article MATH Google Scholar
Calinski, R.B., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statistics 3, 1–27 (1974)
Article MathSciNet Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218 (1985)
Article Google Scholar
Blake, C.L., et al.: UCI repository of machine learning databases (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer and Information Sciences, University of South Australia, Australia
Ravi Jain & Andy Koronios

Authors

Ravi Jain
View author publications
You can also search for this author in PubMed Google Scholar
Andy Koronios
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Lakhmi C. Jain Mika Sato-Ilic Maria Virvou George A. Tsihrintzis Valentina Emilia Balas Canicious Abeynayake

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Jain, R., Koronios, A. (2008). Cluster Validating Techniques in the Presence of Duplicates. In: Jain, L.C., Sato-Ilic, M., Virvou, M., Tsihrintzis, G.A., Balas, V.E., Abeynayake, C. (eds) Computational Intelligence Paradigms. Studies in Computational Intelligence, vol 137. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-79474-5_9

Download citation

DOI: https://doi.org/10.1007/978-3-540-79474-5_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-79473-8
Online ISBN: 978-3-540-79474-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics