Skip to main content

Cluster Validating Techniques in the Presence of Duplicates

  • Chapter
Book cover Computational Intelligence Paradigms

Part of the book series: Studies in Computational Intelligence ((SCI,volume 137))

Abstract

To detect database records containing approximate and exact duplicates because of data entry error or differences in the detailed schemas of records from multiple databases or for some other reasons is an important line of research. Yet no comprehensive comparative study has been performed to evaluate the effectiveness of Silhouette width, Calinski & Harbasz index (pseudo F-statistics) and Baker & Hubert index (γ index) algorithms for exact and approximate duplicates. In this chapter, a comparative study and effectiveness of these three cluster validation techniques which involve measuring the stability of a partition in a data set in the presence of noise, in particular, approximate and exact duplicates are presented. Silhouette width, Calinski & Harbasz index and Baker & Hubert index are calculated before and after inserting the exact and approximate duplicates (deliberately) in the data set. Comprehensive experiments on glass, wine, iris and ruspini database confirms that the Baker & Hubert index is not stable in the presence of approximate duplicates. Moreover, Silhouette width, Calinski and Harbasz index and Baker & Hubert indice do not exceed the original data indice in the presence of approximate duplicates.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Rousseeuw, P.J.: Finding groups in data: an introduction to cluster analysis. Wiley, New York (1990)

    Google Scholar 

  2. Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Comp App. Math. 20, 53–65 (1987)

    Article  MATH  Google Scholar 

  3. R Development Core Team R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, http://www.r-project.org/

  4. Hirano, S., et al.: Comparison of clustering methods for clinical databases. Journal of Information Sciences, 155–165 (2004)

    Google Scholar 

  5. Halkidi, M., et al.: On Clustering validation techniques. Journal of Intelligent Information Systems 17(2/3), 107–145 (2001)

    Article  MATH  Google Scholar 

  6. Jain, A., et al.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999)

    Article  Google Scholar 

  7. Halkidi, M., et al.: Cluster validity methods: Part 1. Sigmod Record 31(2), 40–45 (2002)

    Article  Google Scholar 

  8. Halkidi, M., et al.: Cluster validity methods: Part 2. Sigmod Record 31(3), 19–27 (2002)

    Article  Google Scholar 

  9. Halkidi, M., et al.: On clustering validation techniques. Journal of Intelligent Information Systems 17(2/3), 107–145 (2001)

    Article  MATH  Google Scholar 

  10. MacQueen, J.B.: Some Methods for classification and analysis of multivariate observations. In: Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)

    Google Scholar 

  11. Bolshakova, N., Azuaje, F.: Cluster validation techniques for genome expression data  83(4), 825–833 (2003)

    Google Scholar 

  12. Tibshirani, et al.: Estimating the number of clusters in a data set via the gap statistic. Journal R. Stat. Soc. Ser. B 63, 411–423 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  13. Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2), 159–179 (1985)

    Article  Google Scholar 

  14. Baker, F.B., Hubert, L.J.: Measuring the power of hierarchical cluster analysis. Journal of the American Statistical Association, 31–38 (1975)

    Google Scholar 

  15. Stein, B., et al.: On cluster validity and the information need of users. In: 3rd IASTED Int. Conference on Artificial Intelligence and Applications (AIA 2003), pp. 216–221 (2003)

    Google Scholar 

  16. Ahmed, K., et al.: Duplicate record detection: A survey. IEEE Transactions on Data and Knowledge and Engineering 19(1), 1–16 (2007)

    Article  MATH  Google Scholar 

  17. Calinski, R.B., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statistics 3, 1–27 (1974)

    Article  MathSciNet  Google Scholar 

  18. Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218 (1985)

    Article  Google Scholar 

  19. Blake, C.L., et al.: UCI repository of machine learning databases (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Lakhmi C. Jain Mika Sato-Ilic Maria Virvou George A. Tsihrintzis Valentina Emilia Balas Canicious Abeynayake

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Jain, R., Koronios, A. (2008). Cluster Validating Techniques in the Presence of Duplicates. In: Jain, L.C., Sato-Ilic, M., Virvou, M., Tsihrintzis, G.A., Balas, V.E., Abeynayake, C. (eds) Computational Intelligence Paradigms. Studies in Computational Intelligence, vol 137. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-79474-5_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-79474-5_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-79473-8

  • Online ISBN: 978-3-540-79474-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics