Skip to main content

Sensitivity - An Important Facet of Cluster Validation Process for Entity Matching Technique

  • Chapter
  • First Online:
Transactions on Large-Scale Data- and Knowledge-Centered Systems XXIX

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 10120))

  • 387 Accesses

Abstract

Cluster validity measure is one of the important components of cluster validation process in which once a clustering arrangement is found, then it is compared with the actual clustering arrangement or gold standard if it is available. For this purpose, different external cluster validity measures (VMs) are available. However, all the measures are not equally good for some specific clustering problem. For example, in entity matching technique, F-measure is a preferably used VM than McNemar index as the former satisfies a given set of desirable properties for entity matching problem. But we have observed that even if all the existing desirable properties are satisfied, then also some of the important differences between two clustering arrangements are not detected by some VMs. Thus we propose to introduce another property, termed as sensitivity, which can be added to the desirable property set and can be used along with the existing set of properties for the cluster validation process. In this paper, the sensitivity property of a VM is formally introduced and then the value of sensitivity is computed using the proposed identity matrix based technique. A comprehensive analysis is made to compare some of the existing VMs and then the suitability of the VMs with respect to the entity matching technique is obtained. Thus, this paper helps to improve the performance of the cluster validation process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. http://dblp.uni-trier.de/

  2. https://aminer.org/

  3. Aggarwal, C.C.: A survey of stream clustering algorithms. Data Clustering: Algorithms and Applications, p. 231 (2013)

    Google Scholar 

  4. Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 77–128. Springer, New York (2012)

    Chapter  Google Scholar 

  5. Andrews, G.E.: The Theory of Partitions, 2nd edn. Cambridge University Press, Cambridge (1998)

    MATH  Google Scholar 

  6. Andrews, G.E., Eriksson, K.: Integer Partitions. Cambridge University Press, Cambridge (2004)

    Book  MATH  Google Scholar 

  7. Baker, F.B., Hubert, L.J.: Measuring the power of hierarchical cluster analysis. J. Am. Stat. Assoc. 70(349), 31–38 (1975)

    Article  MATH  Google Scholar 

  8. Baudry, J.P., Raftery, A.E., Celeux, G., Lo, K., Gottardo, R.: Combining mixture components for clustering. J. Comput. Graph. Stat. 19(2) (2010)

    Google Scholar 

  9. Becker, H., Riordan, J.: The arithmetic of bell and stirling numbers. Am. J. Math. 70, 385–394 (1948)

    Article  MathSciNet  MATH  Google Scholar 

  10. Bell, E.T.: Partition polynomials. Ann. Math. 29, 38–46 (1927)

    Article  MathSciNet  MATH  Google Scholar 

  11. Blachon, S., Pensa, R.G., Besson, J., Robardet, C., Boulicaut, J.F., Gandrillon, O.: Clustering formal concepts to discover biologically relevant knowledge from gene expression data. Silico Biol. 7(4), 467–483 (2007)

    Google Scholar 

  12. Cha, S.H.: Recursive algorithm for generating partitions of an integer. Pace University, Seidenberg School of Computer Science and Information Systems, Technical report (2011)

    Google Scholar 

  13. Chen, X., Cai, D.: Large scale spectral clustering with landmark-based representation. In: AAAI (2011)

    Google Scholar 

  14. Chen, Y., Sanghavi, S., Xu, H.: Clustering sparse graphs. In: Advances in Neural Information Processing Systems, pp. 2204–2212 (2012)

    Google Scholar 

  15. Chitta, R., Jin, R., Jain, A.K.: Efficient kernel clustering using random fourier features. In: 2012 IEEE 12th International Conference on Data Mining (ICDM), pp. 161–170. IEEE (2012)

    Google Scholar 

  16. Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78(383), 553–569 (1983)

    Article  MATH  Google Scholar 

  17. Girardi, D., Giretzlehner, M., Küng, J.: Using generic meta-data-models for clustering medical data. In: Böhm, C., Khuri, S., Lhotská, L., Renda, M.E. (eds.) ITBAM 2012. LNCS, vol. 7451, pp. 40–53. Springer, Heidelberg (2012). doi:10.1007/978-3-642-32395-9_4

    Chapter  Google Scholar 

  18. Graves, D., Pedrycz, W.: Kernel-based fuzzy clustering and fuzzy clustering: a comparative experimental study. Fuzzy Sets Syst. 161(4), 522–543 (2010)

    Article  MathSciNet  Google Scholar 

  19. Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. In: 15th International Conference on Data Engineering, 1999, pp. 512–521. IEEE (1999)

    Google Scholar 

  20. Höppner, F.: Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition. Wiley, New York (1999)

    MATH  Google Scholar 

  21. Jackson, D.A., Somers, K.M., Harvey, H.H.: Similarity coefficients: measures of co-occurrence and association or simply measures of occurrence? American Naturalist, pp. 436–453 (1989)

    Google Scholar 

  22. Karypis, G., Han, E.H., Kumar, V.: Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8), 68–75 (1999)

    Article  Google Scholar 

  23. Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16–22. ACM (1999)

    Google Scholar 

  24. Liu, L., Huang, L., Lai, M., Ma, C.: Projective art with buffers for the high dimensional space clustering and an application to discover stock associations. Neurocomputing 72(4), 1283–1295 (2009)

    Article  Google Scholar 

  25. McNicholas, P.D., Murphy, T.B.: Model-based clustering of microarray expression data via latent gaussian mixture models. Bioinformatics 26(21), 2705–2712 (2010)

    Article  Google Scholar 

  26. Meilă, M., Heckerman, D.: An experimental comparison of model-based clustering methods. Mach. Learn. 42(1–2), 9–29 (2001)

    Article  MATH  Google Scholar 

  27. Meng, K., Dong, Z.Y., Wang, D.H., Wong, K.P.: A self-adaptive rbf neural network classifier for transformer fault analysis. IEEE Trans. Power Syst. 25(3), 1350–1360 (2010)

    Article  Google Scholar 

  28. Milligan, G.W., Cooper, M.C.: A study of the comparability of external criteria for hierarchical cluster analysis. Multivar. Behav. Res. 21(4), 441–458 (1986)

    Article  Google Scholar 

  29. Mishra, S., Mondal, S., Saha, S.: Entity matching technique for bibliographic database. In: Decker, H., Lhotská, L., Link, S., Basl, J., Tjoa, A.M. (eds.) DEXA 2013. LNCS, vol. 8056, pp. 34–41. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40173-2_5

    Chapter  Google Scholar 

  30. Müller, K.R., Mika, S., Rätsch, G., Tsuda, K., Schölkopf, B.: An introduction to kernel-based learning algorithms. IEEE Trans. Neural Networks 12(2), 181–201 (2001)

    Article  Google Scholar 

  31. Murray, D.A.: Chironomidae: Ecology, Systematics Cytology and Physiology. Elsevier, Amsterdam (1980)

    Google Scholar 

  32. Nie, F., Zeng, Z., Tsang, I.W., Xu, D., Zhang, C.: Spectral embedded clustering: a framework for in-sample and out-of-sample spectral clustering. IEEE Trans. Neural Networks 22(11), 1796–1808 (2011)

    Article  Google Scholar 

  33. Novák, P., Neumann, P., Macas, J.: Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data. BMC Bioinformatics 11(1), 378 (2010)

    Article  Google Scholar 

  34. Pal, N.R., Bezdek, J.C., Tsao, E.C.: Generalized clustering networks and Kohonen’s self-organizing scheme. IEEE Trans. Neural Networks 4(4), 549–557 (1993)

    Article  Google Scholar 

  35. Pandey, G., Atluri, G., Steinbach, M., Myers, C.L., Kumar, V.: An association analysis approach to biclustering. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 677–686. ACM (2009)

    Google Scholar 

  36. Park, Y., Moore, C., Bader, J.S.: Dynamic networks from hierarchical Bayesian graph clustering. PloS one 5(1), e8118 (2010)

    Article  Google Scholar 

  37. Pensa, R.G., Boulicaut, J.F.: Constrained co-clustering of gene expression data. In: SDM. pp. 25–36. SIAM (2008)

    Google Scholar 

  38. Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)

    Article  Google Scholar 

  39. Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, Cambridge (2002)

    Google Scholar 

  40. Wagner, S., Wagner, D.: Comparing clusterings: an overview. Universität Karlsruhe, Fakultät für Informatik Karlsruhe (2007)

    Google Scholar 

  41. Wang, X., Davidson, I.: Active spectral clustering. In: IEEE 10th International Conference on Data Mining (ICDM), 2010, pp. 561–568. IEEE (2010)

    Google Scholar 

  42. Xiong, H., Steinbach, M., Tan, P.N., Kumar, V.: HICAP: hierarchical clustering with pattern preservation. In: SDM, pp. 279–290 (2004)

    Google Scholar 

  43. Yeung, K.Y., Ruzzo, W.L.: Details of the adjusted rand index and clustering algorithms. Bioinformatics 17(9), 763–774 (2001). Supplement to the paper “An empirical study on principal component analysis for clustering gene expression data”

    Article  Google Scholar 

  44. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: ACM SIGMOD Record. vol. 25, pp. 103–114. ACM (1996)

    Google Scholar 

  45. Zhuang, X., Huang, Y., Palaniappan, K., Zhao, Y.: Gaussian mixture density modeling, decomposition, and applications. IEEE Trans. Image Process. 5(9), 1293–1302 (1996)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Samrat Mondal .

Editor information

Editors and Affiliations

A Appendix

A Appendix

Possible combination of strings in X and Y. The strings which are not considered are marked with \(*\).

Table 11. Possible strings for \(n = 4\)
Table 12. Possible strings for \(n = 4\) by applying H1
Table 13. Possible Strings for \(n = 4\) by applying H2
Table 14. Possible Strings for \(n = 4\) by applying H3
Table 15. Possible strings for \(n = 4\) by applying heuristics H1 and H2
Table 16. Possible Strings for \(n = 5\) by applying H3
Table 17. Possible Strings for \(n = 4\) by applying heuristics H2 and H3
Table 18. Possible Strings for \(n = 4\) by applying heuristics H1, H2 and H3
Table 19. Possible Strings for \(n = 5\) by applying heuristics H2 and H3
Table 20. Possible strings for \(n = 5\) by applying heuristics H1, H2 and H3

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer-Verlag GmbH Germany

About this chapter

Cite this chapter

Mishra, S., Mondal, S., Saha, S. (2016). Sensitivity - An Important Facet of Cluster Validation Process for Entity Matching Technique. In: Hameurlain, A., Küng, J., Wagner, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXIX. Lecture Notes in Computer Science(), vol 10120. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-54037-4_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-54037-4_1

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-54036-7

  • Online ISBN: 978-3-662-54037-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics