Sensitivity - An Important Facet of Cluster Validation Process for Entity Matching Technique

Mishra, Sumit; Mondal, Samrat; Saha, Sriparna

doi:10.1007/978-3-662-54037-4_1

Sumit Mishra¹⁶,
Samrat Mondal¹⁶ &
Sriparna Saha¹⁶

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 10120))

387 Accesses

Abstract

Cluster validity measure is one of the important components of cluster validation process in which once a clustering arrangement is found, then it is compared with the actual clustering arrangement or gold standard if it is available. For this purpose, different external cluster validity measures (VMs) are available. However, all the measures are not equally good for some specific clustering problem. For example, in entity matching technique, F-measure is a preferably used VM than McNemar index as the former satisfies a given set of desirable properties for entity matching problem. But we have observed that even if all the existing desirable properties are satisfied, then also some of the important differences between two clustering arrangements are not detected by some VMs. Thus we propose to introduce another property, termed as sensitivity, which can be added to the desirable property set and can be used along with the existing set of properties for the cluster validation process. In this paper, the sensitivity property of a VM is formally introduced and then the value of sensitivity is computed using the proposed identity matrix based technique. A comprehensive analysis is made to compare some of the existing VMs and then the suitability of the VMs with respect to the entity matching technique is obtained. Thus, this paper helps to improve the performance of the cluster validation process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

http://dblp.uni-trier.de/
https://aminer.org/
Aggarwal, C.C.: A survey of stream clustering algorithms. Data Clustering: Algorithms and Applications, p. 231 (2013)
Google Scholar
Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 77–128. Springer, New York (2012)
Chapter Google Scholar
Andrews, G.E.: The Theory of Partitions, 2nd edn. Cambridge University Press, Cambridge (1998)
MATH Google Scholar
Andrews, G.E., Eriksson, K.: Integer Partitions. Cambridge University Press, Cambridge (2004)
Book MATH Google Scholar
Baker, F.B., Hubert, L.J.: Measuring the power of hierarchical cluster analysis. J. Am. Stat. Assoc. 70(349), 31–38 (1975)
Article MATH Google Scholar
Baudry, J.P., Raftery, A.E., Celeux, G., Lo, K., Gottardo, R.: Combining mixture components for clustering. J. Comput. Graph. Stat. 19(2) (2010)
Google Scholar
Becker, H., Riordan, J.: The arithmetic of bell and stirling numbers. Am. J. Math. 70, 385–394 (1948)
Article MathSciNet MATH Google Scholar
Bell, E.T.: Partition polynomials. Ann. Math. 29, 38–46 (1927)
Article MathSciNet MATH Google Scholar
Blachon, S., Pensa, R.G., Besson, J., Robardet, C., Boulicaut, J.F., Gandrillon, O.: Clustering formal concepts to discover biologically relevant knowledge from gene expression data. Silico Biol. 7(4), 467–483 (2007)
Google Scholar
Cha, S.H.: Recursive algorithm for generating partitions of an integer. Pace University, Seidenberg School of Computer Science and Information Systems, Technical report (2011)
Google Scholar
Chen, X., Cai, D.: Large scale spectral clustering with landmark-based representation. In: AAAI (2011)
Google Scholar
Chen, Y., Sanghavi, S., Xu, H.: Clustering sparse graphs. In: Advances in Neural Information Processing Systems, pp. 2204–2212 (2012)
Google Scholar
Chitta, R., Jin, R., Jain, A.K.: Efficient kernel clustering using random fourier features. In: 2012 IEEE 12th International Conference on Data Mining (ICDM), pp. 161–170. IEEE (2012)
Google Scholar
Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78(383), 553–569 (1983)
Article MATH Google Scholar
Girardi, D., Giretzlehner, M., Küng, J.: Using generic meta-data-models for clustering medical data. In: Böhm, C., Khuri, S., Lhotská, L., Renda, M.E. (eds.) ITBAM 2012. LNCS, vol. 7451, pp. 40–53. Springer, Heidelberg (2012). doi:10.1007/978-3-642-32395-9_4
Chapter Google Scholar
Graves, D., Pedrycz, W.: Kernel-based fuzzy clustering and fuzzy clustering: a comparative experimental study. Fuzzy Sets Syst. 161(4), 522–543 (2010)
Article MathSciNet Google Scholar
Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. In: 15th International Conference on Data Engineering, 1999, pp. 512–521. IEEE (1999)
Google Scholar
Höppner, F.: Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition. Wiley, New York (1999)
MATH Google Scholar
Jackson, D.A., Somers, K.M., Harvey, H.H.: Similarity coefficients: measures of co-occurrence and association or simply measures of occurrence? American Naturalist, pp. 436–453 (1989)
Google Scholar
Karypis, G., Han, E.H., Kumar, V.: Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8), 68–75 (1999)
Article Google Scholar
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16–22. ACM (1999)
Google Scholar
Liu, L., Huang, L., Lai, M., Ma, C.: Projective art with buffers for the high dimensional space clustering and an application to discover stock associations. Neurocomputing 72(4), 1283–1295 (2009)
Article Google Scholar
McNicholas, P.D., Murphy, T.B.: Model-based clustering of microarray expression data via latent gaussian mixture models. Bioinformatics 26(21), 2705–2712 (2010)
Article Google Scholar
Meilă, M., Heckerman, D.: An experimental comparison of model-based clustering methods. Mach. Learn. 42(1–2), 9–29 (2001)
Article MATH Google Scholar
Meng, K., Dong, Z.Y., Wang, D.H., Wong, K.P.: A self-adaptive rbf neural network classifier for transformer fault analysis. IEEE Trans. Power Syst. 25(3), 1350–1360 (2010)
Article Google Scholar
Milligan, G.W., Cooper, M.C.: A study of the comparability of external criteria for hierarchical cluster analysis. Multivar. Behav. Res. 21(4), 441–458 (1986)
Article Google Scholar
Mishra, S., Mondal, S., Saha, S.: Entity matching technique for bibliographic database. In: Decker, H., Lhotská, L., Link, S., Basl, J., Tjoa, A.M. (eds.) DEXA 2013. LNCS, vol. 8056, pp. 34–41. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40173-2_5
Chapter Google Scholar
Müller, K.R., Mika, S., Rätsch, G., Tsuda, K., Schölkopf, B.: An introduction to kernel-based learning algorithms. IEEE Trans. Neural Networks 12(2), 181–201 (2001)
Article Google Scholar
Murray, D.A.: Chironomidae: Ecology, Systematics Cytology and Physiology. Elsevier, Amsterdam (1980)
Google Scholar
Nie, F., Zeng, Z., Tsang, I.W., Xu, D., Zhang, C.: Spectral embedded clustering: a framework for in-sample and out-of-sample spectral clustering. IEEE Trans. Neural Networks 22(11), 1796–1808 (2011)
Article Google Scholar
Novák, P., Neumann, P., Macas, J.: Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data. BMC Bioinformatics 11(1), 378 (2010)
Article Google Scholar
Pal, N.R., Bezdek, J.C., Tsao, E.C.: Generalized clustering networks and Kohonen’s self-organizing scheme. IEEE Trans. Neural Networks 4(4), 549–557 (1993)
Article Google Scholar
Pandey, G., Atluri, G., Steinbach, M., Myers, C.L., Kumar, V.: An association analysis approach to biclustering. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 677–686. ACM (2009)
Google Scholar
Park, Y., Moore, C., Bader, J.S.: Dynamic networks from hierarchical Bayesian graph clustering. PloS one 5(1), e8118 (2010)
Article Google Scholar
Pensa, R.G., Boulicaut, J.F.: Constrained co-clustering of gene expression data. In: SDM. pp. 25–36. SIAM (2008)
Google Scholar
Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)
Article Google Scholar
Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, Cambridge (2002)
Google Scholar
Wagner, S., Wagner, D.: Comparing clusterings: an overview. Universität Karlsruhe, Fakultät für Informatik Karlsruhe (2007)
Google Scholar
Wang, X., Davidson, I.: Active spectral clustering. In: IEEE 10th International Conference on Data Mining (ICDM), 2010, pp. 561–568. IEEE (2010)
Google Scholar
Xiong, H., Steinbach, M., Tan, P.N., Kumar, V.: HICAP: hierarchical clustering with pattern preservation. In: SDM, pp. 279–290 (2004)
Google Scholar
Yeung, K.Y., Ruzzo, W.L.: Details of the adjusted rand index and clustering algorithms. Bioinformatics 17(9), 763–774 (2001). Supplement to the paper “An empirical study on principal component analysis for clustering gene expression data”
Article Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: ACM SIGMOD Record. vol. 25, pp. 103–114. ACM (1996)
Google Scholar
Zhuang, X., Huang, Y., Palaniappan, K., Zhao, Y.: Gaussian mixture density modeling, decomposition, and applications. IEEE Trans. Image Process. 5(9), 1293–1302 (1996)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, 801103, Bihar, India
Sumit Mishra, Samrat Mondal & Sriparna Saha

Authors

Sumit Mishra
View author publications
You can also search for this author in PubMed Google Scholar
Samrat Mondal
View author publications
You can also search for this author in PubMed Google Scholar
Sriparna Saha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Samrat Mondal .

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
FAW, University of Linz, Linz, Austria
Josef Küng
FAW, University of Linz, Linz, Austria
Roland Wagner

A Appendix

Possible combination of strings in X and Y. The strings which are not considered are marked with \(*\).

Table 11. Possible strings for \(n = 4\)

Full size table

Table 12. Possible strings for \(n = 4\) by applying H1

Full size table

Table 13. Possible Strings for \(n = 4\) by applying H2

Full size table

Table 14. Possible Strings for \(n = 4\) by applying H3

Full size table

Table 15. Possible strings for \(n = 4\) by applying heuristics H1 and H2

Full size table

Table 16. Possible Strings for \(n = 5\) by applying H3

Full size table

Table 17. Possible Strings for \(n = 4\) by applying heuristics H2 and H3

Full size table

Table 18. Possible Strings for \(n = 4\) by applying heuristics H1, H2 and H3

Full size table

Table 19. Possible Strings for \(n = 5\) by applying heuristics H2 and H3

Full size table

Table 20. Possible strings for \(n = 5\) by applying heuristics H1, H2 and H3

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mishra, S., Mondal, S., Saha, S. (2016). Sensitivity - An Important Facet of Cluster Validation Process for Entity Matching Technique. In: Hameurlain, A., Küng, J., Wagner, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXIX. Lecture Notes in Computer Science(), vol 10120. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-54037-4_1

Download citation

DOI: https://doi.org/10.1007/978-3-662-54037-4_1
Published: 16 December 2016
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-54036-7
Online ISBN: 978-3-662-54037-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Sensitivity - An Important Facet of Cluster Validation Process for Entity Matching Technique

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix

A Appendix

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation