Abstract
Cluster validity measure is one of the important components of cluster validation process in which once a clustering arrangement is found, then it is compared with the actual clustering arrangement or gold standard if it is available. For this purpose, different external cluster validity measures (VMs) are available. However, all the measures are not equally good for some specific clustering problem. For example, in entity matching technique, F-measure is a preferably used VM than McNemar index as the former satisfies a given set of desirable properties for entity matching problem. But we have observed that even if all the existing desirable properties are satisfied, then also some of the important differences between two clustering arrangements are not detected by some VMs. Thus we propose to introduce another property, termed as sensitivity, which can be added to the desirable property set and can be used along with the existing set of properties for the cluster validation process. In this paper, the sensitivity property of a VM is formally introduced and then the value of sensitivity is computed using the proposed identity matrix based technique. A comprehensive analysis is made to compare some of the existing VMs and then the suitability of the VMs with respect to the entity matching technique is obtained. Thus, this paper helps to improve the performance of the cluster validation process.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aggarwal, C.C.: A survey of stream clustering algorithms. Data Clustering: Algorithms and Applications, p. 231 (2013)
Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 77–128. Springer, New York (2012)
Andrews, G.E.: The Theory of Partitions, 2nd edn. Cambridge University Press, Cambridge (1998)
Andrews, G.E., Eriksson, K.: Integer Partitions. Cambridge University Press, Cambridge (2004)
Baker, F.B., Hubert, L.J.: Measuring the power of hierarchical cluster analysis. J. Am. Stat. Assoc. 70(349), 31–38 (1975)
Baudry, J.P., Raftery, A.E., Celeux, G., Lo, K., Gottardo, R.: Combining mixture components for clustering. J. Comput. Graph. Stat. 19(2) (2010)
Becker, H., Riordan, J.: The arithmetic of bell and stirling numbers. Am. J. Math. 70, 385–394 (1948)
Bell, E.T.: Partition polynomials. Ann. Math. 29, 38–46 (1927)
Blachon, S., Pensa, R.G., Besson, J., Robardet, C., Boulicaut, J.F., Gandrillon, O.: Clustering formal concepts to discover biologically relevant knowledge from gene expression data. Silico Biol. 7(4), 467–483 (2007)
Cha, S.H.: Recursive algorithm for generating partitions of an integer. Pace University, Seidenberg School of Computer Science and Information Systems, Technical report (2011)
Chen, X., Cai, D.: Large scale spectral clustering with landmark-based representation. In: AAAI (2011)
Chen, Y., Sanghavi, S., Xu, H.: Clustering sparse graphs. In: Advances in Neural Information Processing Systems, pp. 2204–2212 (2012)
Chitta, R., Jin, R., Jain, A.K.: Efficient kernel clustering using random fourier features. In: 2012 IEEE 12th International Conference on Data Mining (ICDM), pp. 161–170. IEEE (2012)
Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78(383), 553–569 (1983)
Girardi, D., Giretzlehner, M., Küng, J.: Using generic meta-data-models for clustering medical data. In: Böhm, C., Khuri, S., Lhotská, L., Renda, M.E. (eds.) ITBAM 2012. LNCS, vol. 7451, pp. 40–53. Springer, Heidelberg (2012). doi:10.1007/978-3-642-32395-9_4
Graves, D., Pedrycz, W.: Kernel-based fuzzy clustering and fuzzy clustering: a comparative experimental study. Fuzzy Sets Syst. 161(4), 522–543 (2010)
Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. In: 15th International Conference on Data Engineering, 1999, pp. 512–521. IEEE (1999)
Höppner, F.: Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition. Wiley, New York (1999)
Jackson, D.A., Somers, K.M., Harvey, H.H.: Similarity coefficients: measures of co-occurrence and association or simply measures of occurrence? American Naturalist, pp. 436–453 (1989)
Karypis, G., Han, E.H., Kumar, V.: Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8), 68–75 (1999)
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16–22. ACM (1999)
Liu, L., Huang, L., Lai, M., Ma, C.: Projective art with buffers for the high dimensional space clustering and an application to discover stock associations. Neurocomputing 72(4), 1283–1295 (2009)
McNicholas, P.D., Murphy, T.B.: Model-based clustering of microarray expression data via latent gaussian mixture models. Bioinformatics 26(21), 2705–2712 (2010)
Meilă, M., Heckerman, D.: An experimental comparison of model-based clustering methods. Mach. Learn. 42(1–2), 9–29 (2001)
Meng, K., Dong, Z.Y., Wang, D.H., Wong, K.P.: A self-adaptive rbf neural network classifier for transformer fault analysis. IEEE Trans. Power Syst. 25(3), 1350–1360 (2010)
Milligan, G.W., Cooper, M.C.: A study of the comparability of external criteria for hierarchical cluster analysis. Multivar. Behav. Res. 21(4), 441–458 (1986)
Mishra, S., Mondal, S., Saha, S.: Entity matching technique for bibliographic database. In: Decker, H., Lhotská, L., Link, S., Basl, J., Tjoa, A.M. (eds.) DEXA 2013. LNCS, vol. 8056, pp. 34–41. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40173-2_5
Müller, K.R., Mika, S., Rätsch, G., Tsuda, K., Schölkopf, B.: An introduction to kernel-based learning algorithms. IEEE Trans. Neural Networks 12(2), 181–201 (2001)
Murray, D.A.: Chironomidae: Ecology, Systematics Cytology and Physiology. Elsevier, Amsterdam (1980)
Nie, F., Zeng, Z., Tsang, I.W., Xu, D., Zhang, C.: Spectral embedded clustering: a framework for in-sample and out-of-sample spectral clustering. IEEE Trans. Neural Networks 22(11), 1796–1808 (2011)
Novák, P., Neumann, P., Macas, J.: Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data. BMC Bioinformatics 11(1), 378 (2010)
Pal, N.R., Bezdek, J.C., Tsao, E.C.: Generalized clustering networks and Kohonen’s self-organizing scheme. IEEE Trans. Neural Networks 4(4), 549–557 (1993)
Pandey, G., Atluri, G., Steinbach, M., Myers, C.L., Kumar, V.: An association analysis approach to biclustering. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 677–686. ACM (2009)
Park, Y., Moore, C., Bader, J.S.: Dynamic networks from hierarchical Bayesian graph clustering. PloS one 5(1), e8118 (2010)
Pensa, R.G., Boulicaut, J.F.: Constrained co-clustering of gene expression data. In: SDM. pp. 25–36. SIAM (2008)
Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)
Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, Cambridge (2002)
Wagner, S., Wagner, D.: Comparing clusterings: an overview. Universität Karlsruhe, Fakultät für Informatik Karlsruhe (2007)
Wang, X., Davidson, I.: Active spectral clustering. In: IEEE 10th International Conference on Data Mining (ICDM), 2010, pp. 561–568. IEEE (2010)
Xiong, H., Steinbach, M., Tan, P.N., Kumar, V.: HICAP: hierarchical clustering with pattern preservation. In: SDM, pp. 279–290 (2004)
Yeung, K.Y., Ruzzo, W.L.: Details of the adjusted rand index and clustering algorithms. Bioinformatics 17(9), 763–774 (2001). Supplement to the paper “An empirical study on principal component analysis for clustering gene expression data”
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: ACM SIGMOD Record. vol. 25, pp. 103–114. ACM (1996)
Zhuang, X., Huang, Y., Palaniappan, K., Zhao, Y.: Gaussian mixture density modeling, decomposition, and applications. IEEE Trans. Image Process. 5(9), 1293–1302 (1996)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix
A Appendix
Possible combination of strings in X and Y. The strings which are not considered are marked with \(*\).
Rights and permissions
Copyright information
© 2016 Springer-Verlag GmbH Germany
About this chapter
Cite this chapter
Mishra, S., Mondal, S., Saha, S. (2016). Sensitivity - An Important Facet of Cluster Validation Process for Entity Matching Technique. In: Hameurlain, A., Küng, J., Wagner, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXIX. Lecture Notes in Computer Science(), vol 10120. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-54037-4_1
Download citation
DOI: https://doi.org/10.1007/978-3-662-54037-4_1
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-54036-7
Online ISBN: 978-3-662-54037-4
eBook Packages: Computer ScienceComputer Science (R0)