Abstract
The basic idea of ssHC is to leverage domain knowledge in the form of triple-wise constraints to group data into clusters. In this paper, we perform extensive experiments in order to evaluate the effects of different distance metrics, linkages measures and constraints on the performance of two ssHC algorithms: IPoptim and UltraTran. The algorithms are implemented with varying proportions of constraints in the different datasets, ranging from 10% to 60%. We found that both IPoptim and UltraTran performed almost equally across the seven datasets. An interesting observation is that an increase in constraint does not always show an improvement in ssHC performance. It can also be observed that the inclusion of too many classes degrades the performance of clustering. The experimental results show that the ssHC with Canberra distance perform well, apart from ssHC with well-known distances such as Euclidean and Standard Euclidean distances. Together with complete linkages and small amount of constraints of 10%, ssHC can achieve good results of an F-score close to 0.8 and above for four out of the seven datasets. Moreover, the output of non-parametric statistical test shows that using the UltraTran algorithm in combination with the Manhattan distance metric and Ward.D linkage method provides the best results. Furthermore, utilizing IPoptim and UltraTran with the Canberra distance measure performs better for the given datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Pise, M.N., Kulkarni, P.: A survey of semi-supervised learning methods. In: Proceedings of the International Conference on Computational Intelligence and Security, pp. 30–34 (2008)
Zheng, L., Li, T.: Semi-supervised hierarchical clustering. In: 2011 IEEE 11th International Conference on Data Mining (ICDM), pp. 982–991. IEEE (2011)
Poria, S., Gelbukh, A., Das, D., Bandyopadhyay, S.: Fuzzy clustering for semi-supervised learning – case study: construction of an emotion lexicon. In: Batyrshin, I., González Mendoza, M. (eds.) MICAI 2012. LNCS, vol. 7629, pp. 73–86. Springer, Heidelberg (2013). doi:10.1007/978-3-642-37807-2_7
Kumar, V., Chhabra, J.K., Kumar, D.: Performance evaluation of distance metrics in the clustering algorithms. INFOCOMP J. Comput. Sci. 13(1), 38–52 (2014)
Hair, J.F., Anderson, R.E., Tatham, R.L.: Multivariate Data Analysis. Macmillan, London (1987)
Wagstaff, K., Cardie, C.: Clustering with instance-level constraints. In: Proceedings of the 17th International Conference on Machine Learning, pp. 1103–1110 (2000)
Dasgupta, S., Ng, V.: Which clustering do you want? Inducing your ideal clustering with minimal feedback. J. Artif. Intell. Res. 39, 581–632 (2010)
Hands, S., Everitt, B.: A monte carlo study of the recovery of cluster structure in binary data by hierarchical clustering technique. Multivar. Behav. Res. 22, 235–243 (1987)
Ferreira, L., Hitchcock, D.B.: A comparison of hierarchical methods for clustering functional data. Commun. Stat. Simul. Comput. 38(9), 1925–1949 (2009)
Milligan, G., Cooper, M.: A study of standardization of variables in cluster analysis. J. Classif. 5, 181–204 (1988)
Bade, K., Nurnberger, A.: Personalized hierarchical clustering. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, pp. 181–187 (2006)
Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 4–37 (2000)
Cherry, L.M., Case, S.M., Kunkel, J.G., Wyles, J.S., Wilson, A.C.: Body shape metrics and organismal evolution. Evolution 36(5), 914–933 (1982)
Potolea, R., Cacoveanu, S., Lemnaru, C.: Meta-learning framework for prediction strategy evaluation. In: Filipe, J., Cordeiro, J. (eds.) ICEIS 2010. LNBIP, vol. 73, pp. 280–295. Springer, Heidelberg (2011). doi:10.1007/978-3-642-19802-1_20
Singh, A., Yadav, A., Rana, A.: K-means with three different distance metrics. Int. J. Comput. Appl. 67(10), 13–17 (2013)
Charulatha, B.S., Rodrigues, P., Chitralekha, T., Rajaraman, A.: A comparative study of different distance metrics that can be used in fuzzy clustering algorithms. Int. J. Emerg. Trends Technol. Comput. Sci. (2013)
Mazzocchi, M.: Statistics for Marketing and Consumer Research. Sage Publications, Thousand Oaks (2008)
Albalate, A., Minker, W.: Semi-Supervised and Unsupervised Machine Learning: Novel Strategies. Wiley, Hoboken (2013)
Yashwant, S., Sananse, S.L.: Comparisons of different methods of cluster analysis with application to rainfall data. Int. J. Innov. Res. Sci. Eng. Technol. 4(11), 10861–10872 (2015)
Gan, G., Ma, C., Wu, J.: Data clustering: theory, algorithms, and applications. In: Proceedings of the ASA-SIAM Series on Statistics and Applied Probability (2007)
Lichman, M.: UCI Machine Learning Repository (2013)
Soria, D., Garibaldi, J.M., Ambrogi, F., Green, A.R., Powe, D., Rakha, E., Macmillan, R.D., Blamey, R.W., Ball, G., Lisboa, P.J., Etchells, T.A., Boracchi, P., Biganzoli, E., Ellis, I.O.: A methodology to identify consensus classes from clustering algorithms applied to immunohistochemical data from breast cancer patients. Comput. Biol. Med. 40(3), 318–330 (2010)
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of International Conference on Knowledge Discovery and Data Mining, pp. 16–22 (1999)
Garcia, S., Herrera, F.: An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J. Mach. Learn. Res. 9(Dec), 2677–2694 (2008)
Derrac, J., Garcia, S., Molina, D., Herrera, F.: A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol. Comput. 1(1), 3–18 (2011)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7(Jan), 1–30 (2006)
Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32(200), 675–701 (1937)
Nemenyi, P.B.: Distribution-free multiple comparisons. Ph.D. thesis, Princeton University (1963)
Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65–70 (1979)
Shaffer, J.P.: Modified sequentially rejective multiple test procedures. J. Am. Stat. Assoc. 81(395), 826–831 (1986)
Shaffer, J.P.: Multiple hypothesis testing. Ann. Rev. Psychol. 46(1), 561–584 (1995)
Acknowledgement
We would like to thank Professor Jonathan Garibaldi for sharing the NTBC dataset with us.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Aljohani, A., Lai, D.T.C., Bell, P.C., Edirisinghe, E.A. (2017). A Comparison of Distance Metrics in Semi-supervised Hierarchical Clustering Methods. In: Huang, DS., Hussain, A., Han, K., Gromiha, M. (eds) Intelligent Computing Methodologies. ICIC 2017. Lecture Notes in Computer Science(), vol 10363. Springer, Cham. https://doi.org/10.1007/978-3-319-63315-2_63
Download citation
DOI: https://doi.org/10.1007/978-3-319-63315-2_63
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63314-5
Online ISBN: 978-3-319-63315-2
eBook Packages: Computer ScienceComputer Science (R0)