A Comparison of Distance Metrics in Semi-supervised Hierarchical Clustering Methods

Aljohani, Abeer; Lai, Daphne Teck Ching; Bell, Paul C.; Edirisinghe, Eran A.

doi:10.1007/978-3-319-63315-2_63

Abeer Aljohani¹⁷,
Daphne Teck Ching Lai¹⁸,
Paul C. Bell¹⁷ &
…
Eran A. Edirisinghe¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10363))

Included in the following conference series:

International Conference on Intelligent Computing

2413 Accesses
1 Citations

Abstract

The basic idea of ssHC is to leverage domain knowledge in the form of triple-wise constraints to group data into clusters. In this paper, we perform extensive experiments in order to evaluate the effects of different distance metrics, linkages measures and constraints on the performance of two ssHC algorithms: IPoptim and UltraTran. The algorithms are implemented with varying proportions of constraints in the different datasets, ranging from 10% to 60%. We found that both IPoptim and UltraTran performed almost equally across the seven datasets. An interesting observation is that an increase in constraint does not always show an improvement in ssHC performance. It can also be observed that the inclusion of too many classes degrades the performance of clustering. The experimental results show that the ssHC with Canberra distance perform well, apart from ssHC with well-known distances such as Euclidean and Standard Euclidean distances. Together with complete linkages and small amount of constraints of 10%, ssHC can achieve good results of an F-score close to 0.8 and above for four out of the seven datasets. Moreover, the output of non-parametric statistical test shows that using the UltraTran algorithm in combination with the Manhattan distance metric and Ward.D linkage method provides the best results. Furthermore, utilizing IPoptim and UltraTran with the Canberra distance measure performs better for the given datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Pise, M.N., Kulkarni, P.: A survey of semi-supervised learning methods. In: Proceedings of the International Conference on Computational Intelligence and Security, pp. 30–34 (2008)
Google Scholar
Zheng, L., Li, T.: Semi-supervised hierarchical clustering. In: 2011 IEEE 11th International Conference on Data Mining (ICDM), pp. 982–991. IEEE (2011)
Google Scholar
Poria, S., Gelbukh, A., Das, D., Bandyopadhyay, S.: Fuzzy clustering for semi-supervised learning – case study: construction of an emotion lexicon. In: Batyrshin, I., González Mendoza, M. (eds.) MICAI 2012. LNCS, vol. 7629, pp. 73–86. Springer, Heidelberg (2013). doi:10.1007/978-3-642-37807-2_7
Chapter Google Scholar
Kumar, V., Chhabra, J.K., Kumar, D.: Performance evaluation of distance metrics in the clustering algorithms. INFOCOMP J. Comput. Sci. 13(1), 38–52 (2014)
Google Scholar
Hair, J.F., Anderson, R.E., Tatham, R.L.: Multivariate Data Analysis. Macmillan, London (1987)
Google Scholar
Wagstaff, K., Cardie, C.: Clustering with instance-level constraints. In: Proceedings of the 17th International Conference on Machine Learning, pp. 1103–1110 (2000)
Google Scholar
Dasgupta, S., Ng, V.: Which clustering do you want? Inducing your ideal clustering with minimal feedback. J. Artif. Intell. Res. 39, 581–632 (2010)
MathSciNet MATH Google Scholar
Hands, S., Everitt, B.: A monte carlo study of the recovery of cluster structure in binary data by hierarchical clustering technique. Multivar. Behav. Res. 22, 235–243 (1987)
Article Google Scholar
Ferreira, L., Hitchcock, D.B.: A comparison of hierarchical methods for clustering functional data. Commun. Stat. Simul. Comput. 38(9), 1925–1949 (2009)
Article MathSciNet MATH Google Scholar
Milligan, G., Cooper, M.: A study of standardization of variables in cluster analysis. J. Classif. 5, 181–204 (1988)
Article MathSciNet Google Scholar
Bade, K., Nurnberger, A.: Personalized hierarchical clustering. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, pp. 181–187 (2006)
Google Scholar
Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 4–37 (2000)
Article Google Scholar
Cherry, L.M., Case, S.M., Kunkel, J.G., Wyles, J.S., Wilson, A.C.: Body shape metrics and organismal evolution. Evolution 36(5), 914–933 (1982)
Article Google Scholar
Potolea, R., Cacoveanu, S., Lemnaru, C.: Meta-learning framework for prediction strategy evaluation. In: Filipe, J., Cordeiro, J. (eds.) ICEIS 2010. LNBIP, vol. 73, pp. 280–295. Springer, Heidelberg (2011). doi:10.1007/978-3-642-19802-1_20
Chapter Google Scholar
Singh, A., Yadav, A., Rana, A.: K-means with three different distance metrics. Int. J. Comput. Appl. 67(10), 13–17 (2013)
Google Scholar
Charulatha, B.S., Rodrigues, P., Chitralekha, T., Rajaraman, A.: A comparative study of different distance metrics that can be used in fuzzy clustering algorithms. Int. J. Emerg. Trends Technol. Comput. Sci. (2013)
Google Scholar
Mazzocchi, M.: Statistics for Marketing and Consumer Research. Sage Publications, Thousand Oaks (2008)
Book Google Scholar
Albalate, A., Minker, W.: Semi-Supervised and Unsupervised Machine Learning: Novel Strategies. Wiley, Hoboken (2013)
Book Google Scholar
Yashwant, S., Sananse, S.L.: Comparisons of different methods of cluster analysis with application to rainfall data. Int. J. Innov. Res. Sci. Eng. Technol. 4(11), 10861–10872 (2015)
Google Scholar
Gan, G., Ma, C., Wu, J.: Data clustering: theory, algorithms, and applications. In: Proceedings of the ASA-SIAM Series on Statistics and Applied Probability (2007)
Google Scholar
Lichman, M.: UCI Machine Learning Repository (2013)
Google Scholar
Soria, D., Garibaldi, J.M., Ambrogi, F., Green, A.R., Powe, D., Rakha, E., Macmillan, R.D., Blamey, R.W., Ball, G., Lisboa, P.J., Etchells, T.A., Boracchi, P., Biganzoli, E., Ellis, I.O.: A methodology to identify consensus classes from clustering algorithms applied to immunohistochemical data from breast cancer patients. Comput. Biol. Med. 40(3), 318–330 (2010)
Article Google Scholar
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of International Conference on Knowledge Discovery and Data Mining, pp. 16–22 (1999)
Google Scholar
Garcia, S., Herrera, F.: An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J. Mach. Learn. Res. 9(Dec), 2677–2694 (2008)
MATH Google Scholar
Derrac, J., Garcia, S., Molina, D., Herrera, F.: A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol. Comput. 1(1), 3–18 (2011)
Article Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7(Jan), 1–30 (2006)
MathSciNet MATH Google Scholar
Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32(200), 675–701 (1937)
Article MATH Google Scholar
Nemenyi, P.B.: Distribution-free multiple comparisons. Ph.D. thesis, Princeton University (1963)
Google Scholar
Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65–70 (1979)
MathSciNet MATH Google Scholar
Shaffer, J.P.: Modified sequentially rejective multiple test procedures. J. Am. Stat. Assoc. 81(395), 826–831 (1986)
Article MATH Google Scholar
Shaffer, J.P.: Multiple hypothesis testing. Ann. Rev. Psychol. 46(1), 561–584 (1995)
Article Google Scholar

Download references

Acknowledgement

We would like to thank Professor Jonathan Garibaldi for sharing the NTBC dataset with us.

Author information

Authors and Affiliations

Department of Computer Science, Loughborough University, Loughborough, UK
Abeer Aljohani, Paul C. Bell & Eran A. Edirisinghe
Faculty of Science, Universiti Brunei Darussalam, Bandar Seri Begawan, Brunei
Daphne Teck Ching Lai

Authors

Abeer Aljohani
View author publications
You can also search for this author in PubMed Google Scholar
Daphne Teck Ching Lai
View author publications
You can also search for this author in PubMed Google Scholar
Paul C. Bell
View author publications
You can also search for this author in PubMed Google Scholar
Eran A. Edirisinghe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abeer Aljohani .

Editor information

Editors and Affiliations

Tongji University, Shanghai, China
De-Shuang Huang
Liverpool John Moores University, Liverpool, United Kingdom
Abir Hussain
Inha University, Incheon, Korea (Republic of)
Kyungsook Han
Indian Institute of Technology Madras, Chennai, India
M. Michael Gromiha

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aljohani, A., Lai, D.T.C., Bell, P.C., Edirisinghe, E.A. (2017). A Comparison of Distance Metrics in Semi-supervised Hierarchical Clustering Methods. In: Huang, DS., Hussain, A., Han, K., Gromiha, M. (eds) Intelligent Computing Methodologies. ICIC 2017. Lecture Notes in Computer Science(), vol 10363. Springer, Cham. https://doi.org/10.1007/978-3-319-63315-2_63

Download citation

DOI: https://doi.org/10.1007/978-3-319-63315-2_63
Published: 21 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63314-5
Online ISBN: 978-3-319-63315-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics