Skip to main content

A Comparison of Distance Metrics in Semi-supervised Hierarchical Clustering Methods

  • Conference paper
  • First Online:
Intelligent Computing Methodologies (ICIC 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10363))

Included in the following conference series:

Abstract

The basic idea of ssHC is to leverage domain knowledge in the form of triple-wise constraints to group data into clusters. In this paper, we perform extensive experiments in order to evaluate the effects of different distance metrics, linkages measures and constraints on the performance of two ssHC algorithms: IPoptim and UltraTran. The algorithms are implemented with varying proportions of constraints in the different datasets, ranging from 10% to 60%. We found that both IPoptim and UltraTran performed almost equally across the seven datasets. An interesting observation is that an increase in constraint does not always show an improvement in ssHC performance. It can also be observed that the inclusion of too many classes degrades the performance of clustering. The experimental results show that the ssHC with Canberra distance perform well, apart from ssHC with well-known distances such as Euclidean and Standard Euclidean distances. Together with complete linkages and small amount of constraints of 10%, ssHC can achieve good results of an F-score close to 0.8 and above for four out of the seven datasets. Moreover, the output of non-parametric statistical test shows that using the UltraTran algorithm in combination with the Manhattan distance metric and Ward.D linkage method provides the best results. Furthermore, utilizing IPoptim and UltraTran with the Canberra distance measure performs better for the given datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Pise, M.N., Kulkarni, P.: A survey of semi-supervised learning methods. In: Proceedings of the International Conference on Computational Intelligence and Security, pp. 30–34 (2008)

    Google Scholar 

  2. Zheng, L., Li, T.: Semi-supervised hierarchical clustering. In: 2011 IEEE 11th International Conference on Data Mining (ICDM), pp. 982–991. IEEE (2011)

    Google Scholar 

  3. Poria, S., Gelbukh, A., Das, D., Bandyopadhyay, S.: Fuzzy clustering for semi-supervised learning – case study: construction of an emotion lexicon. In: Batyrshin, I., González Mendoza, M. (eds.) MICAI 2012. LNCS, vol. 7629, pp. 73–86. Springer, Heidelberg (2013). doi:10.1007/978-3-642-37807-2_7

    Chapter  Google Scholar 

  4. Kumar, V., Chhabra, J.K., Kumar, D.: Performance evaluation of distance metrics in the clustering algorithms. INFOCOMP J. Comput. Sci. 13(1), 38–52 (2014)

    Google Scholar 

  5. Hair, J.F., Anderson, R.E., Tatham, R.L.: Multivariate Data Analysis. Macmillan, London (1987)

    Google Scholar 

  6. Wagstaff, K., Cardie, C.: Clustering with instance-level constraints. In: Proceedings of the 17th International Conference on Machine Learning, pp. 1103–1110 (2000)

    Google Scholar 

  7. Dasgupta, S., Ng, V.: Which clustering do you want? Inducing your ideal clustering with minimal feedback. J. Artif. Intell. Res. 39, 581–632 (2010)

    MathSciNet  MATH  Google Scholar 

  8. Hands, S., Everitt, B.: A monte carlo study of the recovery of cluster structure in binary data by hierarchical clustering technique. Multivar. Behav. Res. 22, 235–243 (1987)

    Article  Google Scholar 

  9. Ferreira, L., Hitchcock, D.B.: A comparison of hierarchical methods for clustering functional data. Commun. Stat. Simul. Comput. 38(9), 1925–1949 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  10. Milligan, G., Cooper, M.: A study of standardization of variables in cluster analysis. J. Classif. 5, 181–204 (1988)

    Article  MathSciNet  Google Scholar 

  11. Bade, K., Nurnberger, A.: Personalized hierarchical clustering. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, pp. 181–187 (2006)

    Google Scholar 

  12. Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 4–37 (2000)

    Article  Google Scholar 

  13. Cherry, L.M., Case, S.M., Kunkel, J.G., Wyles, J.S., Wilson, A.C.: Body shape metrics and organismal evolution. Evolution 36(5), 914–933 (1982)

    Article  Google Scholar 

  14. Potolea, R., Cacoveanu, S., Lemnaru, C.: Meta-learning framework for prediction strategy evaluation. In: Filipe, J., Cordeiro, J. (eds.) ICEIS 2010. LNBIP, vol. 73, pp. 280–295. Springer, Heidelberg (2011). doi:10.1007/978-3-642-19802-1_20

    Chapter  Google Scholar 

  15. Singh, A., Yadav, A., Rana, A.: K-means with three different distance metrics. Int. J. Comput. Appl. 67(10), 13–17 (2013)

    Google Scholar 

  16. Charulatha, B.S., Rodrigues, P., Chitralekha, T., Rajaraman, A.: A comparative study of different distance metrics that can be used in fuzzy clustering algorithms. Int. J. Emerg. Trends Technol. Comput. Sci. (2013)

    Google Scholar 

  17. Mazzocchi, M.: Statistics for Marketing and Consumer Research. Sage Publications, Thousand Oaks (2008)

    Book  Google Scholar 

  18. Albalate, A., Minker, W.: Semi-Supervised and Unsupervised Machine Learning: Novel Strategies. Wiley, Hoboken (2013)

    Book  Google Scholar 

  19. Yashwant, S., Sananse, S.L.: Comparisons of different methods of cluster analysis with application to rainfall data. Int. J. Innov. Res. Sci. Eng. Technol. 4(11), 10861–10872 (2015)

    Google Scholar 

  20. Gan, G., Ma, C., Wu, J.: Data clustering: theory, algorithms, and applications. In: Proceedings of the ASA-SIAM Series on Statistics and Applied Probability (2007)

    Google Scholar 

  21. Lichman, M.: UCI Machine Learning Repository (2013)

    Google Scholar 

  22. Soria, D., Garibaldi, J.M., Ambrogi, F., Green, A.R., Powe, D., Rakha, E., Macmillan, R.D., Blamey, R.W., Ball, G., Lisboa, P.J., Etchells, T.A., Boracchi, P., Biganzoli, E., Ellis, I.O.: A methodology to identify consensus classes from clustering algorithms applied to immunohistochemical data from breast cancer patients. Comput. Biol. Med. 40(3), 318–330 (2010)

    Article  Google Scholar 

  23. Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of International Conference on Knowledge Discovery and Data Mining, pp. 16–22 (1999)

    Google Scholar 

  24. Garcia, S., Herrera, F.: An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J. Mach. Learn. Res. 9(Dec), 2677–2694 (2008)

    MATH  Google Scholar 

  25. Derrac, J., Garcia, S., Molina, D., Herrera, F.: A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol. Comput. 1(1), 3–18 (2011)

    Article  Google Scholar 

  26. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7(Jan), 1–30 (2006)

    MathSciNet  MATH  Google Scholar 

  27. Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32(200), 675–701 (1937)

    Article  MATH  Google Scholar 

  28. Nemenyi, P.B.: Distribution-free multiple comparisons. Ph.D. thesis, Princeton University (1963)

    Google Scholar 

  29. Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65–70 (1979)

    MathSciNet  MATH  Google Scholar 

  30. Shaffer, J.P.: Modified sequentially rejective multiple test procedures. J. Am. Stat. Assoc. 81(395), 826–831 (1986)

    Article  MATH  Google Scholar 

  31. Shaffer, J.P.: Multiple hypothesis testing. Ann. Rev. Psychol. 46(1), 561–584 (1995)

    Article  Google Scholar 

Download references

Acknowledgement

We would like to thank Professor Jonathan Garibaldi for sharing the NTBC dataset with us.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abeer Aljohani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Aljohani, A., Lai, D.T.C., Bell, P.C., Edirisinghe, E.A. (2017). A Comparison of Distance Metrics in Semi-supervised Hierarchical Clustering Methods. In: Huang, DS., Hussain, A., Han, K., Gromiha, M. (eds) Intelligent Computing Methodologies. ICIC 2017. Lecture Notes in Computer Science(), vol 10363. Springer, Cham. https://doi.org/10.1007/978-3-319-63315-2_63

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-63315-2_63

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-63314-5

  • Online ISBN: 978-3-319-63315-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics