Analytical review of clustering techniques and proximity measures

Mehta, Vivek; Bawa, Seema; Singh, Jasmeet

doi:10.1007/s10462-020-09840-7

Analytical review of clustering techniques and proximity measures

Published: 02 May 2020

Volume 53, pages 5995–6023, (2020)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

Vivek Mehta¹,
Seema Bawa¹ &
Jasmeet Singh¹

1640 Accesses
25 Citations
Explore all metrics

Abstract

One of the most fundamental approaches to learn and understand from any type of data is by organizing it into meaningful groups (or clusters) and then analyzing them, which is a process known as cluster analysis. During this process of grouping, proximity measures play a significant role in deciding the similarity level of two objects. Moreover, before applying any learning algorithm on a dataset, different aspects related to preprocessing such as dealing with the sparsity of data, leveraging the correlation among features and normalizing the scales of different features are required to be considered. In this study, various proximity measures have been discussed and analyzed from the aforementioned aspects. In addition, a theoretical procedure for selecting a proximity measure for clustering purpose is proposed. This procedure can also be used in the process of designing a new proximity measure. Second, clustering algorithms of different categories have been overviewed and experimentally compared for various datasets of different domains. The datasets have been chosen in such a way that they range from a very low number of dimensions to a very high number of dimensions. Finally, the effect of using different proximity measures is analyzed in partitional and hierarchical clustering techniques based on experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

Notes

https://archive.ics.uci.edu/ml/datasets/iris, Accessed: 2019-05-12.
https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/, Accessed: 2019-05-14.
http://featureselection.asu.edu/datasets.php, Accessed: 2019-05-15.
https://archive.ics.uci.edu/ml/datasets/Multiple+Features, Accessed: 2019-05-15.
https://www.openml.org/d/41070, Accessed: 2019-05-16.
https://cs.nyu.edu/~roweis/data.html, Accessed: 2019-05-16.
https://archive.ics.uci.edu/ml/datasets/glass+identification, Accessed: 2019-05-16.

References

Agrawal R, Gehrke J, Gunopulos D, Raghavan P (2005) Automatic subspace clustering of high dimensional data. Data Min Knowl Discov 11(1):5–33
MathSciNet Google Scholar
Altınçay H, Erenel Z (2010) Analytical evaluation of term weighting schemes for text categorization. Pattern Recognit Lett 31(11):1310–1323
Google Scholar
Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure. In: ACM Sigmod record, vol 28. ACM, pp 49–60
Basu T, Murthy C (2015) A similarity assessment technique for effective grouping of documents. Inf Sci 311:149–162
Google Scholar
Bezdek JC (1981) Objective function clustering. In: Pattern recognition with fuzzy objective function algorithms. Springer, pp 43–93
Bezdek JC, Ehrlich R, Full W (1984) FCM: the fuzzy c-means clustering algorithm. Comput Geosci 10(2–3):191–203
Google Scholar
Bouchachia A, Pedrycz W (2006) Enhancement of fuzzy clustering by mechanisms of partial supervision. Fuzzy Sets Syst 157(13):1733–1759
MathSciNet MATH Google Scholar
Cambria E, Mazzocco T, Hussain A, Eckl C (2011) Sentic medoids: organizing affective common sense knowledge in a multi-dimensional vector space. In: International symposium on neural networks. Springer, pp 601–610
Cambria E, Fu J, Bisio F, Poria S (2015) Affective space 2: enabling affective intuition for concept-level sentiment analysis. In: Twenty-ninth AAAI conference on artificial intelligence, pp 508–514
Cetinkaya S, Basaraner M, Burghardt D (2015) Proximity-based grouping of buildings in urban blocks: a comparison of four algorithms. Geocarto Int 30(6):618–632
Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39:1–38
MathSciNet MATH Google Scholar
Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3:32–57
MathSciNet MATH Google Scholar
Ester M, Kriegel HP, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96:226–231
Google Scholar
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279
Google Scholar
Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172
Google Scholar
García-Pablos A, Cuadros M, Rigau G (2018) W2VLDA: almost unsupervised system for aspect based sentiment analysis. Expert Syst Appl 91:127–137
Google Scholar
Gennari JH, Langley P, Fisher D (1989) Models of incremental concept formation. Artif Intell 40(1–3):11–61
Google Scholar
Glen S. Bray curtis dissimilarity. http://www.statisticshowto.com/bray-curtis-dissimilarity/. Accessed 28 Apr 2018
Glen S. Kullback–leibler kl divergence. https://www.statisticshowto.datasciencecentral.com/kl-divergence. Accessed 28 Apr 2018
Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: ACM sigmod record, vol 27. ACM, pp 73–84
Guha S, Rastogi R, Shim K (2000) Rock: a robust clustering algorithm for categorical attributes. Inf Syst 25(5):345–366
Google Scholar
Gustafson DE, Kessel WC (1979) Fuzzy clustering with a fuzzy covariance matrix. In: 1978 IEEE conference on decision and control including the 17th symposium on adaptive processes. IEEE, pp 761–766
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
MATH Google Scholar
Han X, Quan L, Xiong X, Almeter M, Xiang J, Lan Y (2017) A novel data clustering algorithm based on modified gravitational search algorithm. Eng Appl Artif Intell 61:1–7
Google Scholar
Hanna AR, Rao C, Athanasiou T (2010) Graphs in statistical analysis. In: Key topics in surgical research and methodology. Springer, pp 441–475
Hinneburg A, Keim DA (1999) Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th international conference on very large databases, 1999, pp 506–517
Hinneburg A, Keim DA et al (1998) An efficient approach to clustering in large multimedia databases with noise. KDD 98:58–65
Google Scholar
Hong X, Yu Z, Tang M, Xian Y (2017) Cross-lingual event-centered news clustering based on elements semantic correlations of different news. Multimed Tools Appl 76(23):25129–25143
Google Scholar
Huang Z (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. DMKD 3(8):34–39
Google Scholar
Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the sixth New Zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, pp 49–56
Jaccard index (2018). https://en.wikipedia.org/wiki/Jaccard_index. Accessed 28 Apr 2018
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666
Google Scholar
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs
MATH Google Scholar
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323
Google Scholar
Jan TG (2020) Clustering of tweets: a novel approach to label the unlabelled tweets. In: Proceedings of ICRIC 2019. Springer, pp 671–685
Kameshwaran K, Malarvizhi K (2014) Survey on clustering techniques in data mining. Int J Comput Sci Inf Technol 5(2):2272–2276
Google Scholar
Kannan S, Ramathilagam S, Devi R, Hines E (2012) Strong fuzzy c-means in medical image data analysis. J Syst Softw 85(11):2425–2438
Google Scholar
Karypis G, Han EH, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8):68–75
Google Scholar
Kohonen T (1998) The self-organizing map. Neurocomputing 21(1–3):1–6
MATH Google Scholar
Kruse R, Döring C, Lesot MJ (2007) Fundamentals of fuzzy clustering. In: de Oliveira JV, Pedrycz W (eds) Advances in Fuzzy Clustering and its Applications. Wiley, Chichester, pp 3–30
Google Scholar
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
MathSciNet MATH Google Scholar
Lai DTC, Garibaldi JM (2011) A comparison of distance-based semi-supervised fuzzy c-means clustering algorithms. In: 2011 IEEE international conference on fuzzy systems (FUZZ). IEEE, pp 1580–1586
Lan M, Sung SY, Low HB, Tan CL (2005) A comparative study on term weighting schemes for text categorization. In: Proceedings. 2005 IEEE international joint conference on neural networks, 2005., vol 1. IEEE, pp 546–551
Leoncini A, Sangiacomo F, Peretti C, Argentesi S, Zunino R, Cambria E (2011) Semantic models for style-based text clustering. In: 2011 IEEE fifth international conference on semantic computing. IEEE, pp 75–82
Li C, Liu L, Jiang W (2008) Objective function of semi-supervised fuzzy c-means clustering algorithm. In: 6th IEEE international conference on industrial informatics, 2008. INDIN 2008. IEEE, pp 737–742
Lin YS, Jiang JY, Lee SJ (2014) A similarity measure for text classification and clustering. IEEE Trans Knowl Data Eng 26(7):1575–1590
Google Scholar
MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1. Oakland, CA, USA, pp 281–297
Manning CD, Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge
MATH Google Scholar
McCune B, Grace JB, Urban DL (2002) Analysis of ecological communities, vol 28. MjM Software Design, Gleneden Beach
Google Scholar
Montoyo A, MartíNez-Barco P, Balahur A (2012) Subjectivity and sentiment analysis: an overview of the current state of the area and envisaged developments. Decis Support Syst 53:675–689
Google Scholar
Nanda SJ, Panda G (2014) A survey on nature inspired metaheuristic algorithms for partitional clustering. Swarm Evol Comput 16:1–18
Google Scholar
Ng RT, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of VLDB, pp 144–155
Ng RT, Han J (2002) Clarans: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14(5):1003–1016
Google Scholar
Park HS, Jun CH (2009) A simple and fast algorithm for k-medoids clustering. Expert Syst Appl 36(2):3336–3341
Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Pedrycz W, Waletzky J (1997) Fuzzy clustering with partial supervision. IEEE Trans Syst Man Cybern Part B (Cybern) 27(5):787–795
Google Scholar
Ravi K, Ravi V (2015) A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowl Based Syst 89:14–46
Google Scholar
Ross TJ (2005) Fuzzy logic with engineering applications. Wiley, Hoboken
Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
MATH Google Scholar
Rudolf Kruse Christian Döring ML (2007) Fundamentals of fuzzy clustering. In: de Oliveira WP J Valente (ed) Advances in fuzzy clustering and its applications. Wiley, Oxford, pp 3–30 chap. 1
Google Scholar
Saraçoğlu R, Tütüncü K, Allahverdi N (2007) A fuzzy clustering approach for finding similar documents using a novel similarity measure. Expert Syst Appl 33(3):600–605
Google Scholar
Schoenharl TW, Madey G (2008) Evaluation of measurement techniques for the validation of agent-based simulations against streaming data. In: International conference on computational science. Springer, pp 6–15
Schubert E, Sander J, Ester M, Kriegel HP, Xu X (2017) DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans Database Syst (TODS) 42(3):1–21
MathSciNet Google Scholar
Sedding J, Kazakov D (2004) Wordnet-based text document clustering. In: proceedings of the 3rd workshop on robust methods in analysis of natural language data. Association for Computational Linguistics, pp 104–113
Sehgal G, Garg DK (2014) Comparison of various clustering algorithms. Int J Comput Sci Inf Technol 5(3):3074–3076
Google Scholar
Selim SZ, Alsultan K (1991) A simulated annealing algorithm for the clustering problem. Pattern Recognit 24(10):1003–1008
MathSciNet Google Scholar
Sheikholeslami G, Chatterjee S, Zhang A (1998) Wavecluster: a multi-resolution clustering approach for very large spatial databases. VLDB 98:428–439
Google Scholar
Shirkhorshidi AS, Aghabozorgi S, Wah TY, Herawan T (2014) Big data clustering: a review. In: International conference on computational science and its applications. Springer, pp 707–720
Shirkhorshidi AS, Aghabozorgi S, Wah TY (2015) A comparison study on similarity and dissimilarity measures in clustering continuous data. PLoS ONE 10(12):e0144059
Google Scholar
Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. In: Workshop on artificial intelligence for web search (AAAI 2000), vol 58, pp 58–64
Tang G, Xia Y, Cambria E, Jin P, Zheng TF (2015) Document representation with statistical word senses in cross-lingual document clustering. Int J Pattern Recognit Artif Intell 29(02):1559003
MathSciNet Google Scholar
Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854
MathSciNet MATH Google Scholar
Vossen P (2002) Eurowordnet general document version 3. University of Amsterdam, Amsterdam
Google Scholar
Wang W, Yang J, Muntz R et al (1997) Sting: a statistical information grid approach to spatial data mining. VLDB 97:186–195
Google Scholar
Wei T, Lu Y, Chang H, Zhou Q, Bao X (2015) A semantic approach for text clustering using wordnet and lexical chains. Expert Syst Appl 42(4):2264–2275
Google Scholar
Wu Zd, Xie Wx, Yu Jp (2003) Fuzzy c-means clustering algorithm based on kernel method. In: Proceedings fifth international conference on computational intelligence and multimedia applications. ICCIMA 2003. IEEE, pp 49–54
Xia Y, Tang N, Hussain A, Cambria E (2015) Discriminative bi-term topic model for headline-based social news clustering. In: The twenty-eighth international flairs conference, pp 311–316
Xie J, Girshick R, Farhadi A (2016) Unsupervised deep embedding for clustering analysis. In: International conference on machine learning, pp 478–487
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2(2):165–193
MathSciNet Google Scholar
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Google Scholar
Xu X, Ester M, Kriegel HP, Sander J (1998) A distribution-based clustering algorithm for mining in large spatial databases. In: 14th international conference on data engineering, 1998. Proceedings. IEEE, pp 324–331
Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: Proceedings of the 22nd international conference on World Wide Web. ACM, pp 1445–1456
Yasunori E, Yukihiro H, Makito Y, Sadaaki M (2009) On semi-supervised fuzzy c-means clustering. In: 2009 IEEE international conference on fuzzy systems. IEEE, pp 1119–1124
Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. In: ACM sigmod record, vol 25. ACM, pp 103–114
Zhang D, Tan K, Chen S (2004) Semi-supervised kernel-based fuzzy c-means. In: International conference on neural information processing. Springer, pp 1229–1234

Download references

Author information

Authors and Affiliations

Computer Science and Engineering Department, Thapar Institute of Engineering and Technology, Patiala, Punjab, 147001, India
Vivek Mehta, Seema Bawa & Jasmeet Singh

Authors

Vivek Mehta
View author publications
You can also search for this author in PubMed Google Scholar
Seema Bawa
View author publications
You can also search for this author in PubMed Google Scholar
Jasmeet Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vivek Mehta.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mehta, V., Bawa, S. & Singh, J. Analytical review of clustering techniques and proximity measures. Artif Intell Rev 53, 5995–6023 (2020). https://doi.org/10.1007/s10462-020-09840-7

Download citation

Published: 02 May 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s10462-020-09840-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analytical review of clustering techniques and proximity measures

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Analytical review of clustering techniques and proximity measures

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation