skip to main content
10.1145/3543873.3587307acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
poster

C-Affinity: A Novel Similarity Measure for Effective Data Clustering

Published: 30 April 2023 Publication History

Abstract

Clustering is widely employed in various applications as it is one of the most useful data mining techniques. In performing clustering, a similarity measure, which defines how similar a pair of data objects are, plays an important role. A similarity measure is employed by considering a target dataset’s characteristics. Current similarity measures (or distances) do not reflect the distribution of data objects in a dataset at all. From the clustering point of view, this fact may limit the clustering accuracy. In this paper, we propose c-affinity, a new notion of a similarity measure that reflects the distribution of objects in the given dataset from a clustering point of view. We design c-affinity between any two objects to have a higher value as they are more likely to belong to the same cluster by learning the data distribution. We use random walk with restart (RWR) on the k-nearest neighbor graph of the given dataset to measure (1) how similar a pair of objects are and (2) how densely other objects are distributed between them. Via extensive experiments on sixteen synthetic and real-world datasets, we verify that replacing the existing similarity measure with our c-affinity improves the clustering accuracy significantly.

References

[1]
Kevin Beyer 1999. When is “Nearest Neighbor” Meaningful?. In Proc. of ICDT. 217–235.
[2]
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
[3]
David Eppstein, Michael S. Paterson, and F. Frances Yao. 1997. On Nearest-Neighbor Graphs. Discrete & Computational Geometry 17, 3 (1997), 263–282.
[4]
Martin Ester 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proc. of KDD, Vol. 96. 226–231.
[5]
Pasi Fränti and Sami Sieranoja. 2018. K-means Properties on Six Clustering Benchmark Datasets. Appl. Intelligence 48, 12 (2018), 4743–4759.
[6]
Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data Mining: Concepts and Techniques. Elsevier.
[7]
George Karypis, Eui-Hong Han, and Vipin Kumar. 1999. Chameleon: Hierarchical Clustering Using Dynamic Modeling. IEEE Computer 32, 8 (1999), 68–75.
[8]
Hans-Peter Kriegel 2011. Density-Based Clustering. WIREs: Data Mining and Knowledge Discovery 1, 3 (2011), 231–240.
[9]
Shraddha Pandit and Suchita Gupta. 2011. A Comparative Study on Distance Measuring Approaches for Clustering. Int. J. of Res. in Comp. Sci. 2, 1 (2011), 29–31.
[10]
Jianbo Shi and Jitendra Malik. 2000. Normalized Cuts and Image Segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence 22, 8 (2000), 888–905.
[11]
Sami Sieranoja and Pasi Fränti. 2019. Fast and General Density Peaks Clustering. Pattern Recognition Lett. 128 (2019), 551–558.
[12]
Douglas Steinley. 2004. Properties of the Hubert-Arable Adjusted Rand Index.Psychological methods 9, 3 (2004), 386.
[13]
Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. 2006. Fast Random Walk with Restart and Its Applications. In Proc. of ICDM. 613–622.
[14]
Zhao Yang, René Algesheimer, and Claudio J. Tessone. 2016. A Comparative Analysis of Community Detection Algorithms on Artificial Networks. Scientific reports 6, 1 (2016), 1–18.

Cited By

View all
  • (2024)A Comprehensive Evaluation of Rough Sets Clustering in Uncertainty Driven ContextsStudia Universitatis Babeș-Bolyai Informatica10.24193/subbi.2024.1.0369:1(41-56)Online publication date: 10-Jun-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023
April 2023
1567 pages
ISBN:9781450394192
DOI:10.1145/3543873
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 April 2023

Check for updates

Author Tags

  1. clustering
  2. clustering affinity
  3. nearest neighbor graph
  4. similarity measure

Qualifiers

  • Poster
  • Research
  • Refereed limited

Conference

WWW '23
Sponsor:
WWW '23: The ACM Web Conference 2023
April 30 - May 4, 2023
TX, Austin, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)38
  • Downloads (Last 6 weeks)1
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Comprehensive Evaluation of Rough Sets Clustering in Uncertainty Driven ContextsStudia Universitatis Babeș-Bolyai Informatica10.24193/subbi.2024.1.0369:1(41-56)Online publication date: 10-Jun-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media