Abstract
Structural graph clustering (SCAN) is a classic graph clustering algorithm. In SCAN, a key step is to compute the structural similarity between vertices according to the overlap ratio of one-hop neighborhoods. Given two vertices u and v, existing studies only consider the case when u and v are neighbors. However, the structural similarity between non-neighboring vertices in SCAN is always zero, and using only one-hop neighbors on weighted graphs discards the weights on each edge. Both may not reflect the true closeness of two vertices and may fail to return high-quality clustering results.
To tackle this issue, we define and study the distance-based structural graph clustering problem. Given a distance threshold d and two vertices u and v, the structural similarity between u and v is defined as the ratio of their respective neighbors within a distance of no more than d. We show that the newly defined distance-based SCAN achieves better clustering results compared to the vanilla version of SCAN. However, the new definition brings challenges in the computation of final clustering results. To tackle this efficiency issue, we propose DistanceSCAN, an efficient approximate algorithm for solving the distance-based SCAN problem. The main idea of DistanceSCAN is to use all-distances bottom-k sketches (ADS) to speed up the computation of similarities. Given the ADS, we can derive the similarity between two vertices with a bounded cost of O(k).
However, to ensure that the estimated similarity has an approximation guarantee, the value of k still needs to be set to as large as thousands. This brings high computational costs when computing the similarities between neighboring vertices. To tackle this issue, we further construct histograms to prune the structural similarity computations of vertices pairs. Extensive experiments on real datasets validate the effectiveness and efficiency of DistanceSCAN.
Supplemental Material
- Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering Points To Identify the Clustering Structure. In SIGMOD. 49--60.Google Scholar
- Kevin Aydin, Mohammad Hossein Bateni, and Vahab S. Mirrokni. 2016. Distributed Balanced Partitioning via Linear Embedding. In WSDM. 387--396.Google Scholar
- Rémi Bardenet and Odalric-Ambrym Maillard. 2015. Concentration inequalities for sampling without replacement. Bernoulli, Vol. 21, 3 (2015), 1361--1385.Google ScholarCross Ref
- Paolo Boldi and Sebastiano Vigna. 2004. The webgraph framework I: compression techniques. In WWW. 595--602.Google Scholar
- Dustin Bortner and Jiawei Han. 2010. Progressive clustering of networks using Structure-Connected Order of Traversal. In ICDE. 653--656.Google Scholar
- Lijun Chang, Wei Li, Xuemin Lin, Lu Qin, and Wenjie Zhang. 2016. pSCAN: Fast and exact structural graph clustering. In ICDE. 253--264.Google Scholar
- Yulin Che, Shixuan Sun, and Qiong Luo. 2018. Parallelizing Pruning-based Graph Structural Clustering. In ICPP. 77:1--77:10.Google Scholar
- Edith Cohen. 2015. All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis. TKDE, Vol. 27, 9 (2015), 2320--2334.Google Scholar
- Edith Cohen, Mayur Datar, Shinji Fujiwara, Aristides Gionis, Piotr Indyk, Rajeev Motwani, Jeffrey D. Ullman, and Cheng Yang. 2001. Finding Interesting Associations without Support Pruning. TKDE, Vol. 13, 1 (2001), 64--78.Google ScholarDigital Library
- Edith Cohen and Haim Kaplan. 2007. Summarizing data using bottom-k sketches. In PODC. 225--234.Google Scholar
- Chris H. Q. Ding, Xiaofeng He, Hongyuan Zha, Ming Gu, and Horst D. Simon. 2001. A Min-max Cut Algorithm for Graph Partitioning and Data Clustering. In ICDM. 107--114.Google Scholar
- Pedro M. Domingos and Matthew Richardson. 2001. Mining the network value of customers. In SIGKDD. 57--66.Google Scholar
- Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. Density-based spatial clustering of applications with noise. In Int. Conf. Knowledge Discovery and Data Mining, Vol. 240. 6.Google Scholar
- Michelle Girvan and Mark EJ Newman. 2002. Community structure in social and biological networks. Proceedings of the national academy of sciences, Vol. 99, 12 (2002), 7821--7826.Google ScholarCross Ref
- Jianbin Huang, Heli Sun, Jiawei Han, Hongbo Deng, Yizhou Sun, and Yaguang Liu. 2010. SHRINK: a structural clustering algorithm for detecting hierarchical communities in networks. In CIKM. 219--228.Google Scholar
- Jianbin Huang, Heli Sun, Qinbao Song, Hongbo Deng, and Jiawei Han. 2013. Revealing Density-Based Clustering Structure from the Core-Connected Tree of a Network. IEEE Trans. Knowl. Data Eng., Vol. 25, 8 (2013), 1876--1889.Google ScholarDigital Library
- Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of classification, Vol. 2, 1 (1985), 193--218.Google ScholarCross Ref
- Paul Jaccard. 1912. The distribution of the flora in the alpine zone. 1. New phytologist, Vol. 11, 2 (1912), 37--50.Google Scholar
- U Kang and Christos Faloutsos. 2011. Beyond 'Caveman Communities': Hubs and Spokes for Graph Compression and Mining. In ICDM. 300--309.Google Scholar
- Jure Leskovec and Rok Sosic. 2016. SNAP: A General-Purpose Network Analysis and Graph-Mining Library. ACM Trans. Intell. Syst. Technol., Vol. 8, 1 (2016), 1:1--1:20.Google ScholarDigital Library
- Mark EJ Newman. 2004 a. Analysis of weighted networks. Physical review E, Vol. 70, 5 (2004), 056131.Google Scholar
- Mark EJ Newman. 2004 b. Fast algorithm for detecting community structure in networks. Physical review E, Vol. 69, 6 (2004), 066133.Google Scholar
- Boyu Ruan, Junhao Gan, Hao Wu, and Anthony Wirth. 2021. Dynamic Structural Clustering on Graphs. In SIGMOD. 1491--1503.Google Scholar
- Hiroaki Shiokawa, Yasuhiro Fujiwara, and Makoto Onizuka. 2015. SCAN: Efficient Algorithm for Finding Clusters, Hubs and Outliers on Large-scale Graphs. Proc. VLDB Endow., Vol. 8, 11 (2015), 1178--1189.Google ScholarDigital Library
- Tomokatsu Takahashi, Hiroaki Shiokawa, and Hiroyuki Kitagawa. 2017. SCAN-XP: Parallel Structural Graph Clustering Algorithm on Intel Xeon Phi Coprocessors. In NDA@SIGMOD. 6:1--6:7.Google Scholar
- Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. ArnetMiner: extraction and mining of academic social networks. In SIGKDD. 990--998.Google Scholar
- Tom Tseng, Laxman Dhulipala, and Julian Shun. 2021. Parallel Index-Based Structural Graph Clustering and Its Approximation. In SIGMOD. 1851--1864.Google Scholar
- Yang Wang, Deepayan Chakrabarti, Chenxi Wang, and Christos Faloutsos. 2003. Epidemic Spreading in Real Networks: An Eigenvalue Viewpoint. In SRDS. 25--34.Google Scholar
- Dong Wen, Lu Qin, Ying Zhang, Lijun Chang, and Xuemin Lin. 2019. Efficient structural graph clustering: an index-based approach. VLDB J., Vol. 28, 3 (2019), 377--399.Google ScholarDigital Library
- Changfa Wu, Yu Gu, and Ge Yu. 2019. DPSCAN: Structural Graph Clustering Based on Density Peaks. In DASFAA, Vol. 11447. 626--641.Google Scholar
- Xiaowei Xu, Nurcan Yuruk, Zhidan Feng, and Thomas A. J. Schweiger. 2007. SCAN: a structural clustering algorithm for networks. In SIGKDD. 824--833.Google Scholar
Index Terms
- An Efficient Algorithm for Distance-based Structural Graph Clustering
Recommendations
Stable structural clustering in uncertain graphs
AbstractThe uncertain graph is widely used to model and analyze graph data in which the relation between objects is uncertain. We here study the structural clustering in uncertain graphs. As an important method in graph clustering, structural ...
A Graph Distance Based Structural Clustering Approach for Networks
AI '09: Proceedings of the 22nd Australasian Joint Conference on Advances in Artificial IntelligenceIn the era of information explosion, structured data emerge on a large scale. As a description of structured data, network has drawn attention of researchers in many subjects. Network clustering, as an essential part of this study area, focuses on ...
Graph clustering based on structural/attribute similarities
The goal of graph clustering is to partition vertices in a large graph into different clusters based on various criteria such as vertex connectivity or neighborhood similarity. Graph clustering techniques are very useful for detecting densely connected ...
Comments