skip to main content
research-article
Open Access

An Efficient Algorithm for Distance-based Structural Graph Clustering

Published:30 May 2023Publication History
Skip Abstract Section

Abstract

Structural graph clustering (SCAN) is a classic graph clustering algorithm. In SCAN, a key step is to compute the structural similarity between vertices according to the overlap ratio of one-hop neighborhoods. Given two vertices u and v, existing studies only consider the case when u and v are neighbors. However, the structural similarity between non-neighboring vertices in SCAN is always zero, and using only one-hop neighbors on weighted graphs discards the weights on each edge. Both may not reflect the true closeness of two vertices and may fail to return high-quality clustering results.

To tackle this issue, we define and study the distance-based structural graph clustering problem. Given a distance threshold d and two vertices u and v, the structural similarity between u and v is defined as the ratio of their respective neighbors within a distance of no more than d. We show that the newly defined distance-based SCAN achieves better clustering results compared to the vanilla version of SCAN. However, the new definition brings challenges in the computation of final clustering results. To tackle this efficiency issue, we propose DistanceSCAN, an efficient approximate algorithm for solving the distance-based SCAN problem. The main idea of DistanceSCAN is to use all-distances bottom-k sketches (ADS) to speed up the computation of similarities. Given the ADS, we can derive the similarity between two vertices with a bounded cost of O(k).

However, to ensure that the estimated similarity has an approximation guarantee, the value of k still needs to be set to as large as thousands. This brings high computational costs when computing the similarities between neighboring vertices. To tackle this issue, we further construct histograms to prune the structural similarity computations of vertices pairs. Extensive experiments on real datasets validate the effectiveness and efficiency of DistanceSCAN.

Skip Supplemental Material Section

Supplemental Material

PACMMOD-V1mod045.mp4

mp4

21.7 MB

References

  1. Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering Points To Identify the Clustering Structure. In SIGMOD. 49--60.Google ScholarGoogle Scholar
  2. Kevin Aydin, Mohammad Hossein Bateni, and Vahab S. Mirrokni. 2016. Distributed Balanced Partitioning via Linear Embedding. In WSDM. 387--396.Google ScholarGoogle Scholar
  3. Rémi Bardenet and Odalric-Ambrym Maillard. 2015. Concentration inequalities for sampling without replacement. Bernoulli, Vol. 21, 3 (2015), 1361--1385.Google ScholarGoogle ScholarCross RefCross Ref
  4. Paolo Boldi and Sebastiano Vigna. 2004. The webgraph framework I: compression techniques. In WWW. 595--602.Google ScholarGoogle Scholar
  5. Dustin Bortner and Jiawei Han. 2010. Progressive clustering of networks using Structure-Connected Order of Traversal. In ICDE. 653--656.Google ScholarGoogle Scholar
  6. Lijun Chang, Wei Li, Xuemin Lin, Lu Qin, and Wenjie Zhang. 2016. pSCAN: Fast and exact structural graph clustering. In ICDE. 253--264.Google ScholarGoogle Scholar
  7. Yulin Che, Shixuan Sun, and Qiong Luo. 2018. Parallelizing Pruning-based Graph Structural Clustering. In ICPP. 77:1--77:10.Google ScholarGoogle Scholar
  8. Edith Cohen. 2015. All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis. TKDE, Vol. 27, 9 (2015), 2320--2334.Google ScholarGoogle Scholar
  9. Edith Cohen, Mayur Datar, Shinji Fujiwara, Aristides Gionis, Piotr Indyk, Rajeev Motwani, Jeffrey D. Ullman, and Cheng Yang. 2001. Finding Interesting Associations without Support Pruning. TKDE, Vol. 13, 1 (2001), 64--78.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Edith Cohen and Haim Kaplan. 2007. Summarizing data using bottom-k sketches. In PODC. 225--234.Google ScholarGoogle Scholar
  11. Chris H. Q. Ding, Xiaofeng He, Hongyuan Zha, Ming Gu, and Horst D. Simon. 2001. A Min-max Cut Algorithm for Graph Partitioning and Data Clustering. In ICDM. 107--114.Google ScholarGoogle Scholar
  12. Pedro M. Domingos and Matthew Richardson. 2001. Mining the network value of customers. In SIGKDD. 57--66.Google ScholarGoogle Scholar
  13. Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. Density-based spatial clustering of applications with noise. In Int. Conf. Knowledge Discovery and Data Mining, Vol. 240. 6.Google ScholarGoogle Scholar
  14. Michelle Girvan and Mark EJ Newman. 2002. Community structure in social and biological networks. Proceedings of the national academy of sciences, Vol. 99, 12 (2002), 7821--7826.Google ScholarGoogle ScholarCross RefCross Ref
  15. Jianbin Huang, Heli Sun, Jiawei Han, Hongbo Deng, Yizhou Sun, and Yaguang Liu. 2010. SHRINK: a structural clustering algorithm for detecting hierarchical communities in networks. In CIKM. 219--228.Google ScholarGoogle Scholar
  16. Jianbin Huang, Heli Sun, Qinbao Song, Hongbo Deng, and Jiawei Han. 2013. Revealing Density-Based Clustering Structure from the Core-Connected Tree of a Network. IEEE Trans. Knowl. Data Eng., Vol. 25, 8 (2013), 1876--1889.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of classification, Vol. 2, 1 (1985), 193--218.Google ScholarGoogle ScholarCross RefCross Ref
  18. Paul Jaccard. 1912. The distribution of the flora in the alpine zone. 1. New phytologist, Vol. 11, 2 (1912), 37--50.Google ScholarGoogle Scholar
  19. U Kang and Christos Faloutsos. 2011. Beyond 'Caveman Communities': Hubs and Spokes for Graph Compression and Mining. In ICDM. 300--309.Google ScholarGoogle Scholar
  20. Jure Leskovec and Rok Sosic. 2016. SNAP: A General-Purpose Network Analysis and Graph-Mining Library. ACM Trans. Intell. Syst. Technol., Vol. 8, 1 (2016), 1:1--1:20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Mark EJ Newman. 2004 a. Analysis of weighted networks. Physical review E, Vol. 70, 5 (2004), 056131.Google ScholarGoogle Scholar
  22. Mark EJ Newman. 2004 b. Fast algorithm for detecting community structure in networks. Physical review E, Vol. 69, 6 (2004), 066133.Google ScholarGoogle Scholar
  23. Boyu Ruan, Junhao Gan, Hao Wu, and Anthony Wirth. 2021. Dynamic Structural Clustering on Graphs. In SIGMOD. 1491--1503.Google ScholarGoogle Scholar
  24. Hiroaki Shiokawa, Yasuhiro Fujiwara, and Makoto Onizuka. 2015. SCAN: Efficient Algorithm for Finding Clusters, Hubs and Outliers on Large-scale Graphs. Proc. VLDB Endow., Vol. 8, 11 (2015), 1178--1189.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Tomokatsu Takahashi, Hiroaki Shiokawa, and Hiroyuki Kitagawa. 2017. SCAN-XP: Parallel Structural Graph Clustering Algorithm on Intel Xeon Phi Coprocessors. In NDA@SIGMOD. 6:1--6:7.Google ScholarGoogle Scholar
  26. Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. ArnetMiner: extraction and mining of academic social networks. In SIGKDD. 990--998.Google ScholarGoogle Scholar
  27. Tom Tseng, Laxman Dhulipala, and Julian Shun. 2021. Parallel Index-Based Structural Graph Clustering and Its Approximation. In SIGMOD. 1851--1864.Google ScholarGoogle Scholar
  28. Yang Wang, Deepayan Chakrabarti, Chenxi Wang, and Christos Faloutsos. 2003. Epidemic Spreading in Real Networks: An Eigenvalue Viewpoint. In SRDS. 25--34.Google ScholarGoogle Scholar
  29. Dong Wen, Lu Qin, Ying Zhang, Lijun Chang, and Xuemin Lin. 2019. Efficient structural graph clustering: an index-based approach. VLDB J., Vol. 28, 3 (2019), 377--399.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Changfa Wu, Yu Gu, and Ge Yu. 2019. DPSCAN: Structural Graph Clustering Based on Density Peaks. In DASFAA, Vol. 11447. 626--641.Google ScholarGoogle Scholar
  31. Xiaowei Xu, Nurcan Yuruk, Zhidan Feng, and Thomas A. J. Schweiger. 2007. SCAN: a structural clustering algorithm for networks. In SIGKDD. 824--833.Google ScholarGoogle Scholar

Index Terms

  1. An Efficient Algorithm for Distance-based Structural Graph Clustering

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the ACM on Management of Data
      Proceedings of the ACM on Management of Data  Volume 1, Issue 1
      PACMMOD
      May 2023
      2807 pages
      EISSN:2836-6573
      DOI:10.1145/3603164
      Issue’s Table of Contents

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 May 2023
      Published in pacmmod Volume 1, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)371
      • Downloads (Last 6 weeks)62

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader