Skip to main content
Log in

Cluster-preserving sampling algorithm for large-scale graphs

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

Graph sampling is a very effective method to deal with scalability issues when analyzing large-scale graphs. Lots of sampling algorithms have been proposed, and sampling qualities have been quantified using explicit properties (e.g., degree distribution) of the sample. However, the existing sampling techniques are inadequate for the current sampling task: sampling the clustering structure, which is a crucial property of the current networks. In this paper, using different expansion strategies, two novel top-leader sampling methods (i.e., TLS-e and TLS-i) are proposed to obtain representative samples, and they are capable of effectively preserving the clustering structure. The rationale behind them is to select top-leader nodes of most clusters into the sample and then heuristically incorporate peripheral nodes into the sample using specific expansion strategies. Extensive experiments are conducted to investigate how well sampling techniques preserve the clustering structure of graphs. Our empirical results show that the proposed sampling algorithms can preserve the population’s clustering structure well and provide feasible solutions to sample the clustering structure from large-scale graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Rozemberczki B, Kiss O, Sarkar R. Little ball of fur: a python library for graph sampling. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management (CIKM), 2020

  2. Zhang J P, Pei Y L, Fletcher G, et al. Evaluation of the sample clustering process on graphs. IEEE Trans Knowl Data Eng, 2020, 32: 1333–1347

    Article  Google Scholar 

  3. Leskovec J, Faloutsos C. Sampling from large graphs. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006. 631–636

  4. Ahmed N K, Neville J, Kompella R. Network sampling: from static to streaming graphs. ACM Trans Knowl Discov Data, 2014, 8: 1–56

    Article  Google Scholar 

  5. Zhang J P, Pei Y L, Fletcher G H, et al. Structural measures of clustering quality on graph samples. In: Proceedings of IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2016. 345–348

  6. Hübler C, Kriegel H P, Borgwardt K, et al. Metropolis algorithms for representative subgraph sampling. In: Proceedings of the 8th IEEE International Conference on Data Mining, 2008. 283–292

  7. Maiya A S, Berger-Wolf T Y. Sampling community structure. In: Proceedings of the 19th International Conference on World Wide Web, 2010. 701–710

  8. Wang F, Cheung G N, Wang Y C. Low-complexity graph sampling with noise and signal reconstruction via neumann series. IEEE Trans Signal Process, 2019, 67: 5511–5526

    Article  MathSciNet  ADS  Google Scholar 

  9. Jiao B, Shi J M, Zhang W S, et al. Graph sampling for Internet topologies using normalized Laplacian spectral features. Inf Sci, 2019, 481: 574–603

    Article  MathSciNet  Google Scholar 

  10. Zhou Z G, Shi C, Shen X L, et al. Context-aware sampling of large networks via graph representation learning. IEEE Trans Visual Comput Graph, 2021, 27: 1709–1719

    Article  Google Scholar 

  11. Hu J, Dai G, Wang Y, et al. Graphsdh: a general graph sampling framework with distribution and hierarchy. In: Proceedings of IEEE High Performance Extreme Computing Conference (HPEC), 2020. 1–7

  12. Mall R, Langone R, Suykens J A K. FURS: fast and unique representative subset selection retaining large-scale community structure. Soc Netw Anal Min, 2013, 3: 1075–1095

    Article  Google Scholar 

  13. Barabási A L, Albert R. Emergence of scaling in random networks. Science, 1999, 286: 509–512

    Article  MathSciNet  PubMed  ADS  Google Scholar 

  14. Khorasgani R R, Chen J, Zaiane O R. Top leaders community detection approach in information networks. In: Proceedings of the 4th SNA-KDD Workshop on Social Network Mining and Analysis, 2010

  15. Salehi M, Rabiee H R, Rajabi A. Sampling from complex networks with high community structures. Chaos, 2012, 22: 023126

    Article  MathSciNet  PubMed  ADS  Google Scholar 

  16. Lovász L. Random walks on graphs. Comb Paul Erdos Eighty, 1993, 2: 1–46

    Google Scholar 

  17. Lancichinetti A, Fortunato S, Radicchi F. Benchmark graphs for testing community detection algorithms. Phys Rev E, 2008, 78: 046110

    Article  ADS  Google Scholar 

  18. Yang J, Leskovec J. Structure and overlaps of ground-truth communities in networks. ACM Trans Intell Syst Technol, 2014, 5: 1–35

    Article  Google Scholar 

  19. Yang J, Leskovec J. Defining and evaluating network communities based on ground-truth. Knowl Inf Syst, 2015, 42: 181–213

    Article  Google Scholar 

  20. Emmons S, Kobourov S, Gallant M, et al. Analysis of network clustering algorithms and cluster quality metrics at scale. PLoS ONE, 2016, 11: 0159161

    Article  Google Scholar 

  21. Rosvall M, Bergstrom C T. Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci USA, 2008, 105: 1118–1123

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  22. Blondel V D, Guillaume J L, Lambiotte R, et al. Fast unfolding of communities in large networks. J Stat Mech, 2008, 2008: 10008

    Article  Google Scholar 

  23. Fortunato S, Hric D. Community detection in networks: a user guide. Phys Rep, 2016, 659: 1–44

    Article  MathSciNet  ADS  Google Scholar 

  24. Newman M E J. From the cover: modularity and community structure in networks. Proc Natl Acad Sci USA, 2006, 103: 8577–8582

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  25. Kannan R, Vempala S, Vetta A. On clusterings: good, bad and spectral. J ACM, 2004, 51: 497–515

    Article  MathSciNet  Google Scholar 

  26. Hric D, Darst R K, Fortunato S. Community detection in networks: structural communities versus ground truth. Phys Rev E, 2014, 90: 062805

    Article  CAS  ADS  Google Scholar 

Download references

Acknowledgements

This work was supported by Natural Science Foundation Youth Fund Project (Grant No. 62002384), China Postdoctoral Science Foundation Funded Project (Grant No. 47689), and Zhengzhou City Collaborative Innovation Major Project (Grant No. 162/32410218).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianpeng Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, J., Chen, H., Yu, D. et al. Cluster-preserving sampling algorithm for large-scale graphs. Sci. China Inf. Sci. 66, 112103 (2023). https://doi.org/10.1007/s11432-021-3370-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11432-021-3370-4

Keywords

Navigation