Skip to main content
Log in

Random walk on node cliques for high-quality samples to estimate large graphs with high accuracies and low costs

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Random-walk-based sampling is an efficient way to extract and analyze the properties of large and complex graphs representing social networks. However, it is almost impractical for existing random-walk-based sampling schemes to reach the desired node distribution because of the indeterministic sampling budget (i.e., the number of samples or sampling steps) required for doing so with large volumes of data in graphs. On the other hand, under a small sampling budget, these methods produce low-quality samples with many repeats and high correlations (i.e., many common attributes), which leads to a large deviation from the desired node distribution and large estimation errors. In this paper, we propose a new random-walk sampling scheme based on node cliques (a subset of cliques), called node-clique random walk, or NCRW, to strike a good balance between the estimation error and the sampling budget, by producing unique samples with low correlations. Meanwhile, both the deviation from the desired node distribution and the estimation errors under the constraint of the sampling budget are reduced both theoretically and experimentally. Thus, the sampling costs which are closely related to the sampling budget are reduced. Our extensive experimental evaluation driven by real-world datasets further confirms that NCRW significantly increases the quality of samples and accuracy of estimations with much lower costs than those of existing random-walk-based sampling schemes especially in estimating the higher-order node attributes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Ahmed NK, Duffield N, Willke TL, Rossi RA (2017) On sampling from massive graph streams. VLDB 10(11):1430–1441

    Google Scholar 

  2. Avrachenkov K, Ribeiro B, Towsley D (2010) Improving random walk estimation accuracy with uniform restarts. In: Avrachenkov K et al (eds) Algorithms and models for the Web-Graph. Springer, Berlin, pp 98–109

    Chapter  Google Scholar 

  3. Bhuiyan M. A, Rahman M, Rahman M, Al Hasan M.(2012) Guise: uniform sampling of graphlets for large graph analysis. In: 2012 IEEE 12th international conference on data mining, IEEE, pp 91–100

  4. Chen F, Lovász L, Pak I.(1999) Lifting markov chains to speed up mixing. In: Proceedings of the thirty-first annual ACM symposium on Theory of computing, ACM, pp 275–281

  5. Chen J, Gong Z, Mo J, Wang W, Wang C, Dong X, Liu W, Wu K (2021) Self-training enhanced: network embedding and overlapping community detection with adversarial learning. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2021.3083318

    Article  Google Scholar 

  6. Chen X, Li Y, Wang P, Lui J (2016) A general framework for estimating graphlet statistics via random walk. Proc VLDB Endow 10(3):253–264

    Article  Google Scholar 

  7. Chiericetti F, Dasgupta A, Kumar R, Lattanzi S, Sarlós T (2016) On sampling nodes in a network. In: Proceedings of the 25th international conference on World Wide Web, international World Wide Web conferences steering committee, pp 471–481

  8. Ching W-K, Ng MK, Fung ES (2008) Higher-order multivariate markov chains and their applications. Linear Algebra Appl 428(2–3):492–507

    Article  MathSciNet  Google Scholar 

  9. Cowles MK, Carlin BP (1996) Markov chain Monte Carlo convergence diagnostics: a comparative review. J Am Stat Assoc 91(434):883–904

    Article  MathSciNet  Google Scholar 

  10. Cui Y, Li X, Li J, Wang H, Chen X (2022) A survey of sampling method for social media embeddedness relationship. ACM Comput Surv. https://doi.org/10.1145/3524105

    Article  Google Scholar 

  11. De Stefani L, Epasto A, Riondato M, Upfal E (2016) Trièst: Counting local and global triangles in fully-dynamic streams with fixed memory size. ACM Trans Knowl Discov Data (TKDD) 11:825–834

    Google Scholar 

  12. Gjoka M, Kurant M, Butts C. T, Markopoulou A (2010) Walking in facebook: A case study of unbiased sampling of osns. In: 2010 Proceedings IEEE Infocom, IEEE, PP 1–9

  13. Gjoka M, Kurant M, Butts CT, Markopoulou A (2011) Practical recommendations on crawling online social networks. IEEE J Sel Areas Commun 29(9):1872–1892

    Article  Google Scholar 

  14. Jha M, Seshadhri C, Pinar A (2013) A space efficient streaming algorithm for triangle counting using the birthday paradox. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 589–597

  15. Jowhari H, Ghodsi M (2005) New streaming algorithms for counting triangles in graphs. In: International computing and combinatorics conference, Springer, pp 710–716.

  16. Konc J, Janezic D (2007) An improved branch and bound algorithm for the maximum clique problem. Proteins 4(5):590–596

    MATH  Google Scholar 

  17. Kurant M, Gjoka M, Butts C. T, Markopoulou A.(2011) Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems, ACM, pp 281–292

  18. Kurant M, Markopoulou A, Thiran P (2011) Towards unbiased bfs sampling. IEEE J Sel Areas Commun 29(9):1799–1809

    Article  Google Scholar 

  19. Kutzkov K, Pagh R (2013) On the streaming complexity of computing local clustering coefficients. In: Proceedings of the sixth ACM international conference on Web search and data mining, ACM, pp 677–686

  20. Lee C-H, Xu X, Eun DY (2012) Beyond random walk and metropolis-hastings samplers: why you should not backtrack for unbiased graph sampling. ACM SIGMETRICS Perform Eval Rev 40:319–330

    Article  Google Scholar 

  21. Leskovec J, Adamic LA, Huberman BA (2007) The dynamics of viral marketing. ACM Trans Web (TWEB) 1(1):5

    Article  Google Scholar 

  22. Leskovec J, Lang KJ, Dasgupta A, Mahoney MW (2009) Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math 6(1):29–123

    Article  MathSciNet  Google Scholar 

  23. Li R.-H, Yu J. X, Huang X, Cheng H (2014) Random-walk domination in large graphs. In: 2014 IEEE 30th international conference on data engineering, IEEE, pp 736–747.

  24. R.-H. Li, J. X. Yu, L. Qin, R. Mao, and T. Jin (2015) On random walk based graph sampling. In: 2015 IEEE 31st international conference on data engineering, IEEE, pp 927–938

  25. Li W, Ng MK (2014) On the limiting probability distribution of a transition probability tensor. Linear Multili Algebra 62(3):362–385

    Article  MathSciNet  Google Scholar 

  26. Liben-Nowell D, Kleinberg J (2007) The link-prediction problem for social networks. J Assoc Inf Sci Technol 58(7):1019–1031

    Article  Google Scholar 

  27. Lim Y, Kang U (2015) Mascot: memory-efficient and accurate sampling for counting local triangles in graph streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 685–694

  28. Lovász L (1993) Random walks on graphs: a survey. Combinatorics Paul Erdos Eighty 2(1):1–46

    Google Scholar 

  29. Lovász L,Winkler P (1995) Efficient stopping rules for markov chains. In: Proceedings of the twenty-seventh annual ACM symposium on theory of computing, ACM, pp 76–82

  30. Mislove A, Marcon M, Gummadi KP, Druschel P, Bhattacharjee B (2007) Measurement and analysis of online social networks. In: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, ACM, pp 29–42

  31. Mohaisen A, Yun A, Kim Y (2010) Measuring the mixing time of social graphs. In: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, ACM, pp 383–389

  32. Murai F, Ribeiro B, Towsley D, Wang P (2013) On set size distribution estimation and the characterization of large networks via sampling. IEEE J Sel Areas Commun 31(6):1017–1025

    Article  Google Scholar 

  33. Nakajima K, Shudo K (2021) Social graph restoration via random walk sampling. arXiv preprint arXiv:2111.11966,

  34. Ribeiro B, Towsley D (2010) Estimating and sampling graphs with multidimensional random walks. In: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, ACM, pp 390–403

  35. Ribeiro B, Wang P, Murai F, Towsley D (2012) Sampling directed graphs with random walks. In: 2012 Proceedings IEEE INFOCOM, IEEE, pp 1692–1700

  36. Stutzbach D, Rejaie R, Duffield N, Sen S, Willinger W (2009) On unbiased sampling for unstructured peer-to-peer networks. IEEE/ACM Trans Netw (TON) 17(2):377–390

    Article  Google Scholar 

  37. Wang P, Lui J, Ribeiro B, Towsley D, Zhao J, Guan X (2014) Efficiently estimating motif statistics of large networks. ACM Trans Know Discov Data (TKDD) 9(2):8

    Google Scholar 

  38. Wang P, Qi Y, Sun Y, Zhang X, Tao J, Guan X (2017) Approximately counting triangles in large graph streams including edge duplicates with a fixed memory usage. VLDB 11(2):162–175

    Google Scholar 

  39. Wang P, Ribeiro B, Zhao J, Lui J, Towsley D, Guan X (2013) Practical characterization of large networks using neighborhood information. arXiv preprint arXiv:1311.3037

  40. Wang P, Zhao J, Lui JC, Towsley D, Guan X (2018) Fast crawling methods of exploring content distributed over large graphs. Know Inf Syst 59:1–26

    Google Scholar 

  41. Xu X, Lee CH et al (2017) Challenging the limits: sampling online social networks with cost constraints. In: IEEE INFOCOM 2017-IEEE conference on computer communications

  42. Yang J, Leskovec J (2015) Defining and evaluating network communities based on ground-truth. Know Inf Syst 42(1):181–213

    Article  Google Scholar 

  43. Yi P, Xie H, Li Y, Lui JC (2021) A bootstrapping approach to optimize random walk based statistical estimation over graphs. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), IEEE, pp 900–911

  44. Zafar MB, Bhattacharya P, Ganguly N, Gummadi KP, Ghosh S (2015) Sampling content from online social networks: comparing random versus expert sampling of the twitter stream. ACM Trans Web (TWEB) 9(3):12

    Google Scholar 

  45. Zaykov AL, Vaganov DA, Guleva VY (2020) Diffusion dynamics prediction on networks using sub-graph motif distribution. In: International conference on complex networks and their applications, Springer, pp 482–493

  46. Zhang L, Jiang H, Wang F, Feng D (2020) Draws: a dual random-walk based sampling method to efficiently estimate distributions of degree and clique size over social networks. Know-Based Syst 198:105891

    Article  Google Scholar 

  47. Zhao J, Wang P, Lui J, Towsley D, Guan X (2019) Sampling online social networks by random walk with indirect jumps. Data Min Know Discov 33(1):24–57

    Article  MathSciNet  Google Scholar 

  48. Zhao Y, Jiang H, Qin Y, Xie H, Wu Y, Liu S, Zhou Z, Xia J, Zhou F et al (2020) Preserving minority structures in graph sampling. IEEE Trans Vis Comput Gr 27(2):1698–1708

    Article  Google Scholar 

  49. Zhao Y, Shi J, Liu J, Zhao J, Zhou F, Zhang W, Chen K, Zhao X, Zhu C, Chen W (2021) Evaluating effects of background stories on graph perception. IEEE Trans Vis Comput Gr. https://doi.org/10.1109/TVCG.2021.3107297

    Article  Google Scholar 

  50. Zhong M, Shen K (2006) Random walk based node sampling in self-organizing networks. SIGOPS 40(3):49–55

    Article  MathSciNet  Google Scholar 

  51. Zhou Z, Zhang N, Das G (2015) Leveraging history for faster sampling of online social networks. VLDB 8(10):1034–1045

    Google Scholar 

  52. Zhou Z, Zhang N, Gong Z, Das G (2016) Faster random walks by rewiring online social networks on-the-fly. ACM Trans Database Syst (TODS) 40(4):26

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We thanks to all the reviewers of this paper. Furthermore, this work is supported by NSFC No.61772216, 61832020,61821003, Wuhan application basic research project 2017010201010103, Fund from Science, Technology and Innovation Commission of Shenzhen Municipality(JCYJ20170307172248636).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lingling Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, L., Wang, F., Jiang, H. et al. Random walk on node cliques for high-quality samples to estimate large graphs with high accuracies and low costs. Knowl Inf Syst 64, 1909–1935 (2022). https://doi.org/10.1007/s10115-022-01691-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-022-01691-8

Keywords

Navigation