Abstract
Random-walk-based sampling is an efficient way to extract and analyze the properties of large and complex graphs representing social networks. However, it is almost impractical for existing random-walk-based sampling schemes to reach the desired node distribution because of the indeterministic sampling budget (i.e., the number of samples or sampling steps) required for doing so with large volumes of data in graphs. On the other hand, under a small sampling budget, these methods produce low-quality samples with many repeats and high correlations (i.e., many common attributes), which leads to a large deviation from the desired node distribution and large estimation errors. In this paper, we propose a new random-walk sampling scheme based on node cliques (a subset of cliques), called node-clique random walk, or NCRW, to strike a good balance between the estimation error and the sampling budget, by producing unique samples with low correlations. Meanwhile, both the deviation from the desired node distribution and the estimation errors under the constraint of the sampling budget are reduced both theoretically and experimentally. Thus, the sampling costs which are closely related to the sampling budget are reduced. Our extensive experimental evaluation driven by real-world datasets further confirms that NCRW significantly increases the quality of samples and accuracy of estimations with much lower costs than those of existing random-walk-based sampling schemes especially in estimating the higher-order node attributes.











Similar content being viewed by others
References
Ahmed NK, Duffield N, Willke TL, Rossi RA (2017) On sampling from massive graph streams. VLDB 10(11):1430–1441
Avrachenkov K, Ribeiro B, Towsley D (2010) Improving random walk estimation accuracy with uniform restarts. In: Avrachenkov K et al (eds) Algorithms and models for the Web-Graph. Springer, Berlin, pp 98–109
Bhuiyan M. A, Rahman M, Rahman M, Al Hasan M.(2012) Guise: uniform sampling of graphlets for large graph analysis. In: 2012 IEEE 12th international conference on data mining, IEEE, pp 91–100
Chen F, Lovász L, Pak I.(1999) Lifting markov chains to speed up mixing. In: Proceedings of the thirty-first annual ACM symposium on Theory of computing, ACM, pp 275–281
Chen J, Gong Z, Mo J, Wang W, Wang C, Dong X, Liu W, Wu K (2021) Self-training enhanced: network embedding and overlapping community detection with adversarial learning. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2021.3083318
Chen X, Li Y, Wang P, Lui J (2016) A general framework for estimating graphlet statistics via random walk. Proc VLDB Endow 10(3):253–264
Chiericetti F, Dasgupta A, Kumar R, Lattanzi S, Sarlós T (2016) On sampling nodes in a network. In: Proceedings of the 25th international conference on World Wide Web, international World Wide Web conferences steering committee, pp 471–481
Ching W-K, Ng MK, Fung ES (2008) Higher-order multivariate markov chains and their applications. Linear Algebra Appl 428(2–3):492–507
Cowles MK, Carlin BP (1996) Markov chain Monte Carlo convergence diagnostics: a comparative review. J Am Stat Assoc 91(434):883–904
Cui Y, Li X, Li J, Wang H, Chen X (2022) A survey of sampling method for social media embeddedness relationship. ACM Comput Surv. https://doi.org/10.1145/3524105
De Stefani L, Epasto A, Riondato M, Upfal E (2016) Trièst: Counting local and global triangles in fully-dynamic streams with fixed memory size. ACM Trans Knowl Discov Data (TKDD) 11:825–834
Gjoka M, Kurant M, Butts C. T, Markopoulou A (2010) Walking in facebook: A case study of unbiased sampling of osns. In: 2010 Proceedings IEEE Infocom, IEEE, PP 1–9
Gjoka M, Kurant M, Butts CT, Markopoulou A (2011) Practical recommendations on crawling online social networks. IEEE J Sel Areas Commun 29(9):1872–1892
Jha M, Seshadhri C, Pinar A (2013) A space efficient streaming algorithm for triangle counting using the birthday paradox. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 589–597
Jowhari H, Ghodsi M (2005) New streaming algorithms for counting triangles in graphs. In: International computing and combinatorics conference, Springer, pp 710–716.
Konc J, Janezic D (2007) An improved branch and bound algorithm for the maximum clique problem. Proteins 4(5):590–596
Kurant M, Gjoka M, Butts C. T, Markopoulou A.(2011) Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems, ACM, pp 281–292
Kurant M, Markopoulou A, Thiran P (2011) Towards unbiased bfs sampling. IEEE J Sel Areas Commun 29(9):1799–1809
Kutzkov K, Pagh R (2013) On the streaming complexity of computing local clustering coefficients. In: Proceedings of the sixth ACM international conference on Web search and data mining, ACM, pp 677–686
Lee C-H, Xu X, Eun DY (2012) Beyond random walk and metropolis-hastings samplers: why you should not backtrack for unbiased graph sampling. ACM SIGMETRICS Perform Eval Rev 40:319–330
Leskovec J, Adamic LA, Huberman BA (2007) The dynamics of viral marketing. ACM Trans Web (TWEB) 1(1):5
Leskovec J, Lang KJ, Dasgupta A, Mahoney MW (2009) Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math 6(1):29–123
Li R.-H, Yu J. X, Huang X, Cheng H (2014) Random-walk domination in large graphs. In: 2014 IEEE 30th international conference on data engineering, IEEE, pp 736–747.
R.-H. Li, J. X. Yu, L. Qin, R. Mao, and T. Jin (2015) On random walk based graph sampling. In: 2015 IEEE 31st international conference on data engineering, IEEE, pp 927–938
Li W, Ng MK (2014) On the limiting probability distribution of a transition probability tensor. Linear Multili Algebra 62(3):362–385
Liben-Nowell D, Kleinberg J (2007) The link-prediction problem for social networks. J Assoc Inf Sci Technol 58(7):1019–1031
Lim Y, Kang U (2015) Mascot: memory-efficient and accurate sampling for counting local triangles in graph streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 685–694
Lovász L (1993) Random walks on graphs: a survey. Combinatorics Paul Erdos Eighty 2(1):1–46
Lovász L,Winkler P (1995) Efficient stopping rules for markov chains. In: Proceedings of the twenty-seventh annual ACM symposium on theory of computing, ACM, pp 76–82
Mislove A, Marcon M, Gummadi KP, Druschel P, Bhattacharjee B (2007) Measurement and analysis of online social networks. In: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, ACM, pp 29–42
Mohaisen A, Yun A, Kim Y (2010) Measuring the mixing time of social graphs. In: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, ACM, pp 383–389
Murai F, Ribeiro B, Towsley D, Wang P (2013) On set size distribution estimation and the characterization of large networks via sampling. IEEE J Sel Areas Commun 31(6):1017–1025
Nakajima K, Shudo K (2021) Social graph restoration via random walk sampling. arXiv preprint arXiv:2111.11966,
Ribeiro B, Towsley D (2010) Estimating and sampling graphs with multidimensional random walks. In: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, ACM, pp 390–403
Ribeiro B, Wang P, Murai F, Towsley D (2012) Sampling directed graphs with random walks. In: 2012 Proceedings IEEE INFOCOM, IEEE, pp 1692–1700
Stutzbach D, Rejaie R, Duffield N, Sen S, Willinger W (2009) On unbiased sampling for unstructured peer-to-peer networks. IEEE/ACM Trans Netw (TON) 17(2):377–390
Wang P, Lui J, Ribeiro B, Towsley D, Zhao J, Guan X (2014) Efficiently estimating motif statistics of large networks. ACM Trans Know Discov Data (TKDD) 9(2):8
Wang P, Qi Y, Sun Y, Zhang X, Tao J, Guan X (2017) Approximately counting triangles in large graph streams including edge duplicates with a fixed memory usage. VLDB 11(2):162–175
Wang P, Ribeiro B, Zhao J, Lui J, Towsley D, Guan X (2013) Practical characterization of large networks using neighborhood information. arXiv preprint arXiv:1311.3037
Wang P, Zhao J, Lui JC, Towsley D, Guan X (2018) Fast crawling methods of exploring content distributed over large graphs. Know Inf Syst 59:1–26
Xu X, Lee CH et al (2017) Challenging the limits: sampling online social networks with cost constraints. In: IEEE INFOCOM 2017-IEEE conference on computer communications
Yang J, Leskovec J (2015) Defining and evaluating network communities based on ground-truth. Know Inf Syst 42(1):181–213
Yi P, Xie H, Li Y, Lui JC (2021) A bootstrapping approach to optimize random walk based statistical estimation over graphs. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), IEEE, pp 900–911
Zafar MB, Bhattacharya P, Ganguly N, Gummadi KP, Ghosh S (2015) Sampling content from online social networks: comparing random versus expert sampling of the twitter stream. ACM Trans Web (TWEB) 9(3):12
Zaykov AL, Vaganov DA, Guleva VY (2020) Diffusion dynamics prediction on networks using sub-graph motif distribution. In: International conference on complex networks and their applications, Springer, pp 482–493
Zhang L, Jiang H, Wang F, Feng D (2020) Draws: a dual random-walk based sampling method to efficiently estimate distributions of degree and clique size over social networks. Know-Based Syst 198:105891
Zhao J, Wang P, Lui J, Towsley D, Guan X (2019) Sampling online social networks by random walk with indirect jumps. Data Min Know Discov 33(1):24–57
Zhao Y, Jiang H, Qin Y, Xie H, Wu Y, Liu S, Zhou Z, Xia J, Zhou F et al (2020) Preserving minority structures in graph sampling. IEEE Trans Vis Comput Gr 27(2):1698–1708
Zhao Y, Shi J, Liu J, Zhao J, Zhou F, Zhang W, Chen K, Zhao X, Zhu C, Chen W (2021) Evaluating effects of background stories on graph perception. IEEE Trans Vis Comput Gr. https://doi.org/10.1109/TVCG.2021.3107297
Zhong M, Shen K (2006) Random walk based node sampling in self-organizing networks. SIGOPS 40(3):49–55
Zhou Z, Zhang N, Das G (2015) Leveraging history for faster sampling of online social networks. VLDB 8(10):1034–1045
Zhou Z, Zhang N, Gong Z, Das G (2016) Faster random walks by rewiring online social networks on-the-fly. ACM Trans Database Syst (TODS) 40(4):26
Acknowledgements
We thanks to all the reviewers of this paper. Furthermore, this work is supported by NSFC No.61772216, 61832020,61821003, Wuhan application basic research project 2017010201010103, Fund from Science, Technology and Innovation Commission of Shenzhen Municipality(JCYJ20170307172248636).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, L., Wang, F., Jiang, H. et al. Random walk on node cliques for high-quality samples to estimate large graphs with high accuracies and low costs. Knowl Inf Syst 64, 1909–1935 (2022). https://doi.org/10.1007/s10115-022-01691-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-022-01691-8