Random walk on node cliques for high-quality samples to estimate large graphs with high accuracies and low costs

Zhang, Lingling; Wang, Fang; Jiang, Hong; Feng, Dan; Xie, Yanwen; Zhang, Zhiwei; Wang, Guoren

doi:10.1007/s10115-022-01691-8

Random walk on node cliques for high-quality samples to estimate large graphs with high accuracies and low costs

Regular Paper
Published: 23 June 2022

Volume 64, pages 1909–1935, (2022)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Lingling Zhang ORCID: orcid.org/0000-0003-4845-9868^1,2,
Fang Wang²,
Hong Jiang³,
Dan Feng²,
Yanwen Xie²,
Zhiwei Zhang¹ &
…
Guoren Wang¹

271 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Random-walk-based sampling is an efficient way to extract and analyze the properties of large and complex graphs representing social networks. However, it is almost impractical for existing random-walk-based sampling schemes to reach the desired node distribution because of the indeterministic sampling budget (i.e., the number of samples or sampling steps) required for doing so with large volumes of data in graphs. On the other hand, under a small sampling budget, these methods produce low-quality samples with many repeats and high correlations (i.e., many common attributes), which leads to a large deviation from the desired node distribution and large estimation errors. In this paper, we propose a new random-walk sampling scheme based on node cliques (a subset of cliques), called node-clique random walk, or NCRW, to strike a good balance between the estimation error and the sampling budget, by producing unique samples with low correlations. Meanwhile, both the deviation from the desired node distribution and the estimation errors under the constraint of the sampling budget are reduced both theoretically and experimentally. Thus, the sampling costs which are closely related to the sampling budget are reduced. Our extensive experimental evaluation driven by real-world datasets further confirms that NCRW significantly increases the quality of samples and accuracy of estimations with much lower costs than those of existing random-walk-based sampling schemes especially in estimating the higher-order node attributes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Guided sampling for large graphs

Article 18 March 2020

Efficient Local Clustering Coefficient Estimation in Massive Graphs

SSRW: A Scalable Algorithm for Estimating Graphlet Statistics Based on Random Walk

References

Ahmed NK, Duffield N, Willke TL, Rossi RA (2017) On sampling from massive graph streams. VLDB 10(11):1430–1441
Google Scholar
Avrachenkov K, Ribeiro B, Towsley D (2010) Improving random walk estimation accuracy with uniform restarts. In: Avrachenkov K et al (eds) Algorithms and models for the Web-Graph. Springer, Berlin, pp 98–109
Chapter Google Scholar
Bhuiyan M. A, Rahman M, Rahman M, Al Hasan M.(2012) Guise: uniform sampling of graphlets for large graph analysis. In: 2012 IEEE 12th international conference on data mining, IEEE, pp 91–100
Chen F, Lovász L, Pak I.(1999) Lifting markov chains to speed up mixing. In: Proceedings of the thirty-first annual ACM symposium on Theory of computing, ACM, pp 275–281
Chen J, Gong Z, Mo J, Wang W, Wang C, Dong X, Liu W, Wu K (2021) Self-training enhanced: network embedding and overlapping community detection with adversarial learning. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2021.3083318
Article Google Scholar
Chen X, Li Y, Wang P, Lui J (2016) A general framework for estimating graphlet statistics via random walk. Proc VLDB Endow 10(3):253–264
Article Google Scholar
Chiericetti F, Dasgupta A, Kumar R, Lattanzi S, Sarlós T (2016) On sampling nodes in a network. In: Proceedings of the 25th international conference on World Wide Web, international World Wide Web conferences steering committee, pp 471–481
Ching W-K, Ng MK, Fung ES (2008) Higher-order multivariate markov chains and their applications. Linear Algebra Appl 428(2–3):492–507
Article MathSciNet Google Scholar
Cowles MK, Carlin BP (1996) Markov chain Monte Carlo convergence diagnostics: a comparative review. J Am Stat Assoc 91(434):883–904
Article MathSciNet Google Scholar
Cui Y, Li X, Li J, Wang H, Chen X (2022) A survey of sampling method for social media embeddedness relationship. ACM Comput Surv. https://doi.org/10.1145/3524105
Article Google Scholar
De Stefani L, Epasto A, Riondato M, Upfal E (2016) Trièst: Counting local and global triangles in fully-dynamic streams with fixed memory size. ACM Trans Knowl Discov Data (TKDD) 11:825–834
Google Scholar
Gjoka M, Kurant M, Butts C. T, Markopoulou A (2010) Walking in facebook: A case study of unbiased sampling of osns. In: 2010 Proceedings IEEE Infocom, IEEE, PP 1–9
Gjoka M, Kurant M, Butts CT, Markopoulou A (2011) Practical recommendations on crawling online social networks. IEEE J Sel Areas Commun 29(9):1872–1892
Article Google Scholar
Jha M, Seshadhri C, Pinar A (2013) A space efficient streaming algorithm for triangle counting using the birthday paradox. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 589–597
Jowhari H, Ghodsi M (2005) New streaming algorithms for counting triangles in graphs. In: International computing and combinatorics conference, Springer, pp 710–716.
Konc J, Janezic D (2007) An improved branch and bound algorithm for the maximum clique problem. Proteins 4(5):590–596
MATH Google Scholar
Kurant M, Gjoka M, Butts C. T, Markopoulou A.(2011) Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems, ACM, pp 281–292
Kurant M, Markopoulou A, Thiran P (2011) Towards unbiased bfs sampling. IEEE J Sel Areas Commun 29(9):1799–1809
Article Google Scholar
Kutzkov K, Pagh R (2013) On the streaming complexity of computing local clustering coefficients. In: Proceedings of the sixth ACM international conference on Web search and data mining, ACM, pp 677–686
Lee C-H, Xu X, Eun DY (2012) Beyond random walk and metropolis-hastings samplers: why you should not backtrack for unbiased graph sampling. ACM SIGMETRICS Perform Eval Rev 40:319–330
Article Google Scholar
Leskovec J, Adamic LA, Huberman BA (2007) The dynamics of viral marketing. ACM Trans Web (TWEB) 1(1):5
Article Google Scholar
Leskovec J, Lang KJ, Dasgupta A, Mahoney MW (2009) Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math 6(1):29–123
Article MathSciNet Google Scholar
Li R.-H, Yu J. X, Huang X, Cheng H (2014) Random-walk domination in large graphs. In: 2014 IEEE 30th international conference on data engineering, IEEE, pp 736–747.
R.-H. Li, J. X. Yu, L. Qin, R. Mao, and T. Jin (2015) On random walk based graph sampling. In: 2015 IEEE 31st international conference on data engineering, IEEE, pp 927–938
Li W, Ng MK (2014) On the limiting probability distribution of a transition probability tensor. Linear Multili Algebra 62(3):362–385
Article MathSciNet Google Scholar
Liben-Nowell D, Kleinberg J (2007) The link-prediction problem for social networks. J Assoc Inf Sci Technol 58(7):1019–1031
Article Google Scholar
Lim Y, Kang U (2015) Mascot: memory-efficient and accurate sampling for counting local triangles in graph streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 685–694
Lovász L (1993) Random walks on graphs: a survey. Combinatorics Paul Erdos Eighty 2(1):1–46
Google Scholar
Lovász L,Winkler P (1995) Efficient stopping rules for markov chains. In: Proceedings of the twenty-seventh annual ACM symposium on theory of computing, ACM, pp 76–82
Mislove A, Marcon M, Gummadi KP, Druschel P, Bhattacharjee B (2007) Measurement and analysis of online social networks. In: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, ACM, pp 29–42
Mohaisen A, Yun A, Kim Y (2010) Measuring the mixing time of social graphs. In: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, ACM, pp 383–389
Murai F, Ribeiro B, Towsley D, Wang P (2013) On set size distribution estimation and the characterization of large networks via sampling. IEEE J Sel Areas Commun 31(6):1017–1025
Article Google Scholar
Nakajima K, Shudo K (2021) Social graph restoration via random walk sampling. arXiv preprint arXiv:2111.11966,
Ribeiro B, Towsley D (2010) Estimating and sampling graphs with multidimensional random walks. In: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, ACM, pp 390–403
Ribeiro B, Wang P, Murai F, Towsley D (2012) Sampling directed graphs with random walks. In: 2012 Proceedings IEEE INFOCOM, IEEE, pp 1692–1700
Stutzbach D, Rejaie R, Duffield N, Sen S, Willinger W (2009) On unbiased sampling for unstructured peer-to-peer networks. IEEE/ACM Trans Netw (TON) 17(2):377–390
Article Google Scholar
Wang P, Lui J, Ribeiro B, Towsley D, Zhao J, Guan X (2014) Efficiently estimating motif statistics of large networks. ACM Trans Know Discov Data (TKDD) 9(2):8
Google Scholar
Wang P, Qi Y, Sun Y, Zhang X, Tao J, Guan X (2017) Approximately counting triangles in large graph streams including edge duplicates with a fixed memory usage. VLDB 11(2):162–175
Google Scholar
Wang P, Ribeiro B, Zhao J, Lui J, Towsley D, Guan X (2013) Practical characterization of large networks using neighborhood information. arXiv preprint arXiv:1311.3037
Wang P, Zhao J, Lui JC, Towsley D, Guan X (2018) Fast crawling methods of exploring content distributed over large graphs. Know Inf Syst 59:1–26
Google Scholar
Xu X, Lee CH et al (2017) Challenging the limits: sampling online social networks with cost constraints. In: IEEE INFOCOM 2017-IEEE conference on computer communications
Yang J, Leskovec J (2015) Defining and evaluating network communities based on ground-truth. Know Inf Syst 42(1):181–213
Article Google Scholar
Yi P, Xie H, Li Y, Lui JC (2021) A bootstrapping approach to optimize random walk based statistical estimation over graphs. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), IEEE, pp 900–911
Zafar MB, Bhattacharya P, Ganguly N, Gummadi KP, Ghosh S (2015) Sampling content from online social networks: comparing random versus expert sampling of the twitter stream. ACM Trans Web (TWEB) 9(3):12
Google Scholar
Zaykov AL, Vaganov DA, Guleva VY (2020) Diffusion dynamics prediction on networks using sub-graph motif distribution. In: International conference on complex networks and their applications, Springer, pp 482–493
Zhang L, Jiang H, Wang F, Feng D (2020) Draws: a dual random-walk based sampling method to efficiently estimate distributions of degree and clique size over social networks. Know-Based Syst 198:105891
Article Google Scholar
Zhao J, Wang P, Lui J, Towsley D, Guan X (2019) Sampling online social networks by random walk with indirect jumps. Data Min Know Discov 33(1):24–57
Article MathSciNet Google Scholar
Zhao Y, Jiang H, Qin Y, Xie H, Wu Y, Liu S, Zhou Z, Xia J, Zhou F et al (2020) Preserving minority structures in graph sampling. IEEE Trans Vis Comput Gr 27(2):1698–1708
Article Google Scholar
Zhao Y, Shi J, Liu J, Zhao J, Zhou F, Zhang W, Chen K, Zhao X, Zhu C, Chen W (2021) Evaluating effects of background stories on graph perception. IEEE Trans Vis Comput Gr. https://doi.org/10.1109/TVCG.2021.3107297
Article Google Scholar
Zhong M, Shen K (2006) Random walk based node sampling in self-organizing networks. SIGOPS 40(3):49–55
Article MathSciNet Google Scholar
Zhou Z, Zhang N, Das G (2015) Leveraging history for faster sampling of online social networks. VLDB 8(10):1034–1045
Google Scholar
Zhou Z, Zhang N, Gong Z, Das G (2016) Faster random walks by rewiring online social networks on-the-fly. ACM Trans Database Syst (TODS) 40(4):26
Article MathSciNet Google Scholar

Download references

Acknowledgements

We thanks to all the reviewers of this paper. Furthermore, this work is supported by NSFC No.61772216, 61832020,61821003, Wuhan application basic research project 2017010201010103, Fund from Science, Technology and Innovation Commission of Shenzhen Municipality(JCYJ20170307172248636).

Author information

Authors and Affiliations

Beijing Institute of Technology, Beijing, China
Lingling Zhang, Zhiwei Zhang & Guoren Wang
Wuhan National Laboratory for Optoelectronics, Key Laboratory of Information Storage System Engineering Research Center of data storage systems and Technology, Ministry of Education of China, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
Lingling Zhang, Fang Wang, Dan Feng & Yanwen Xie
Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, USA
Hong Jiang

Authors

Lingling Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Fang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hong Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Dan Feng
View author publications
You can also search for this author in PubMed Google Scholar
Yanwen Xie
View author publications
You can also search for this author in PubMed Google Scholar
Zhiwei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Guoren Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lingling Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, L., Wang, F., Jiang, H. et al. Random walk on node cliques for high-quality samples to estimate large graphs with high accuracies and low costs. Knowl Inf Syst 64, 1909–1935 (2022). https://doi.org/10.1007/s10115-022-01691-8

Download citation

Received: 11 December 2018
Revised: 18 May 2022
Accepted: 21 May 2022
Published: 23 June 2022
Issue Date: July 2022
DOI: https://doi.org/10.1007/s10115-022-01691-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Random walk on node cliques for high-quality samples to estimate large graphs with high accuracies and low costs

Abstract

Access this article

Similar content being viewed by others

Guided sampling for large graphs

Efficient Local Clustering Coefficient Estimation in Massive Graphs

SSRW: A Scalable Algorithm for Estimating Graphlet Statistics Based on Random Walk

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Random walk on node cliques for high-quality samples to estimate large graphs with high accuracies and low costs

Abstract

Access this article

Similar content being viewed by others

Guided sampling for large graphs

Efficient Local Clustering Coefficient Estimation in Massive Graphs

SSRW: A Scalable Algorithm for Estimating Graphlet Statistics Based on Random Walk

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation