Skip to main content
Log in

Guided sampling for large graphs

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Large real-world graphs claim lots of resources in terms of memory and computational power to study them and this makes their full analysis extremely challenging. In order to understand the structure and properties of these graphs, we intend to extract a small representative subgraph from a big graph while preserving its topology and characteristics. In this work, we aim at producing good samples with sample size as low as 0.1% while maintaining the structure and some of the key properties of a network. We exploit the fact that average values of degree and clustering coefficient of a graph can be estimated accurately and efficiently. We use the estimated values to guide the sampling process and extract tiny samples that preserve the properties of the graph and closely approximate their distributions in the original graph. The distinguishing feature of our work is that we apply traversal based sampling that utilizes only the local information of nodes as opposed to the global information of the network and this makes our approach a practical choice for crawling online networks. We evaluate the effectiveness of our sampling technique using real-world datasets and show that it surpasses the existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  • Ahmed N, Neville J, Kompella RR (2011) Network sampling via edge-based node selection with graph induction. Technical Report 11-016, Purdue Digital Library

  • Ahn Y, Han S, Kwak H, Moon S, Jeong H (2007) Analysis of topological characteristics of huge online social networking services. In: Proceedings of WWW, pp 835–844

  • Al Hasan M, Zaki MJ (2009) Output space sampling for graph patterns. Proc VLDB Endow 2(1):730–741

    Article  Google Scholar 

  • Bar-Yossef Z, Gurevich M (2008) Random sampling from a search engine’s index. J ACM 55(5):24:1–24:74

    Article  MathSciNet  Google Scholar 

  • Becchetti L, Castillo C, Donato D, Fazzone A (2006) A comparison of sampling techniques for web graph characterization. In: LinkKDD

  • Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 10:P10008

    Article  Google Scholar 

  • Chen B, Liu L, Jia H, Zhang Y (2017a) Reducing repetition rate: unbiased delay sampling in online social networks. Recent Pat Comput Sci 10(4):308–314

    Article  Google Scholar 

  • Chen Y, Ding C, Hu J, Chen R, Hui P, Fu X (2017b) Building and analyzing a global co-authorship network using google scholar data. In: Proceedings of 26th international World Wide Web conference (WWW 2017) Companion

  • Chepuri SP, Leus G (2017) Graph sampling for covariance estimation. IEEE Trans Signal Inf Process Over Netw 3:451–466

    Article  MathSciNet  Google Scholar 

  • Chiericetti F, Dasgupta A, Kumar R, Lattanzi S, Sarlós T (2016) On sampling nodes in a network. In: Proceedings of the 25th international conference on World Wide Web, WWW ’16, pp 471–481

  • Doerr C, Blenn N (2013) Metric convergence in social network sampling. In: ACM Hotplanet

  • Feige U (1995) A tight upper bound on the cover time for random walks on graphs. Random Struct Algorithms 6:51–54

    Article  MathSciNet  Google Scholar 

  • Gjoka M, Kurant M, Butts C, Markopoulou A (2010) Walking in facebook: a case study of unbiased sampling of OSNS. In: INFOCOM

  • Gkantsidis C, Mihail M, Saberi A (2006) Random walks in peer-to-peer networks: algorithms and evaluation. Perform Eval 63(3):241–263

    Article  Google Scholar 

  • Hardiman SJ, Katzir L (2013) Estimating clustering coefficient and size of social networks via random walk. In: ACM’s WWW

  • Hubler C, Kriegel P, Borgwardt KM, Ghahramani Z (2008) Metropolis algorithms for representative subgraph sampling. In: ICDM

  • Hu P, Lau WC (2014) A survey and taxonomy of graph sampling. In: HONGKONG UNI

  • Kim N, Laing C, Elmetwaly S, Jung S, Curuksu J, Schlick T (2014) Graph-based sampling for approximating global helical topologies of RNA. Proc Natl Acad Sci 111(11):4079–4084

    Article  Google Scholar 

  • Konect (2015) Network dataset—KONECT. http://konect.uni-koblenz.de/networks/. Accessed Sept 2018

  • Lee CH, Xu X, Eun DY (2012) Beyond random walk and metropolis–hastings samplers: why you should not backtrack for unbiased graph sampling. In: Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on measurement and modeling of computer systems, SIGMETRICS ’12, pp 319–330

  • Lee S, Kim P, Jeong H (2006) Statistical properties of sampled networks. Phys Rev E 73:016102

    Article  Google Scholar 

  • Leskovec J, Kleinberg J, Faloutsos C (2007) Graph evolution: densification and shrinking diameters. ACM Trans Knowl Discov Data 1(1):2. https://doi.org/10.1145/1217299.1217301

    Article  Google Scholar 

  • Leskovec J, Faloutsos C (2006) Sampling from large graphs. In: SIGKDD, pp 631–636

  • Leskovec J, Krevl A (2014) SNAP datasets: Stanford large network dataset collection. http://snap.stanford.edu/data. Accessed Sept 2018

  • Liu L, Wang L, Wu W, Jia H, Zhang Y (2019) A novel hybrid-jump-based sampling method for complex social networks. IEEE Trans Comput Soc Syst 6(2):241–249

    Article  Google Scholar 

  • Li R, Yu JX, Qin L, Mao R, Jin T (2015) On random walk based graph sampling. In: 2015 IEEE 31st international conference on data engineering, pp 927–938

  • Maiya AS, Berger-Wolf TY (2010) Sampling community structure. In: Proceedings of the 19th international conference on World Wide Web, WWW ’10, pp 701–710

  • Maiya AC, Berger-Wolf TY (2011) Benefits of bias: towards better characterization of network sampling. In: ACM KDD

  • Najork M, Wiener JL (2001) Breadth-first crawling yields high-quality pages. In: Proceedings of the 10th international conference on World Wide Web, WWW ’01, pp 114–118

  • Rasti AH, Torkjazi M, Rejaie R, Duffield NG, Willinger W, Stutzbach D (2009) Respondent-driven sampling for characterizing unstructured overlays. In: INFOCOM 2009. 28th IEEE international conference on computer communications, 19–25 April 2009, Rio de Janeiro, Brazil, pp 2701–2705

  • Ribeeiro B, Towsley D (2010) Estimating and sampling graphs with multidimensional random walks. In: ACM internet measurement conference

  • Rossi RA, Ahmed NK (2015) The network data repository with interactive graph analytics and visualization. http://networkrepository.com. Accessed Sept 2018

  • Sethu H, Chu X (2012) A new algorithm for extracting a small representative subgraph from a very large graph. arXiv:1207.4825

  • Stutzbach D, Rejaie R, Duffield N, Sen S, Willinger W (2009) On unbiased sampling for unstructured peer-to-peer networks. IEEE/ACM Trans Netw 17(2):377–390

    Article  Google Scholar 

  • Voudigari E, Salamanos N, Papageorgiou T, Yannakoudakis EJ (2016) Rank degree: an efficient algorithm for graph sampling. In: 2016 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM), pp 120–129

  • Wang T, Chen Y, Zhang Z, Sun P, Deng B, Li X (2010) Unbiased sampling in directed social graph. In: Proceedings of the ACM SIGCOMM 2010 conference, SIGCOMM ’10, pp 401–402

  • Wang T, Chen Y, Zhang Z, Xu T, Jin L, Hui P, Deng B, Li X (2011) Understanding graph sampling algorithms for social network analysis. In: Proceedings of the 2011 31st international conference on distributed computing systems workshops, ICDCSW ’11, pp 123–128

  • Watts DJ, Strogatz SH (1998) Collective dynamics of ’small-world’ networks. Nature 393:440–442

    Article  Google Scholar 

  • Xu X, Lee C (2014) A general framework of hybrid graph sampling for complex network analysis. In: Proceedings of INFOCOM

  • Xu XK, Zhu JJ (2016) Flexible sampling large-scale social networks by self-adjustable random walk. Phys A: Stat Mech Appl 463:356–365

    Article  Google Scholar 

  • Xu X, Lee CH et al (2017) Challenging the limits: sampling online social networks with cost constraints. In: IEEE INFOCOM 2017-IEEE conference on computer communications, pp 1–9

  • Ye S, Lang J, Wu F (2010) Crawling online social graphs. In: Proceedings of the 2010 12th international Asia-Pacific web conference, pp 236–242

  • Zafarani R, Liu H (2009) Social computing data repository at ASU. School of Computing, Informatics and Decision Systems Engineering

Download references

Acknowledgements

This work was supported by Korea Institute of Science and Technology (KIST) under the project “HERO Part 1: Development of core technology of ambient intelligence for proactive service in digital in-home care”, and by the Technology Innovation Program (20006489, “Development of the embedded robot equipment control system equipped with an AI vision module for industrial environment”) funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suhyun Kim.

Additional information

Responsible editor: Hanghang Tong

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yousuf, M.I., Kim, S. Guided sampling for large graphs. Data Min Knowl Disc 34, 905–948 (2020). https://doi.org/10.1007/s10618-020-00683-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-020-00683-y

Keywords

Navigation