Skip to main content
Log in

Active exploration for large graphs

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Modern information networks, such as social networks, communication networks, and citation networks, are often characterized by very large sizes and dynamically changing structures. Common solutions to graph mining tasks (e.g., node classification) usually employ an unrestricted sampling-then-mining paradigm to reduce a large network to a manageable size, followed by subsequent mining tasks. However, real-world networks may be unaccessible at once and must be crawled progressively. This can be due to the fact that the size of the network is too large, or some privacy/legal concerns. In this paper, we propose an Active Exploration framework for large graphs, where the goal is to simultaneously carry out network sampling and node labeling in order to build a sampled network from which the trained classifier can have the maximum node classification accuracy. To achieve this goal, we consider a network as a Markov chain and compute the stationary distribution of the nodes by deriving supervised random walks. The stationary distribution helps identify specific nodes to be sampled in the next step, and the labeling process labels the most informative node which in turn strengthens the sampling of the network. To improve the scalability of active exploration for large graphs, we also propose a more efficient multi-seed algorithm that simultaneously runs multiple, parallel exploration processes, and makes joint decisions to determine which nodes are to be sampled and labeled next. The simultaneous, mutually enhanced sampling and labeling processes ensure that the final sampled network contains a maximum number of nodes directly related to the underlying mining tasks. Experiments on both synthetic and real-world networks demonstrate that our active exploration algorithms have much better chance to include target nodes in the sampled networks than baseline methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. www.facebook.com.

  2. www.informatik.uni-trier.de/~ley/db/.

  3. www.statisticbrain.com/facebook-statistics/.

  4. www.flickr.com/photos/franckmichel/6855169886/.

  5. http://www.cs.umd.edu/projects/linqs/projects/lbc.

References

  • Ahmed N, Neville J, Kompella R (2014) Network sampling: from static to streaming graphs. ACM Trans Knowl Discov Data 8(2):1–56

    Article  Google Scholar 

  • Ahn Y-Y, Han S, Kwak H, Moon S, Jeong H (2007) Analysis of topological characteristics of huge online social networking services. In: Proceedings of the 16th international conference on world wide web, pp 835–844. ACM, Banff, Alberta

  • Alamgir M, Von Luxburg U (2010) Multi-agent random walks for local clustering on graphs. In: Proceedings of the 10th IEEE international conference on data mining, pp 18–27. IEEE, Sydney

  • Alon N, Avin C, Koucky M, Kozma G, Lotker Z, Tuttle MR (2008) Many random walks are faster than one. In: Proceedings of the 20th annual symposium on parallelism in algorithms and architectures, pp. 119–128. ACM, Munich, Germany

  • Backstrom L, Leskovec J (2011) Supervised random walks: predicting and recommending links in social networks. In: Proceedings of the 4th ACM international conference on web search and data mining, pp 635–644. ACM, Hong Kong

  • Becchetti L, Castillo C, Donato D, Fazzone A (2006) A comparison of sampling techniques for web graph characterization. In: Proceedings of the 2006 workshop on link analysis: dynamics and static of large networks, Philadelphia, PA

  • Belkin M, Matveeva I, Niyogi P (2004) Regularization and semi-supervised learning on large graphs. Learning theory. Springer, Berlin, pp 624–638

    Chapter  Google Scholar 

  • Bilgic M, Getoor L (2008) Effective label acquisition for collective classification. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 43–51. ACM

  • Bilgic M, Mihalkova L, Getoor L (2010) Active learning for networked data. In: Proceedings of the 27th international conference on machine learning, pp 79–86. Haifa, Israel

  • Catanese S, Meo P, Ferrara E, Fiumara G, Provetti A (2011) Crawling facebook for social network analysis purposes. In: Proceedings of the 1st international conference on web intelligence, mining and semantics, Article No. 52. ACM, Sogndal

  • Cesa-Bianchi N, Gentile C, Vitale F, Zappella G (2010) Active learning on trees and graphs. In: Proceedings of the 23rd annual conference on learning theory, pp 320–332. Haifa, Israel

  • Erdős P, Rényi A (1959) On random graphs I. Publ Math Debrecen 6:290–297

    MathSciNet  MATH  Google Scholar 

  • Fang M, Tao D (2014) Networked bandits with disjoint linear payoffs. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 1106–1115. ACM

  • Fang M, Yin J, Zhu X (2013) Active exploration: Simultaneous sampling and labeling for large graphs. In: Proceedings of the 22nd ACM international conference on information and knowledge management, pp 829–834. ACM, Burlingame, CA

  • Gjoka M, Kurant M, Butts C, Markopoulou A (2010) Walking in facebook: a case study of unbiased sampling of osns. In: Proceedings of the 29th conference on computer communications, pp 1–9. San Diego, CA

  • Gkantsidis C, Mihail M, Saberi A (2004) Random walks in peer-to-peer networks. In: Proceedings of the 23rd annual joint conference of the IEEE computer and communications societie, pp 120–130. IEEE

  • Gyongyi Z, Garcia-Molina H, Pedersen J (2006) Web content categorization using link information. Technical report, Stanford

  • Halperin S, Zwick U (1994) An optimal randomized logarithmic time connectivity algorithm for the EREW PRAM. In: Proceedings of the 6th annual ACM symposium on parallel algorithms and architectures, pp 1–10. ACM, Cape May, NJ

  • He J, Carbonell JG, Liu Y (2007) Graph-based semi-supervised learning as a generative model. In: Proceedings of the 20th international joint conference on artificial intelligence, pp 2492–2497

  • Henzinger MR, Heydon A, Mitzenmacher M, Najork M (2000) On near-uniform URL sampling. Comput Netw 33(1):295–308

    Article  Google Scholar 

  • Hübler C, Kriegel H-P, Borgwardt K, Ghahramani Z (2008) Metropolis algorithms for representative subgraph sampling. In: Proceedings of the 8th IEEE international conference on data mining, pp 283–292. Pisa

  • Karger DR, Nisan N, Parnas M (1992) Fast connected components algorithms for the erew pram. In: Proceedings of the 4th annual ACM symposium on parallel algorithms and architectures, pp 373–381. ACM, San Diego, CA

  • Kuwadekar A, Neville J (2011) Relational active learning for joint collective classification models. In: Proceedings of the 28th international conference on machine learning, pp 385–392. Bellevue, WA

  • Lee S, Kim P, Jeong H (2006) Statistical properties of sampled networks. Phys Rev E 73(1):016102

    Article  Google Scholar 

  • Leskovec J, Faloutsos C (2006) Sampling from large graphs. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 631–636. Philadelphia, PA

  • Lin F, Cohen WW (2010) Semi-supervised classification of network data using very few labels. In: Proceedings of the 2010 international conference on advances in social networks analysis and mining, pp 192–199. IEEE, Adense

  • Lovász L (1993) Random walks on graphs: a survey. Combinatorics 2(1):1–46

    MathSciNet  Google Scholar 

  • Macskassy SA (2007) Improving learning in networked data by combining explicit and mined links. In: Proceedings of the 22nd conference on artificial intelligence, pp. 590–595. AAAI Press, Vancouver, BC

  • Maiya A, Berger-Wolf T (2010) Online sampling of high centrality individuals in social networks. In: Proceedings of the 14th Pacific-Asia conference on knowledge discovery and data mining, pp 91–98. Hyderabad, India

  • Mislove A, Koppula H, Gummadi K, Druschel P, Bhattacharjee B (2008) Growth of the flickr social network. In: Proceedings of the 1st workshop on online social networks, pp 25–30. ACM, Seattle, WA

  • Mislove A, Marcon M, Gummadi K, Druschel P, Bhattacharjee B (2007) Measurement and analysis of online social networks. In: Proceedings of the 7th ACM SIGCOMM conference on internet measurement, pp 29–42. ACM, San Diego, CA

  • Namata G, Sen P, Bilgic M, Getoor L, Sahami M, Srivastava A (2009) Collective classification for text classification. In: Text mining: classification, clustering, and applications. CRC Press, Boca Rotan, pp 51–69

  • Papagelis M, Das G, Koudas N (2013) Sampling online social networks. IEEE Trans Knowl Data Eng 25:662–676

    Article  Google Scholar 

  • Pfeiffer III J, Neville J, Bennett P (2012) Active sampling of networks. In:Proceedings of the ICML workshop on mining and learning with graphs, Edinburgh, Scotland

  • Rafiei D, Curial S (2005) Effectively visualizing large networks through sampling. In: Proceedings of IEEE visualization, pp 375–382. IEEE, Minneapolis, MN

  • Rasti A, Torkjazi M, Rejaie R, Duffield N, Willinger W, Stutzbach D (2009) Respondent-driven sampling for characterizing unstructured overlays. In: Proceedings of the 28th conference on computer communications, pp 2701–2705. IEEE, Rio de Janeiro

  • Sarma A, Collapudi S, Panigrahy R (2011) Estimating pagerank on graph streams. J ACM 58(3):13

    Article  MathSciNet  MATH  Google Scholar 

  • Sen P, Namata G, Bilgic M, Getoor L, Galligher B, Eliassi-Rad T (2008) Collective classification in network data. AI Mag 29(3):93

    Google Scholar 

  • Stanton I, Kliot G (2012) Streaming graph partitioning for large distributed graphs. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Beijing

  • Stutzbach D, Rejaie R, Duffield N, Sen S, Willinger W (2009) On unbiased sampling for unstructured peer-to-peer networks. IEEE/ACM Trans Netw 17(2):377–390

    Article  Google Scholar 

  • Tong H, Faloutsos C, Pan J-Y (2006) Fast random walk with restart and its applications. In: Proceedings of the 6th international conference on data mining, pp 613–622. IEEE

  • Viswanath B, Mislove A, Cha M, Gummadi K (2009) On the evolution of user interaction in facebook. In: Proceedings of the 2nd ACM SIGCOMM workshop on online social networks, pp 37–42. ACM, Barcelona

  • Wasserman S, Faust K (1995) Social network analysis: methods and applications. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Wilson C, Boe B, Sala A, Puttaswamy K, Zhao B (2009) User interactions in social networks and their implications. In: Proceedings of the 4th ACM European conference on computer systems, pp 205–218. ACM, Nuremberg

  • Ye S, Lang J, Wu F (2010) Crawling online social graphs. In: Proceedings of the 12th International Asia-Pacific conference on web conference, pp 236–242. IEEE, Busan

  • Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B (2004) Learning with local and global consistency. Adv Neural Inf Process Syst 16(16):321–328

    Google Scholar 

  • Zhou Z, Zhang N, Gong Z, Das G (2013) Faster random walks by rewiring online social networks on-the-fly. In: Proceedings of the 29th IEEE international conference on data engineering, pp 769–780. IEEE, Brisbane

  • Zhu X, Ghahramani Z (2002) Learning from labeled and unlabeled data with label propagation. Technical report, CMU-CALD-02-107, Carnegie Mellon University

  • Zhu X, Ghahramani Z, Lafferty J (2003) Semi-supervised learning using Gaussian fields and harmonic functions. In: Proceedings of the 20th International Conference on Machine Learning, vol 3, pp 912–919. Washington, DC

  • Zhu X, Lafferty J, Ghahramani Z (2003) Combining active learning and semi-supervised learning using gaussian fields and harmonic functions. In: Proceedings of the 2003 ICML workshop on the continuum from labeled to unlabeled data in machine learning and data mining, pp 58–65. Washington, DC

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Meng Fang.

Additional information

Responsible editor: Thomas Seidl.

Most of this research was done while Meng Fang was at the University of Technology, Sydney.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (doc 25 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fang, M., Yin, J. & Zhu, X. Active exploration for large graphs. Data Min Knowl Disc 30, 511–549 (2016). https://doi.org/10.1007/s10618-015-0424-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-015-0424-z

Keywords

Navigation