Skip to main content
Log in

Fast PageRank approximation by adaptive sampling

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

PageRank is typically computed from the power of transition matrix in a Markov Chain model. It is therefore computationally expensive, and efficient approximation methods to accelerate the computation are necessary, especially when it comes to large graphs. In this paper, we propose two sampling algorithms for PageRank efficient approximation: Direct sampling and Adaptive sampling. Both methods sample the transition matrix and use the sample in PageRank computation. Direct sampling method samples the transition matrix once and uses the sample directly in PageRank computation, whereas adaptive sampling method samples the transition matrix multiple times with an adaptive sample rate which is adjusted iteratively as the computing procedure proceeds. This adaptive sample rate is designed for a good trade-off between accuracy and efficiency for PageRank approximation. We provide detailed theoretical analysis on the error bounds of both methods. We also compare them with several state-of-the-art PageRank approximation methods, including power extrapolation and inner–outer power iteration algorithm. Experimental results on several real-world datasets show that our methods can achieve significantly higher efficiency while attaining comparable accuracy than state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

References

  1. Achlioptas D, McSherry F (2007) Fast computation of low-rank matrix approximations. J. ACM 54(2):9

    Article  MathSciNet  Google Scholar 

  2. Avrachenkov K, Lebedev D (2006) Pagerank of scale-free growing networks. Internet Math 3(2):207–232

    Article  MathSciNet  MATH  Google Scholar 

  3. Berkhin P (2005) Survey: a survey on pagerank computing. Internet Math 2(1):73–120

    Article  MathSciNet  MATH  Google Scholar 

  4. Benczur A, Csalogány K, Sarlós T (2005) On the feasibility of low-rank approximation for personalized pagerank. In: Proceedings of the 14th international conference on World Wide Web, Chiba, Japan, May 2005, pp 972–973

  5. Borodin J, Roberts G, Tsaparas P (2005) Link analysis ranking: algorithms, theory, and experiments. ACM Trans Internet Technol 5: pp 231–297

    Google Scholar 

  6. Brin RMS, Page L, Winograd T (1999) The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November 1999

  7. Candès EJ, Plan Y (2010) Tight oracle bounds for low-rank matrix recovery from a minimal number of random measurements. CoRR. abs/1001.0339, 2010

  8. Drineas P, Kannan R (2001) Fast Monte-Carlo algorithms for approximate matrix multiplication. In: 42nd annual symposium on foundations of computer science, Las Vegas, Nevada, USA, October 2001, pp 452–459

  9. Drineas P, Kannan R, Mahoney MW (2006) Fast Monte Carlo algorithms for matrices I: approximating matrix multiplication. SIAM J Sci Comput 36:132–157

    Article  MathSciNet  MATH  Google Scholar 

  10. Gleich DF, Gray AP, Greif C, Lau T (2010) An inner–outer iteration for PageRank. SIAM J Sci Comput 32(1):349–371

    Article  MathSciNet  MATH  Google Scholar 

  11. Haveliwala T, Kamvar S, Klein D, Manning C, Golub G (2003) Computing PageRank using power extrapolation. Stanford University Technical Report, July 2003

  12. He G, Feng H, Li C, Chen H (2010) Parallel SimRank computation on large graphs with iterative aggregation. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, USA, July 2010, pp 543–552

  13. Kamvar S, Haveliwala T, Golub G (2003) Adaptive methods for the computation of pagerank. Technical Report 2003-26, Stanford InfoLab, April 2003

  14. Kamvar S, Haveliwala T, Manning C, Golub G (2003) Extrapolation methods for accelerating pagerank computations. In: Proceedings of the twelfth international world wide web conference, Budapest, Hungary, May 2003, pp 261–270

  15. Kwong MK, Zettl A (1991) Norm inequalities for the powers of a matrix. Am Math Mon 98(6):533–538

    Article  MathSciNet  MATH  Google Scholar 

  16. Langville AN, Meyer CD (2003) Survey: deeper inside pagerank. Internet Math 1(3):335–380

    Article  MathSciNet  Google Scholar 

  17. Lee CP, Golub GH, Zenios SA (2007) A two-stage algorithm for computing pagerank and multistage generalizations. Internet Math 4 (4):299–327

    Google Scholar 

  18. McSherry F (2005) A uniform approach to accelerated pagerank computation. In: Proceedings of the 14th international conference on World Wide Web, Chiba, Japan, May 2005, pp 575–582

  19. Osborne JRS, Wiggins E (2009) On accelerating the pagerank computation. Internet Math 6(2):157–172

    Article  MathSciNet  MATH  Google Scholar 

  20. Sidi A (2008) Methods for acceleration of convergence (extrapolation) of vector sequences. In: Wah BW (ed) Wiley encyclopedia of Computer Science and Engineering. Wiley, New York

    Google Scholar 

  21. SNAP (2007). Stanford Network Analysis Platform Standard Large Network Dataset Collection, Jure Leskovec. http://snap.stanford.edu/data/index.html

  22. Wicks J, Greenwald AR (2007) More efficient parallel computation of pagerank. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, Amsterdam, The Netherlands, July 2007, pp 861–862

  23. Wu G, Wei Y (2010) Arnoldi versus GMRES for computing pagerank: a theoretical contribution to google’s pagerank problem. ACM Trans Inf Syst 28(3):11:1–11:28

    Article  Google Scholar 

  24. Xue GR, Yang Q, Zeng HJ, Yu Y, Chen Z (2005) Exploiting the hierarchical structure for link analysis. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, Salvador, Brazil, August 2005, pp 186–193

  25. Zhu Y, Ye S, Li X (2005) Distributed pagerank computation based on iterative aggregation-disaggregation methods. In: ACM fourteenth conference on information and knowledge management (CIKM), Bremen, Germany, November 2005, pp 578–585

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenting Liu.

Appendix

Appendix

The proof of Theorem 1 is as follows.

Proof

From Theorem 3.1 in Achlioptas and McSherry [1], when \(E(\widetilde{A}_{ij})=A_{ij},\,Var(\widetilde{A}_{ij})\le \delta ^2,\hbox { and }|\widetilde{A}_{ij}-A_{ij}|\le \delta K\), where \(K=\left( \frac{\log (1+\epsilon )}{\log (2n)}\right) ^2\times \sqrt{2n}\), for any fixed \(\epsilon >0\). When we choose \(K>1\), then for any \(\omega >0\hbox { and }2n>152,\,\Vert \widetilde{A}-A\Vert _2 \le 2(1+\epsilon +\omega )\delta \sqrt{2n}\) holds w.h.p.

As \(|\widetilde{A}_{ij}-A_{ij}|\le \delta \le \delta K\) (choosing \(K>1\)), then \(\Vert \widetilde{A}-A\Vert _2 \le 2(1+\epsilon +\omega )\delta \sqrt{2n}\) holds w.h.p. for any \(\omega >0\hbox { and }n>76\).

Especially, for sparse transition matrix \(A\), the number of nonzero entries \(N=dn,\hbox { where }d\) is the average degree and \(d\le 50\) for most sparse real datasets; thus, \(\sqrt{2n}/\sqrt{N}\) is a constant. Let \(\eta =2(1+\epsilon +\omega )\sqrt{2n}/\sqrt{N},\hbox { then }\Vert \widetilde{A}-A\Vert _2 \le \eta \sqrt{N} \delta \) holds w.h.p. \(\square \)

The proof of Lemma 2 is as follows.

Proof

We can compute \(E(\widetilde{A}_{ij}^2)\) by Eq. 4 as follows

$$\begin{aligned} E(\widetilde{A}_{ij}^2) = \widetilde{A}_{ij}^2 * p_{ij} = \frac{A_{ij}^2}{p_{ij}} \le {\left\{ \begin{array}{ll} \frac{\Vert A\Vert _F^2}{s} \quad &{} \text {if}~A_{ij} > \frac{\theta \Vert A\Vert _F}{\sqrt{s}} \\ A_{ij} \frac{\theta \Vert A\Vert _F}{\sqrt{s}} \le \frac{\theta ^2\Vert A\Vert _F^2}{s} < \frac{\Vert A\Vert _F^2}{s} \quad &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

It is obvious that

$$\begin{aligned} E\left( \widetilde{A}_{ij}^2\right) \le \frac{\Vert A\Vert _F^2}{s} \end{aligned}$$
(12)

The upper bound for the variance of \(\widetilde{A}_{ij}\) is given as

$$\begin{aligned} Var(\widetilde{A}_{ij})&= E\left[ (\widetilde{A}_{ij}-A_{ij})^2\right] \nonumber \\&= E(\widetilde{A}_{ij}^2)-A_{ij}^2 \le E(\widetilde{A}_{ij}^2) \le \frac{\Vert A\Vert _F^2}{s} \end{aligned}$$
(13)

Let \(\delta =\Vert A\Vert _F/\sqrt{s}\). Then, \(|\widetilde{A}_{ij}-A_{ij}| \le \delta \theta < \delta \hbox { and }Var(\widetilde{A}_{ij}) \le \delta ^2\). According to Eq. 2 and Theorem 1, we have

$$\begin{aligned} \Vert \widetilde{A}-A\Vert _2 \le \sqrt{N}\delta \eta = \eta \Vert A\Vert _F \sqrt{N/s}, \end{aligned}$$

holds w.h.p., where \(\eta \) is a small constant.

We prove Eq. 6 as follows. According to Eq. 12, we can derive the upper bound of \(\Vert \widetilde{A}\Vert _F\) as follows.

As \(E\left( \Vert \widetilde{A}\Vert _F\right) \le \sqrt{E\left( \Vert \widetilde{A}\Vert _F^2\right) } \le \sqrt{\sum _{ij}E\left( \widetilde{A}_{ij}^2\right) } \le \sqrt{s\frac{\Vert A\Vert _F^2}{s}}=\Vert A\Vert _F\). According to Chernoff bound, \(Pr\left[ \Vert \widetilde{A}\Vert _F\le \Vert A\Vert _F \right] \ge 1-\exp (-\Omega (\Vert A\Vert _F))\). Thus,

$$\begin{aligned} \Vert \widetilde{A}\Vert _F \le \Vert A\Vert _F \end{aligned}$$

holds w.h.p.

Since \(\Vert \widetilde{A}\Vert _2 \le \Vert \widetilde{A}\Vert _F\), we have

$$\begin{aligned} \Vert \widetilde{A}\Vert _2 \le \Vert \widetilde{A}\Vert _F \le \Vert A\Vert _F. \end{aligned}$$

\(\square \)

The proof of Theorem 3 is as follows.

Proof

According to Lemma 2, we have \(\Vert \widetilde{P}-P\Vert _2 \le \eta \alpha \Vert P\Vert _F ,\hbox { where }\alpha = \sqrt{\frac{N}{s}},\,\eta >0\) is a small constant; and \(\Vert \widetilde{P}\Vert _2 \le \Vert P\Vert _F\).

Let \(R_k =\widetilde{P}^k-P^k,\hbox { where }k\in \{1,\ldots ,K\}\), then,

$$\begin{aligned} \Vert R_k\Vert _2 = \Vert \widetilde{P}^k-P^k\Vert _2&= \Vert \widetilde{P}\widetilde{P}^{k-1}-PP^{k-1}\Vert _2 \nonumber \\&= \Vert \widetilde{P}(\widetilde{P}^{k-1} - P^{k-1} + P^{k-1})-PP^{k-1}\Vert _2 \nonumber \\&= \Vert \widetilde{P} R_{k-1}+ \widetilde{P} P^{k-1}-P P^{k-1}\Vert _2 \nonumber \\&\le \Vert \widetilde{P} R_{k-1} \Vert _2 + \Vert (\widetilde{P}- P)P^{k-1}\Vert _2 \nonumber \\&\le \Vert \widetilde{P}\Vert _2 \Vert R_{k-1} \Vert _2 + \Vert \widetilde{P}- P\Vert _2 \Vert P\Vert _2^{k-1} \nonumber \\&\le \Vert P\Vert _F \Vert \Vert R_{k-1} \Vert _2 + \eta \alpha \Vert P\Vert _F^k \end{aligned}$$
(14)

From the above steps, we can see that \(\Vert R_k \Vert _2\) can be induced from \(\Vert R_{k-1} \Vert _2\). Next, we induce

$$\begin{aligned} \Vert R_k\Vert _2 \le k\eta \alpha \Vert P\Vert _F^k \end{aligned}$$
(15)

via mathematical induction as follows.

Basis: Initially,

$$\begin{aligned} \Vert R_1\Vert _2=\Vert \widetilde{P}- P\Vert _2 \le \eta \alpha \Vert P\Vert _F. \end{aligned}$$
(16)

When \(k=2\), according to Eqs. 14 and  16,

$$\begin{aligned} \Vert R_2\Vert _2 \le \Vert P\Vert _F \Vert \eta \alpha \Vert P\Vert _F + \eta \alpha \Vert P\Vert _F^2 = 2\eta \alpha \Vert P\Vert _F^2. \end{aligned}$$

Thus, Eq. 15 holds for the first step.

Inductive step: Assume that Eq. 15 holds for the \((k-1)\)th step,

$$\begin{aligned} \Vert R_{k-1} \Vert _2 \le (k-1)\eta \alpha \Vert P\Vert _F^{k-1}. \end{aligned}$$
(17)

Then, we deduce the \(k\)th step according to Eqs. 14 and  17 as follows.

$$\begin{aligned} \Vert R_k \Vert _2 \le \Vert P\Vert _F (k-1)\eta \alpha \Vert P\Vert _F^{k-1} + \eta \alpha \Vert P\Vert _F^k = k\eta \alpha \Vert P\Vert _F^k \end{aligned}$$

that is, Eq. 15 also holds for the \(k\)th step.

Since both the basis and the inductive step have been performed, by mathematical induction, the statement Eq. 15 holds for all natural \(k\). As such, we have

$$\begin{aligned} \Vert \widetilde{P}^k-P^k\Vert _2 \le k\eta \alpha \Vert P\Vert _F^k, \end{aligned}$$
(18)

where \(\alpha = \sqrt{\frac{N}{s}}\) and \(\eta >0\) is a small constant. \(\square \)

The proof of Theorem 4 is as follows.

Proof

Let \(B_{i}\) be the input matrix \(B\) at the \(i\)th iteration of Algorithm 8, where \(1 \le i \le k\). Then, \(B_{i}=B_{i-1}\widetilde{P}_{i},\hbox { where }\widetilde{P}_{i}\) is the sampled matrix of \(A\) with \(\alpha _{i}=\sqrt{\frac{N}{s}}\), where \(N\) is the number nonzero entries of \(P,\hbox { and }s\) is the sample size of \(\widetilde{P}_{i}\). According to Lemma 2, \(\Vert \widetilde{P}_{i}- P\Vert _2 \le \eta \alpha _{i}\Vert P\Vert _F\), we have

$$\begin{aligned} \Vert B_{i}- B_{i-1} P\Vert _2&= \Vert B_{i-1}\widetilde{P}_{i}-B_{i-1} P\Vert _2 \nonumber \\&\le \Vert B_{i-1}\Vert _2 \Vert \widetilde{P}_{i}- P\Vert _2 \nonumber \\&\le \Vert B_{i-1}\Vert _F \eta \alpha _{i} \Vert P\Vert _F. \end{aligned}$$
(19)

From Lemma 2, \(\Vert \widetilde{P}_i\Vert _F \le \Vert P\Vert _F\), thus,

$$\begin{aligned} \Vert B_i\Vert _F = \Vert B_{i-1}\widetilde{P}_{i}\Vert _F \le \Vert B_{i-1}\Vert _F \Vert \widetilde{P}_{i}\Vert _F \le \Vert B_{i-1}\Vert _F \Vert P\Vert _F. \end{aligned}$$
(20)

Since \(\Vert B_1\Vert _2 = \Vert P\Vert _2 \le \Vert P\Vert _F \) and from Eq. 20, we have

$$\begin{aligned} \Vert B_i\Vert _F \le \Vert P\Vert _F^i. \end{aligned}$$
(21)

Since the estimation of \(P^k\) from Algorithm 8 is \(\widetilde{P^k} = B_k = B_{k-1}\widetilde{P}_k\), the total error for estimating \(P^k\) in Algorithm 8 is given by \(\Vert B_k-P^k\Vert _2\). Note that \(B_0=I\) is the identity matrix. According to Eqs. 19 and  21, we have

$$\begin{aligned} \Vert \widetilde{P^k}-P^k\Vert _2&= \Vert B_k-P^k\Vert _2 \nonumber \\&= \Vert (B_k -B_{k-1}P)+ (B_{k-1}P-B_{k-2}P^2)+\cdots + (B_{k-i}P^i-B_{k-i-1}P^{i+1}) \nonumber \\&\quad +\cdots + (B_1 P^{k-1}-P^k)\Vert _2 \nonumber \\&= \left\| \sum _{i=0}^{k-1}(B_{k-i}- B_{k-i-1}P)P^i\right\| _2 \nonumber \\&\le \sum _{i=0}^{k-1} \Vert (B_{k-i}- B_{k-i-1}P)P^i\Vert _2 \nonumber \\&\le \sum _{i=0}^{k-1} \Vert B_{k-i-1}\Vert _F \eta \alpha _{k-i} \Vert P\Vert _F \Vert P^i\Vert _2 \nonumber \\&\le \sum _{i=0}^{k-1} \Vert P\Vert _F^{k-i-1} \eta \alpha _{k-i} \Vert P\Vert _F \Vert P\Vert _F^i \nonumber \\&= \eta \Vert P\Vert _F^k \sum _{i=0}^{k-1}\alpha _{k-i} = \eta \Vert P\Vert _F^k \sum _{i=1}^{k}\alpha _{i} \end{aligned}$$
(22)

If we adaptively choose \(\alpha _i=a\alpha _{i-1}\) and \(\alpha _1 = a\), then \(\alpha _i=a^i\), then we obtain the error bound as follows:

$$\begin{aligned} \Vert \widetilde{P^k}-P^k\Vert _2&\le \eta \Vert P\Vert _F^{k} \displaystyle \sum _{i=1}^k a^i = \eta \Vert P\Vert _F^{k} \dfrac{a(1-a^k)}{1-a}. \end{aligned}$$
(23)

\(\square \)

The proof of Theorem 5 is as follows.

Proof

The total error of estimating \(\pi \) is represented by

$$\begin{aligned} \Vert \tilde{\pi }-\pi \Vert _2 \le (1-c)\left[ \sum _{k=1}^K c^k \Vert \widetilde{P^k} - P^k\Vert _2\right] \Vert v\Vert _2 \end{aligned}$$
(24)

where \(c\) is a constant, and \(v\) is constant vector.

According to error analysis of direct sampling of \(P^k\) from Theorem 3,

$$\begin{aligned} \Vert \widetilde{P}^k-P^k\Vert _2 \le k\eta \alpha \Vert P\Vert _F^k, \end{aligned}$$

where \(\eta >0\) is small constant; thus, the total error of estimating \(\pi \) by the direct sampling method is proportional to

$$\begin{aligned} \sum _{k=1}^K c^k \Vert \widetilde{P}^k - P^k\Vert _2 \le \sum _{k=1}^{K}c^k \Vert P\Vert _F^{k} k\alpha \end{aligned}$$

And since

$$\begin{aligned} \sum _{k=1}^{K}c^k \Vert P\Vert _F^{k} k\alpha&= \alpha \left[ c \Vert P\Vert _F + 2c^2 \Vert P\Vert _F^2 +\cdots + k c^k \Vert P\Vert _F^{k} +\cdots + K c^K \Vert P\Vert _F^{K}\right] \nonumber \\&= \alpha \left[ \left( c \Vert P\Vert _F +\cdots + c^k \Vert P\Vert _F^{k} +\cdots + c^K \Vert P\Vert _F^{K}\right) \right. \nonumber \\&\quad \left. + \left( c^2 \Vert P\Vert _F^2 +\cdots + c^k \Vert P\Vert _F^{k} +\cdots + c^K \Vert P\Vert _F^{K}\right) +\cdots + c^K\Vert P\Vert _F^{K}\right] \nonumber \\&= \alpha \sum _{k=1}^{K} c^k \Vert P\Vert _F^{k} + \alpha \sum _{k=2}^{K} c^k \Vert P\Vert _F^{k} +\cdots + \alpha c^K\Vert P\Vert _F^{K} \nonumber \end{aligned}$$

we have

$$\begin{aligned} \sum _{k=1}^K c^k \Vert \tilde{P}^k - P^k\Vert _2&\le \sum _{k=1}^{K}c^k \Vert P\Vert _F^{k} k\alpha \nonumber \\&= \alpha \sum _{k=1}^{K} c^k \Vert P\Vert _F^{k} + \alpha \sum _{k=2}^{K} c^k \Vert P\Vert _F^{k} +\cdots + \alpha c^K\Vert P\Vert _F^{K}. \end{aligned}$$
(25)

With certain choice of \(\alpha _i\hbox { for }1 \le i \le K\), according to error analysis of adaptive sampling of \(P^k\) in Theorem 4,

$$\begin{aligned} \Vert \widetilde{P^k}-P^k\Vert _2 \le \eta \Vert P\Vert _F^k \sum _{i=1}^{k}\alpha _{i}, \end{aligned}$$

where \(\eta >0\) is small constant; thus, the total error of estimating \(\pi \) by the adaptive sampling method is proportional to

$$\begin{aligned} \sum _{k=1}^K c^k \Vert \widetilde{P^k} - P^k\Vert _2&\le \sum _{k=1}^{K}c^k\Vert P\Vert _F^{k}\left( \sum _{i=1}^{k} \alpha _i\right) \nonumber \\&= c \Vert P\Vert _F \alpha _1 + c^2\Vert P\Vert _F^{2}(\alpha _1 + \alpha _2)\nonumber \\&\quad \!+\!\cdots + c^k\Vert P\Vert _F^{k}(\alpha _1 +\cdots + \alpha _k) \!+\!\cdots \!+\! c^K\Vert P\Vert _F^{K}(\alpha _1 +\cdots + \alpha _K) \nonumber \\&= \alpha _1 \sum _{k=1}^{K} c^k \Vert P\Vert _F^{k} + \alpha _2 \sum _{k=2}^{K} c^k \Vert P\Vert _F^{k} +\cdots + \alpha _{K} c^K\Vert P\Vert _F^{K} \end{aligned}$$
(26)

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, W., Li, G. & Cheng, J. Fast PageRank approximation by adaptive sampling. Knowl Inf Syst 42, 127–146 (2015). https://doi.org/10.1007/s10115-013-0691-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-013-0691-1

Keywords

Navigation