Skip to main content
Log in

A new statistical approach to estimate global file populations from local observations in the eDonkey P2P file sharing system

  • Published:
annals of telecommunications - annales des télécommunications Aims and scope Submit manuscript

Abstract

In this paper, we propose a new statistical approach, also known in biology under the name capture–recapture methods in order to estimate global population statistics from local observations. Evaluating population sizes in P2P systems has received much attention lately as these may be useful to set system parameters, to derive other system statistics, or to predict system performance. As these systems are very large, encompassing several millions of users and since they are highly distributed estimating population sizes is a challenging task. More precisely, we are interested in estimating the number of file replicas in the system, i.e., the size of the population of users possessing given files. To this end, we propose a capture–recapture method which is both computationally efficient and accurate. The method proposed allows deriving global population statistics from local and time-limited observations. We apply the method on a measurement data set of several days on a residential network. We compare the results obtained from direct counting procedures with those derived with the proposed methodology.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Assumption \(({\cal H}1)\) is crucial in traditional population estimation as the samples are taken in successive time periods. In the current context, we will apply analog methods on samples taken on identical time periods. If the population varies during the observation and assumption \(({\cal H}2)\) is valid, we estimate the total number of peers having belonged to the population any time during the observation instead of the average population. Both values are close if the average time in system is long compared to the observation period.

References

  1. Anagnostopoulos I, Stavropoulos P, Kouzas G, Anagnostopoulos C, Vergados DD (2006) Estimating the evolution of categorized web page populations. In: ICWE ’06: workshop proceedings of the sixth international conference on Web engineering. ACM, New York, p 13

    Chapter  Google Scholar 

  2. Bawa M, Garcia-Molina H, Gionis A, Motwani R (2003) Estimating aggregates on a peer-to-peer network. Technical report, Dept of Computer Science, Stanford University

  3. Brown P, Petrovic S (2009) Large scale analysis of the eDonkey P2P file sharing system. In: INFOCOM, Rio de Janeiro, Brazil, pp 2746–2750

  4. Brown P, Petrovic S (2009) A new statistical approach to estimate global file populations in the eDonkey P2P file sharing system. In: International teletraffic congress, Paris, France

  5. Feller W (1968) An introduction to probability theory and its applications, vol 1, 3rd edn. Wiley, New York

    Google Scholar 

  6. Fessant FL, Handurukande SB, Kermarrec AM, Massoulié L (2004) Clustering in peer-to-peer file sharing workloads. In: IPTPS, lecture notes in computer science, vol 3279. Springer, Berlin, pp 217–226

    Google Scholar 

  7. Gazey W, Staley M (1986) Population estimation from mark–recapture experiments using a sequential Bayes algorithm. Ecology 67:941–951

    Article  Google Scholar 

  8. Handurukande S, Kermarrec A, Fessant FL, Massoulié L, Patarin S (2006) Peer sharing behaviour in the eDonkey network, and implications for the design of server-less file sharing systems. In: EuroSys’06. Leuven, Belgium

    Google Scholar 

  9. Krebs CJ (1989) Ecological methodology. Harper and Row, New York

    Google Scholar 

  10. Massoulié L, Merrer EL, Kermarrec AM, Ganesh A (2006) Peer counting and sampling in overlay networks: random walk methods. In: PODC ’06: proceedings of the twenty-fifth annual ACM symposium on principles of distributed computing, New York, NY, USA, pp 123–132

  11. Petrovic S (2008) Towards a better understanding of eMule. Ph.D. thesis, University of Nice–Sophia Antipolis

  12. Petrovic S, Brown P, Costeux JL (2007) Unfairness in the e-mule file sharing system. In: International teletraffic congress, Ottawa, Canada, pp 594–605

    Google Scholar 

  13. Plissonneau L, Costeux JL, Brown P (2006) Detailed analysis of eDonkey transfers on ADSL. In: 2nd EuroNGI conference on next generation internet design and engineering, Valencia, Spain

    Google Scholar 

  14. Ricker WE (1975) Computation and interpretation of biological statistics of fish populations. Fish Res Board Can 191:1–382

    Google Scholar 

  15. Schumacher FX, Eschmeyer RW (1943) The estimate of fish population in lakes or ponds. J Tenn Acad Sci (18):228–249

    Google Scholar 

  16. Schwarz C, Seber G (1999) Estimating animal abundance: review III. Stat Sci 14:427–56

    Article  Google Scholar 

  17. Seber G (1982) The estimation of animal abundance and related parameters, 2nd edn. Charles Griffin & Co, London

    Google Scholar 

  18. Steiner M, Biersack EW, En Najjary T (2007) Actively monitoring peers in KAD. In: IPTPS’07, 6th international workshop on peer-to-peer systems. Bellevue, USA

    Google Scholar 

  19. Stutzbach D, Rejaie R (2006) Understanding churn in peer-to-peer networks. In: Internet measurement conference

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Patrick Brown.

Additional information

This paper is an extended version of a paper presented at the 21st International Teletraffic Congress, Paris, 15–17 September 2009 [4].

Appendix (Proof of proposition 4)

Appendix (Proof of proposition 4)

Proof

Denote M = M S . We prove that the ratio \(\frac{r(n)}{r(n-1)}\) is larger than one below a threshold n 0 and smaller beyond. It is sufficient to prove that for \(x \in [0,\frac{1}{M}[\)

$$ f(x)=\ln\left(\frac{1}{1-x M}\prod\limits_{i=0}^{S-1}(1-x C_i )\right) $$

is negative in a neighborhood of 0 and changes sign only once. Note that f(0) = 0. Also:

$$ f'(x)=\frac{M}{1-x M}-\sum\limits_{i=0}^{S-1}\frac{C_i}{1-x C_i }. $$

We have \(f'(0)=M-\sum_{i=0}^{S-1}C_i<0\) as the number of different peers obtained in the end, M, must be smaller than the sum of peers obtained in each sample set, \(\sum_{i=0}^{S-1}C_i\). This proves that f(x) < 0 on some interval ]0,ε[. As \(\lim_{x\rightarrow {\frac{1}{M}}} f(x)=\infty\), f(x) changes sign at least once. We now prove that f(x) is convex at any point where its derivative is positive or null. The derivative must be positive or null when f(x) becomes positive for the first time. Thus, its derivative will stay positive from then on which will finish the proof. To prove convexity at points of positive derivative, we write:

$$ f"(x)=\left(\frac{M}{1-x M}\right)^2-\sum_{i=0}^{S-1}\left(\frac{C_i}{1-x C_i }\right)^2. $$

Let \(a=\frac{M}{1-x M}>0\) and \(b_i=\frac{C_i}{1-x C_i }>0\). Then:

$$\begin{array}{rll} f'(x)>0 &\Rightarrow& a>\sum_{i=0}^{S-1}b_i\\ &\Rightarrow& a^2>(\sum_{i=0}^{S-1}b_i)^2 >\sum_{i=0}^{S-1}b_i^2 \\ &\Rightarrow& f"(x)>0. \end{array} $$

Thus f(.) changes sign only once. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brown, P., Petrovic, S. A new statistical approach to estimate global file populations from local observations in the eDonkey P2P file sharing system. Ann. Telecommun. 66, 5–16 (2011). https://doi.org/10.1007/s12243-010-0202-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12243-010-0202-2

Keywords

Navigation