Skip to main content
Log in

Revisiting the coupon collector’s problem to unveil users’ online sessions in networked systems

  • Published:
Peer-to-Peer Networking and Applications Aims and scope Submit manuscript

Abstract

Accuratecomprehension of users’ behavior is paramount for understanding the dynamics of several systems, such as e-commerce platforms, social networks, and mobile computing. To this end, several strategies have been proposed to obtain data sets based on the capture of usage information, which can then serve for user analytics. A popular strategy consists of taking periodic snapshots of online users, a practical instance of the coupon collector’s problem tailored to users monitoring in networked systems. Due to system-specific limitations, however, users may fail to appear in some snapshots, although online. To bridge this gap, we present a methodology to correct ill-collected snapshots and build more accurate data sets. In summary, we formally model user snapshotting as an instance of the coupon collector’s problem, estimate the probability that some users are missing in a given snapshot following a Bernoulli process, and correct those snapshots should the probability exceed a given threshold.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. Link to trace files: https://www.inf.ufrgs.br/~wlccordeiro/swarm_traces/

  2. Link to the Github repo with scripts and source code of the software used in this work: https://github.com/ComputerNetworks-UFRGS/TraceCollection/

  3. Parameter min interval, in https://wiki.theory.org/BitTorrentSpecification

References

  1. Vasilakos AV, Li Z, Simon G, You W (2015) Information centric network: Research challenges and opportunities. J Netw Comput Appl 52:1–10

    Article  Google Scholar 

  2. Han J, Choi D, Chung T, Chuah C-N, Kim H-C, Kwon TT (2019) Predicting content consumption from content-to-content relationships. J Netw Comput Appl 132:1–9

    Article  Google Scholar 

  3. Tokuyama K, Miyoshi N (2018) Data rate and handoff rate analysis for user mobility in cellular networks. In: 2018 IEEE wireless communications and networking conference (WCNC). IEEE, pp 1–6

  4. Lareida A, Hoßfeld T, Stiller B (2017) The bittorrent peer collector problem. In: 2017 IFIP/IEEE symposium on integrated network and service management (IM). IEEE, pp 449–455

  5. Zannettou S, Caulfield T, Blackburn J, De Cristofaro E, Sirivianos M, Stringhini G, Suarez-Tangil G (2018), NY, USA

  6. Hoßfeld T, Lehrieder F, Hock D, Oechsner S, Despotovic Z, Kellerer W, Michel M (2011) Characterization of BitTorrent swarms and their distribution in the Internet. Comput Netw 55(5):1197–1215

    Article  Google Scholar 

  7. Stadje W (1990) The collector’s problem with group drawings. Adv Appl Probab 22(4):866–882

    Article  MathSciNet  Google Scholar 

  8. Cuevas R, Kryczka M, Cuevas A, Kaune S, Guerrero C, Rejaie R (2010) Is content publishing in BitTorrent altruistic or profit-driven?. In: 6Th international conference on emerging networking EXperiments and technologies (co-NEXT’10)

  9. Yoshida M, Nakao A (2011) Measuring BitTorrent swarms beyond reach. In: IEEE international conference on peer-to-peer computing (P2P 2011), pp 220–229

  10. Mansilha RB, Bays LR, Lehmann MB, Mezzomo A, Facchini G, Gaspary LP, Barcellos MP (2011) Observing the bittorrent universe through telescopes. In: 2011 IFIP/IEEE international symposium on integrated network management

  11. Nyang D, Shin D (2016) Recyclable counter with confinement for real-time per-flow measurement. IEEE/ACM Trans Netw 24(5):3191–3203

    Article  Google Scholar 

  12. Tan R, Kong X, Zhang Y, Tan Q, Lu H, Li M, Sun Y (2019) Bitcoin network size estimation based on coupon collection model. In: International conference on artificial intelligence and security. Springer, pp 298–307

  13. Smith JA, Moody J, Morgan JH (2017) Network sampling coverage ii: The effect of non-random missing data on network measurement. Soc Netw 48:78–99. https://doi.org/10.1016/j.socnet.2016.04.005, http://www.sciencedirect.com/science/article/pii/S0378873316301551

  14. Flores H, Hui P, Nurmi P, Lagerspetz E, Tarkoma S, Manner J, Kostakos V, Li Y, Su X (2017) Evidence-aware mobile computational offloading. IEEE Trans Mob Comput 17(8):1834–1850

    Article  Google Scholar 

  15. Lareida A, Stiller B (2018) Big torrent measurement: a country-, network-, and content-centric analysis of video sharing in bittorrent. In: NOMS 2018–2018 IEEE/IFIP network operations and management symposium. IEEE, pp 1–9

  16. Padmanabhan R, Schulman A, Levin D, Spring N (2019) Residential links under the weather. In: Proceedings of the ACM special interest group on data communication, SIGCOMM ’19. ACM, New York, pp 145–158

  17. Xie K, Wang L, Wang X, Xie G, Wen J, Zhang G (2016) Accurate recovery of internet traffic data: A tensor completion approach. In: IEEE INFOCOM 2016 - the 35th annual IEEE international conference on computer communications, pp 1–9

  18. Xie K, Peng C, Wang X, Xie G, Wen J (2017) Accurate recovery of internet traffic data under dynamic measurements. In: IEEE INFOCOM 2017 - the 36th annual IEEE international conference on computer communications, pp 1–9. https://doi.org/10.1109/INFOCOM.2017.8057218

  19. Zhou H, Tan L, Zeng Q, Wu C (2016) Traffic matrix estimation: A neural network approach with extended input and expectation maximization iteration. J Netw Comput Appl 60:220–232. https://doi.org/10.1016/j.jnca.2015.11.013

  20. Xie K, Wang L, Wang X, Xie G, Wen J, Zhang G, Cao J, Zhang D (2018) Accurate recovery of internet traffic data: A sequential tensor completion approach. IEEE/ACM Trans Netw 26 (2):793–806

    Article  Google Scholar 

  21. Cheng L, Niu J, Kong L, Luo C, Gu Y, He W, Das SK (2017) Co mpressive sensing based data quality improvement for crowd-sensing applications. J Netw Comput Appl 77:123–134. https://doi.org/10.1016/j.jnca.2016.10.004, http://www.sciencedirect.com/science/article/pii/S1084804516302338

  22. Xie K, Chen Y, Wang X, Xie G, Cao J, Wen J (2020) Accurate and fast recovery of network monitoring data: A gpu accelerated matrix completion. IEEE/ACM Trans Netw:1–14

  23. Xie K, Wang L, Wang X, Xie G, Wen J, Zhang G, Cao J, Zhang D (2019) Accurate recovery of missing network measurement data with localized tensor completion. IEEE/ACM Trans Netw 27 (6):2222–2235

    Article  Google Scholar 

  24. Xie K, Ning X, Wang G, Xie D, Cao J, Xie G, Wen J (2016) Recover corrupted data in sensor networks: a matrix completion solution. IEEE Trans Mob Comput 16:1–1. https://doi.org/10.1109/TMC.2016.2595569

    Google Scholar 

  25. Wang J, Shen J, Li P, Xu H (2017) Online matrix completion for signed link prediction. In: Proceedings of the tenth acm international conference on web search and data mining, WSDM ’17, association for computing machinery, New York, pp 475–484. https://doi.org/10.1145/3018661.3018681

  26. Izal M, Urvoy-Keller G, Biersack E, Felber P, Al Hamra A, Garcés-Erice L. (2004) Dissecting bittorrent: Five months in a torrent’s lifetime. In: Passive and active network measurement, vol 3015. Springer, Berlin, pp 1–11

  27. Guo L, Chen S, Xiao Z, Tan E, Ding X, Zhang X (2005) Measurements, analysis, and modeling of bittorrent-like systems. In: 5th ACM SIGCOMM conference on internet measurement, IMC ’05. USENIX Association, Berkeley, pp 4–4

  28. Steiner M, En-Najjary T, Biersack EW (2009) Long term study of peer behavior in the kad dht. IEEE/ACM Trans Netw 17(5):1371–1384

    Article  Google Scholar 

  29. Jiang J-Y, Li C-T, Chen Y, Wang W (2018) Identifying users behind shared accounts in online streaming services. In: The 41st international ACM SIGIR conference on research & development in information retrieval, SIGIR ’18. ACM, New York, pp 65–74

  30. Sottocornola G., Symeonidis P., Zanker M. (2018) Session-based news recommendations. In: Companion Proceedings of the The Web Conference 2018, WWW ’18, pp 1395–1399

  31. Zhang C, Dhungel P, Wu D, Ross KW (2011) Unraveling the bittorrent ecosystem. IEEE Trans Parallel Distrib Syst 22(7):1164–1177

    Article  Google Scholar 

  32. Qureshi A, Megías D, Rifà-pous H (2016) PSUM. J Netw Comput Appl 66(C):180–197. https://doi.org/10.1016/j.jnca.2016.03.007

    Article  Google Scholar 

  33. Ojo OE, Iyadi CO, Oluwatope AO, Akinwale AT AyoPeer: The adapted ayo-game for minimizing free riding in peer-assisted network, Peer-to-Peer Networking and Applications. https://doi.org/10.1007/s12083-020-00913-6

  34. Naik AR, Keshavamurthy BN (2020) Next level peer-to-peer overlay networks under high churns: a survey. Peer-to-Peer Netw Appl 13(3):905–931. https://doi.org/10.1007/s12083-019-00839-8

    Article  Google Scholar 

  35. Bhagatkar N, Dolas K, Ghosh RK, Das SK An integrated P2P framework for E-learning, Peer-to-Peer Networking and Applications. https://doi.org/10.1007/s12083-020-00919-0

  36. Nwebonyi FN, Martins R, Correia ME (2019) Reputation based approach for improved fairness and robustness in P2P protocols. Peer-to-Peer Netw Appl 12(4):951–968. https://doi.org/10.1007/s12083-018-0701-x

    Article  Google Scholar 

  37. Kobza J, Jacobson S, Vaughan D (2007) A survey of the coupon collector’s problem with random sample sizes. Methodol Comput Appl Probab 9(4):573–584. https://doi.org/10.1007/s11009-006-9013-3

    Article  MathSciNet  Google Scholar 

  38. Van Steen M, Tanenbaum AS (2017) Distributed Systems, 3rd Edn. https://www.distributed-systems.net/

  39. Yao Z, Wang X, Leonard D, Loguinov D Node isolation model and age-based neighbor selection in unstructured p2p networks. IEEE/ACM Trans Netw:17(1)

Download references

Acknowledgements

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weverton Cordeiro.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 A Formal Proof of Theorems 1 and 2

Theorem 1

Let c be the budget required for capturing snapshots from G, and let L(vi,bk) be the acquaintance list returned by bkB upon request from viV. Considering that crawler vi makes only one acquaintance list request per snapshot, we then have \(s_{t} = \bigcup _{i=1}^{c} L(v_{i}, b_{i}) = V_{t} \iff L_{max} \to \infty \), c ≥|B|, and each bootstrap entity is queried at least once.

Proof

AB. Suppose that each user viV is regarded as currently online by a single bootstrap entity bkB only. We thus have from Definition 1 that \(\bigcup _{k=1}^{|B|} e(b_{k}) = V_{t}\) and \(\bigcap _{k=1}^{|B|} e(b_{k}) = \emptyset \). From Definition 3, we have \(L(v_{i}, b_{k}) \subseteq e({b_{k}})\). Now consider the restrictions that a crawler viV makes only one acquaintance list request per snapshot. In this case, obtaining a full snapshot of the system (st = Vt) requires contacting each of the bootstrap entities in the system, and that L(vi,bk) = e(bk). The condition for L(vi,bk) = e(bk) being true regardless of the size of a bootstrap acquaintance list is that \(L_{max} \to \infty \), whereas the condition for \(\bigcup _{i=1}^{c} L(v_{i}, b_{i}) = V_{t}\) is that there is at least one crawler for each bootstrap entity (i.e., c ≥|B|), and that each bootstrap entity be queried by a crawler at least once.

BA. Suppose that \(L_{max} \to \infty \). In this case, L(vi,bk) = e(bk). Suppose also that c = |B| and that each bootstrap entity is queried at least once when building a snapshot set. In this case, we have \(\bigcup _{i=1}^{c} L(v_{i}, b_{i}) = V_{t}\), which also respects the restriction that each crawler makes only one acquaintance list request per snapshot. □

Theorem 2

Let c be the budget required for capturing snapshots from G, and let L(vi′,vj) be the acquaintance list returned by vjVt upon request from a instrumented client vi′∈ V. Considering that client vi′ makes one acquaintance list request per snapshot, we then have \(s_{t} = \bigcup _{i=1}^{c} L(v'_{i}, v_{i}) = V_{t} \iff \bigcup _{j=1}^{|V_{t}|} e({v_{j}}) = V_{t}\), \(L_{max} \to \infty \), c ≥|VtV|, and each user vi is queried at least once.

Proof

AB. Suppose that each ordinary user viV is regarded as online by at least one user vjV (i.e., Et = Vt × Vt). This situation may lead to several possible configurations in the acquaintance graph; it ranges from one in which the graph forms a single cycle, to another one in which just a single user vi is aware of all other users vk in the system, and some other user vj is aware of vi as being online. We thus have from Definition 2 that \(\bigcup _{j=1}^{|V_{t}|} e(v_{j}) = V_{t}\) and \(\bigcap _{j=1}^{|V_{t}|} e(v_{j}) = \emptyset \). From Definition 4, we have \(L(v_{i}, v_{j}) \subseteq e({v_{j}})\). Now consider the restrictions that a crawler viV makes only one acquaintance list request per snapshot. In this case, obtaining a full snapshot of the system (st = Vt) requires contacting each of the users in the system, and that L(vi′,vj) = e(vj). The condition for L(vi′,vj) = e(vj) being true regardless of the size of a user’s acquaintance list is that \(L_{max} \to \infty \). The condition for \(\bigcup _{i=1}^{c} L(v'_{i}, v_{i}) = V_{t}\) is that (i) there is at least one crawler for each ordinary user (c ≥|VtV|), (ii) that each ordinary user be queried by a crawler at least once, and (iii) that every user be regarded as currently online by at least another user in the system (which results in \(\bigcup _{j=1}^{|V_{t}|} e({v_{j}}) = V_{t}\)).

BA. Suppose that \(L_{max} \to \infty \). In this case, L(vi′,vi) = e(vi). Suppose also that c ≥|VtV|, that each ordinary user is queried at least once when building a snapshot set, and that every user be regarded as currently online by at least another user in the system (i.e., \(\bigcup _{j=1}^{|V_{t}|} e({v_{j}}) = V_{t}\)). In this case, querying all ordinary users using the budget we have and making the union of the peer lists obtained, we obtain a full snapshot of the system (\(\bigcup _{i=1}^{c} L(v'_{i}, v_{i}) = V_{t}\)). Observe that since we have a crawler for each ordinary user, the restriction that each crawler makes only one acquaintance list request per snapshot can thus be respected. □

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cordeiro, W., Gaspary, L., Beltran, R. et al. Revisiting the coupon collector’s problem to unveil users’ online sessions in networked systems. Peer-to-Peer Netw. Appl. 14, 687–707 (2021). https://doi.org/10.1007/s12083-020-01012-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12083-020-01012-2

Keywords

Navigation