Skip to main content
Log in

Fine-grained probability counting for cardinality estimation of data streams

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Estimating the number of distinct flows, also called the cardinality, is an important issue in many network applications, such as traffic measurement, anomaly detection, etc. The challenge is that high accuracy should be achieved with line speed and small auxiliary memory. Flajolet-Martin algorithm, LogLog algorithm, and HyperLogLog algorithm form a line of work in this area with improving performance. In this paper, we propose refined versions of these algorithms to achieve higher accuracy. The key observations are (1) the “leftmost” hash functions used by these algorithms can be generalized to reach higher accuracy, (2) the amendment coefficient can be highly biased in some certain streams or datasets so dynamically setting the amendment coefficient instead of using the one derived in pure math can lead to much better accuracy. Experimental results show great improvement of accuracy and stability of the refined versions over original algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14

Similar content being viewed by others

References

  1. Chabchoub, Y., Hébrail, G.: Sliding hyperloglog: estimating cardinality in a data stream over a sliding window. In: IEEE International Conference on Data Mining Workshops (ICDMW), pp 1297–1303. IEEE (2010)

  2. Dai, H., Shahzad, M., Liu, A.X., Zhong, Y.: Finding persistent items in data streams. Proc. VLDB Endow. 10(4), 289–300 (2016)

    Article  Google Scholar 

  3. Dai, H., Zhong, Y., Liu, A.X., Wang, W., Li, M.: Noisy bloom filters for multi-set membership testing. In: ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, pp. 139–151 (2016)

  4. Dai, H., Meng, L., Liu, A.X.: Finding persistent items in distributed, datasets. In: Proceedings of the 37th Annual IEEE International Conference on Computer Communications (INFOCOM) (2018)

  5. Durand, M., Flajolet, P.: Loglog counting of large cardinalities. In: European Symposium on Algorithms, pp. 605–617. Springer (2003)

  6. Estan, C., Varghese, G.: New directions in traffic measurement and accounting. ACM, 32(4) (2002)

  7. Estan, C., Varghese, G., Fisk, M.: Bitmap algorithms for counting active flows on high speed links. In: Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement, pp. 153–166. ACM (2003)

  8. Flajolet, P.: On adaptive sampling. Computing 43(4), 391–400 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  9. Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  10. Flajolet, P., Fusy, É. , Gandouet, O., Meunier, F.: Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. Anal. Algor. 2007(AofA07), 127–146 (2007)

    MathSciNet  MATH  Google Scholar 

  11. Garofalakis, M., Hellerstein, J.M., Maniatis, P.: Proof sketches: Verifiable in-network aggregation. In: IEEE 23rd International Conference on Data Engineering, 2007. ICDE 2007, pp. 996–1005. IEEE (2007)

  12. Han, Q., Du, S., Ren, D., Zhu, H.: Sas: a secure data aggregation scheme in vehicular sensing networks. In: IEEE International Conference on Communications (ICC), pp. 1–5. IEEE (2010)

  13. Han, J., Zheng, K., Sun, A., Shang, S., Wen, J.-R.: Discovering neighborhood pattern queries by sample answers in knowledge base. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 1014–1025. IEEE (2016)

  14. Heule, S., Nunkesser, M., Hall, A.: Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In: Proceedings of the 16th International Conference on Extending Database Technology, pp. 683–692. ACM (2013)

  15. Kang, U., Tsourakakis, C.E., Appel, A.P., Faloutsos, C., Leskovec, J.: Hadi: mining radii of large graphs. ACM Trans. Knowl. Discov. Data (TKDD) 5(2), 8 (2011)

    Google Scholar 

  16. Knuth, D.E.: The art of computer programming: sorting and searching, vol. 3. Pearson Education (1998)

  17. Li, Z., Xiao, F., Wang, S., Pei, T., Li, J.: Achievable rate maximization for cognitive hybrid satellite-terrestrial networks with af-relays. IEEE Journal on Selected Areas in Communications (2018)

  18. Liu, J., Zhao, K., Sommer, P., Shang, S., Kusy, B., Jurdak, R.: Bounded quadrant system: Error-bounded trajectory compression on the go. In: IEEE 31st International Conference onData Engineering (ICDE), pp. 987–998. IEEE (2015)

  19. Lochert, C., Scheuermann, B., Mauve, M.: Probabilistic aggregation for data dissemination in vanets. In: Proceedings of the Fourth ACM International Workshop on Vehicular ad hoc Networks, pp. 1–8. ACM (2007)

  20. Lochert, C., Rybicki, J., Scheuermann, B., Mauve, M.: Scalable data dissemination for inter-vehicle-communication: Aggregation versus peer-to-peer (skalierbare informationsverbreitung für die fahrzeug-fahrzeug-kommunikation: Aggregation versus peer-to-peer). it-Information Technology 50(4), 237–242 (2008)

    Article  Google Scholar 

  21. Lochert, C., Scheuermann, B., Mauve, M.: A probabilistic method for cooperative hierarchical aggregation of data in vanets. Ad Hoc Netw. 8(5), 518–530 (2010)

    Article  Google Scholar 

  22. Open-source codes, https://github.com/spartazhihu/Fine-Grained-Probability-Counting-Algorithms

  23. Penn tree bank dataset, https://catalog.ldc.upenn.edu/ldc99t42

  24. Sridharan, A., Ye, T.: Tracking port scanners on the ip backbone. In: Proceedings of the 2007 Workshop on Large Scale Attack Defense, pp. 137–144. ACM (2007)

  25. Tong, Y., Chen, L., Cheng, Y., Yu, P.S.: Mining frequent itemsets over uncertain databases. Proc. VLDB Endow 5(11), 1650–1661 (2012)

    Article  Google Scholar 

  26. Tong, Y., Chen, L., Ding, B.: Discovering threshold-based frequent closed itemsets over probabilistic data. In: 2012 IEEE 28th International Conference on Data Engineering (ICDE), pp. 270–281. IEEE (2012)

  27. Tong, Y., Chen, L., Yu, P.S.: Ufimt: an uncertain frequent itemset mining toolbox. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1508–1511. ACM (2012)

  28. Tong, Y.-X., Chen, L., She, J.: Mining frequent itemsets in correlated uncertain databases. J. Comput. Sci. Technol. 30(4), 696–712 (2015)

    Article  MathSciNet  Google Scholar 

  29. Tong, Y., Zhang, X., Chen, L.: Tracking frequent items over distributed probabilistic data. World Wide Web 19(4), 579–604 (2016)

    Article  Google Scholar 

  30. Wang, L., Cai, Z., Wang, H., Jiang, J., Yang, T., Cui, B., Li, X.: Fine-grained probability counting. Refined loglog algorithm. IEEE Bigcomp (2018)

  31. Wei, Z., Liu, X., Li, F., Shang, S., Du, X., Wen, J.-R.: Matrix sketching over sliding windows. In: Proceedings of the 2016 International Conference on Management of Data, pp. 1465–1480. ACM (2016)

  32. Wei, S.W.S.S.Z., He, X, Xiao, X, Wen, J.R.: Topppr: top-k personalized pagerank queries with precision guarantees on large graphs. In: SIGMOD. ACM (2018)

  33. Whang, K.-Y., Vander-Zanden, B.T., Taylor, H.M.: A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst. (TODS) 15(2), 208–229 (1990)

    Article  Google Scholar 

  34. Yang, B., Guo, C., Jensen, C.S., Kaul, M., Shang, S.: Stochastic skyline route planning under time-varying uncertainty. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 136–147 (2014)

  35. Zhao, Y., Guo, S., Yang, Y.: Hermes: an optimization of hyperloglog counting in real-time data processing. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 1890–1895. IEEE (2016)

Download references

Acknowledgments

This work is partially supported by Primary Research, Development Plan of China (2016YFB1000304), National Basic Research Program of China (2014CB340405), NSFC (61672061), the OpenProject Funding of CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tong Yang.

Additional information

This work was done by Lun Wang, Zekun Cai, and Hao Wang under the guidance of their mentor: Tong Yang.

This article belongs to the Topical Collection: Special Issue on Big Data Management and Intelligent Analytics

Guest Editors: Junping Du, Panos Kalnis, Wenling Li, and Shuo Shang

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, L., Yang, T., Wang, H. et al. Fine-grained probability counting for cardinality estimation of data streams. World Wide Web 22, 2065–2081 (2019). https://doi.org/10.1007/s11280-018-0583-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-018-0583-0

Keywords

Navigation