Fine-grained probability counting for cardinality estimation of data streams

Wang, Lun; Yang, Tong; Wang, Hao; Jiang, Jie; Cai, Zekun; Cui, Bin; Li, Xiaoming

doi:10.1007/s11280-018-0583-0

Fine-grained probability counting for cardinality estimation of data streams

Published: 04 May 2018

Volume 22, pages 2065–2081, (2019)
Cite this article

World Wide Web Aims and scope Submit manuscript

Lun Wang¹,
Tong Yang¹,
Hao Wang¹,
Jie Jiang¹,
Zekun Cai¹,
Bin Cui¹ &
…
Xiaoming Li¹

495 Accesses
8 Citations
Explore all metrics

Abstract

Estimating the number of distinct flows, also called the cardinality, is an important issue in many network applications, such as traffic measurement, anomaly detection, etc. The challenge is that high accuracy should be achieved with line speed and small auxiliary memory. Flajolet-Martin algorithm, LogLog algorithm, and HyperLogLog algorithm form a line of work in this area with improving performance. In this paper, we propose refined versions of these algorithms to achieve higher accuracy. The key observations are (1) the “leftmost” hash functions used by these algorithms can be generalized to reach higher accuracy, (2) the amendment coefficient can be highly biased in some certain streams or datasets so dynamically setting the amendment coefficient instead of using the one derived in pure math can lead to much better accuracy. Experimental results show great improvement of accuracy and stability of the refined versions over original algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fast counting the cardinality of flows for big traffic over sliding windows

Article 01 February 2017

Jingsong Shan, Yinjin Fu, … Zhaofeng Wu

Mining Frequent Closed Flows Based on Approximate Support with a Sliding Window over Packet Streams

MCSketch: An Accurate Sketch for Heavy Flow Detection and Heavy Flow Frequency Estimation

References

Chabchoub, Y., Hébrail, G.: Sliding hyperloglog: estimating cardinality in a data stream over a sliding window. In: IEEE International Conference on Data Mining Workshops (ICDMW), pp 1297–1303. IEEE (2010)
Dai, H., Shahzad, M., Liu, A.X., Zhong, Y.: Finding persistent items in data streams. Proc. VLDB Endow. 10(4), 289–300 (2016)
Article Google Scholar
Dai, H., Zhong, Y., Liu, A.X., Wang, W., Li, M.: Noisy bloom filters for multi-set membership testing. In: ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, pp. 139–151 (2016)
Dai, H., Meng, L., Liu, A.X.: Finding persistent items in distributed, datasets. In: Proceedings of the 37th Annual IEEE International Conference on Computer Communications (INFOCOM) (2018)
Durand, M., Flajolet, P.: Loglog counting of large cardinalities. In: European Symposium on Algorithms, pp. 605–617. Springer (2003)
Estan, C., Varghese, G.: New directions in traffic measurement and accounting. ACM, 32(4) (2002)
Estan, C., Varghese, G., Fisk, M.: Bitmap algorithms for counting active flows on high speed links. In: Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement, pp. 153–166. ACM (2003)
Flajolet, P.: On adaptive sampling. Computing 43(4), 391–400 (1990)
Article MathSciNet MATH Google Scholar
Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985)
Article MathSciNet MATH Google Scholar
Flajolet, P., Fusy, É. , Gandouet, O., Meunier, F.: Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. Anal. Algor. 2007(AofA07), 127–146 (2007)
MathSciNet MATH Google Scholar
Garofalakis, M., Hellerstein, J.M., Maniatis, P.: Proof sketches: Verifiable in-network aggregation. In: IEEE 23rd International Conference on Data Engineering, 2007. ICDE 2007, pp. 996–1005. IEEE (2007)
Han, Q., Du, S., Ren, D., Zhu, H.: Sas: a secure data aggregation scheme in vehicular sensing networks. In: IEEE International Conference on Communications (ICC), pp. 1–5. IEEE (2010)
Han, J., Zheng, K., Sun, A., Shang, S., Wen, J.-R.: Discovering neighborhood pattern queries by sample answers in knowledge base. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 1014–1025. IEEE (2016)
Heule, S., Nunkesser, M., Hall, A.: Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In: Proceedings of the 16th International Conference on Extending Database Technology, pp. 683–692. ACM (2013)
Kang, U., Tsourakakis, C.E., Appel, A.P., Faloutsos, C., Leskovec, J.: Hadi: mining radii of large graphs. ACM Trans. Knowl. Discov. Data (TKDD) 5(2), 8 (2011)
Google Scholar
Knuth, D.E.: The art of computer programming: sorting and searching, vol. 3. Pearson Education (1998)
Li, Z., Xiao, F., Wang, S., Pei, T., Li, J.: Achievable rate maximization for cognitive hybrid satellite-terrestrial networks with af-relays. IEEE Journal on Selected Areas in Communications (2018)
Liu, J., Zhao, K., Sommer, P., Shang, S., Kusy, B., Jurdak, R.: Bounded quadrant system: Error-bounded trajectory compression on the go. In: IEEE 31st International Conference onData Engineering (ICDE), pp. 987–998. IEEE (2015)
Lochert, C., Scheuermann, B., Mauve, M.: Probabilistic aggregation for data dissemination in vanets. In: Proceedings of the Fourth ACM International Workshop on Vehicular ad hoc Networks, pp. 1–8. ACM (2007)
Lochert, C., Rybicki, J., Scheuermann, B., Mauve, M.: Scalable data dissemination for inter-vehicle-communication: Aggregation versus peer-to-peer (skalierbare informationsverbreitung für die fahrzeug-fahrzeug-kommunikation: Aggregation versus peer-to-peer). it-Information Technology 50(4), 237–242 (2008)
Article Google Scholar
Lochert, C., Scheuermann, B., Mauve, M.: A probabilistic method for cooperative hierarchical aggregation of data in vanets. Ad Hoc Netw. 8(5), 518–530 (2010)
Article Google Scholar
Open-source codes, https://github.com/spartazhihu/Fine-Grained-Probability-Counting-Algorithms
Penn tree bank dataset, https://catalog.ldc.upenn.edu/ldc99t42
Sridharan, A., Ye, T.: Tracking port scanners on the ip backbone. In: Proceedings of the 2007 Workshop on Large Scale Attack Defense, pp. 137–144. ACM (2007)
Tong, Y., Chen, L., Cheng, Y., Yu, P.S.: Mining frequent itemsets over uncertain databases. Proc. VLDB Endow 5(11), 1650–1661 (2012)
Article Google Scholar
Tong, Y., Chen, L., Ding, B.: Discovering threshold-based frequent closed itemsets over probabilistic data. In: 2012 IEEE 28th International Conference on Data Engineering (ICDE), pp. 270–281. IEEE (2012)
Tong, Y., Chen, L., Yu, P.S.: Ufimt: an uncertain frequent itemset mining toolbox. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1508–1511. ACM (2012)
Tong, Y.-X., Chen, L., She, J.: Mining frequent itemsets in correlated uncertain databases. J. Comput. Sci. Technol. 30(4), 696–712 (2015)
Article MathSciNet Google Scholar
Tong, Y., Zhang, X., Chen, L.: Tracking frequent items over distributed probabilistic data. World Wide Web 19(4), 579–604 (2016)
Article Google Scholar
Wang, L., Cai, Z., Wang, H., Jiang, J., Yang, T., Cui, B., Li, X.: Fine-grained probability counting. Refined loglog algorithm. IEEE Bigcomp (2018)
Wei, Z., Liu, X., Li, F., Shang, S., Du, X., Wen, J.-R.: Matrix sketching over sliding windows. In: Proceedings of the 2016 International Conference on Management of Data, pp. 1465–1480. ACM (2016)
Wei, S.W.S.S.Z., He, X, Xiao, X, Wen, J.R.: Topppr: top-k personalized pagerank queries with precision guarantees on large graphs. In: SIGMOD. ACM (2018)
Whang, K.-Y., Vander-Zanden, B.T., Taylor, H.M.: A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst. (TODS) 15(2), 208–229 (1990)
Article Google Scholar
Yang, B., Guo, C., Jensen, C.S., Kaul, M., Shang, S.: Stochastic skyline route planning under time-varying uncertainty. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 136–147 (2014)
Zhao, Y., Guo, S., Yang, Y.: Hermes: an optimization of hyperloglog counting in real-time data processing. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 1890–1895. IEEE (2016)

Download references

Acknowledgments

This work is partially supported by Primary Research, Development Plan of China (2016YFB1000304), National Basic Research Program of China (2014CB340405), NSFC (61672061), the OpenProject Funding of CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences.

Author information

Authors and Affiliations

Department of Computer Science, Peking University, Beijing, China
Lun Wang, Tong Yang, Hao Wang, Jie Jiang, Zekun Cai, Bin Cui & Xiaoming Li

Authors

Lun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jie Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Zekun Cai
View author publications
You can also search for this author in PubMed Google Scholar
Bin Cui
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoming Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tong Yang.

Additional information

This work was done by Lun Wang, Zekun Cai, and Hao Wang under the guidance of their mentor: Tong Yang.

This article belongs to the Topical Collection: Special Issue on Big Data Management and Intelligent Analytics

Guest Editors: Junping Du, Panos Kalnis, Wenling Li, and Shuo Shang

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, L., Yang, T., Wang, H. et al. Fine-grained probability counting for cardinality estimation of data streams. World Wide Web 22, 2065–2081 (2019). https://doi.org/10.1007/s11280-018-0583-0

Download citation

Received: 07 March 2018
Revised: 24 April 2018
Accepted: 26 April 2018
Published: 04 May 2018
Issue Date: 15 September 2019
DOI: https://doi.org/10.1007/s11280-018-0583-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fine-grained probability counting for cardinality estimation of data streams

Abstract

Access this article

Similar content being viewed by others

Fast counting the cardinality of flows for big traffic over sliding windows

Mining Frequent Closed Flows Based on Approximate Support with a Sliding Window over Packet Streams

MCSketch: An Accurate Sketch for Heavy Flow Detection and Heavy Flow Frequency Estimation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fine-grained probability counting for cardinality estimation of data streams

Abstract

Access this article

Similar content being viewed by others

Fast counting the cardinality of flows for big traffic over sliding windows

Mining Frequent Closed Flows Based on Approximate Support with a Sliding Window over Packet Streams

MCSketch: An Accurate Sketch for Heavy Flow Detection and Heavy Flow Frequency Estimation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation