Abstract
Estimating the number of distinct flows, also called the cardinality, is an important issue in many network applications, such as traffic measurement, anomaly detection, etc. The challenge is that high accuracy should be achieved with line speed and small auxiliary memory. Flajolet-Martin algorithm, LogLog algorithm, and HyperLogLog algorithm form a line of work in this area with improving performance. In this paper, we propose refined versions of these algorithms to achieve higher accuracy. The key observations are (1) the “leftmost” hash functions used by these algorithms can be generalized to reach higher accuracy, (2) the amendment coefficient can be highly biased in some certain streams or datasets so dynamically setting the amendment coefficient instead of using the one derived in pure math can lead to much better accuracy. Experimental results show great improvement of accuracy and stability of the refined versions over original algorithms.
Similar content being viewed by others
References
Chabchoub, Y., Hébrail, G.: Sliding hyperloglog: estimating cardinality in a data stream over a sliding window. In: IEEE International Conference on Data Mining Workshops (ICDMW), pp 1297–1303. IEEE (2010)
Dai, H., Shahzad, M., Liu, A.X., Zhong, Y.: Finding persistent items in data streams. Proc. VLDB Endow. 10(4), 289–300 (2016)
Dai, H., Zhong, Y., Liu, A.X., Wang, W., Li, M.: Noisy bloom filters for multi-set membership testing. In: ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, pp. 139–151 (2016)
Dai, H., Meng, L., Liu, A.X.: Finding persistent items in distributed, datasets. In: Proceedings of the 37th Annual IEEE International Conference on Computer Communications (INFOCOM) (2018)
Durand, M., Flajolet, P.: Loglog counting of large cardinalities. In: European Symposium on Algorithms, pp. 605–617. Springer (2003)
Estan, C., Varghese, G.: New directions in traffic measurement and accounting. ACM, 32(4) (2002)
Estan, C., Varghese, G., Fisk, M.: Bitmap algorithms for counting active flows on high speed links. In: Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement, pp. 153–166. ACM (2003)
Flajolet, P.: On adaptive sampling. Computing 43(4), 391–400 (1990)
Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985)
Flajolet, P., Fusy, É. , Gandouet, O., Meunier, F.: Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. Anal. Algor. 2007(AofA07), 127–146 (2007)
Garofalakis, M., Hellerstein, J.M., Maniatis, P.: Proof sketches: Verifiable in-network aggregation. In: IEEE 23rd International Conference on Data Engineering, 2007. ICDE 2007, pp. 996–1005. IEEE (2007)
Han, Q., Du, S., Ren, D., Zhu, H.: Sas: a secure data aggregation scheme in vehicular sensing networks. In: IEEE International Conference on Communications (ICC), pp. 1–5. IEEE (2010)
Han, J., Zheng, K., Sun, A., Shang, S., Wen, J.-R.: Discovering neighborhood pattern queries by sample answers in knowledge base. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 1014–1025. IEEE (2016)
Heule, S., Nunkesser, M., Hall, A.: Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In: Proceedings of the 16th International Conference on Extending Database Technology, pp. 683–692. ACM (2013)
Kang, U., Tsourakakis, C.E., Appel, A.P., Faloutsos, C., Leskovec, J.: Hadi: mining radii of large graphs. ACM Trans. Knowl. Discov. Data (TKDD) 5(2), 8 (2011)
Knuth, D.E.: The art of computer programming: sorting and searching, vol. 3. Pearson Education (1998)
Li, Z., Xiao, F., Wang, S., Pei, T., Li, J.: Achievable rate maximization for cognitive hybrid satellite-terrestrial networks with af-relays. IEEE Journal on Selected Areas in Communications (2018)
Liu, J., Zhao, K., Sommer, P., Shang, S., Kusy, B., Jurdak, R.: Bounded quadrant system: Error-bounded trajectory compression on the go. In: IEEE 31st International Conference onData Engineering (ICDE), pp. 987–998. IEEE (2015)
Lochert, C., Scheuermann, B., Mauve, M.: Probabilistic aggregation for data dissemination in vanets. In: Proceedings of the Fourth ACM International Workshop on Vehicular ad hoc Networks, pp. 1–8. ACM (2007)
Lochert, C., Rybicki, J., Scheuermann, B., Mauve, M.: Scalable data dissemination for inter-vehicle-communication: Aggregation versus peer-to-peer (skalierbare informationsverbreitung für die fahrzeug-fahrzeug-kommunikation: Aggregation versus peer-to-peer). it-Information Technology 50(4), 237–242 (2008)
Lochert, C., Scheuermann, B., Mauve, M.: A probabilistic method for cooperative hierarchical aggregation of data in vanets. Ad Hoc Netw. 8(5), 518–530 (2010)
Open-source codes, https://github.com/spartazhihu/Fine-Grained-Probability-Counting-Algorithms
Penn tree bank dataset, https://catalog.ldc.upenn.edu/ldc99t42
Sridharan, A., Ye, T.: Tracking port scanners on the ip backbone. In: Proceedings of the 2007 Workshop on Large Scale Attack Defense, pp. 137–144. ACM (2007)
Tong, Y., Chen, L., Cheng, Y., Yu, P.S.: Mining frequent itemsets over uncertain databases. Proc. VLDB Endow 5(11), 1650–1661 (2012)
Tong, Y., Chen, L., Ding, B.: Discovering threshold-based frequent closed itemsets over probabilistic data. In: 2012 IEEE 28th International Conference on Data Engineering (ICDE), pp. 270–281. IEEE (2012)
Tong, Y., Chen, L., Yu, P.S.: Ufimt: an uncertain frequent itemset mining toolbox. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1508–1511. ACM (2012)
Tong, Y.-X., Chen, L., She, J.: Mining frequent itemsets in correlated uncertain databases. J. Comput. Sci. Technol. 30(4), 696–712 (2015)
Tong, Y., Zhang, X., Chen, L.: Tracking frequent items over distributed probabilistic data. World Wide Web 19(4), 579–604 (2016)
Wang, L., Cai, Z., Wang, H., Jiang, J., Yang, T., Cui, B., Li, X.: Fine-grained probability counting. Refined loglog algorithm. IEEE Bigcomp (2018)
Wei, Z., Liu, X., Li, F., Shang, S., Du, X., Wen, J.-R.: Matrix sketching over sliding windows. In: Proceedings of the 2016 International Conference on Management of Data, pp. 1465–1480. ACM (2016)
Wei, S.W.S.S.Z., He, X, Xiao, X, Wen, J.R.: Topppr: top-k personalized pagerank queries with precision guarantees on large graphs. In: SIGMOD. ACM (2018)
Whang, K.-Y., Vander-Zanden, B.T., Taylor, H.M.: A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst. (TODS) 15(2), 208–229 (1990)
Yang, B., Guo, C., Jensen, C.S., Kaul, M., Shang, S.: Stochastic skyline route planning under time-varying uncertainty. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 136–147 (2014)
Zhao, Y., Guo, S., Yang, Y.: Hermes: an optimization of hyperloglog counting in real-time data processing. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 1890–1895. IEEE (2016)
Acknowledgments
This work is partially supported by Primary Research, Development Plan of China (2016YFB1000304), National Basic Research Program of China (2014CB340405), NSFC (61672061), the OpenProject Funding of CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was done by Lun Wang, Zekun Cai, and Hao Wang under the guidance of their mentor: Tong Yang.
This article belongs to the Topical Collection: Special Issue on Big Data Management and Intelligent Analytics
Guest Editors: Junping Du, Panos Kalnis, Wenling Li, and Shuo Shang
Rights and permissions
About this article
Cite this article
Wang, L., Yang, T., Wang, H. et al. Fine-grained probability counting for cardinality estimation of data streams. World Wide Web 22, 2065–2081 (2019). https://doi.org/10.1007/s11280-018-0583-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-018-0583-0