Skip to main content
Log in

Optimizing the confidence bound of count-min sketches to estimate the streaming big data query results more precisely

  • Published:
Computing Aims and scope Submit manuscript

Abstract

A count-min sketch is a probabilistic data structure, which serves as a frequency table of events to process a stream of big data. It uses hash functions to map events to frequencies. Querying a count-min sketch returns the targeted event along with an estimated frequency, which is not less than the actual frequency. The estimated error, i.e., the difference between the estimated frequency and the actual, can be measured by a pre-defined confidence bound. However, the bound originally defined is too loose. The reason is that the Markov inequality used to derive the bound does not perform well. In this paper, based on binomial distribution and central limit theorem, we define a tighter bound. We indicate that the reliability of the bound is related to the deviation of data, which can be measured by the data’s coefficient of standard deviation. Our extensive experiments well support the effectiveness and efficiency of the new bound.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. http://snap.stanford.edu/data/loc-gowalla.html.

References

  1. Chen D, Wang L, Xiaomin W, Chen J, Khan SU, Koodziej J, Tian M, Huang F, Liu W (2013) Hybrid modelling and simulation of huge crowd over a hierarchical grid architecture. Future Gener Comput Syst 29(5):1309–1317

    Article  Google Scholar 

  2. Cormode G (2009) Count-min sketch. In: Liu L, Özsu MT (eds) Encyclopedia of database systems. Springer, pp 511–516. https://doi.org/10.1007/978-0-387-39940-9_87

  3. Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1):58–75

    Article  MathSciNet  Google Scholar 

  4. Deng Z, Wu X, Wang L, Chen X, Ranjan R, Zomaya A, Chen D (2015) Parallel processing of dynamic continuous queries over streaming data flows. IEEE Trans Parallel Distrib Syst 26(3):834–846

    Article  Google Scholar 

  5. Deng Z, Han W, Wang L, Ranjan R, Zomaya AY, Jie W (2017) An efficient online direction-preserving compression approach for trajectory streaming data. Future Gener Comput Syst 68:150–162

    Article  Google Scholar 

  6. Dong L, Yao H, Ranjan R, Zhang F, Pan M (2017) Fast lightweight reconfiguration of virtual constellation for obtaining of earth observation big data. Clust Comput 20(3):2299–2310

    Article  Google Scholar 

  7. Everitt B, Skrondal A (2002) The Cambridge dictionary of statistics, vol 106. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  8. Ge Luo L, Wang KY, Cormode G (2016) Quantiles over data streams: experimental comparisons, new analyses, and further improvements. The VLDB J 25(4):449–472

    Article  Google Scholar 

  9. Goyal A, Jagarlamudi J, Daumé III, Hal VS (2010) Sketch techniques for scaling distributional similarity to the web. In: Proceedings of the 2010 workshop on geometrical models of natural language semantics, Association for Computational Linguistics, pp 51–56

  10. He Z, Chonglong W, Liu G, Zheng Z, Tian Y (2015) Decomposition tree: a spatio-temporal indexing method for movement big data. Clust Comput 18(4):1481–1492

    Article  Google Scholar 

  11. Ippoliti D, Jiang C, Ding Z, Zhou X (2016) Online adaptive anomaly detection for augmented network flows. ACM Trans Auton Adapt Syst (TAAS) 11(3):17

    Google Scholar 

  12. Khoshkbarforoushha A, Ranjan R, Gaire R, Abbasnejad E, Wang L, Zomaya AY (2017) Distribution based workload modelling of continuous queries in clouds. IEEE Trans Emerg Top Comput 5(1):120–133

    Article  Google Scholar 

  13. Leon-Garcia A (2008) Probability, statistics, and random processes for electrical engineering, 3rd edn. Pearson, London

    Google Scholar 

  14. Li H, Huang H (2005) New estimation methods of count-min sketch. In: Research issues in data engineering: stream data mining and applications, 2005. RIDE-SDMA 2005. 15th international workshop on, IEEE, pp 73–80

  15. Liu H, Sun Y, Kim MS (2011) Fine-grained ddos detection scheme based on bidirectional count sketch. In: Computer communications and networks (ICCCN), 2011 proceedings of 20th international conference on, IEEE, pp 1–6

  16. Minton GT, Price E (2014) Improved concentration bounds for count-sketch. In: Proceedings of the twenty-fifth annual ACM-SIAM symposium on discrete algorithms, society for industrial and applied mathematics, pp 669–686

  17. Mood AMF (1950) Introduction to the theory of statistics. McGraw-hill, NY

    MATH  Google Scholar 

  18. Papapetrou O, Garofalakis M, Deligiannakis A (2015) Sketching distributed sliding-window data streams. The VLDB J 24(3):345–368

    Article  Google Scholar 

  19. Perera C, Ranjan R, Wang L, Khan SU, Zomaya AY (2015) Big data privacy in the internet of things era. IT Prof 17(3):32–39

    Article  Google Scholar 

  20. Probabilistic data structures. https://en.wikipedia.org/wiki/Category:Probabilistic_data_structures/. Accessed 29 Dec 2018

  21. Ranjan R, Wang L, Zomaya AY, Tao J, Jayaraman PP, Georgakopoulos D (2016) Advances in methods and techniques for processing streaming big data in datacentre clouds. IEEE Trans Emerg Top Comput 4(2):262–265

    Article  Google Scholar 

  22. Rottenstreich O, Kanizo Y, Keslassy I (2014) The variable-increment counting bloom filter. IEEE/ACM Trans Netw 22(4):1092–1105

    Article  Google Scholar 

  23. Rusu F, Dobra A (2008) Sketches for size of join estimation. ACM Trans Database Syst (TODS) 33(3):15

    Article  Google Scholar 

  24. Schechter S, Herley C, Mitzenmacher M (2010) Popularity is everything: a new approach to protecting passwords from statistical-guessing attacks. In: Proceedings of the 5th USENIX conference on Hot topics in security, USENIX Association, pp 1–8

  25. Tong D, Prasanna V (2016) High throughput sketch based online heavy hitter detection on fpga. ACM SIGARCH Comput Archit N 43(4):70–75

    Article  Google Scholar 

  26. Wang L, Ranjan R (2015) Processing distributed internet of things data in clouds. IEEE Cloud Comput 2(1):76–80

    Article  Google Scholar 

  27. Yang Y, Zhu J (2016) Write skew and zipf distribution: evidence and implications. ACM Trans Storage (TOS) 12(4):21

    MathSciNet  Google Scholar 

  28. Zhang F, Gong T, Lee VE, Zhao G, Rong C, Guangzhi Q (2016) Fast algorithms to evaluate collaborative filtering recommender systems. Knowl-Based Syst 96(3):96–103

    Google Scholar 

  29. Zhang F, Lee VE, Raymond Choo K-K (2018) Jo-dpmf: differentially private matrix factorization learning through joint optimization. Inf Sci 467(10):271–281

    MathSciNet  Google Scholar 

Download references

Acknowledgements

The study is partially supported by the National Natural Science Foundation of China under Grant No. U1711266, U1711267, and the Fundamental Research Founds for National University under Grant No. 1610491B22, China University of Geosciences (Wuhan).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feng Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, R., Xue, E., Zhang, F. et al. Optimizing the confidence bound of count-min sketches to estimate the streaming big data query results more precisely. Computing 102, 1419–1445 (2020). https://doi.org/10.1007/s00607-018-00695-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-018-00695-z

Keywords

Mathematics Subject Classification

Navigation