Skip to main content

Advertisement

Log in

WavingSketch: an unbiased and generic sketch for finding top-k items in data streams

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Finding top-k items in data streams is a fundamental problem in data mining. Unbiased estimation is well acknowledged as an elegant and important property for top-k algorithms. In this paper, we propose a novel sketch algorithm, called WavingSketch, which is more accurate than existing unbiased algorithms. We theoretically prove that WavingSketchcan provide unbiased estimation, and derive its error bound. WavingSketchis generic to measurement tasks, and we apply it to five applications: finding top-k frequent items, finding top-k heavy changes, finding top-k persistent items, finding top-k Super-Spreaders, and join-aggregate estimation. Our experimental results show that, compared with the state-of-the-art Unbiased Space-Saving, WavingSketchachieves \(10 \times \) faster speed and \(10^3 \times \) smaller error on finding frequent items. For other applications, WavingSketchalso achieves higher accuracy and faster speed. All related codes are open-sourced at GitHub (https://github.com/WavingSketch/Waving-Sketch).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26

Similar content being viewed by others

Notes

  1. Notice that in practice the sizes of data streams are often skewed (e.g. power law distribution) [30, 38], where heavy data streams have more items and light data streams have less items.

  2. Notice that in our implementation, we always use the Heavy Part rearrangement technique for higher processing speed, even without enabling SIMD acceleration.

  3. We assume \(\alpha >1\) so that the series \(\sum _{i=1}^{\infty } \frac{1}{i^\alpha }\) converges.

  4. Notice that Z is a constant determined by data stream \(\sigma \).

  5. We have formally defined the persistence of an item as the number of time windows it appears (see Sect. 2.1).

References

  1. Lukasz, G., David, D., D, D.E., Alejandro, L., Ian, M.J.: Identifying frequent items in sliding windows over on-line packet streams. In: IMC. ACM, (2003)

  2. Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28(1), 51–55 (2003)

    Article  Google Scholar 

  3. Nishad, M., Themis, P.: Frequent items in streaming data: an experimental evaluation of the state-of-the-art. Data Knowl. Eng., (2009)

  4. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. Languages and Programming. Springer, In Automata (2002)

    Book  Google Scholar 

  5. Wei, Z., Luo, G., Yi, K., Du, X., Wen, J.-R.: Persistent data sketching. In: Proc. ACM SIGMOD, pp. 795–810. ACM, (2015)

  6. Schweller, R., Li, Z., Chen, Y., et al.: Reversible sketches: enabling monitoring and analysis over high-speed data streams. IEEE/ACM Trans. Netw. 15(5), 1059–1072 (2007)

    Article  Google Scholar 

  7. Balachander, K., Subhabrata, S., Yin, Z., Yan, C.: Sketch-based change detection: methods, evaluation, and applications. In: Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement, pp. 234–247. ACM, (2003)

  8. Li, Y., Miao, R., Kim, C., Yu, M.: Flowradar: a better netflow for data centers. In: USENIX NSDI, pp. 311–324. USENIX Association, (2016)

  9. Dai, H., Shahzad, M., Liu, A.X., Zhong, Y.: Finding persistent items in data streams. Proc. VLDB Endow. 10(4), 289–300 (2016)

    Article  Google Scholar 

  10. Venkataraman, S., Song, D.X., Gibbons, P.B., Blum, A.: New streaming algorithms for fast detection of superspreaders. In: NDSS, (2005)

  11. Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)

    Article  MathSciNet  Google Scholar 

  12. Estan, C., Varghese, G.: New directions in traffic measurement and accounting. ACM SIGMCOMM CCR, 32(4), (2002)

  13. Metwally, A., Agrawal, D., El Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: International Conference on Database Theory. Springer, (2005)

  14. Singh, M.G., Rajeev, M.: Approximate frequency counts over data streams. In: Proc. VLDB, pp. 346–357, (2002)

  15. Ting, D.: Data sketches for disaggregated subset sum and frequent item estimation. In: SIGMOD Conference, (2018)

  16. Pratanu, R., Arijit, K., Gustavo, A.: Augmented sketch: Faster and more accurate stream processing. In: Proc, ACM SIGMOD (2016)

  17. Yang, D., Li, B., Rettig, L., Cudré-Mauroux, P.: \(\text{ D}^{22}\) histosketch: discriminative and dynamic similarity-preserving sketching of streaming histograms. IEEE Trans. Knowl. Data Eng. 31(10), 1898–1911 (2018)

    Article  Google Scholar 

  18. Buddhika, T., Malensek, M., Pallickara, S.L., Pallickara, S.: Synopsis: A distributed sketch over voluminous spatiotemporal observational streams. IEEE Trans. Knowl. Data Eng. 29(11), 2552–2566 (2017)

    Article  Google Scholar 

  19. Zhao, B., Li, X., Tian, B., Mei, Z., Wu, W.: Dhs: Adaptive memory layout organization of sketch slots for fast and accurate data stream processing. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 2285–2293, (2021)

  20. Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB Endow. 1(2), 1530–1541 (2008)

    Article  Google Scholar 

  21. Yang, T., Gong, J., Zhang, H., Zou, L., Shi, L., Li. X.: Heavyguardian: Separate and guard hot items in data streams. In: SIGKDD, (2018)

  22. Zhou, Y., Yang, T., Jiang, J., Cui, B., Yu, M., Li, X., Uhlig, S.: Cold filter: A meta-framework for faster and more accurate stream processing. In: SIGMOD Conference, (2018)

  23. Huang, Q., Lee, P.P.: Ld-sketch: A distributed sketching design for accurate and scalable anomaly detection in network data streams. In: IEEE INFOCOM 2014-IEEE Conference on Computer Communications, pp. 1420–1428. IEEE, (2014)

  24. Shokrollahi, A.: Raptor codes. IEEE Trans. Inf. Theory 52(6), 2551–2567 (2006)

    Article  MathSciNet  Google Scholar 

  25. Lahiri, B., Chandrashekar, J., Tirthapura, S.: Space-efficient tracking of persistent items in a massive data stream. Stat. Anal. Data Mining 7, 70–92 (2011)

    Article  MathSciNet  Google Scholar 

  26. Zhang, Y., Li, J., Lei, Y., Yang, T., Li, Z., Zhang, G., Cui, B.: On-off sketch: a fast and accurate sketch on persistence. Proc. VLDB Endow. 14(2), 128–140 (2020)

    Article  Google Scholar 

  27. Yu, M., Jose, L., Miao, R.: Software defined traffic measurement with opensketch. In: NSDI 2013, (2013)

  28. Tang, L., Huang, Q., Lee, P.P.: Spreadsketch: Toward invertible and network-wide detection of superspreaders. In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications, pp. 1608–1617. IEEE, (2020)

  29. Estan, C., Varghese, G., Fisk, M.: Bitmap algorithms for counting active flows on high speed links. In: Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement, pp. 153–166, (2003)

  30. Zhao, Y., Han, W., Zhong, Z., Zhang, Y., Yang, T., Cui, B.: Double-anonymous sketch: achieving top-k-fairness for finding global top-k frequent items. Proc. ACM Manag. Data 1(1), 1–26 (2023)

    Article  Google Scholar 

  31. Wang, F., Chen, Q., Li, Y., Yang, T., Tu, Y., Yu, L., Cui, B.: Joinsketch: a sketch algorithm for accurate and unbiased inner-product estimation. Proc. ACM Manag. Data 1(1), 1–26 (2023)

    Article  Google Scholar 

  32. Cormode, G., Garofalakis, M.: Sketching streams through the net: Distributed approximate query tracking. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 13–24, (2005)

  33. Liu, Z., Manousis, A., Vorsanger, G., Sekar, V., Braverman, V.: One sketch to rule them all: Rethinking network flow monitoring with univmon. In: Proceedings of the 2016 ACM SIGCOMM Conference, pp. 101–114, (2016)

  34. Miao, R., Zhang, Y., Qu, G., Yang, K., Yang, T., Cui, B.: Hyper-uss: Answering subset query over multi-attribute data stream. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1698–1709, (2023)

  35. Zhang, Y., Liu, Z., Wang, R., Yang, T., Li, J., Miao, R., Liu, P., Zhang, R., Jiang J.: Cocosketch: High-performance sketch-based measurement over arbitrary partial key query. In: Proceedings of the 2021 ACM SIGCOMM 2021 Conference, pp. 207–222, (2021)

  36. Rekhter, Y., Li, T., Hares, S.: A border gateway protocol 4 (bgp-4). Technical report, (2006)

  37. Sobrinho, J.L.: Network routing with path vector protocols: Theory and applications. In: Proceedings of the 2003 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pp. 49–60, (2003)

  38. Park, K., Lee, H.: On the effectiveness of route-based packet filtering for distributed dos attack prevention in power-law internets. ACM SIGCOMM Comput. Commun. Rev. 31(4), 15–26 (2001)

    Article  Google Scholar 

  39. Murmur hashing source codes. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp

  40. Yang, T., Jiang, J., Liu, P., Huang, Q., Gong, J., Zhou, Y., Miao, R., Li, X., Uhlig, S.: Elastic sketch: Adaptive and fast network-wide measurements. In: ACM SIGCOMM, vol. 2018, pp. 561–575 (2018)

  41. Kim, W., Yun, J., Jung, H.: Evaluation of high-frequency financial transaction processing in distributed memory systems. In: Proceedings of the 2014 Conference on Research in Adaptive and Convergent Systems, pp. 362–364, (2014)

  42. Zhang, H., Liu, Z., Chen, B., Zhao, Y., Zhao, T., Yang, T., Cui, B.: Cafe: Towards compact, adaptive, and fast embedding for large-scale recommendation models. In: Proceedings of the 2024 ACM International Conference on Management of Data (SIGMOD), (2024)

  43. Flynn, M.: Some computer organizations and their effectiveness. IEEE Trans. Comput. 100, 948–960 (1972)

    Article  Google Scholar 

  44. Li, Y., Wang, F., Yu, X., Yang, Y., Yang, K., Yang, T., Ma, Z., Cui, B., Uhlig, S.: Ladderfilter: Filtering infrequent items with small memory and time overhead. Proc. ACM Manag. Data 1(1), 1–21 (2023)

    Google Scholar 

  45. Liu, Z., Kong, C., Yang, K., Yang, T., Miao, R., Chen, Q., Zhao, Y., Tu, Y., Cui B.: Hypercalm sketch: One-pass mining periodic batches in data streams. In: 2023 IEEE 39th International Conference on Data Engineering (ICDE). IEEE, (2023)

  46. Zhou, Y., Yang, T., Jiang, J., Cui, B., Yu, M., Li, X., Uhlig, S.: Cold filter: A meta-framework for faster and more accurate stream processing. In: Proceedings of the 2018 international conference on management of data, pp. 741–756, (2018)

  47. Supplementary materials of wavingsketch. https://github.com/WavingSketch/Waving-Sketch/blob/master/WavingSketch_Supplementary.pdf

  48. Li, J., Li, Z., Xu, Y., Jiang, S., Yang, T., Cui, B., Dai, Y., Zhang, G.: Wavingsketch: An unbiased and generic sketch for finding top-k items in data streams. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1574–1584, (2020)

  49. Powers, D.M.: Applications and explanations of Zipf’s law. In: Proc. EMNLP-CoNLL, Association for Computational Linguistics (1998)

  50. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Article  Google Scholar 

  51. Leis, V., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., Neumann, T.: How good are query optimizers, really? Proc. VLDB Endow. 9(3), 204–215 (2015)

    Article  Google Scholar 

  52. Izenov, Y., Datta, A., Rusu, F., Shin, J.H.: Compass: Online sketch-based query optimization for in-memory databases. In: Proceedings of the 2021 International Conference on Management of Data, pp. 804–816, (2021)

  53. Leis, V., Radke, B., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., Neumann, T.: Query optimization through the looking glass, and what we found running the join order benchmark. VLDB J. 27(5), 643–668 (2018)

    Article  Google Scholar 

  54. Wang, Y., Yi, K.: Secure yannakakis: Join-aggregate queries over private data. In: Proceedings of the 2021 International Conference on Management of Data, pp. 1969–1981, (2021)

  55. Kutzkov, K., Ahmed, M., Nikitaki, S.: Weighted similarity estimation in data streams. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1051–1060, (2015)

  56. Pruthi, G., Liu, F., Kale, S., Sundararajan, M.: Estimating training data influence by tracing gradient descent. Adv. Neural. Inf. Process. Syst. 33, 19920–19930 (2020)

    Google Scholar 

  57. Alon, N., Gibbons, P.B., Matias, Y., Szegedy, M.: Tracking join and self-join sizes in limited storage. In Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems 10–20 (1999)

  58. Ganguly, S., Garofalakis, M., Rastogi, R.: Processing data-stream join aggregates using skimmed sketches. In: International Conference on Extending Database Technology, pp. 569–586. Springer, (2004)

  59. Ganguly, S., Kesh, D., Saha, C.: Practical algorithms for tracking database join sizes. In: International Conference on Foundations of Software Technology and Theoretical Computer Science, pp. 297–309. Springer, (2005)

  60. Rusu, F., Dobra, A.: Sketches for size of join estimation. ACM Trans. Database Syst. 33(3), 1–46 (2008)

    Article  Google Scholar 

  61. Rusu, F., Dobra, A.: Statistical analysis of sketch estimators. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 187–198, (2007)

  62. Cai, W., Balazinska, M., Suciu, D.: Pessimistic cardinality estimation: Tighter upper bounds for intermediate join cardinalities. In: Proceedings of the 2019 International Conference on Management of Data, pp. 18–35, (2019)

  63. CAIDA [online]. Available: http://www.caida.org/home

  64. Real-life transactional dataset. http://fimi.ua.ac.be/data/

  65. The Network dataset Internet Traces. http://snap.stanford.edu/data/

  66. Rousskov, A., Wessels, D.: High-performance benchmarking with web polygraph. Practice and Experience, Software (2004)

    Book  Google Scholar 

  67. Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache flink: Stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Tech. Committee Data Eng., 36(4), (2015)

  68. Source code related to WavingSketch.. https://github.com/WavingSketch/Waving-Sketch

Download references

Acknowledgements

This work is supported by National Key R &D Program of China (No. 2022YFB2901504), National Natural Science Foundation of China (NSFC) (No. U20A20179, 62372009, 623B2005), and research grant No. SH-2024JK29.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tong Yang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Z., Dong, F., Liu, C. et al. WavingSketch: an unbiased and generic sketch for finding top-k items in data streams. The VLDB Journal 33, 1697–1722 (2024). https://doi.org/10.1007/s00778-024-00869-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-024-00869-6

Keywords