WavingSketch: an unbiased and generic sketch for finding top-k items in data streams

Liu, Zirui; Dong, Fenghao; Liu, Chengwu; Deng, Xiangwei; Yang, Tong; Zhao, Yikai; Li, Jizhou; Cui, Bin; Zhang, Gong

doi:10.1007/s00778-024-00869-6

WavingSketch: an unbiased and generic sketch for finding top-k items in data streams

Regular Paper
Published: 29 July 2024

Volume 33, pages 1697–1722, (2024)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Zirui Liu¹,
Fenghao Dong²,
Chengwu Liu¹,
Xiangwei Deng¹,
Tong Yang ORCID: orcid.org/0000-0003-2402-5854¹,
Yikai Zhao¹,
Jizhou Li¹,
Bin Cui¹ &
…
Gong Zhang³

298 Accesses
Explore all metrics

Abstract

Finding top-k items in data streams is a fundamental problem in data mining. Unbiased estimation is well acknowledged as an elegant and important property for top-k algorithms. In this paper, we propose a novel sketch algorithm, called WavingSketch, which is more accurate than existing unbiased algorithms. We theoretically prove that WavingSketchcan provide unbiased estimation, and derive its error bound. WavingSketchis generic to measurement tasks, and we apply it to five applications: finding top-k frequent items, finding top-k heavy changes, finding top-k persistent items, finding top-k Super-Spreaders, and join-aggregate estimation. Our experimental results show that, compared with the state-of-the-art Unbiased Space-Saving, WavingSketchachieves $10 \times $ faster speed and $10^3 \times $ smaller error on finding frequent items. For other applications, WavingSketchalso achieves higher accuracy and faster speed. All related codes are open-sourced at GitHub (https://github.com/WavingSketch/Waving-Sketch).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 7

Fig. 8

Fig. 9

Fig. 10

Fig. 12

Fig. 16

Fig. 17

Fig. 18

Fast and accurate stream processing by filtering the cold

Article 13 August 2019

PrivSketch: A Private Sketch-Based Frequency Estimation Protocol for Data Streams

PiqSketch: An Efficient Sketching Algorithm for Per-Key Tail Quantile Estimation in Large-Scale Data Streams

Notes

Notice that in practice the sizes of data streams are often skewed (e.g. power law distribution) [30, 38], where heavy data streams have more items and light data streams have less items.
Notice that in our implementation, we always use the Heavy Part rearrangement technique for higher processing speed, even without enabling SIMD acceleration.
We assume $\alpha >1$ so that the series $\sum _{i=1}^{\infty } \frac{1}{i^\alpha }$ converges.
Notice that Z is a constant determined by data stream $\sigma $.
We have formally defined the persistence of an item as the number of time windows it appears (see Sect. 2.1).

References

Lukasz, G., David, D., D, D.E., Alejandro, L., Ian, M.J.: Identifying frequent items in sliding windows over on-line packet streams. In: IMC. ACM, (2003)
Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28(1), 51–55 (2003)
Article Google Scholar
Nishad, M., Themis, P.: Frequent items in streaming data: an experimental evaluation of the state-of-the-art. Data Knowl. Eng., (2009)
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. Languages and Programming. Springer, In Automata (2002)
Book Google Scholar
Wei, Z., Luo, G., Yi, K., Du, X., Wen, J.-R.: Persistent data sketching. In: Proc. ACM SIGMOD, pp. 795–810. ACM, (2015)
Schweller, R., Li, Z., Chen, Y., et al.: Reversible sketches: enabling monitoring and analysis over high-speed data streams. IEEE/ACM Trans. Netw. 15(5), 1059–1072 (2007)
Article Google Scholar
Balachander, K., Subhabrata, S., Yin, Z., Yan, C.: Sketch-based change detection: methods, evaluation, and applications. In: Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement, pp. 234–247. ACM, (2003)
Li, Y., Miao, R., Kim, C., Yu, M.: Flowradar: a better netflow for data centers. In: USENIX NSDI, pp. 311–324. USENIX Association, (2016)
Dai, H., Shahzad, M., Liu, A.X., Zhong, Y.: Finding persistent items in data streams. Proc. VLDB Endow. 10(4), 289–300 (2016)
Article Google Scholar
Venkataraman, S., Song, D.X., Gibbons, P.B., Blum, A.: New streaming algorithms for fast detection of superspreaders. In: NDSS, (2005)
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)
Article MathSciNet Google Scholar
Estan, C., Varghese, G.: New directions in traffic measurement and accounting. ACM SIGMCOMM CCR, 32(4), (2002)
Metwally, A., Agrawal, D., El Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: International Conference on Database Theory. Springer, (2005)
Singh, M.G., Rajeev, M.: Approximate frequency counts over data streams. In: Proc. VLDB, pp. 346–357, (2002)
Ting, D.: Data sketches for disaggregated subset sum and frequent item estimation. In: SIGMOD Conference, (2018)
Pratanu, R., Arijit, K., Gustavo, A.: Augmented sketch: Faster and more accurate stream processing. In: Proc, ACM SIGMOD (2016)
Yang, D., Li, B., Rettig, L., Cudré-Mauroux, P.: $\text{ D}^{22}$ histosketch: discriminative and dynamic similarity-preserving sketching of streaming histograms. IEEE Trans. Knowl. Data Eng. 31(10), 1898–1911 (2018)
Article Google Scholar
Buddhika, T., Malensek, M., Pallickara, S.L., Pallickara, S.: Synopsis: A distributed sketch over voluminous spatiotemporal observational streams. IEEE Trans. Knowl. Data Eng. 29(11), 2552–2566 (2017)
Article Google Scholar
Zhao, B., Li, X., Tian, B., Mei, Z., Wu, W.: Dhs: Adaptive memory layout organization of sketch slots for fast and accurate data stream processing. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 2285–2293, (2021)
Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB Endow. 1(2), 1530–1541 (2008)
Article Google Scholar
Yang, T., Gong, J., Zhang, H., Zou, L., Shi, L., Li. X.: Heavyguardian: Separate and guard hot items in data streams. In: SIGKDD, (2018)
Zhou, Y., Yang, T., Jiang, J., Cui, B., Yu, M., Li, X., Uhlig, S.: Cold filter: A meta-framework for faster and more accurate stream processing. In: SIGMOD Conference, (2018)
Huang, Q., Lee, P.P.: Ld-sketch: A distributed sketching design for accurate and scalable anomaly detection in network data streams. In: IEEE INFOCOM 2014-IEEE Conference on Computer Communications, pp. 1420–1428. IEEE, (2014)
Shokrollahi, A.: Raptor codes. IEEE Trans. Inf. Theory 52(6), 2551–2567 (2006)
Article MathSciNet Google Scholar
Lahiri, B., Chandrashekar, J., Tirthapura, S.: Space-efficient tracking of persistent items in a massive data stream. Stat. Anal. Data Mining 7, 70–92 (2011)
Article MathSciNet Google Scholar
Zhang, Y., Li, J., Lei, Y., Yang, T., Li, Z., Zhang, G., Cui, B.: On-off sketch: a fast and accurate sketch on persistence. Proc. VLDB Endow. 14(2), 128–140 (2020)
Article Google Scholar
Yu, M., Jose, L., Miao, R.: Software defined traffic measurement with opensketch. In: NSDI 2013, (2013)
Tang, L., Huang, Q., Lee, P.P.: Spreadsketch: Toward invertible and network-wide detection of superspreaders. In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications, pp. 1608–1617. IEEE, (2020)
Estan, C., Varghese, G., Fisk, M.: Bitmap algorithms for counting active flows on high speed links. In: Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement, pp. 153–166, (2003)
Zhao, Y., Han, W., Zhong, Z., Zhang, Y., Yang, T., Cui, B.: Double-anonymous sketch: achieving top-k-fairness for finding global top-k frequent items. Proc. ACM Manag. Data 1(1), 1–26 (2023)
Article Google Scholar
Wang, F., Chen, Q., Li, Y., Yang, T., Tu, Y., Yu, L., Cui, B.: Joinsketch: a sketch algorithm for accurate and unbiased inner-product estimation. Proc. ACM Manag. Data 1(1), 1–26 (2023)
Article Google Scholar
Cormode, G., Garofalakis, M.: Sketching streams through the net: Distributed approximate query tracking. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 13–24, (2005)
Liu, Z., Manousis, A., Vorsanger, G., Sekar, V., Braverman, V.: One sketch to rule them all: Rethinking network flow monitoring with univmon. In: Proceedings of the 2016 ACM SIGCOMM Conference, pp. 101–114, (2016)
Miao, R., Zhang, Y., Qu, G., Yang, K., Yang, T., Cui, B.: Hyper-uss: Answering subset query over multi-attribute data stream. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1698–1709, (2023)
Zhang, Y., Liu, Z., Wang, R., Yang, T., Li, J., Miao, R., Liu, P., Zhang, R., Jiang J.: Cocosketch: High-performance sketch-based measurement over arbitrary partial key query. In: Proceedings of the 2021 ACM SIGCOMM 2021 Conference, pp. 207–222, (2021)
Rekhter, Y., Li, T., Hares, S.: A border gateway protocol 4 (bgp-4). Technical report, (2006)
Sobrinho, J.L.: Network routing with path vector protocols: Theory and applications. In: Proceedings of the 2003 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pp. 49–60, (2003)
Park, K., Lee, H.: On the effectiveness of route-based packet filtering for distributed dos attack prevention in power-law internets. ACM SIGCOMM Comput. Commun. Rev. 31(4), 15–26 (2001)
Article Google Scholar
Murmur hashing source codes. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp
Yang, T., Jiang, J., Liu, P., Huang, Q., Gong, J., Zhou, Y., Miao, R., Li, X., Uhlig, S.: Elastic sketch: Adaptive and fast network-wide measurements. In: ACM SIGCOMM, vol. 2018, pp. 561–575 (2018)
Kim, W., Yun, J., Jung, H.: Evaluation of high-frequency financial transaction processing in distributed memory systems. In: Proceedings of the 2014 Conference on Research in Adaptive and Convergent Systems, pp. 362–364, (2014)
Zhang, H., Liu, Z., Chen, B., Zhao, Y., Zhao, T., Yang, T., Cui, B.: Cafe: Towards compact, adaptive, and fast embedding for large-scale recommendation models. In: Proceedings of the 2024 ACM International Conference on Management of Data (SIGMOD), (2024)
Flynn, M.: Some computer organizations and their effectiveness. IEEE Trans. Comput. 100, 948–960 (1972)
Article Google Scholar
Li, Y., Wang, F., Yu, X., Yang, Y., Yang, K., Yang, T., Ma, Z., Cui, B., Uhlig, S.: Ladderfilter: Filtering infrequent items with small memory and time overhead. Proc. ACM Manag. Data 1(1), 1–21 (2023)
Google Scholar
Liu, Z., Kong, C., Yang, K., Yang, T., Miao, R., Chen, Q., Zhao, Y., Tu, Y., Cui B.: Hypercalm sketch: One-pass mining periodic batches in data streams. In: 2023 IEEE 39th International Conference on Data Engineering (ICDE). IEEE, (2023)
Zhou, Y., Yang, T., Jiang, J., Cui, B., Yu, M., Li, X., Uhlig, S.: Cold filter: A meta-framework for faster and more accurate stream processing. In: Proceedings of the 2018 international conference on management of data, pp. 741–756, (2018)
Supplementary materials of wavingsketch. https://github.com/WavingSketch/Waving-Sketch/blob/master/WavingSketch_Supplementary.pdf
Li, J., Li, Z., Xu, Y., Jiang, S., Yang, T., Cui, B., Dai, Y., Zhang, G.: Wavingsketch: An unbiased and generic sketch for finding top-k items in data streams. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1574–1584, (2020)
Powers, D.M.: Applications and explanations of Zipf’s law. In: Proc. EMNLP-CoNLL, Association for Computational Linguistics (1998)
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Article Google Scholar
Leis, V., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., Neumann, T.: How good are query optimizers, really? Proc. VLDB Endow. 9(3), 204–215 (2015)
Article Google Scholar
Izenov, Y., Datta, A., Rusu, F., Shin, J.H.: Compass: Online sketch-based query optimization for in-memory databases. In: Proceedings of the 2021 International Conference on Management of Data, pp. 804–816, (2021)
Leis, V., Radke, B., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., Neumann, T.: Query optimization through the looking glass, and what we found running the join order benchmark. VLDB J. 27(5), 643–668 (2018)
Article Google Scholar
Wang, Y., Yi, K.: Secure yannakakis: Join-aggregate queries over private data. In: Proceedings of the 2021 International Conference on Management of Data, pp. 1969–1981, (2021)
Kutzkov, K., Ahmed, M., Nikitaki, S.: Weighted similarity estimation in data streams. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1051–1060, (2015)
Pruthi, G., Liu, F., Kale, S., Sundararajan, M.: Estimating training data influence by tracing gradient descent. Adv. Neural. Inf. Process. Syst. 33, 19920–19930 (2020)
Google Scholar
Alon, N., Gibbons, P.B., Matias, Y., Szegedy, M.: Tracking join and self-join sizes in limited storage. In Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems 10–20 (1999)
Ganguly, S., Garofalakis, M., Rastogi, R.: Processing data-stream join aggregates using skimmed sketches. In: International Conference on Extending Database Technology, pp. 569–586. Springer, (2004)
Ganguly, S., Kesh, D., Saha, C.: Practical algorithms for tracking database join sizes. In: International Conference on Foundations of Software Technology and Theoretical Computer Science, pp. 297–309. Springer, (2005)
Rusu, F., Dobra, A.: Sketches for size of join estimation. ACM Trans. Database Syst. 33(3), 1–46 (2008)
Article Google Scholar
Rusu, F., Dobra, A.: Statistical analysis of sketch estimators. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 187–198, (2007)
Cai, W., Balazinska, M., Suciu, D.: Pessimistic cardinality estimation: Tighter upper bounds for intermediate join cardinalities. In: Proceedings of the 2019 International Conference on Management of Data, pp. 18–35, (2019)
CAIDA [online]. Available: http://www.caida.org/home
Real-life transactional dataset. http://fimi.ua.ac.be/data/
The Network dataset Internet Traces. http://snap.stanford.edu/data/
Rousskov, A., Wessels, D.: High-performance benchmarking with web polygraph. Practice and Experience, Software (2004)
Book Google Scholar
Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache flink: Stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Tech. Committee Data Eng., 36(4), (2015)
Source code related to WavingSketch.. https://github.com/WavingSketch/Waving-Sketch

Download references

Acknowledgements

This work is supported by National Key R &D Program of China (No. 2022YFB2901504), National Natural Science Foundation of China (NSFC) (No. U20A20179, 62372009, 623B2005), and research grant No. SH-2024JK29.

Author information

Authors and Affiliations

Institute: School of Computer Science, Peking University, Beijing, China
Zirui Liu, Chengwu Liu, Xiangwei Deng, Tong Yang, Yikai Zhao, Jizhou Li & Bin Cui
Institute: Carnegie Mellon University, Pittsburgh, USA
Fenghao Dong
Institute: Huawei Theory Lab, Shenzhen, China
Gong Zhang

Authors

Zirui Liu
View author publications
You can also search for this author inPubMed Google Scholar
Fenghao Dong
View author publications
You can also search for this author inPubMed Google Scholar
Chengwu Liu
View author publications
You can also search for this author inPubMed Google Scholar
Xiangwei Deng
View author publications
You can also search for this author inPubMed Google Scholar
Tong Yang
View author publications
You can also search for this author inPubMed Google Scholar
Yikai Zhao
View author publications
You can also search for this author inPubMed Google Scholar
Jizhou Li
View author publications
You can also search for this author inPubMed Google Scholar
Bin Cui
View author publications
You can also search for this author inPubMed Google Scholar
Gong Zhang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Tong Yang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, Z., Dong, F., Liu, C. et al. WavingSketch: an unbiased and generic sketch for finding top-k items in data streams. The VLDB Journal 33, 1697–1722 (2024). https://doi.org/10.1007/s00778-024-00869-6

Download citation

Received: 21 December 2022
Revised: 08 February 2024
Accepted: 28 June 2024
Published: 29 July 2024
Issue Date: September 2024
DOI: https://doi.org/10.1007/s00778-024-00869-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

WavingSketch: an unbiased and generic sketch for finding top-k items in data streams

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Fast and accurate stream processing by filtering the cold

PrivSketch: A Private Sketch-Based Frequency Estimation Protocol for Data Streams

PiqSketch: An Efficient Sketching Algorithm for Per-Key Tail Quantile Estimation in Large-Scale Data Streams

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now