Abstract
Finding top-k items in data streams is a fundamental problem in data mining. Unbiased estimation is well acknowledged as an elegant and important property for top-k algorithms. In this paper, we propose a novel sketch algorithm, called WavingSketch, which is more accurate than existing unbiased algorithms. We theoretically prove that WavingSketchcan provide unbiased estimation, and derive its error bound. WavingSketchis generic to measurement tasks, and we apply it to five applications: finding top-k frequent items, finding top-k heavy changes, finding top-k persistent items, finding top-k Super-Spreaders, and join-aggregate estimation. Our experimental results show that, compared with the state-of-the-art Unbiased Space-Saving, WavingSketchachieves \(10 \times \) faster speed and \(10^3 \times \) smaller error on finding frequent items. For other applications, WavingSketchalso achieves higher accuracy and faster speed. All related codes are open-sourced at GitHub (https://github.com/WavingSketch/Waving-Sketch).



























Similar content being viewed by others
Notes
Notice that in our implementation, we always use the Heavy Part rearrangement technique for higher processing speed, even without enabling SIMD acceleration.
We assume \(\alpha >1\) so that the series \(\sum _{i=1}^{\infty } \frac{1}{i^\alpha }\) converges.
Notice that Z is a constant determined by data stream \(\sigma \).
We have formally defined the persistence of an item as the number of time windows it appears (see Sect. 2.1).
References
Lukasz, G., David, D., D, D.E., Alejandro, L., Ian, M.J.: Identifying frequent items in sliding windows over on-line packet streams. In: IMC. ACM, (2003)
Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28(1), 51–55 (2003)
Nishad, M., Themis, P.: Frequent items in streaming data: an experimental evaluation of the state-of-the-art. Data Knowl. Eng., (2009)
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. Languages and Programming. Springer, In Automata (2002)
Wei, Z., Luo, G., Yi, K., Du, X., Wen, J.-R.: Persistent data sketching. In: Proc. ACM SIGMOD, pp. 795–810. ACM, (2015)
Schweller, R., Li, Z., Chen, Y., et al.: Reversible sketches: enabling monitoring and analysis over high-speed data streams. IEEE/ACM Trans. Netw. 15(5), 1059–1072 (2007)
Balachander, K., Subhabrata, S., Yin, Z., Yan, C.: Sketch-based change detection: methods, evaluation, and applications. In: Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement, pp. 234–247. ACM, (2003)
Li, Y., Miao, R., Kim, C., Yu, M.: Flowradar: a better netflow for data centers. In: USENIX NSDI, pp. 311–324. USENIX Association, (2016)
Dai, H., Shahzad, M., Liu, A.X., Zhong, Y.: Finding persistent items in data streams. Proc. VLDB Endow. 10(4), 289–300 (2016)
Venkataraman, S., Song, D.X., Gibbons, P.B., Blum, A.: New streaming algorithms for fast detection of superspreaders. In: NDSS, (2005)
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)
Estan, C., Varghese, G.: New directions in traffic measurement and accounting. ACM SIGMCOMM CCR, 32(4), (2002)
Metwally, A., Agrawal, D., El Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: International Conference on Database Theory. Springer, (2005)
Singh, M.G., Rajeev, M.: Approximate frequency counts over data streams. In: Proc. VLDB, pp. 346–357, (2002)
Ting, D.: Data sketches for disaggregated subset sum and frequent item estimation. In: SIGMOD Conference, (2018)
Pratanu, R., Arijit, K., Gustavo, A.: Augmented sketch: Faster and more accurate stream processing. In: Proc, ACM SIGMOD (2016)
Yang, D., Li, B., Rettig, L., Cudré-Mauroux, P.: \(\text{ D}^{22}\) histosketch: discriminative and dynamic similarity-preserving sketching of streaming histograms. IEEE Trans. Knowl. Data Eng. 31(10), 1898–1911 (2018)
Buddhika, T., Malensek, M., Pallickara, S.L., Pallickara, S.: Synopsis: A distributed sketch over voluminous spatiotemporal observational streams. IEEE Trans. Knowl. Data Eng. 29(11), 2552–2566 (2017)
Zhao, B., Li, X., Tian, B., Mei, Z., Wu, W.: Dhs: Adaptive memory layout organization of sketch slots for fast and accurate data stream processing. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 2285–2293, (2021)
Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB Endow. 1(2), 1530–1541 (2008)
Yang, T., Gong, J., Zhang, H., Zou, L., Shi, L., Li. X.: Heavyguardian: Separate and guard hot items in data streams. In: SIGKDD, (2018)
Zhou, Y., Yang, T., Jiang, J., Cui, B., Yu, M., Li, X., Uhlig, S.: Cold filter: A meta-framework for faster and more accurate stream processing. In: SIGMOD Conference, (2018)
Huang, Q., Lee, P.P.: Ld-sketch: A distributed sketching design for accurate and scalable anomaly detection in network data streams. In: IEEE INFOCOM 2014-IEEE Conference on Computer Communications, pp. 1420–1428. IEEE, (2014)
Shokrollahi, A.: Raptor codes. IEEE Trans. Inf. Theory 52(6), 2551–2567 (2006)
Lahiri, B., Chandrashekar, J., Tirthapura, S.: Space-efficient tracking of persistent items in a massive data stream. Stat. Anal. Data Mining 7, 70–92 (2011)
Zhang, Y., Li, J., Lei, Y., Yang, T., Li, Z., Zhang, G., Cui, B.: On-off sketch: a fast and accurate sketch on persistence. Proc. VLDB Endow. 14(2), 128–140 (2020)
Yu, M., Jose, L., Miao, R.: Software defined traffic measurement with opensketch. In: NSDI 2013, (2013)
Tang, L., Huang, Q., Lee, P.P.: Spreadsketch: Toward invertible and network-wide detection of superspreaders. In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications, pp. 1608–1617. IEEE, (2020)
Estan, C., Varghese, G., Fisk, M.: Bitmap algorithms for counting active flows on high speed links. In: Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement, pp. 153–166, (2003)
Zhao, Y., Han, W., Zhong, Z., Zhang, Y., Yang, T., Cui, B.: Double-anonymous sketch: achieving top-k-fairness for finding global top-k frequent items. Proc. ACM Manag. Data 1(1), 1–26 (2023)
Wang, F., Chen, Q., Li, Y., Yang, T., Tu, Y., Yu, L., Cui, B.: Joinsketch: a sketch algorithm for accurate and unbiased inner-product estimation. Proc. ACM Manag. Data 1(1), 1–26 (2023)
Cormode, G., Garofalakis, M.: Sketching streams through the net: Distributed approximate query tracking. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 13–24, (2005)
Liu, Z., Manousis, A., Vorsanger, G., Sekar, V., Braverman, V.: One sketch to rule them all: Rethinking network flow monitoring with univmon. In: Proceedings of the 2016 ACM SIGCOMM Conference, pp. 101–114, (2016)
Miao, R., Zhang, Y., Qu, G., Yang, K., Yang, T., Cui, B.: Hyper-uss: Answering subset query over multi-attribute data stream. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1698–1709, (2023)
Zhang, Y., Liu, Z., Wang, R., Yang, T., Li, J., Miao, R., Liu, P., Zhang, R., Jiang J.: Cocosketch: High-performance sketch-based measurement over arbitrary partial key query. In: Proceedings of the 2021 ACM SIGCOMM 2021 Conference, pp. 207–222, (2021)
Rekhter, Y., Li, T., Hares, S.: A border gateway protocol 4 (bgp-4). Technical report, (2006)
Sobrinho, J.L.: Network routing with path vector protocols: Theory and applications. In: Proceedings of the 2003 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pp. 49–60, (2003)
Park, K., Lee, H.: On the effectiveness of route-based packet filtering for distributed dos attack prevention in power-law internets. ACM SIGCOMM Comput. Commun. Rev. 31(4), 15–26 (2001)
Murmur hashing source codes. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp
Yang, T., Jiang, J., Liu, P., Huang, Q., Gong, J., Zhou, Y., Miao, R., Li, X., Uhlig, S.: Elastic sketch: Adaptive and fast network-wide measurements. In: ACM SIGCOMM, vol. 2018, pp. 561–575 (2018)
Kim, W., Yun, J., Jung, H.: Evaluation of high-frequency financial transaction processing in distributed memory systems. In: Proceedings of the 2014 Conference on Research in Adaptive and Convergent Systems, pp. 362–364, (2014)
Zhang, H., Liu, Z., Chen, B., Zhao, Y., Zhao, T., Yang, T., Cui, B.: Cafe: Towards compact, adaptive, and fast embedding for large-scale recommendation models. In: Proceedings of the 2024 ACM International Conference on Management of Data (SIGMOD), (2024)
Flynn, M.: Some computer organizations and their effectiveness. IEEE Trans. Comput. 100, 948–960 (1972)
Li, Y., Wang, F., Yu, X., Yang, Y., Yang, K., Yang, T., Ma, Z., Cui, B., Uhlig, S.: Ladderfilter: Filtering infrequent items with small memory and time overhead. Proc. ACM Manag. Data 1(1), 1–21 (2023)
Liu, Z., Kong, C., Yang, K., Yang, T., Miao, R., Chen, Q., Zhao, Y., Tu, Y., Cui B.: Hypercalm sketch: One-pass mining periodic batches in data streams. In: 2023 IEEE 39th International Conference on Data Engineering (ICDE). IEEE, (2023)
Zhou, Y., Yang, T., Jiang, J., Cui, B., Yu, M., Li, X., Uhlig, S.: Cold filter: A meta-framework for faster and more accurate stream processing. In: Proceedings of the 2018 international conference on management of data, pp. 741–756, (2018)
Supplementary materials of wavingsketch. https://github.com/WavingSketch/Waving-Sketch/blob/master/WavingSketch_Supplementary.pdf
Li, J., Li, Z., Xu, Y., Jiang, S., Yang, T., Cui, B., Dai, Y., Zhang, G.: Wavingsketch: An unbiased and generic sketch for finding top-k items in data streams. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1574–1584, (2020)
Powers, D.M.: Applications and explanations of Zipf’s law. In: Proc. EMNLP-CoNLL, Association for Computational Linguistics (1998)
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Leis, V., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., Neumann, T.: How good are query optimizers, really? Proc. VLDB Endow. 9(3), 204–215 (2015)
Izenov, Y., Datta, A., Rusu, F., Shin, J.H.: Compass: Online sketch-based query optimization for in-memory databases. In: Proceedings of the 2021 International Conference on Management of Data, pp. 804–816, (2021)
Leis, V., Radke, B., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., Neumann, T.: Query optimization through the looking glass, and what we found running the join order benchmark. VLDB J. 27(5), 643–668 (2018)
Wang, Y., Yi, K.: Secure yannakakis: Join-aggregate queries over private data. In: Proceedings of the 2021 International Conference on Management of Data, pp. 1969–1981, (2021)
Kutzkov, K., Ahmed, M., Nikitaki, S.: Weighted similarity estimation in data streams. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1051–1060, (2015)
Pruthi, G., Liu, F., Kale, S., Sundararajan, M.: Estimating training data influence by tracing gradient descent. Adv. Neural. Inf. Process. Syst. 33, 19920–19930 (2020)
Alon, N., Gibbons, P.B., Matias, Y., Szegedy, M.: Tracking join and self-join sizes in limited storage. In Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems 10–20 (1999)
Ganguly, S., Garofalakis, M., Rastogi, R.: Processing data-stream join aggregates using skimmed sketches. In: International Conference on Extending Database Technology, pp. 569–586. Springer, (2004)
Ganguly, S., Kesh, D., Saha, C.: Practical algorithms for tracking database join sizes. In: International Conference on Foundations of Software Technology and Theoretical Computer Science, pp. 297–309. Springer, (2005)
Rusu, F., Dobra, A.: Sketches for size of join estimation. ACM Trans. Database Syst. 33(3), 1–46 (2008)
Rusu, F., Dobra, A.: Statistical analysis of sketch estimators. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 187–198, (2007)
Cai, W., Balazinska, M., Suciu, D.: Pessimistic cardinality estimation: Tighter upper bounds for intermediate join cardinalities. In: Proceedings of the 2019 International Conference on Management of Data, pp. 18–35, (2019)
CAIDA [online]. Available: http://www.caida.org/home
Real-life transactional dataset. http://fimi.ua.ac.be/data/
The Network dataset Internet Traces. http://snap.stanford.edu/data/
Rousskov, A., Wessels, D.: High-performance benchmarking with web polygraph. Practice and Experience, Software (2004)
Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache flink: Stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Tech. Committee Data Eng., 36(4), (2015)
Source code related to WavingSketch.. https://github.com/WavingSketch/Waving-Sketch
Acknowledgements
This work is supported by National Key R &D Program of China (No. 2022YFB2901504), National Natural Science Foundation of China (NSFC) (No. U20A20179, 62372009, 623B2005), and research grant No. SH-2024JK29.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, Z., Dong, F., Liu, C. et al. WavingSketch: an unbiased and generic sketch for finding top-k items in data streams. The VLDB Journal 33, 1697–1722 (2024). https://doi.org/10.1007/s00778-024-00869-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-024-00869-6