skip to main content
10.1145/3394486.3403208acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

WavingSketch: An Unbiased and Generic Sketch for Finding Top-k Items in Data Streams

Published: 20 August 2020 Publication History

Abstract

Finding top-k items in data streams is a fundamental problem in data mining. Existing algorithms that can achieve unbiased estimation suffer from poor accuracy. In this paper, we propose a new sketch, WavingSketch, which is much more accurate than existing unbiased algorithms. WavingSketch is generic, and we show how it can be applied to four applications: finding top-k frequent items, finding top-k heavy changes, finding top-k persistent items, and finding top-k Super-Spreaders. We theoretically prove that WavingSketch can provide unbiased estimation, and then give an error bound of our algorithm. Our experimental results show that, compared with the state-of-the-art, WavingSketch has 4.50 times higher insertion speed and up to 9 x 106 times (2 x 104 times in average) lower error rate in finding frequent items when memory size is tight. For other applications, WavingSketch can also achieve up to 286 times lower error rate. All related codes are open-sourced and available at Github anonymously.

References

[1]
G. Lukasz, D. David, D. Erik D, L. Alejandro, and M. J Ian. Identifying frequent items in sliding windows over on-line packet streams. In IMC. ACM, 2003.
[2]
Richard M Karp, Scott Shenker, and Christos H Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems (TODS), 28(1):51--55, 2003.
[3]
M. Nishad and P. Themis. Frequent items in streaming data: An experimental evaluation of the state-of-the-art. Data & Knowledge Engineering, 2009.
[4]
Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In Automata, Languages and Programming. Springer, 2002.
[5]
Zhewei Wei, Ge Luo, Ke Yi, Xiaoyong Du, and Ji-Rong Wen. Persistent data sketching. In ¶roc ACM SIGMOD, pages 795--810. ACM, 2015.
[6]
Robert Schweller, Zhichun Li, Yan Chen, et al. Reversible sketches: enabling monitoring and analysis over high-speed data streams. IEEE/ACM Transactions on Networking (ToN), 15(5):1059--1072, 2007.
[7]
K. Balachander, S. Subhabrata, Z. Yin, and C. Yan. Sketch-based change detection: methods, evaluation, and applications. In Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, pages 234--247. ACM, 2003.
[8]
Yuliang Li, Rui Miao, Changhoon Kim, and Minlan Yu. Flowradar: a better netflow for data centers. In USENIX NSDI, pages 311--324. USENIX Association, 2016.
[9]
D. Haipeng, S. Muhammad, L. Alex X, and Z. Yuankun. Finding persistent items in data streams. ¶roc VLDB, 2016.
[10]
S. Venkataraman, D. Xiaodong Song, P. B. Gibbons, and A. Blum. New streaming algorithms for fast detection of superspreaders. In NDSS, 2005.
[11]
Graham Cormode and S Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 2005.
[12]
Cristian Estan and George Varghese. New directions in traffic measurement and accounting. ACM SIGMCOMM CCR, 32(4), 2002.
[13]
Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and top-k elements in data streams. In International Conference on Database Theory. Springer, 2005.
[14]
M. Gurmeet Singh and M. Rajeev. Approximate frequency counts over data streams. In ¶roc VLDB, pages 346--357, 2002.
[15]
Daniel Ting. Data sketches for disaggregated subset sum and frequent item estimation. In SIGMOD Conference, 2018.
[16]
Ming Ji, Jun Yan, Siyu Gu, Jiawei Han, Xiaofei He, Wei Vivian Zhang, and Zheng Chen. Learning search tasks in queries and web pages via graph regularization. In ¶roc ACM SIGIR, pages 55--64. ACM, 2011.
[17]
Source code related to aname. https://github.com/WavingSketch/Waving-Sketch.
[18]
Graham Cormode. Sketch techniques for approximate query processing. Foundations and Trends in Databases. NOW publishers, 2011.
[19]
Pinghui Wang, Yiyan Qi, Yuanming Zhang, Qiaozhu Zhai, Chenxu Wang, John CS Lui, and Xiaohong Guan. A memory-efficient sketch method for estimating high similarities in streaming sets. In SIGKDD, pages 25--33, 2019.
[20]
Jiawei Jiang, Fangcheng Fu, Tong Yang, and Bin Cui. Sketchml: Accelerating distributed machine learning with data sketches. In SIGMOD. ACM, 2018.
[21]
Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou, Rui Miao, Xiaoming Li, and Steve Uhlig. Elastic sketch: Adaptive and fast network-wide measurements. In ACM SIGCOMM 2018, pages 561--575, 2018.
[22]
Tong Yang, Yang Zhou, Hao Jin, Shigang Chen, and Xiaoming Li. Pyramid sketch: A sketch framework for frequency estimation of data streams. Proc. VLDB Endow., 10(11):1442--1453, August 2017.
[23]
Tong Yang, Alex X Liu, Muhammad Shahzad, Yuankun Zhong, Qiaobin Fu, Zi Li, Gaogang Xie, and Xiaoming Li. A shifting bloom filter framework for set queries. Proceedings of the VLDB Endowment, 9(5):408--419, 2016.
[24]
Yang Zhou, Tong Yang, Jie Jiang, Bin Cui, Minlan Yu, Xiaoming Li, and Steve Uhlig. Cold filter: A meta-framework for faster and more accurate stream processing. In SIGMOD Conference, 2018.
[25]
Bofang Li, Aleksandr Drozd, and et al. Scaling word2vec on big corpus. Data Science and Engineering, pages 1--19, 2019.
[26]
Stephen Bonner, Ibad Kureshi, and et al. Exploring the semantic content of unsupervised graph embeddings: An empirical study. Data Science and Engineering, 4(3):269--289, 2019.
[27]
Yinghui Wang, Peng Lin, and Yiguang Hong. Distributed regression estimation with incomplete data in multi-agent networks. Science China Information Sciences, 61(9):092202, 2018.
[28]
Tongya Zheng, Gang Chen, and et al. Real-time intelligent big data processing: technology, platform, and applications. Science China Information Sciences, 62(8):82101, 2019.
[29]
R. Pratanu, K. Arijit, and A. Gustavo. Augmented sketch: Faster and more accurate stream processing. In ¶roc ACM SIGMOD, 2016.
[30]
C. Graham and H. Marios. Finding frequent items in data streams. VLDB, 2008.
[31]
Tong Yang, Junzhi Gong, Haowei Zhang, Lei Zou, Lei Shi, and Xiaoming Li. Heavyguardian: Separate and guard hot items in data streams. In SIGKDD, 2018.
[32]
A. Shokrollahi. Raptor codes. IEEE Transactions Information Theory, 52(6), 2006.
[33]
Bibudh Lahiri, Jaideep Chandrashekar, and Srikanta Tirthapura. Space-efficient tracking of persistent items in a massive data stream. Statistical Analysis and Data Mining, 7:70--92, 2011.
[34]
Minlan Yu, Lavanya Jose, and Rui Miao. Software defined traffic measurement with opensketch. In NSDI 2013, 2013.
[35]
Michael Flynn. Some computer organizations and their effectiveness. ieee trans comput c-21:948. Computers, IEEE Transactions on, C-21:948 -- 960, 10 1972.
[36]
Burton H Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426, 1970.
[37]
Hash website. http://burtleburtle.net/bob/hash/evahash.html.
[38]
David MW Powers. Applications and explanations of Zipf's law. In ¶roc EMNLP-CoNLL. Association for Computational Linguistics, 1998.
[39]
Alex Rousskov and Duane Wessels. High-performance benchmarking with web polygraph. Software: Practice and Experience, 2004.
[40]
The caida anonymized 2016 traces. http://www.caida.org/data/overview/.
[41]
Real-life transactional dataset. http://fimi.ua.ac.be/data/.
[42]
The Network dataset Internet Traces. http://snap.stanford.edu/data/.

Cited By

View all
  • (2025)TailoredSketch: A Fast and Adaptive Sketch for Efficient Per-Flow Size MeasurementIEEE Transactions on Network Science and Engineering10.1109/TNSE.2024.350390412:1(505-517)Online publication date: Jan-2025
  • (2025)In Search of a Memory-Efficient Framework for Online Cardinality EstimationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348657137:1(392-407)Online publication date: Jan-2025
  • (2025)Expiration filter: Mining recent heavy flows in high-speed networksComputer Networks10.1016/j.comnet.2024.111010258(111010)Online publication date: Feb-2025
  • Show More Cited By

Index Terms

  1. WavingSketch: An Unbiased and Generic Sketch for Finding Top-k Items in Data Streams

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
      August 2020
      3664 pages
      ISBN:9781450379984
      DOI:10.1145/3394486
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 20 August 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. data stream mining
      2. top-k item
      3. unbiased estimation
      4. waving counter

      Qualifiers

      • Research-article

      Funding Sources

      • National Key R&D Program of China
      • PKU-Baidu Fund
      • National Natural Science Foundation of China

      Conference

      KDD '20
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

      Upcoming Conference

      KDD '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)177
      • Downloads (Last 6 weeks)24
      Reflects downloads up to 20 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)TailoredSketch: A Fast and Adaptive Sketch for Efficient Per-Flow Size MeasurementIEEE Transactions on Network Science and Engineering10.1109/TNSE.2024.350390412:1(505-517)Online publication date: Jan-2025
      • (2025)In Search of a Memory-Efficient Framework for Online Cardinality EstimationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348657137:1(392-407)Online publication date: Jan-2025
      • (2025)Expiration filter: Mining recent heavy flows in high-speed networksComputer Networks10.1016/j.comnet.2024.111010258(111010)Online publication date: Feb-2025
      • (2024)A Universal Sketch for Estimating Heavy Hitters and Per-Element Frequency Moments in Data Streams with Bounded DeletionsProceedings of the ACM on Management of Data10.1145/36987992:6(1-28)Online publication date: 20-Dec-2024
      • (2024)CAFE: Towards Compact, Adaptive, and Fast Embedding for Large-scale Recommendation ModelsProceedings of the ACM on Management of Data10.1145/36393062:1(1-28)Online publication date: 26-Mar-2024
      • (2024)Local Differentially Private Heavy Hitter Detection in Data Streams with Bounded MemoryProceedings of the ACM on Management of Data10.1145/36392852:1(1-27)Online publication date: 26-Mar-2024
      • (2024)Unbiased Real-Time Traffic SketchingIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.328400411:3(2371-2383)Online publication date: May-2024
      • (2024)A Generic Framework for Finding Special Quadratic Elements in Data StreamsIEEE/ACM Transactions on Networking10.1109/TNET.2024.339202932:4(3269-3284)Online publication date: Aug-2024
      • (2024)A Probabilistic Sketch for Summarizing Cold Items of Data StreamsIEEE/ACM Transactions on Networking10.1109/TNET.2023.331642632:2(1287-1302)Online publication date: Apr-2024
      • (2024)P-Sketch: A Fast and Accurate Sketch for Persistent Item LookupIEEE/ACM Transactions on Networking10.1109/TNET.2023.330689732:2(987-1002)Online publication date: Apr-2024
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media