skip to main content
10.1145/3580305.3599383acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Hyper-USS: Answering Subset Query Over Multi-Attribute Data Stream

Published: 04 August 2023 Publication History

Abstract

Sketching algorithms are considered as promising solutions for answering approximate query on massive data stream. In real scenarios, a large number of problems can be abstracted as subset query over multiple attributes. Existing sketches are designed for query on single attributes, and therefore are inefficient for query on multiple attributes. In this work, we propose Hyper-USS, an innovative sketching algorithm that supports subset query over multiple attributes accurately and efficiently. To the best of our knowledge, this work is the first sketching algorithm designed to answer approximate query over multi-attribute data stream. We utilize the key technique, Joint Variance Optimization, to guarantee high estimation accuracy on all attributes. Experiment results show that, compared with the state-of-the-art (SOTA) sketches that support subset query on single attributes, Hyper-USS improves the accuracy by 16.67X and the throughput by 8.54X. The code is open-sourced at Github.

Supplementary Material

MP4 File (rtfp0356-2min-promo.mp4)
Hyper-USS is a novel sketching solution for subset query over multi-attribute data stream, which achieves high accuracy and high processing efficiency.

References

[1]
Ran Ben-Basat, Gil Einziger, Roy Friedman, Marcelo Caggiani Luizelli, and Erez Waisbard. Constant time updates in hierarchical heavy hitters. In SIGCOMM 2017. ACM, 2017.
[2]
Michel Cukier, Robin Berthier, Susmit Panjwani, and Stephanie Tan. A statistical analysis of attack data to separate attacks. In International Conference on Dependable Systems and Networks (DSN'06), pages 383--392. IEEE, 2006.
[3]
Graham Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 2005.
[4]
Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 2004.
[5]
Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and top-k elements in data streams. In Thomas Eiter and Leonid Libkin, editors, ICDT 2005, Lecture Notes in Computer Science. Springer, 2005.
[6]
Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou, Rui Miao, Xiaoming Li, and Steve Uhlig. Elastic sketch: adaptive and fast network-wide measurements. In SIGCOMM 2018. ACM, 2018.
[7]
Pratanu Roy, Arijit Khan, and Gustavo Alonso. Augmented sketch: Faster and more accurate stream processing. In Proceedings of the 2016 International Conference on Management of Data, pages 1449--1463, 2016.
[8]
Jizhou Li, Zikun Li, Yifei Xu, Shiqi Jiang, Tong Yang, Bin Cui, Yafei Dai, and Gong Zhang. Wavingsketch: An unbiased and generic sketch for finding top-k items in data streams. In KDD '20, pages 1574--1584. ACM, 2020.
[9]
Daniel Ting. Count-min: Optimal estimation and tight error bounds using empirical error distributions. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2319--2328, 2018.
[10]
Ran Ben Basat, Gil Einziger, Michael Mitzenmacher, and Shay Vargaftik. Salsa: self-adjusting lean streaming analytics. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 864--875. IEEE, 2021.
[11]
Daniel Ting. Data sketches for disaggregated subset sum and frequent item estimation. In Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein, editors, SIGMOD 2018. ACM, 2018.
[12]
Yinda Zhang, Zaoxing Liu, Ruixin Wang, Tong Yang, Jizhou Li, Ruijie Miao, Peng Liu, Ruwen Zhang, and Junchen Jiang. Cocosketch: high-performance sketch-based measurement over arbitrary partial key query. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference, pages 207--222, 2021.
[13]
Bohan Zhao, Xiang Li, Boyu Tian, Zhiyu Mei, and Wenfei Wu. Dhs: Adaptive memory layout organization of sketch slots for fast and accurate data stream processing. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 2285--2293, 2021.
[14]
Prashant Pandey, Shikha Singh, Michael A Bender, Jonathan W Berry, Martín Farach-Colton, Rob Johnson, Thomas M Kroeger, and Cynthia A Phillips. Timely reporting of heavy hitters using external memory. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 1431--1446, 2020.
[15]
Haipeng Dai, Muhammad Shahzad, Alex X Liu, and Yuankun Zhong. Finding persistent items in data streams. Proceedings of the VLDB Endowment, 10(4):289--300, 2016.
[16]
Balachander Krishnamurthy, Subhabrata Sen, Yin Zhang, and Yan Chen. Sketch-based change detection: methods, evaluation, and applications. In Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, 2003.
[17]
Zhewei Wei, Ge Luo, Ke Yi, Xiaoyong Du, and Ji-Rong Wen. Persistent data sketching. In Proceedings of the 2015 ACM SIGMOD international conference on Management of Data, pages 795--810, 2015.
[18]
Benwei Shi, Zhuoyue Zhao, Yanqing Peng, Feifei Li, and Jeff M Phillips. At-the-time and back-in-time persistent sketches. In Proceedings of the 2021 International Conference on Management of Data, pages 1623--1636, 2021.
[19]
Nick G. Duffield, Carsten Lund, and Mikkel Thorup. Priority sampling for estimation of arbitrary subset sums. J. ACM, 2007.
[20]
Mehdi Sharifzadeh and Cyrus Shahabi. The spatial skyline queries. In Proceedings of the 32nd international conference on Very large data bases, pages 751--762, 2006.
[21]
Dimitris Papadias, Yufei Tao, Greg Fu, and Bernhard Seeger. An optimal and progressive algorithm for skyline queries. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 467--478, 2003.
[22]
Gowtham Atluri, Anuj Karpatne, and Vipin Kumar. Spatio-temporal data mining: A survey of problems and methods. ACM Computing Surveys (CSUR), 51(4):1--41, 2018.
[23]
Marc Gyssens and Laks VS Lakshmanan. A foundation for multi-dimensional databases. In VLDB, volume 97, pages 106--115. Citeseer, 1997.
[24]
Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. Learning multi-dimensional indexes. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 985--1000, 2020.
[25]
Jelena Lukić, Milo? Radenković, Marijana Despotović-Zrakić, Aleksandra Labus, and Zorica Bogdanović. A hybrid approach to building a multi-dimensional business intelligence system for electricity grid operators. Utilities Policy, 41:95--106, 2016.
[26]
Jinbao Wang, Sai Wu, Hong Gao, Jianzhong Li, and Beng Chin Ooi. Indexing multi-dimensional data in a cloud system. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 591--602, 2010.
[27]
Xiuyao Song, Mingxi Wu, Christopher Jermaine, and Sanjay Ranka. Statistical change detection for multi-dimensional data. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 667--676, 2007.
[28]
Katsiaryna Mirylenka, Graham Cormode, Themis Palpanas, and Divesh Srivastava. Conditional heavy hitters: detecting interesting correlations in data streams. The VLDB Journal, 24:395--414, 2015.
[29]
Robert T. Schweller, Ashish Gupta, Elliot Parsons, and Yan Chen. Reversible sketches for efficient and accurate change detection over network data streams. In Alfio Lombardo and James F. Kurose, editors, IMC 2004. ACM, 2004.
[30]
Yin Zhang, Sumeet Singh, Subhabrata Sen, Nick G. Duffield, and Carsten Lund. Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications. In IMC 2004. ACM, 2004.
[31]
Graham Cormode, Flip Korn, Shanmugavelayutham Muthukrishnan, and Divesh Srivastava. Finding hierarchical heavy hitters in data streams. In Proceedings 2003 VLDB Conference, pages 464--475. Elsevier, 2003.
[32]
Graham Cormode, Flip Korn, S Muthukrishnan, and Divesh Srivastava. Diamond in the rough: Finding hierarchical heavy hitters in multi-dimensional data. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 155--166, 2004.
[33]
David MW Powers. Applications and explanations of zipf's law. In New methods in language processing and computational natural language learning, 1998.
[34]
Xiangyang Gou, Long He, Yinda Zhang, Ke Wang, Xilai Liu, Tong Yang, Yi Wang, and Bin Cui. Sliding sketches: A framework using time zones for data stream processing in sliding windows. In KDD '20. ACM, 2020.
[35]
The Criteo 1TB Click Logs dataset. https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/.
[36]
The NBA dataset. https://www.kaggle.com/datasets/nathanlauga/nba-games/.
[37]
Anonymized internet traces 2018. https://catalog.caida.org/details/dataset/ passive_2018_pcap. Accessed: 2022-6-29.
[38]
Source code related to Hyper-USS. https://github.com/HyperUSS/HyperUSS.

Cited By

View all
  • (2024)WavingSketch: an unbiased and generic sketch for finding top-k items in data streamsThe VLDB Journal10.1007/s00778-024-00869-633:5(1697-1722)Online publication date: 29-Jul-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2023
5996 pages
ISBN:9798400701030
DOI:10.1145/3580305
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. multi-attribute data stream
  2. sketch
  3. subset query

Qualifiers

  • Research-article

Funding Sources

  • Key-Area Research and Development Program of Guangdong Province
  • National Natural Science Foundation of China (NSFC)

Conference

KDD '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)143
  • Downloads (Last 6 weeks)5
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)WavingSketch: an unbiased and generic sketch for finding top-k items in data streamsThe VLDB Journal10.1007/s00778-024-00869-633:5(1697-1722)Online publication date: 29-Jul-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media