research-article

Hyper-USS: Answering Subset Query Over Multi-Attribute Data Stream

Authors:

Bin CuiAuthors Info & Claims

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 1698 - 1709

https://doi.org/10.1145/3580305.3599383

Published: 04 August 2023 Publication History

Abstract

Sketching algorithms are considered as promising solutions for answering approximate query on massive data stream. In real scenarios, a large number of problems can be abstracted as subset query over multiple attributes. Existing sketches are designed for query on single attributes, and therefore are inefficient for query on multiple attributes. In this work, we propose Hyper-USS, an innovative sketching algorithm that supports subset query over multiple attributes accurately and efficiently. To the best of our knowledge, this work is the first sketching algorithm designed to answer approximate query over multi-attribute data stream. We utilize the key technique, Joint Variance Optimization, to guarantee high estimation accuracy on all attributes. Experiment results show that, compared with the state-of-the-art (SOTA) sketches that support subset query on single attributes, Hyper-USS improves the accuracy by 16.67X and the throughput by 8.54X. The code is open-sourced at Github.

Supplementary Material

MP4 File (rtfp0356-2min-promo.mp4)

Hyper-USS is a novel sketching solution for subset query over multi-attribute data stream, which achieves high accuracy and high processing efficiency.

Download
52.56 MB

References

[1]

Ran Ben-Basat, Gil Einziger, Roy Friedman, Marcelo Caggiani Luizelli, and Erez Waisbard. Constant time updates in hierarchical heavy hitters. In SIGCOMM 2017. ACM, 2017.

Digital Library

[2]

Michel Cukier, Robin Berthier, Susmit Panjwani, and Stephanie Tan. A statistical analysis of attack data to separate attacks. In International Conference on Dependable Systems and Networks (DSN'06), pages 383--392. IEEE, 2006.

Digital Library

[3]

Graham Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 2005.

[4]

Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 2004.

Digital Library

[5]

Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and top-k elements in data streams. In Thomas Eiter and Leonid Libkin, editors, ICDT 2005, Lecture Notes in Computer Science. Springer, 2005.

[6]

Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou, Rui Miao, Xiaoming Li, and Steve Uhlig. Elastic sketch: adaptive and fast network-wide measurements. In SIGCOMM 2018. ACM, 2018.

Digital Library

[7]

Pratanu Roy, Arijit Khan, and Gustavo Alonso. Augmented sketch: Faster and more accurate stream processing. In Proceedings of the 2016 International Conference on Management of Data, pages 1449--1463, 2016.

Digital Library

[8]

Jizhou Li, Zikun Li, Yifei Xu, Shiqi Jiang, Tong Yang, Bin Cui, Yafei Dai, and Gong Zhang. Wavingsketch: An unbiased and generic sketch for finding top-k items in data streams. In KDD '20, pages 1574--1584. ACM, 2020.

Digital Library

[9]

Daniel Ting. Count-min: Optimal estimation and tight error bounds using empirical error distributions. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2319--2328, 2018.

Digital Library

[10]

Ran Ben Basat, Gil Einziger, Michael Mitzenmacher, and Shay Vargaftik. Salsa: self-adjusting lean streaming analytics. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 864--875. IEEE, 2021.

[11]

Daniel Ting. Data sketches for disaggregated subset sum and frequent item estimation. In Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein, editors, SIGMOD 2018. ACM, 2018.

[12]

Yinda Zhang, Zaoxing Liu, Ruixin Wang, Tong Yang, Jizhou Li, Ruijie Miao, Peng Liu, Ruwen Zhang, and Junchen Jiang. Cocosketch: high-performance sketch-based measurement over arbitrary partial key query. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference, pages 207--222, 2021.

Digital Library

[13]

Bohan Zhao, Xiang Li, Boyu Tian, Zhiyu Mei, and Wenfei Wu. Dhs: Adaptive memory layout organization of sketch slots for fast and accurate data stream processing. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 2285--2293, 2021.

Digital Library

[14]

Prashant Pandey, Shikha Singh, Michael A Bender, Jonathan W Berry, Martín Farach-Colton, Rob Johnson, Thomas M Kroeger, and Cynthia A Phillips. Timely reporting of heavy hitters using external memory. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 1431--1446, 2020.

Digital Library

[15]

Haipeng Dai, Muhammad Shahzad, Alex X Liu, and Yuankun Zhong. Finding persistent items in data streams. Proceedings of the VLDB Endowment, 10(4):289--300, 2016.

Digital Library

[16]

Balachander Krishnamurthy, Subhabrata Sen, Yin Zhang, and Yan Chen. Sketch-based change detection: methods, evaluation, and applications. In Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, 2003.

Digital Library

[17]

Zhewei Wei, Ge Luo, Ke Yi, Xiaoyong Du, and Ji-Rong Wen. Persistent data sketching. In Proceedings of the 2015 ACM SIGMOD international conference on Management of Data, pages 795--810, 2015.

Digital Library

[18]

Benwei Shi, Zhuoyue Zhao, Yanqing Peng, Feifei Li, and Jeff M Phillips. At-the-time and back-in-time persistent sketches. In Proceedings of the 2021 International Conference on Management of Data, pages 1623--1636, 2021.

Digital Library

[19]

Nick G. Duffield, Carsten Lund, and Mikkel Thorup. Priority sampling for estimation of arbitrary subset sums. J. ACM, 2007.

Digital Library

[20]

Mehdi Sharifzadeh and Cyrus Shahabi. The spatial skyline queries. In Proceedings of the 32nd international conference on Very large data bases, pages 751--762, 2006.

Digital Library

[21]

Dimitris Papadias, Yufei Tao, Greg Fu, and Bernhard Seeger. An optimal and progressive algorithm for skyline queries. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 467--478, 2003.

Digital Library

[22]

Gowtham Atluri, Anuj Karpatne, and Vipin Kumar. Spatio-temporal data mining: A survey of problems and methods. ACM Computing Surveys (CSUR), 51(4):1--41, 2018.

[23]

Marc Gyssens and Laks VS Lakshmanan. A foundation for multi-dimensional databases. In VLDB, volume 97, pages 106--115. Citeseer, 1997.

[24]

Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. Learning multi-dimensional indexes. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 985--1000, 2020.

Digital Library

[25]

Jelena Lukić, Milo? Radenković, Marijana Despotović-Zrakić, Aleksandra Labus, and Zorica Bogdanović. A hybrid approach to building a multi-dimensional business intelligence system for electricity grid operators. Utilities Policy, 41:95--106, 2016.

[26]

Jinbao Wang, Sai Wu, Hong Gao, Jianzhong Li, and Beng Chin Ooi. Indexing multi-dimensional data in a cloud system. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 591--602, 2010.

Digital Library

[27]

Xiuyao Song, Mingxi Wu, Christopher Jermaine, and Sanjay Ranka. Statistical change detection for multi-dimensional data. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 667--676, 2007.

Digital Library

[28]

Katsiaryna Mirylenka, Graham Cormode, Themis Palpanas, and Divesh Srivastava. Conditional heavy hitters: detecting interesting correlations in data streams. The VLDB Journal, 24:395--414, 2015.

Digital Library

[29]

Robert T. Schweller, Ashish Gupta, Elliot Parsons, and Yan Chen. Reversible sketches for efficient and accurate change detection over network data streams. In Alfio Lombardo and James F. Kurose, editors, IMC 2004. ACM, 2004.

[30]

Yin Zhang, Sumeet Singh, Subhabrata Sen, Nick G. Duffield, and Carsten Lund. Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications. In IMC 2004. ACM, 2004.

Digital Library

[31]

Graham Cormode, Flip Korn, Shanmugavelayutham Muthukrishnan, and Divesh Srivastava. Finding hierarchical heavy hitters in data streams. In Proceedings 2003 VLDB Conference, pages 464--475. Elsevier, 2003.

[32]

Graham Cormode, Flip Korn, S Muthukrishnan, and Divesh Srivastava. Diamond in the rough: Finding hierarchical heavy hitters in multi-dimensional data. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 155--166, 2004.

Digital Library

[33]

David MW Powers. Applications and explanations of zipf's law. In New methods in language processing and computational natural language learning, 1998.

Digital Library

[34]

Xiangyang Gou, Long He, Yinda Zhang, Ke Wang, Xilai Liu, Tong Yang, Yi Wang, and Bin Cui. Sliding sketches: A framework using time zones for data stream processing in sliding windows. In KDD '20. ACM, 2020.

Digital Library

[35]

The Criteo 1TB Click Logs dataset. https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/.

[36]

The NBA dataset. https://www.kaggle.com/datasets/nathanlauga/nba-games/.

[37]

Anonymized internet traces 2018. https://catalog.caida.org/details/dataset/ passive_2018_pcap. Accessed: 2022-6-29.

[38]

Source code related to Hyper-USS. https://github.com/HyperUSS/HyperUSS.

Cited By

Liu ZDong FLiu CDeng XYang TZhao YLi JCui BZhang G(2024)WavingSketch: an unbiased and generic sketch for finding top-k items in data streamsThe VLDB Journal10.1007/s00778-024-00869-633:5(1697-1722)Online publication date: 29-Jul-2024
https://doi.org/10.1007/s00778-024-00869-6

Index Terms

Hyper-USS: Answering Subset Query Over Multi-Attribute Data Stream
1. Information systems
  1. Information systems applications
    1. Data mining
      1. Data stream mining
2. Theory of computation
  1. Design and analysis of algorithms
    1. Streaming, sublinear and near linear time algorithms
      1. Sketching and sampling

Recommendations

Efficient Consistent Query Answering Based on Attribute Deletions
CSA '08: Proceedings of the International Symposium on Computer Science and its Applications

Data integrated from multiple sources may contain inconsistencies. A consistent query answer (CQA) in a possibly inconsistent database is an answer which is true in every minimal repair of the database. It is proved that for most constraints and queries ...
Condensative stream query language for data streams
ADC '07: Proceedings of the eighteenth conference on Australasian database - Volume 63

In contrast to traditional database queries, a query on stream data is continuous in that it is periodically evaluated over fractions (sliding windows) of the data stream. This introduces challenges beyond those encountered when processing traditional ...
View-Based Query Answering and Query Containment over Semistructured Data
DBPL '01: Revised Papers from the 8th International Workshop on Database Programming Languages

The basic querying mechanism over semistructured data, namely regular path queries, asks for all pairs of objects that are connected by a path conforming to a regular expression. We consider conjunctive two-way regular path queries (C2RPQ_c's), which ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2023

5996 pages

ISBN:9798400701030

DOI:10.1145/3580305

General Chairs:
Ambuj Singh
UC Santa Barbara, USA
,
Yizhou Sun
UC Los Angeles, USA
,
Program Chairs:
Leman Akoglu
Carnegie Mellon University, USA
,
Dimitrios Gunopulos
University of Athens, Greece
,
Xifeng Yan
UC Santa Barbara, USA
,
Ravi Kumar
Google, USA
,
Fatma Ozcan
Google, USA
,
Jieping Ye
Alibaba DAMO Academy

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Key-Area Research and Development Program of Guangdong Province
National Natural Science Foundation of China (NSFC)

Conference

KDD '23

Sponsor:

KDD '23: The 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 6 - 10, 2023

CA, Long Beach, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
254
Total Downloads

Downloads (Last 12 months)143
Downloads (Last 6 weeks)5

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu ZDong FLiu CDeng XYang TZhao YLi JCui BZhang G(2024)WavingSketch: an unbiased and generic sketch for finding top-k items in data streamsThe VLDB Journal10.1007/s00778-024-00869-633:5(1697-1722)Online publication date: 29-Jul-2024
https://doi.org/10.1007/s00778-024-00869-6

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents