skip to main content
10.1145/3448016.3452784acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Out of Many We are One: Measuring Item Batch with Clock-Sketch

Published: 18 June 2021 Publication History

Abstract

Item batch denotes a consecutive sequence of identical items that are close in time in a data stream. It is a useful data stream pattern in cache, burst detection, APT detection, \etc Basic item batch measurement tasks include membership, cardinality, time span and size. Currently, there is no algorithm tailored for item batch measurement. The greatest challenge lies in accurately estimating the time gap between two consecutive identical items. In this paper, we propose Clock-sketch, a framework that introduces the well-known CLOCK algorithm into item batch measurement. The methodology of Clock-sketch is to clean outdated information as much as possible, while guaranteeing that the information of all items visited within the time window $\mathcalT $ is preserved. We conduct experiments on three real-world datasets that feature in item batch pattern. We compare the accuracy and throughput performance of our Clock-sketch against the state-of-the-art and two naive approaches without using Clock-sketch technique. Results of item batch activeness show that Clock-sketch outperforms the state-of-the-art SWAMP in generating 50 times less false positive rate when memory is small. All source codes are open-sourced and released at Github.

Supplementary Material

MP4 File (3448016.3452784.mp4)
Item batch denotes a consecutive sequence of identical items thatare close in time in a data stream. It is an useful data stream patternin cache, burst detection and APT detectionetc.Basic item batchmeasurement tasks include membership, cardinality, time span andsize. Currently, there is no algorithm tailored for item batch mea-surement. The greatest challenge lies in accurately estimating thetime gap between two consecutive identical items. In this paper, wepropose clock-sketch, a framework that introduces the well-knownCLOCK algorithm into item batch measurement. The methodologyof clock-sketch is to clean outdated information as much as possible,while guaranteeing that the information of all items visited withintime windowTis preserved. We conduct experiments on threereal-world datasets which feature in item batch pattern and com-pared accuracy and throughput performance of our clock-sketchagainst the state-of-the-art and two naive approaches without us-ing clock-sketch technique. Results of item batch activeness showthat clock-sketch outperforms the state-of-the-art SWAMP in gen-erating 50 times less false positive rate when memory is small. Allsource codes are open-sourced and released at Github.

References

[1]
Eran Assaf, Ran Ben Basat, Gil Einziger, and Roy Friedman. 2018. Pay for a sliding bloom filter and get counting, distinct elements, and entropy for free. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE, 2204--2212.
[2]
Bryan Ball, Mark Flood, Hosagrahar Visvesvaraya Jagadish, Joe Langsam, Louiqa Raschid, and Peratham Wiriyathammabhum. 2014. A flexible and extensible contract aggregation framework (caf) for financial data stream analytics. In Proceedings of the International Workshop on Data Science for Macro-Modeling. 1--6.
[3]
Sorav Bansal and Dharmendra S Modha. 2004. CAR: Clock with Adaptive Replacement. In FAST, Vol. 4. 187--200.
[4]
Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, Vol. 13, 7 (1970), 422--426.
[5]
Oscar Boykin, Sam Ritchie, and Jimmy Lin. 2014. Summingbird: A Framework for Integrating Batch and Online MapReduce Computations. Proc. VLDB Endow., Vol. 7, 13 (Aug. 2014), 1441-451. https://doi.org/10.14778/2733004.2733016
[6]
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, Vol. 36, 4 (2015).
[7]
Fernando J Corbato. 1968 a. A paging experiment with the multics system . Technical Report. MASSACHUSETTS INST OF TECH CAMBRIDGE PROJECT MAC.
[8]
Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, Vol. 55, 1 (2005), 58--75.
[9]
Michael K Daly. 2009. Advanced persistent threat. Usenix, Nov, Vol. 4, 4 (2009), 2013--2016.
[10]
Gil Einziger and Roy Friedman. 2015. Counting with TinyTable: Every bit counts!. In 2015 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). IEEE, 77--78.
[11]
Lajos Gergely Gyurkó, Terry Lyons, Mark Kontkowski, and Jonathan Field. 2013. Extracting information from the signature of a financial data stream. arXiv preprint arXiv:1307.7244 (2013).
[12]
Laura M. Haas, Walter Chang, Guy M. Lohman, John McPherson, Paul F. Wilms, George Lapis, Bruce Lindsay, Hamid Pirahesh, Michael J. Carey, and Eugene Shekita. 1990. Starburst mid-flight: As the dust clears. IEEE Transactions on Knowledge & Data Engineering 1 (1990), 143--160.
[13]
Thoufique Haq, Jinjian Zhai, and Vinay K Pidathala. 2017. Advanced persistent threat (APT) detection center. US Patent 9,628,507.
[14]
Hyang-Ah Kim and David R O'Hallaron. 2003. Counting network flows in real time. In GLOBECOM'03. IEEE Global Telecommunications Conference (IEEE Cat. No. 03CH37489), Vol. 7. IEEE, 3888--3893.
[15]
Jon Kleinberg. 2003. Bursty and hierarchical structure in streams. Data Mining and Knowledge Discovery, Vol. 7, 4 (2003), 373--397.
[16]
Shijin Kong, Tao He, Xiaoxin Shao, Changqing An, and Xing Li. 2006. Time-out bloom filter: A new sampling method for recording more flows. In International Conference on Information Networking. Springer, 590--599.
[17]
Marshall Kirk McKusick, Keith Bostic, Michael J Karels, and John S Quarterman. 1996. The design and implementation of the 4.4 BSD operating system. Vol. 2. Addison-Wesley Reading, MA.
[18]
Prashant Pandey, Michael A. Bender, Rob Johnson, and Rob Patro. 2017. A General-Purpose Counting Filter: Making Every Bit Count. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD 7). Association for Computing Machinery, New York, NY, USA, 775-87. https://doi.org/10.1145/3035918.3035963
[19]
Yanqing Peng, Jinwei Guo, Feifei Li, Weining Qian, and Aoying Zhou. 2018. Persistent bloom filter: Membership testing for the entire history. In Proceedings of the 2018 International Conference on Management of Data . 1037--1052.
[20]
Pratanu Roy, Arijit Khan, and Gustavo Alonso. 2016. Augmented sketch: Faster and more accurate stream processing. In Proceedings of the 2016 International Conference on Management of Data . 1449--1463.
[21]
Jingsong Shan, Jianxin Luo, Guiqiang Ni, Zhaofeng Wu, and Weiwei Duan. 2016. CVS: fast cardinality estimation for large-scale data streams over sliding windows. Neurocomputing, Vol. 194 (2016), 107--116.
[22]
Anshumali Shrivastava, Arnd Christian Konig, and Mikhail Bilenko. 2016. Time adaptive sketches (ada-sketches) for summarizing data streams. In Proceedings of the 2016 International Conference on Management of Data . 1417--1432.
[23]
Kai Sheng Tai, Vatsal Sharan, Peter Bailis, and Gregory Valiant. 2018. Sketching linear classifiers over data streams. In Proceedings of the 2018 International Conference on Management of Data. 757--772.
[24]
Nan Tang, Qing Chen, and Prasenjit Mitra. 2016. Graph stream summarization: From big bang to big crunch. In Proceedings of the 2016 International Conference on Management of Data. 1481--1496.
[25]
Daniel Ting. 2018. Count-Min: Optimal Estimation and Tight Error Bounds using Empirical Error Distributions. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining . 2319--2328.
[26]
Tingting Chen, Yi Wang, Binxing Fang, and Jun Zheng. 2006. Detecting Lasting and Abrupt Bursts in Data Streams Using Two-Layered Wavelet Tree. In Advanced Int'l Conference on Telecommunications and Int'l Conference on Internet and Web Applications and Services (AICT-ICIW'06). 30--30.
[27]
Nikos Virvilis and Dimitris Gritzalis. 2013. The big four-what we did wrong in advanced persistent threat detection?. In 2013 international conference on availability, reliability and security. IEEE, 248--254.
[28]
Zhewei Wei, Ge Luo, Ke Yi, Xiaoyong Du, and Ji-Rong Wen. 2015. Persistent data sketching. In Proceedings of the 2015 ACM SIGMOD international conference on Management of Data. 795--810.
[29]
Kyu-Young Whang, Brad T Vander-Zanden, and Howard M Taylor. 1990. A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems (TODS), Vol. 15, 2 (1990), 208--229.
[30]
Wei Xie, Feida Zhu, Jing Jiang, Ee-Peng Lim, and Ke Wang. 2016. Topicsketch: Real-time bursty topic detection from twitter. IEEE Transactions on Knowledge and Data Engineering, Vol. 28, 8 (2016), 2216--2229.
[31]
Linfeng Zhang and Yong Guan. 2008. Detecting click fraud in pay-per-click streams of online advertising networks. In 2008 The 28th International Conference on Distributed Computing Systems. IEEE, 77--84.
[32]
Yang Zhou, Tong Yang, Jie Jiang, Bin Cui, Minlan Yu, Xiaoming Li, and Steve Uhlig. 2018. Cold filter: A meta-framework for faster and more accurate stream processing. In Proceedings of the 2018 International Conference on Management of Data . 741--756.
[33]
Yunyue Zhu and Dennis Shasha. 2003. Efficient elastic burst detection in data streams. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining . 336--345.

Cited By

View all
  • (2025)Pandora: An Efficient and Rapid Solution for Persistence-Based Tasks in High-Speed Data StreamsProceedings of the ACM on Management of Data10.1145/37097113:1(1-26)Online publication date: 11-Feb-2025
  • (2025)In Search of a Memory-Efficient Framework for Online Cardinality EstimationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348657137:1(392-407)Online publication date: Jan-2025
  • (2024)A Universal Sketch for Estimating Heavy Hitters and Per-Element Frequency Moments in Data Streams with Bounded DeletionsProceedings of the ACM on Management of Data10.1145/36987992:6(1-28)Online publication date: 20-Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
June 2021
2969 pages
ISBN:9781450383431
DOI:10.1145/3448016
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clock
  2. data stream mining
  3. item batch
  4. sketch

Qualifiers

  • Research-article

Funding Sources

  • FANet: PCL Future Greater-Bay Area Network Facilities for Large-scale Experiments and Applications
  • Key-Area Research and Development Program of Guangdong Province
  • National Natural Science Foundation of China (NSFC)

Conference

SIGMOD/PODS '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)62
  • Downloads (Last 6 weeks)4
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Pandora: An Efficient and Rapid Solution for Persistence-Based Tasks in High-Speed Data StreamsProceedings of the ACM on Management of Data10.1145/37097113:1(1-26)Online publication date: 11-Feb-2025
  • (2025)In Search of a Memory-Efficient Framework for Online Cardinality EstimationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348657137:1(392-407)Online publication date: Jan-2025
  • (2024)A Universal Sketch for Estimating Heavy Hitters and Per-Element Frequency Moments in Data Streams with Bounded DeletionsProceedings of the ACM on Management of Data10.1145/36987992:6(1-28)Online publication date: 20-Dec-2024
  • (2024)Unbiased Real-Time Traffic SketchingIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.328400411:3(2371-2383)Online publication date: May-2024
  • (2024)Bamboo Filters: Make Resizing Smooth and AdaptiveIEEE/ACM Transactions on Networking10.1109/TNET.2024.340399732:5(3776-3791)Online publication date: Oct-2024
  • (2024)A Unified Framework for Mining Batch and Periodic Batch in Data StreamsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339902436:11(5544-5561)Online publication date: Nov-2024
  • (2024)Priority Sketch: A Priority-aware Measurement Framework2024 International Conference on Satellite Internet (SAT-NET)10.1109/SAT-NET62854.2024.00012(18-23)Online publication date: 25-Oct-2024
  • (2024)M4: A Framework for Per-Flow Quantile Estimation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00364(4787-4800)Online publication date: 13-May-2024
  • (2024)Newton Sketches: Estimating Node Intimacy in Dynamic Graphs Using Newton's Law of Cooling2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00225(2904-2916)Online publication date: 13-May-2024
  • (2024)Online Detection of Outstanding Quantiles with QuantileFilter2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00069(831-844)Online publication date: 13-May-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media