Review articleOrderSketch: An Unbiased and Fast Sketch for Frequency Estimation of Data Streams
Introduction
In many big data scenarios such as real-time IP traffic, web clicks, and sensor measurements, the massive data are generated as a high-rate stream [2], [3], [4], [5]. Estimating the frequency of each distinct item in data streams is one of the most fundamental problems [6], [7], [8]. Due to the high speed and massive amount, data streams are often processed in a single-pass and it is impractical to achieve accurate estimation of item frequencies. Therefore, approximate flow frequency estimation becomes popular and gains wide acceptance [9], [10], [11].
As a probabilistic data structure, sketches are widely used for the estimation of item frequency in data streams by providing rigorous accuracy guarantees [12], [13], [14], [15], [16], [17]. Meanwhile, sketches summarize massive data streams within a limited space by using multiple hash functions. However, there are still many problems with the existing sketches. First, the sketch is still not fast enough in many scenarios due to too many hash computations and memory accesses. For example, Count-Min Sketch and Count Sketch deployed in software switches do not achieve 10M packets per second (Mpps). This is not surprising because sketches are theoretically designed for memory efficiency and are not optimized for speed [2,18]. Second, many extended sketches can obtain high accuracy but require more complicated configurations, which is a heavy burden for users [7,19]. Third, the existing sketch can hardly guarantee unbiased estimation, which is profitable for various global estimation problems, such as global heavy hitters, global distribution, global entropy, and so on [13,20].
In summary, the above mentioned problems require that the sketch be fast, generic, and accurate. First, to be fast, the processing time of each packet should be constant and small. Second, for the convenience of users, the complexity of sketch configuration needs to be reduced for simplicity of employment. Third, accuracy means that the error rate should be small enough when using a given amount of memory.
Towards the above design goal, we propose the OrderSketch in this paper. Before describing our approach, we first briefly describe how the conventional Count-Min sketch [6] works because several design choices in the OrderSketch are built on it. As shown in Fig. 1, the Count-Min sketch is a array. For each incoming item, Count-Min uses independent hash functions to hash item to get insert positions and increment the d counters. We found that multiple hashes and insert operations greatly slowed down the insertion speed, which inspired our work. The first key idea of OrderSketch is using an ordered strategy to map arriving packets to a unique row. Specifically, to reduce the burden of hash and memory access, we no longer map each item to all rows of the sketch but a specific row according to the ordered strategy. The ordered hash strategy we proposed can ensure that the item arrived randomly are approximately distributed in each row, and can also support deletion operations to meet the needs of some special tasks.
Additionally, when querying the frequency of an item, Count-Min computes hash functions and returns the value of the smallest one among the hashed counters. However, because Count-Min uses a compact data structure to process massive amounts of data, there will inevitably be many collisions leading to a large estimation of flow frequency. The second key idea is to estimate the noise that may be generated by the sketch, and subtract this part of the noise to ensure an unbiased estimate when querying the frequency of item.
- 1.
We propose a novel sketch for flow frequency estimation, namely the OrderSketch, which is fast, generic, and can provide an unbiased estimation.
- 2.
We theoretically prove that OrderSketch can provide unbiased estimation, and then give an error bound of our algorithm.
- 3.
We implement SimpleHash, CM sketch, CU sketch, Count sketch, Pyramid sketch, ColdFilter, and OrderSketch. And we carry out considerable experiments to evaluate and compare the performance of all these sketches. Experimental results show that OrderSketch outperforms other sketches by up to 3 times in terms of insert speed.
Section snippets
Related work
Sketches have been widely applied to estimating item frequency in data streams. Typical sketches include Count-Min(CM) [6], CU [21], Count [22], SimpleHash [23] and more. A comprehensive survey of sketch algorithms is provided in the literature [24]. These sketches usually consist of arrays containing many counters, and each array has a corresponding independent hash function to update the counters. When inserting an item, the CM increments all the mapped counters. Because hash collisions
The OrderSketch design
The OrderSketch is a novel sketch data structure for frequency estimation taking two key techniques. In this section, we first present the main idea of OrderSketch, then we introduce the data structure of it, we show its basic operations (including insert, query) at last. We list the symbols frequently used in this paper and this meaning in Table 1.
Formal analysis of OrderSketch
In this section, we prove that for each item , the OrderSketch can provide an unbiased estimated frequency. then we show the error bound of OrderSketch.
Experimental setup
Implementation: We have implemented CM sketch, CU sketch, Count sketch, Pyramid sketch, ColdFilter, SimpleHash and our OrderSketch in C++. The hash functions are implemented using the 32-bit Bob Hash which is obtained from the open-source website [29].
Traces: we use one one-hour public traffic traces collected in the Equinix-Chicago monitor from CAIDA [30]. We use the CAIDA trace with a monitoring time interval of 5s as a default trace, which contains 1.1M to 2.8M packets with 60K to 110K flows
Conclusion
The sketch is a probabilistic data structure and is widely used to store and query the frequency of each distinct item in data streams. In this paper, we propose a new sketch called OrderSketch. It can provide unbiased estimation and the insertion speed is about 3 times faster than existing algorithms. We believe our sketch can be applied to various fields, especially those with high requirements for the processing speed, such as software switches, and data center. In the future, we will deploy
CRediT authorship contribution statement
Lu Jie: Data curation, Writing – original draft, Methodology. Chen Hongchang: Investigation. Sun Penghao: Conceptualization, Methodology. Hu Tao: Visualization. Zhang Zhen: Methodology.
Declaration of Competing Interest
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work.
Lu Jie (corresponding author), received the bachelor's degree from Zhejiang University in 2016. He is currently pursuing Doctor Degree in Information engineering university, Zhengzhou, China. His research interests include software-defined networking, network management, and network security.
References (30)
- et al.
An improved data stream summary: The count-min sketch and its applications
J. Algorithms
(2005) - “Source codes of OrderSketch and related sketches.” [Online]. Available:...
Nitrosketch:Robust and General Sketch-based Monitoring in Software Switches
- et al.
Per-flow traffic measurement through randomized counter sharing
IEEE/ACM Trans. Netw.
(2012) - et al.
gSketch: On query estimation in graph streams
Proc. VLDB Endow
(2011) - et al.
Robust aggregation algorithm in sensor networks
Ruan Jian Xue Bao/Journal Softw
(2009) - et al.
Pyramid sketch: A sketch framework for frequency estimation of data streams
Proc. VLDB Endow.
(2017) SF-Sketch: A Two-Stage Sketch for Data Streams
IEEE Trans. Parallel Distrib. Syst.
(2020)Fine-grained probability counting: Refined loglog algorithm
- et al.
Hash-based techniques for high-speed packet processing
Algorithms for Next Generation Networks
(2010)
On efficient query processing of stream counts on the cell processor
Augmented Sketch: Faster and More Accurate Stream Processing
SIGMOD
WavingSketch : An Unbiased and Generic Sketch for Finding Top-k Items in Data Streams
SIGKDD
HeavyKeeper: An accurate algorithm for finding top-k elephant flows
One sketch to rule them all: Rethinking network flow monitoring with UnivMon
Cited by (13)
SuperGuardian: Superspreader removal for cardinality estimation in data streaming
2024, Information SystemsA survey on sliding window sketch for network measurement
2023, Computer NetworksAn Accurate and Invertible Sketch for Super Spread Detection
2024, Electronics (Switzerland)MimoSketch: A Framework to Mine Item Frequency on Multiple Nodes with Sketches
2023, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data MiningSSD-AIS: An Accurate and Invertible Sketch for Super Spread Detection
2023, Research Square
Lu Jie (corresponding author), received the bachelor's degree from Zhejiang University in 2016. He is currently pursuing Doctor Degree in Information engineering university, Zhengzhou, China. His research interests include software-defined networking, network management, and network security.
Chen Hongchan, professor, deputy director of the National Digital Switching System Engineering Technology Research Center, and leader of the innovation team of the National Science and Technology Progress Award for Network Communication and Switching Technology. The main research areas are cyberspace security, big data and artificial intelligence.
Sun Penghao, received the bachelor's degrees from Information Engineering university, in 2014. He is currently pursuing the Ph.D. degree in Information engineering university, Zhengzhou, China. His research interests include software-defined networking, Network management and AI.
Hu Tao, received the bachelor's degrees from Xi'an Jiaotong University in 2015. He is currently pursuing the Ph.D. degree in Information engineering university, Zhengzhou, China. His research interests include software-defined networking, DDos, and network security.
Zhang Zhen, is currently an associate professor with the National Digital Switching System Engineering and Technological Research and Development Center. His research interests include network measurement and network management.