ABSTRACT
Heavy hitters are data items that occur at high frequency in a data set. They are among the most important items for an organization to summarize and understand during analytical processing. In data sets with sufficient skew, the number of heavy hitters can be relatively small. We take advantage of this small footprint to compute aggregate functions for the heavy hitters in fast cache memory in a single pass.
We design cache-resident, shared-nothing structures that hold only the most frequent elements. Our algorithm works in three phases. It first samples and picks heavy hitter candidates. It then builds a hash table and computes the exact aggregates of these elements. Finally, a validation step identifies the true heavy hitters from among the candidates.
We identify trade-offs between the hash table configuration and performance. Configurations consist of the probing algorithm and the table capacity that determines how many candidates can be aggregated. The probing algorithm can be perfect hashing, cuckoo hashing and bucketized hashing to explore trade-offs between size and speed.
We optimize performance by the use of SIMD instructions, utilized in novel ways beyond single vectorized operations, to minimize cache accesses and the instruction footprint.
- P. Boncz, M. Zukowski, and N. Nes. Monetdb/x100: Hyper-pipelining query execution. In CIDR, 2005.Google Scholar
- M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In ICALP, 2002. Google ScholarDigital Library
- J. Cieslewicz and K. A. Ross. Adaptive aggregation on chip multiprocessors. In VLDB, 2007. Google ScholarDigital Library
- J. Cieslewicz, K. A. Ross, K. Satsumi, and Y. Ye. Automatic contention detection and amelioration for data-intensive operations. In SIGMOD, 2010. Google ScholarDigital Library
- G. Cormode et al. An improved data stream summary: the count-min sketch and its applications. J. Algo., 55(1), 2005. Google ScholarDigital Library
- G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. In VLDB, 2008. Google ScholarDigital Library
- M. Dietzfelbinger et al. A reliable randomized algorithm for the closest-pair problem. J. Algorithms, 25(1), 1997. Google ScholarDigital Library
- M. Dietzfelbinger and U. Schellbach. Weaknesses of cuckoo hashing with a simple universal hash class: The case of large universes. In SOFSEM, 2009. Google ScholarDigital Library
- W. J. Ewens and H. S. Wilf. Computing the distribution of the maximum in balls-and-boxes problems with application to clusters of disease cases. PNAS, 104(27), 2007.Google Scholar
- R. M. Karp et al. A simple algorithm for finding frequent elements in streams and bags. ACM T. Dat. S., 28(1), 2003. Google ScholarDigital Library
- S. Manegold et al. Optimizing database architecture for the new bottleneck: memory access. VLDB J., 9(3), 2000. Google ScholarDigital Library
- G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In VLDB, 2002. Google ScholarDigital Library
- A. Metwally, D. Agrawal, and A. E. Abbadi. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Database Syst., 31(3), 2006. Google ScholarDigital Library
- J. Misra and D. Gries. Finding repeating elements. Technical report, Cornell University, 1982. Google ScholarDigital Library
- T. Neumann. Efficiently compiling efficient query plans for modern hardware. VLDB, 4(9), 2011. Google ScholarDigital Library
- R. Pagh et al. Cuckoo hashing. J. Algorithms, 51(2), 2004. Google ScholarDigital Library
- K. A. Ross. Efficient hash probes on modern processors. In ICDE, 2007.Google ScholarCross Ref
- P. Roy, J. Teubner, and G. Alonso. Efficient frequent item counting in multi-core hardware. In KDD, 2012. Google ScholarDigital Library
- Y. Ye, K. A. Ross, and N. Vesdapunt. Scalable aggregation on multicore processors. In DaMoN, 2011. Google ScholarDigital Library
- J. Zhou and K. A. Ross. Implementing database operations using simd instructions. In SIGMOD, 2002. Google ScholarDigital Library
- M. Zukowski, S. Héman, and P. Boncz. Architecture-conscious hashing. In DaMoN, 2006. Google ScholarDigital Library
Recommendations
High Throughput Hierarchical Heavy Hitter Detection in Data Streams
HIPC '15: Proceedings of the 2015 IEEE 22nd International Conference on High Performance Computing (HiPC)Detecting heavy activity aggregation in data streams is a critical task for many networking, data base and data-mining applications. The aggregation points often belong to hierarchical domains (e.g. IP domain, XML data tree, etc.). These aggregation ...
High Throughput Sketch Based Online Heavy Hitter Detection on FPGA
HEART '15In the context of networking, a heavy hitter is an entity in a data stream whose amount of activity (such as bandwidth consumption or number of connections) is higher than a given threshold. Detecting heavy hitters is a critical task for network ...
Comments