skip to main content
10.1145/2485278.2485284acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

High throughput heavy hitter aggregation for modern SIMD processors

Published:24 June 2013Publication History

ABSTRACT

Heavy hitters are data items that occur at high frequency in a data set. They are among the most important items for an organization to summarize and understand during analytical processing. In data sets with sufficient skew, the number of heavy hitters can be relatively small. We take advantage of this small footprint to compute aggregate functions for the heavy hitters in fast cache memory in a single pass.

We design cache-resident, shared-nothing structures that hold only the most frequent elements. Our algorithm works in three phases. It first samples and picks heavy hitter candidates. It then builds a hash table and computes the exact aggregates of these elements. Finally, a validation step identifies the true heavy hitters from among the candidates.

We identify trade-offs between the hash table configuration and performance. Configurations consist of the probing algorithm and the table capacity that determines how many candidates can be aggregated. The probing algorithm can be perfect hashing, cuckoo hashing and bucketized hashing to explore trade-offs between size and speed.

We optimize performance by the use of SIMD instructions, utilized in novel ways beyond single vectorized operations, to minimize cache accesses and the instruction footprint.

References

  1. P. Boncz, M. Zukowski, and N. Nes. Monetdb/x100: Hyper-pipelining query execution. In CIDR, 2005.Google ScholarGoogle Scholar
  2. M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In ICALP, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Cieslewicz and K. A. Ross. Adaptive aggregation on chip multiprocessors. In VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Cieslewicz, K. A. Ross, K. Satsumi, and Y. Ye. Automatic contention detection and amelioration for data-intensive operations. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. Cormode et al. An improved data stream summary: the count-min sketch and its applications. J. Algo., 55(1), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. In VLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Dietzfelbinger et al. A reliable randomized algorithm for the closest-pair problem. J. Algorithms, 25(1), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Dietzfelbinger and U. Schellbach. Weaknesses of cuckoo hashing with a simple universal hash class: The case of large universes. In SOFSEM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. J. Ewens and H. S. Wilf. Computing the distribution of the maximum in balls-and-boxes problems with application to clusters of disease cases. PNAS, 104(27), 2007.Google ScholarGoogle Scholar
  10. R. M. Karp et al. A simple algorithm for finding frequent elements in streams and bags. ACM T. Dat. S., 28(1), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Manegold et al. Optimizing database architecture for the new bottleneck: memory access. VLDB J., 9(3), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In VLDB, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Metwally, D. Agrawal, and A. E. Abbadi. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Database Syst., 31(3), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Misra and D. Gries. Finding repeating elements. Technical report, Cornell University, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Neumann. Efficiently compiling efficient query plans for modern hardware. VLDB, 4(9), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Pagh et al. Cuckoo hashing. J. Algorithms, 51(2), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. K. A. Ross. Efficient hash probes on modern processors. In ICDE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  18. P. Roy, J. Teubner, and G. Alonso. Efficient frequent item counting in multi-core hardware. In KDD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Ye, K. A. Ross, and N. Vesdapunt. Scalable aggregation on multicore processors. In DaMoN, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Zhou and K. A. Ross. Implementing database operations using simd instructions. In SIGMOD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Zukowski, S. Héman, and P. Boncz. Architecture-conscious hashing. In DaMoN, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    DaMoN '13: Proceedings of the Ninth International Workshop on Data Management on New Hardware
    June 2013
    65 pages
    ISBN:9781450321969
    DOI:10.1145/2485278

    Copyright © 2013 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 24 June 2013

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate80of102submissions,78%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader