research-article

FAST: fast architecture sensitive tree search on modern CPUs and GPUs

Authors:
Changkyu Kim

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

,
Jatin Chhugani

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

,
Nadathur Satish

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

,
Eric Sedlar

Oracle Corporation, Redwood Shores, WA, USA

Oracle Corporation, Redwood Shores, WA, USA
View Profile

,
Anthony D. Nguyen

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

,
Tim Kaldewey

Oracle Corporation, Redwood Shores, WA, USA

Oracle Corporation, Redwood Shores, WA, USA
View Profile

,
Victor W. Lee

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

,
Scott A. Brandt

University of California at Santa Cruz, Santa Cruz, CA, USA

University of California at Santa Cruz, Santa Cruz, CA, USA
View Profile

,
Pradeep Dubey

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of dataJune 2010Pages 339–350https://doi.org/10.1145/1807167.1807206

Published:06 June 2010Publication History

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Pages 339–350

ABSTRACT

In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous computing power by integrating multiple cores, each with wide vector units. There has been much work to exploit modern processor architectures for database primitives like scan, sort, join and aggregation. However, unlike other primitives, tree search presents significant challenges due to irregular and unpredictable data accesses in tree traversal.

In this paper, we present FAST, an extremely fast architecture sensitive layout of the index tree. FAST is a binary tree logically organized to optimize for architecture features like page size, cache line size, and SIMD width of the underlying hardware. FAST eliminates impact of memory latency, and exploits thread-level and datalevel parallelism on both CPUs and GPUs to achieve 50 million (CPU) and 85 million (GPU) queries per second, 5X (CPU) and 1.7X (GPU) faster than the best previously reported performance on the same architectures. FAST supports efficient bulk updates by rebuilding index trees in less than 0.1 seconds for datasets as large as 64Mkeys and naturally integrates compression techniques, overcoming the memory bandwidth bottleneck and achieving a 6X performance improvement over uncompressed index search for large keys on CPUs.

References

D. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In SIGMOD, pages 671--682, 2006. Google ScholarDigital Library
D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, et al. Real-time parallel hashing on the GPU. ACM Transactions on Graphics, 28(5), Dec. 2009. Google ScholarDigital Library
V. H. Allan, R. B. Jones, R. M. Lee, and S. J. Allan. Software pipelining. ACM Comput. Surv., 27(3):367--432, 1995. Google ScholarDigital Library
L. Arge. The buffer tree: A technique for designing batched external data structures. Algorithmica, 37(1):1--24, 2003.Google ScholarDigital Library
R. Bayer and K. Unterauer. Prefix b-trees. ACM Trans. Database Syst., 2(1):11--26, 1977. Google ScholarDigital Library
D. Belazzougui, P. Boldi, R. Pagh, and S. Vigna. Theory and practise of monotone minimal perfect hashing. In ALENEX, pages 132--144, 2009.Google ScholarCross Ref
C. Binnig, S. Hildenbrand, and F. Färber. Dictionary-based order-preserving string compression for column stores. In SIGMOD, pages 283--296, 2009. Google ScholarDigital Library
P. Bohannon, P. Mcllroy, and R. Rastogi. Main-memory index structures with fixed-size partial keys. In SIGMOD, pages 163--174, 2001. Google ScholarDigital Library
S. Chen, P. B. Gibbons, and T. C. Mowry. Improving index performance through prefetching. SIGMOD Record, 30(2):235--246, 2001. Google ScholarDigital Library
S. Chen, P. B. Gibbons, T. C. Mowry, et al. Fractal prefetching b+-trees: optimizing both cache and disk performance. In SIGMOD, pages 157--168, '02. Google ScholarDigital Library
J. Chhugani, A. D. Nguyen, V.W. Lee,W. Macy, et al. Efficient implementation of sorting on multi-core SIMD CPU architecture. PVLDB, 1(2), 2008. Google ScholarDigital Library
J. Cieslewicz and K. A. Ross. Adaptive aggregation on chip multiprocessors. In VLDB, pages 339--350, 2007. Google ScholarDigital Library
D. Comer. Ubiquitous b-tree. ACM Comput. Surv., 11(2):121--137, 1979. Google ScholarDigital Library
E. A. Fox, Q. F. Chen, A. M. Daoud, and L. S. Heath. Order-preserving minimal perfect hash functions. ACM Trans. Inf. Syst., 9(3):281--308, 1991. Google ScholarDigital Library
J. Goldstein, R. Ramakrishnan, and U. Shaft. Compressing relations and indexes. In ICDE, pages 370--379, 1998. Google ScholarDigital Library
G. Graefe and P.-A. Larson. B-tree indexes and cpu caches. In ICDE, pages 349--358, 2001. Google ScholarDigital Library
G. Graefe and L. Shapiro. Data compression and database performance. In Applied Computing, pages 22--27, Apr 1991.Google Scholar
R. A. Hankins and J. M. Patel. Effect of node size on the performance of cache-conscious b+-trees. In SIGMETRICS, pages 283--294, 2003. Google ScholarDigital Library
A. L. Holloway, V. Raman, G. Swart, and D. J. DeWitt. How to barter bits for chronons: tradeoffs for database scans. In SIGMOD, pages 389--400, 2007. Google ScholarDigital Library
B. R. Iyer and D. Wilhite. Data compression support in databases. In VLDB, pages 695--704, 1994. Google ScholarDigital Library
T. Kaldewey, J. Hagen, A. D. Blas, and E. Sedlar. Parallel search on video cards. In USENIX Workshop on Hot Topics in Parallelism, 2009. Google ScholarDigital Library
C. Kim, E. Sedlar, J. Chhugani, T. Kaldewey, et al. Sort vs. hash revisited: Fast join implementation on multi-core CPUs. PVLDB, 2(2):1378--1389, 2009. Google ScholarDigital Library
T. J. Lehman and M. J. Carey. A study of index structures for main memory database management systems. In VLDB, pages 294--303, 1986. Google ScholarDigital Library
NVIDIA. NVIDIA CUDA Programming Guide 2.3. 2009.Google Scholar
J. Rao and K. A. Ross. Cache conscious indexing for decision support in main memory. In VLDB, pages 78--89, 1999. Google ScholarDigital Library
J. Rao and K. A. Ross. Making b+- trees cache conscious in main memory. In SIGMOD, pages 475--486, 2000. Google ScholarDigital Library
M. Reilly. When multicore isn't enough: Trends and the future for multi-multicore systems. In HPEC, 2008.Google Scholar
B. Schlegel, R. Gemulla, and W. Lehner. k-ary search on modern processors. In DaMoN, pages 52--60, 2009. Google ScholarDigital Library
L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, et al. Larrabee: A Many-Core x86 Architecture for Visual Computing. SIGGRAPH, 27(3), 2008. Google ScholarDigital Library
T. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner, et al. Simd-scan: Ultra fast in-memory scan using vector processing units. PVLDB, 2(1):385--394, 2009. Google ScholarDigital Library
J. Zhou and K. A. Ross. Implementing database operations using simd instructions. In SIGMOD Conference, pages 145--156, 2002. Google ScholarDigital Library
J. Zhou and K. A. Ross. Buffering accesses to memory resident index structures. In VLDB, pages 405--416, 2003. Google ScholarDigital Library
M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar ram-cpu cache compression. In ICDE, page 59, 2006 Google ScholarDigital Library

Index Terms

FAST: fast architecture sensitive tree search on modern CPUs and GPUs
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Designing fast architecture-sensitive tree search on modern multicore/many-core processors

In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous computing power by integrating multiple cores, each with wide vector units. There has been much work to exploit modern processor ...
Read More
A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs
FPGA '14: Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Sparse Matrix-Vector Multiplication (SpMxV) is a widely used mathematical operation in many high-performance scientific and engineering applications. In recent years, tuned software libraries for multi-core microprocessors (CPUs) and graphics processing ...
Read More
Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications
SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

In this paper, we evaluate the portability of the SYCL programming model on some of the latest CPUs and GPUs from a wide range of vendors, utilizing the two main compilers: DPC++ and hipSYCL/OpenSYCL. Both compilers currently support GPUs from all three ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
June 2010
1286 pages
ISBN:9781450300322
DOI:10.1145/1807167
General Chair:
Ahmed Elmagarmid
Purdue University, USA
,
Program Chair:
Divyakant Agrawal
University of California at Santa Barbara, USA
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 June 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
compression
cpu
data-level parallelism
gpu
thread-level parallelism
tree search
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 212
  Total Citations
  View Citations
- 3,002
  Total Downloads
- Downloads (Last 12 months)165
- Downloads (Last 6 weeks)17
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

FAST: fast architecture sensitive tree search on modern CPUs and GPUs

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Designing fast architecture-sensitive tree search on modern multicore/many-core processors

A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs

Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications