skip to main content
10.1145/2723372.2737787acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

A Padded Encoding Scheme to Accelerate Scans by Leveraging Skew

Published: 27 May 2015 Publication History

Abstract

In-memory data analytic systems that use vertical bit-parallel scan methods generally use encoding techniques. We observe that in such environments, there is an opportunity to turn skew in both the data and predicate distributions (usually a problem for query processing) into a benefit that can be leveraged to encode the column values. This paper proposes a padded encoding scheme to address this opportunity. The proposed scheme creates encodings that map common attribute values to codes that can easily be distinguished from other codes by only examining a few bits in the full code. Consequently, scans on columns stored using the padded encoding scheme can safely prune the computation without examining all the bits in the code, thereby reducing the memory bandwidth and CPU cycles that are consumed when evaluating scan queries. Our padded encoding method results in a fixed-length encoding, as fixed-length encodings are easier to manage. However, the proposed padded encoding may produce longer (fixed-length) codes than those produced by popular order-preserving encoding methods, such as dictionary-based encoding. This additional space overhead has the potential to negate the gains from early pruning of the scan computation. However, as we demonstrate empirically, the additional space overhead is generally small, and the padded encoding scheme provides significant performance improvements.

References

[1]
D. J. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In SIGMOD Conference, pages 671--682, 2006.
[2]
L. Abraham, J. Allen, O. Barykin, V. R. Borkar, B. Chopra, C. Gerea, D. Merl, J. Metzler, D. Reiss, S. Subramanian, J. L. Wiener, and O. Zed. Scuba: Diving into data at facebook. PVLDB, 6(11):1057--1067, 2013.
[3]
K. Alexiou, D. Kossmann, and P. Larson. Adaptive range filters for cold data: Avoiding trips to siberia. PVLDB, 6(14):1714--1725, 2013.
[4]
T. Apaydin, G. Canahuate, H. Ferhatosmanoglu, and A. S. Tosun. Approximate encoding for direct access and query processing over compressed bitmaps. In VLDB Conference, pages 846--857, 2006.
[5]
C. Binnig, S. Hildenbrand, and F. Färber. Dictionary-based order-preserving string compression for main memory column stores. In SIGMOD Conference, pages 283--296, 2009.
[6]
S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in mapreduce. In SIGMOD Conference, pages 975--986, 2010.
[7]
Z. Chen, J. Gehrke, and F. Korn. Query optimization in compressed database systems. In SIGMOD Conference, pages 271--282, 2001.
[8]
D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handling in parallel joins. In VLDB Conference, pages 27--40, 1992.
[9]
C. Faloutsos and H. V. Jagadish. On b-tree indices for skewed distributions. In VLDB Conference, pages 363--374, 1992.
[10]
W. Fang, B. He, and Q. Luo. Database compression on graphics processors. PVLDB, 3(1):670--680, 2010.
[11]
F. Färber, N. May, W. Lehner, P. Große, I. Müller, H. Rauhe, and J. Dees. The SAP HANA database -- an architecture overview. IEEE Data Eng. Bull., 35(1):28--33, 2012.
[12]
A. M. Garsia and M. L. Wachs. A new algorithm for minimum cost binary trees. SIAM J. Comput., 6(4):622--642, 1977.
[13]
G. Giannikis, P. Unterbrunner, J. Meyer, G. Alonso, D. Fauser, and D. Kossmann. Crescando. In SIGMOD Conference, pages 1227--1230, 2010.
[14]
E. N. Gilbert and E. F. Moore. Variable-length binary encodings. Bell System Technical Journal, 38(4):933--967, 1959.
[15]
T. C. Hu and A. C. Tucker. Optimal computer search trees and variable-length alphabetical codes. SIAM Journal on Applied Mathematics, 21(4):514--532, 1971.
[16]
D. A. Huffman et al. A method for the construction of minimum redundancy codes. Proc. IRE, 40(9):1098--1101, 1952.
[17]
R. Johnson, V. Raman, R. Sidle, and G. Swart. Row-wise parallel predicate evaluation. PVLDB, 1(1):622--634, 2008.
[18]
D. E. Knuth. Optimum binary search trees. Acta Inf., 1:14--25, 1971.
[19]
D. E. Knuth. The Art of Computer Programming, Volume III: Sorting and Searching, 2nd Edition. Addison-Wesley, 1998.
[20]
J. Krüger, C. Kim, M. Grund, N. Satish, D. Schwalb, J. Chhugani, H. Plattner, P. Dubey, and A. Zeier. Fast updates on read-optimized databases using multi-core CPUs. PVLDB, 5(1):61--72, 2011.
[21]
Y. Kwon, M. Balazinska, B. Howe, and J. A. Rolia. Skewtune: mitigating skew in mapreduce applications. In SIGMOD Conference, pages 25--36, 2012.
[22]
A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandier, L. Doshi, and C. Bear. The vertica analytic database: C-store 7 years later. PVLDB, 5(12):1790--1801, 2012.
[23]
J.-G. Lee, G. K. Attaluri, R. Barber, N. Chainani, O. Draese, F. Ho, S. Idreos, M.-S. Kim, S. Lightstone, G. M. Lohman, K. Morfonios, K. Murthy, I. Pandis, L. Qiao, V. Raman, V. K. Samy, R. Sidle, K. Stolze, and L. Zhang. Joins on encoded and partitioned data. PVLDB, 7(13):1355--1366, 2014.
[24]
W. Li, D. Gao, and R. T. Snodgrass. Skew handling techniques in sort-merge join. In SIGMOD Conference, pages 169--180, 2002.
[25]
Y. Li and J. M. Patel. BitWeaving: fast scans for main memory data processing. In SIGMOD Conference, pages 289--300, 2013.
[26]
Y. Li and J. M. Patel. WideTable: An accelerator for analytical data processing. PVLDB, 7(10):907--918, 2014.
[27]
C. A. Lynch. Selectivity estimation and query optimization in large databases with highly skewed distribution of column values. In VLDB Conference, pages 240--251, 1988.
[28]
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. PVLDB, 3(1):330--339, 2010.
[29]
P. E. O'Neil and D. Quass. Improved query performance with variant indexes. In SIGMOD Conference, pages 38--49, 1997.
[30]
A. Pavlo, C. Curino, and S. B. Zdonik. Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In SIGMOD Conference, pages 61--72, 2012.
[31]
O. Polychroniou and K. A. Ross. High throughput heavy hitter aggregation for modern SIMD processors. In DaMoN Workshop, page 6, 2013.
[32]
V. Raman, G. K. Attaluri, R. Barber, N. Chainani, D. Kalmuk, V. KulandaiSamy, J. Leenstra, S. Lightstone, S. Liu, G. M. Lohman, T. Malkemus, R. Müller, I. Pandis, B. Schiefer, D. Sharpe, R. Sidle, A. J. Storm, and L. Zhang. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11):1080--1091, 2013.
[33]
V. Raman, G. Swart, L. Qiao, F. Reiss, V. Dialani, D. Kossmann, I. Narang, and R. Sidle. Constant-time query processing. In ICDE Conference, pages 60--69, 2008.
[34]
D. Rinfret, P. E. O'Neil, and E. J. O'Neil. Bit-sliced index arithmetic. In SIGMOD Conference, pages 47--57, 2001.
[35]
B. Schlegel, R. Gemulla, and W. Lehner. Fast integer compression using SIMD instructions. In DaMoN Workshop, pages 34--40, 2010.
[36]
L. Sun, M. J. Franklin, S. Krishnan, and R. S. Xin. Fine-grained partitioning for aggressive data skipping. In SIGMOD Conference, pages 1115--1126, 2014.
[37]
Transaction Processing Performance Council. TPC Benchmark H. Revision 2.17.0. April 2014.
[38]
P. Unterbrunner, G. Giannikis, G. Alonso, D. Fauser, and D. Kossmann. Predictable performance for unpredictable workloads. PVLDB, 2(1):706--717, 2009.
[39]
C. B. Walton, A. G. Dale, and R. M. Jenevein. A taxonomy and performance model of data skew effects in parallel joins. In VLDB Conference, pages 537--548, 1991.
[40]
T. Willhalm, I. Oukid, I. Müller, and F. Faerber. Vectorizing database column scans with complex predicates. In ADMS Workshop, pages 1--12, 2013.
[41]
T. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner, A. Zeier, and J. Schaffner. SIMD-scan: Ultra fast in-memory table scan using on-chip vector processing units. PVLDB, 2(1):385--394, 2009.
[42]
J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek. An effective algorithm for parallelizing hash joins in the presence of data skew. In ICDE Conference, pages 200--209, 1991.
[43]
K. Wu and P. S. Yu. Range-based bitmap indexing for high cardinality attributes with skew. In COMPSAC Conference, pages 61--67, 1998.
[44]
Y. Xu and P. Kostamaa. Efficient outer join data skew handling in parallel DBMS. PVLDB, 2(2):1390--1396, 2009.
[45]
Y. Xu, P. Kostamaa, X. Zhou, and L. Chen. Handling data skew in parallel joins in shared-nothing systems. In SIGMOD Conference, pages 1043--1052, 2008.
[46]
J. Zhou and K. A. Ross. Implementing database operations using SIMD instructions. In SIGMOD Conference, pages 145--156, 2002.

Cited By

View all
  • (2024)LeCo: Lightweight Compression via Learning Serial CorrelationsProceedings of the ACM on Management of Data10.1145/36393202:1(1-28)Online publication date: 26-Mar-2024
  • (2023)Rethinking the Encoding of Integers for Scans on Skewed DataProceedings of the ACM on Management of Data10.1145/36267511:4(1-27)Online publication date: 12-Dec-2023
  • (2022)Exploiting Data Skew for Improved Query PerformanceIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.300644634:5(2176-2189)Online publication date: 1-May-2022
  • Show More Cited By

Index Terms

  1. A Padded Encoding Scheme to Accelerate Scans by Leveraging Skew

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
    May 2015
    2110 pages
    ISBN:9781450327589
    DOI:10.1145/2723372
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 May 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. analytics
    2. bit-parallel
    3. encoding
    4. scan
    5. skew

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SIGMOD/PODS'15
    Sponsor:
    SIGMOD/PODS'15: International Conference on Management of Data
    May 31 - June 4, 2015
    Victoria, Melbourne, Australia

    Acceptance Rates

    SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)16
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)LeCo: Lightweight Compression via Learning Serial CorrelationsProceedings of the ACM on Management of Data10.1145/36393202:1(1-28)Online publication date: 26-Mar-2024
    • (2023)Rethinking the Encoding of Integers for Scans on Skewed DataProceedings of the ACM on Management of Data10.1145/36267511:4(1-27)Online publication date: 12-Dec-2023
    • (2022)Exploiting Data Skew for Improved Query PerformanceIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.300644634:5(2176-2189)Online publication date: 1-May-2022
    • (2022)ByteStore: Hybrid Layouts for Main-Memory Column Stores2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020303(170-179)Online publication date: 17-Dec-2022
    • (2021)Understanding and Optimizing Conjunctive Predicates Under Memory-Efficient Storage LayoutsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.295867233:6(2803-2817)Online publication date: 1-Jun-2021
    • (2021)Energy-Efficient Scans by Weaving Indexes Into the Storage Layout in Computing Platforms for Internet of ThingsIEEE Transactions on Green Communications and Networking10.1109/TGCN.2021.30698295:3(1212-1222)Online publication date: Sep-2021
    • (2021)Utilizing the column imprints to accelerate no‐partitioning hash joins in large‐scale edge systemsTransactions on Emerging Telecommunications Technologies10.1002/ett.408432:6Online publication date: 13-Jun-2021
    • (2020)Tree-Encoded BitmapsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380588(937-967)Online publication date: 11-Jun-2020
    • (2020)Order-Preserving Key Compression for In-Memory Search TreesProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380583(1601-1615)Online publication date: 11-Jun-2020
    • (2019)Accelerating raw data analysis with the ACCORDA software and hardware architectureProceedings of the VLDB Endowment10.14778/3342263.334263412:11(1568-1582)Online publication date: 1-Jul-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media