research-article

Open access

Automated Multidimensional Data Layouts in Amazon Redshift

Authors:

Sanghita Bandyopadhyay,

Luciano Di Palma,

Orestis Polychroniou,

Tim KraskaAuthors Info & Claims

SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data

Pages 55 - 67

https://doi.org/10.1145/3626246.3653379

Published: 09 June 2024 Publication History

Abstract

Analytic data systems typically use data layouts to improve the performance of scanning and filtering data. Common data layout techniques include single-column sort keys, compound sort keys, and more complex multidimensional data layouts such as the Z-order. An appropriately-selected data layout over a table, in combination with metadata such as zone maps, enables the system to skip irrelevant data blocks when scanning the table, which reduces the amount of data scanned and improves query performance.

In this paper, we introduce Multidimensional Data Layouts (MDDL), a new data layout technique which outperforms existing data layout techniques for query workloads with repetitive scan filters. Unlike existing data layout approaches, which typically sort tables based on columns, MDDL sorts tables based on a collection of predicates, which enables a much higher degree of specialization to the user's workload. We additionally introduce an algorithm for automatically learning the best MDDL for each table based on telemetry collected from the historical workload. We implemented MDDL within Amazon Redshift. Benchmarks on internal datasets and workloads show that MDDL achieves up to 85% reduction in end-to-end workload runtime compared to using traditional column-based data layout techniques. MDDL is, to the best of our knowledge, the first data layout technique in a commercial product that sorts based on predicates and automatically learns the best predicates.

References

[1]

[n.d.]. ALTER TABLE APPEND - Amazon Redshift-docs.aws.amazon.com. https: //docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE_APPEND.html. [Accessed 25--11--2023].

[2]

[n.d.]. Amazon Redshift Advisor recommendations - Amazon Redshift - docs.aws.amazon.com. https://docs.aws.amazon.com/redshift/latest/dgadvisorrecommendations.html. [Accessed 20--11--2023].

[3]

[n.d.]. Amazon Redshift Engineering's Advanced Table Design Playbook: Compound and Interleaved Sort Keys | Amazon Web Services - aws.amazon.com. https://aws.amazon.com/blogs/big-data/amazon-redshift-engineeringsadvanced- table-design-playbook-compound-and-interleaved-sort-keys/. [Accessed 28--11--2023].

[4]

[n.d.]. Amazon Redshift introduces Automatic Table Sort, an automated alternative to Vacuum Sort - aws.amazon.com. https://aws.amazon.com/aboutaws/ whats-new/2019/11/amazon-redshift-introduces-automatic-table-sortalternative- vacuum-sort/. [Accessed 20--11--2023].

[5]

[n.d.]. Announcing Delta Lake 3.0 with New Universal Format and Liquid Clustering - databricks.com. https://www.databricks.com/blog/announcing-deltalake- 30-new-universal-format-and-liquid-clustering. [Accessed 28--11--2023].

[6]

[n.d.]. Automated materialized views - Amazon Redshift - docs.aws.amazon.com. https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-automv. html. [Accessed 28--11--2023].

[7]

[n.d.]. Clustering Keys & Clustered Tables | Snowflake Documentation - docs.snowflake.com. https://docs.snowflake.com/en/user-guide/tablesclustering- keys. [Accessed 28--11--2023].

[8]

[n.d.]. Data skipping with Z-order indexes for Delta Lake - docs.databricks.com. https://docs.databricks.com/en/delta/data-skipping.html. [Accessed 28--11--2023].

[9]

[n.d.]. DataWarehousing Guide - docs.oracle.com. https://docs.oracle.com/en/ database/oracle/oracle-database/19/dwhsg/using-zone-maps.html. [Accessed 28--11--2023].

[10]

[n.d.]. Indexing Expressions - docs.oracle.com. https://docs.oracle.com/ cd/E16338_01/appdev.112/e14919/exprn_indexexpressions.htm. [Accessed 28--11--2023].

[11]

[n.d.]. Micro-partitions & Data Clustering | Snowflake Documentation - docs.snowflake.com. https://docs.snowflake.com/en/user-guide/tablesclustering- micropartitions. [Accessed 28--11--2023].

[12]

[n.d.]. Partitioning tables in dedicated SQL pool - Azure Synapse Analytics - learn.microsoft.com. https://learn.microsoft.com/en-us/azure/synapseanalytics/ sql-data-warehouse/sql-data-warehouse-tables-partition. [Accessed 28--11--2023].

[13]

[n.d.]. SUPER type - Amazon Redshift - docs.aws.amazon.com. https://docs.aws. amazon.com/redshift/latest/dg/r_SUPER_type.html. [Accessed 21--11--2023].

[14]

[n.d.]. Working with automatic table optimization - Amazon Redshift - docs.aws.amazon.com. https://docs.aws.amazon.com/redshift/latest/dg/t_ Creating_tables.html. [Accessed 20--11--2023].

[15]

Sanjay Agrawal, Surajit Chaudhuri, and Vivek R. Narasayya. 2000. Automated Selection of Materialized Views and Indexes in SQL Databases. In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB '00). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 496--505.

Digital Library

[16]

Anastassia Ailamaki, David J. DeWitt, and Mark D. Hill. 2002. Data Page Layouts for Relational Databases on Deep Memory Hierarchies. The VLDB Journal 11, 3 (nov 2002), 198--215. https://doi.org/10.1007/s00778-002-0074--9

Digital Library

[17]

Nikos Armenatzoglou, Sanuj Basu, Naga Bhanoori, Mengchu Cai, Naresh Chainani, Kiran Chinta, Venkatraman Govindaraju, Todd J. Green, Monish Gupta, Sebastian Hillig, Eric Hotinger, Yan Leshinksy, Jintian Liang, Michael McCreedy, Fabian Nagel, Ippokratis Pandis, Panos Parchas, Rahul Pathak, Orestis Polychroniou, Foyzur Rahman, Gaurav Saxena, Gokul Soundararajan, Sriram Subramanian, and Doug Terry. 2022. Amazon Redshift Re-Invented. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD '22). Association for Computing Machinery, New York, NY, USA, 2205--2217. https://doi.org/10.1145/3514221.3526045

Digital Library

[18]

Nemanja Boric, Hinnerk Gildhoff, Menelaos Karavelas, Ippokratis Pandis, and Ioanna Tsalouchidou. 2020. Unified Spatial Analytics from Heterogeneous Sources with Amazon Redshift. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 2781--2784. https://doi.org/10.1145/3318464.3384704

Digital Library

[19]

Nicolas Bruno and Surajit Chaudhuri. 2005. Automatic Physical Database Tuning: A Relaxation-Based Approach. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (Baltimore, Maryland) (SIGMOD '05). Association for Computing Machinery, New York, NY, USA, 227--238. https://doi.org/10.1145/1066157.1066184

Digital Library

[20]

Jialin Ding, Ryan Marcus, Andreas Kipf, Vikram Nathan, Aniruddha Nrusimha, Kapil Vaidya, Alexander van Renen, and Tim Kraska. 2022. SageDB: An Instance-Optimized Data Analytics System. Proc. VLDB Endow. 15, 13 (2022), 4062--4078. https://www.vldb.org/pvldb/vol15/p4062-ding.pdf

Digital Library

[21]

Jialin Ding, Umar Farooq Minhas, Badrish Chandramouli, ChiWang, Yinan Li, Ying Li, DonaldKossmann, Johannes Gehrke, and Tim Kraska. 2021. Instance-Optimized Data Layouts for CloudAnalyticsWorkloads. Association for Computing Machinery, New York, NY, USA, 418--431. https://doi.org/10.1145/3448016.3457270

Digital Library

[22]

Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. 2020. Tsunami: A Learned Multi-Dimensional Index for Correlated Data and Skewed Workloads. Proc. VLDB Endow. 14, 2 (Oct. 2020), 74--86. https://doi.org/10.14778/3425879.3425880

Digital Library

[23]

Guido Moerkotte. 1998. Small Materialized Aggregates: A LightWeight Index Structure for DataWarehousing. In Proceedings of the 24rd International Conference on Very Large Data Bases (VLDB '98). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 476--487.

[24]

Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learning Multi-dimensional Indexes. In Proceedings of the 2020 International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3318464.3380579

Digital Library

[25]

Panos Parchas, Yonatan Naamad, Peter Van Bouwel, Christos Faloutsos, and Michalis Petropoulos. 2020. Fast and Effective Distribution-Key Recommendation for Amazon Redshift. Proc. VLDB Endow. 13, 12 (jul 2020), 2411--2423. https://doi.org/10.14778/3407790.3407834

Digital Library

[26]

Gaurav Saxena, Mohammad Rahman, Naresh Chainani, Chunbin Lin, George Caragea, Fahim Chowdhury, Ryan Marcus, Tim Kraska, Ippokratis Pandis, and Balakrishnan (Murali) Narayanaswamy. 2023. Auto-WLM: Machine Learning EnhancedWorkload Management in Amazon Redshift. In Companion of the 2023 International Conference on Management of Data (Seattle, WA, USA) (SIGMOD '23). Association for Computing Machinery, New York, NY, USA, 225--237. https://doi.org/10.1145/3555041.3589677

Digital Library

[27]

Sivaprasad Sudhir, Michael Cafarella, and Samuel Madden. 2021. Replicated Layout for In-Memory Database Systems. Proc. VLDB Endow. 15, 4 (dec 2021), 984--997. https://doi.org/10.14778/3503585.3503606

Digital Library

[28]

Sivaprasad Sudhir, Wenbo Tao, Nikolay Laptev, Cyrille Habis, Michael Cafarella, and Samuel Madden. 2023. Pando: Enhanced Data Skipping with Logical Data Partitioning. Proc. VLDB Endow. 16, 9 (may 2023), 2316--2329. https://doi.org/10.14778/3598581.3598601

Digital Library

[29]

Christoforos Svingos, Andre Hernich, Hinnerk Gildhoff, Yannis Papakonstantinou, and Yannis Ioannidis. 2023. Foreign Keys Open the Door for Faster Incremental View Maintenance. Proc. ACM Manag. Data 1, 1, Article 40 (may 2023), 25 pages. https://doi.org/10.1145/3588720

Digital Library

[30]

Zongheng Yang, Badrish Chandramouli, ChiWang, Johannes Gehrke, Yinan Li, Umar F. Minhas, Per-Åke Larson, Donald Kossmann, and Rajeev Acharya. 2020. Qd-tree: Learning Data Layouts for Big Data Analytics. In Proceedings of the 2020 International Conference on Management of Data.

Digital Library

[31]

Zack Slayton. 2017. Z-Order Indexing for Multifaceted Queries in Amazon DynamoDB. https://aws.amazon.com/blogs/database/z-order-indexing-formultifaceted- queries-in-amazon-dynamodb-part-1/.

Cited By

Hansert PMichel S(2024)Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion PipelinesProceedings of the VLDB Endowment10.14778/3681954.368201317:11(3456-3469)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682013

Index Terms

Automated Multidimensional Data Layouts in Amazon Redshift
1. Information systems
  1. Data management systems

Recommendations

Alliance Rules for Data Warehouse Cleansing
ICSPS '09: Proceedings of the 2009 International Conference on Signal Processing Systems

Data Cleansing is an activity performed on the data sets of data warehouse to enhance and maintain the quality and consistency of the data. This paper addresses the problems related with dirty data, entrance of dirty data and detection of dirty data in ...
Multidimensional benchmarking in data warehouses

Benchmarking is among the most widely adopted practices in business today. However, to the best of our knowledge, conducting multidimensional benchmarking in data warehouses has not been explored from a technical efficiency perspective. In this ...
Present and future directions in data warehousing

Many large organizations have developed data warehouses to support decision making. The data in a warehouse are subject oriented, integrated, time variant, and nonvolatile. A data warehouse contains five types of data: current detail data, older detail ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data

June 2024

694 pages

ISBN:9798400704222

DOI:10.1145/3626246

General Chairs:
Pablo Barcelo
Universidad Catolica, Chile
,
Nayat Sanchez-Pi
INRIA Chile
,
Program Chairs:
Alexandra Meliou
University of Massachusetts Amherst, USA
,
S. Sudarshan
Indian Institute of Technology Bombay

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '24

Sponsor:

SIGMOD

SIGMOD/PODS '24: International Conference on Management of Data

June 9 - 15, 2024

Santiago AA, Chile

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
433
Total Downloads

Downloads (Last 12 months)433
Downloads (Last 6 weeks)84

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hansert PMichel S(2024)Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion PipelinesProceedings of the VLDB Endowment10.14778/3681954.368201317:11(3456-3469)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682013

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten