skip to main content
10.1145/3626246.3653379acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Automated Multidimensional Data Layouts in Amazon Redshift

Published: 09 June 2024 Publication History

Abstract

Analytic data systems typically use data layouts to improve the performance of scanning and filtering data. Common data layout techniques include single-column sort keys, compound sort keys, and more complex multidimensional data layouts such as the Z-order. An appropriately-selected data layout over a table, in combination with metadata such as zone maps, enables the system to skip irrelevant data blocks when scanning the table, which reduces the amount of data scanned and improves query performance.
In this paper, we introduce Multidimensional Data Layouts (MDDL), a new data layout technique which outperforms existing data layout techniques for query workloads with repetitive scan filters. Unlike existing data layout approaches, which typically sort tables based on columns, MDDL sorts tables based on a collection of predicates, which enables a much higher degree of specialization to the user's workload. We additionally introduce an algorithm for automatically learning the best MDDL for each table based on telemetry collected from the historical workload. We implemented MDDL within Amazon Redshift. Benchmarks on internal datasets and workloads show that MDDL achieves up to 85% reduction in end-to-end workload runtime compared to using traditional column-based data layout techniques. MDDL is, to the best of our knowledge, the first data layout technique in a commercial product that sorts based on predicates and automatically learns the best predicates.

References

[1]
[n.d.]. ALTER TABLE APPEND - Amazon Redshift-docs.aws.amazon.com. https: //docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE_APPEND.html. [Accessed 25--11--2023].
[2]
[n.d.]. Amazon Redshift Advisor recommendations - Amazon Redshift - docs.aws.amazon.com. https://docs.aws.amazon.com/redshift/latest/dgadvisorrecommendations.html. [Accessed 20--11--2023].
[3]
[n.d.]. Amazon Redshift Engineering's Advanced Table Design Playbook: Compound and Interleaved Sort Keys | Amazon Web Services - aws.amazon.com. https://aws.amazon.com/blogs/big-data/amazon-redshift-engineeringsadvanced- table-design-playbook-compound-and-interleaved-sort-keys/. [Accessed 28--11--2023].
[4]
[n.d.]. Amazon Redshift introduces Automatic Table Sort, an automated alternative to Vacuum Sort - aws.amazon.com. https://aws.amazon.com/aboutaws/ whats-new/2019/11/amazon-redshift-introduces-automatic-table-sortalternative- vacuum-sort/. [Accessed 20--11--2023].
[5]
[n.d.]. Announcing Delta Lake 3.0 with New Universal Format and Liquid Clustering - databricks.com. https://www.databricks.com/blog/announcing-deltalake- 30-new-universal-format-and-liquid-clustering. [Accessed 28--11--2023].
[6]
[n.d.]. Automated materialized views - Amazon Redshift - docs.aws.amazon.com. https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-automv. html. [Accessed 28--11--2023].
[7]
[n.d.]. Clustering Keys & Clustered Tables | Snowflake Documentation - docs.snowflake.com. https://docs.snowflake.com/en/user-guide/tablesclustering- keys. [Accessed 28--11--2023].
[8]
[n.d.]. Data skipping with Z-order indexes for Delta Lake - docs.databricks.com. https://docs.databricks.com/en/delta/data-skipping.html. [Accessed 28--11--2023].
[9]
[n.d.]. DataWarehousing Guide - docs.oracle.com. https://docs.oracle.com/en/ database/oracle/oracle-database/19/dwhsg/using-zone-maps.html. [Accessed 28--11--2023].
[10]
[n.d.]. Indexing Expressions - docs.oracle.com. https://docs.oracle.com/ cd/E16338_01/appdev.112/e14919/exprn_indexexpressions.htm. [Accessed 28--11--2023].
[11]
[n.d.]. Micro-partitions & Data Clustering | Snowflake Documentation - docs.snowflake.com. https://docs.snowflake.com/en/user-guide/tablesclustering- micropartitions. [Accessed 28--11--2023].
[12]
[n.d.]. Partitioning tables in dedicated SQL pool - Azure Synapse Analytics - learn.microsoft.com. https://learn.microsoft.com/en-us/azure/synapseanalytics/ sql-data-warehouse/sql-data-warehouse-tables-partition. [Accessed 28--11--2023].
[13]
[n.d.]. SUPER type - Amazon Redshift - docs.aws.amazon.com. https://docs.aws. amazon.com/redshift/latest/dg/r_SUPER_type.html. [Accessed 21--11--2023].
[14]
[n.d.]. Working with automatic table optimization - Amazon Redshift - docs.aws.amazon.com. https://docs.aws.amazon.com/redshift/latest/dg/t_ Creating_tables.html. [Accessed 20--11--2023].
[15]
Sanjay Agrawal, Surajit Chaudhuri, and Vivek R. Narasayya. 2000. Automated Selection of Materialized Views and Indexes in SQL Databases. In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB '00). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 496--505.
[16]
Anastassia Ailamaki, David J. DeWitt, and Mark D. Hill. 2002. Data Page Layouts for Relational Databases on Deep Memory Hierarchies. The VLDB Journal 11, 3 (nov 2002), 198--215. https://doi.org/10.1007/s00778-002-0074--9
[17]
Nikos Armenatzoglou, Sanuj Basu, Naga Bhanoori, Mengchu Cai, Naresh Chainani, Kiran Chinta, Venkatraman Govindaraju, Todd J. Green, Monish Gupta, Sebastian Hillig, Eric Hotinger, Yan Leshinksy, Jintian Liang, Michael McCreedy, Fabian Nagel, Ippokratis Pandis, Panos Parchas, Rahul Pathak, Orestis Polychroniou, Foyzur Rahman, Gaurav Saxena, Gokul Soundararajan, Sriram Subramanian, and Doug Terry. 2022. Amazon Redshift Re-Invented. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD '22). Association for Computing Machinery, New York, NY, USA, 2205--2217. https://doi.org/10.1145/3514221.3526045
[18]
Nemanja Boric, Hinnerk Gildhoff, Menelaos Karavelas, Ippokratis Pandis, and Ioanna Tsalouchidou. 2020. Unified Spatial Analytics from Heterogeneous Sources with Amazon Redshift. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 2781--2784. https://doi.org/10.1145/3318464.3384704
[19]
Nicolas Bruno and Surajit Chaudhuri. 2005. Automatic Physical Database Tuning: A Relaxation-Based Approach. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (Baltimore, Maryland) (SIGMOD '05). Association for Computing Machinery, New York, NY, USA, 227--238. https://doi.org/10.1145/1066157.1066184
[20]
Jialin Ding, Ryan Marcus, Andreas Kipf, Vikram Nathan, Aniruddha Nrusimha, Kapil Vaidya, Alexander van Renen, and Tim Kraska. 2022. SageDB: An Instance-Optimized Data Analytics System. Proc. VLDB Endow. 15, 13 (2022), 4062--4078. https://www.vldb.org/pvldb/vol15/p4062-ding.pdf
[21]
Jialin Ding, Umar Farooq Minhas, Badrish Chandramouli, ChiWang, Yinan Li, Ying Li, DonaldKossmann, Johannes Gehrke, and Tim Kraska. 2021. Instance-Optimized Data Layouts for CloudAnalyticsWorkloads. Association for Computing Machinery, New York, NY, USA, 418--431. https://doi.org/10.1145/3448016.3457270
[22]
Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. 2020. Tsunami: A Learned Multi-Dimensional Index for Correlated Data and Skewed Workloads. Proc. VLDB Endow. 14, 2 (Oct. 2020), 74--86. https://doi.org/10.14778/3425879.3425880
[23]
Guido Moerkotte. 1998. Small Materialized Aggregates: A LightWeight Index Structure for DataWarehousing. In Proceedings of the 24rd International Conference on Very Large Data Bases (VLDB '98). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 476--487.
[24]
Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learning Multi-dimensional Indexes. In Proceedings of the 2020 International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3318464.3380579
[25]
Panos Parchas, Yonatan Naamad, Peter Van Bouwel, Christos Faloutsos, and Michalis Petropoulos. 2020. Fast and Effective Distribution-Key Recommendation for Amazon Redshift. Proc. VLDB Endow. 13, 12 (jul 2020), 2411--2423. https://doi.org/10.14778/3407790.3407834
[26]
Gaurav Saxena, Mohammad Rahman, Naresh Chainani, Chunbin Lin, George Caragea, Fahim Chowdhury, Ryan Marcus, Tim Kraska, Ippokratis Pandis, and Balakrishnan (Murali) Narayanaswamy. 2023. Auto-WLM: Machine Learning EnhancedWorkload Management in Amazon Redshift. In Companion of the 2023 International Conference on Management of Data (Seattle, WA, USA) (SIGMOD '23). Association for Computing Machinery, New York, NY, USA, 225--237. https://doi.org/10.1145/3555041.3589677
[27]
Sivaprasad Sudhir, Michael Cafarella, and Samuel Madden. 2021. Replicated Layout for In-Memory Database Systems. Proc. VLDB Endow. 15, 4 (dec 2021), 984--997. https://doi.org/10.14778/3503585.3503606
[28]
Sivaprasad Sudhir, Wenbo Tao, Nikolay Laptev, Cyrille Habis, Michael Cafarella, and Samuel Madden. 2023. Pando: Enhanced Data Skipping with Logical Data Partitioning. Proc. VLDB Endow. 16, 9 (may 2023), 2316--2329. https://doi.org/10.14778/3598581.3598601
[29]
Christoforos Svingos, Andre Hernich, Hinnerk Gildhoff, Yannis Papakonstantinou, and Yannis Ioannidis. 2023. Foreign Keys Open the Door for Faster Incremental View Maintenance. Proc. ACM Manag. Data 1, 1, Article 40 (may 2023), 25 pages. https://doi.org/10.1145/3588720
[30]
Zongheng Yang, Badrish Chandramouli, ChiWang, Johannes Gehrke, Yinan Li, Umar F. Minhas, Per-Åke Larson, Donald Kossmann, and Rajeev Acharya. 2020. Qd-tree: Learning Data Layouts for Big Data Analytics. In Proceedings of the 2020 International Conference on Management of Data.
[31]
Zack Slayton. 2017. Z-Order Indexing for Multifaceted Queries in Amazon DynamoDB. https://aws.amazon.com/blogs/database/z-order-indexing-formultifaceted- queries-in-amazon-dynamodb-part-1/.

Cited By

View all
  • (2024)Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion PipelinesProceedings of the VLDB Endowment10.14778/3681954.368201317:11(3456-3469)Online publication date: 1-Jul-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data
June 2024
694 pages
ISBN:9798400704222
DOI:10.1145/3626246
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2024

Check for updates

Author Tags

  1. analytic database
  2. data warehouse
  3. machine learning
  4. sort key

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)433
  • Downloads (Last 6 weeks)84
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion PipelinesProceedings of the VLDB Endowment10.14778/3681954.368201317:11(3456-3469)Online publication date: 1-Jul-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media