Redshift Table Design Best Practices

Amazon Redshift

Amazon Redshift is a fully managed petabyte scale datawarehouse designed to handle large scale datasets, perform data analysis and business intelligence reporting.

Architecture

Amazon Redshift is based on MPP architecture in which cluster is the core component. A cluster is composed on leader and compute nodes.

  • Stores metadata of objects in the cluster
  • Performs aggregations and user communication
Redshift Architecture

Table Design Principles

Redshift is designed under the principles of Massively Parallel Processing (MPP) which is different from traditional datawarehouses. Indexes, Partitions, constraints (PK, FK) are often used in the traditional databases to improve query performance but redshift table natively doesn’t have indexes or partition keys.

  1. Data Distribution
  2. Data Compression
  3. Constraints

Data Sorting

Amazon Redshift stores the data on disk in sorted order based on the sort key of the table. There can be one or more columns as sort keys for a table. But how does sorting the data in disk improves the query performance? The answer is ‘Zone Maps’.

Zone Maps in Redshift
  • Define four or less sort key columns — more will result in marginal gains and increased ingestion overhead

Data Distribution

Distribution style property defines how the table’s data is distributed throughout the cluster. Distributing the data spread across the compute nodes will result in a high degree of parallelism, minimize data movement across nodes (same as shuffle behavior) and avoid data skewness. There are three types of distribution styles in redshift.

Compression Encoding

Redshift being a columnar database enables compressions at the column-level that reduces the size of the data when its stored. Compression conserves storage space and reduces the size of data that is read from storage, which reduces the amount of disk I/O and therefore improves query performance.

  1. Use COPY command to analyze and apply compression
  • Avoid compressing SORT KEYS in the table
  • Changing column compression requires table rebuild. Column encoding utility available here.

Constraints

Constraints are not enforced by Redshift. Redshift assumes all keys in validated when loaded with an exception of NOT NULL column constraints

Summary

  • Sort keys should be added on the potential columns that will be used in query filters. Column order matters.
  • Avoid compressing the sort key column to reduce data scan.
  • Avoid distribution keys on columns with low cardinality.
  • Use ANALYZE COMPRESSION to choose the optimal compression for the columns.
  • Always use table constraints for better query planning.
  • Identify sort key metrics using system tables (SVV_TABLE_INFO)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store