Dimension Modeling: Late Arriving Dimensions
Overview
A data warehouse designed using the principles of dimensional modeling techniques will require all the dimensions loads to be completed before starting the fact table ETL load.In certain cases the fact records appear before the dimension data is loaded.
Example:Online ordering analytics data warehouse receives orders data from order processing ETL before new product data is loaded into products dimension table.
In this article I will cover the various design approaches to handle early arriving fact or late arriving dimension data.
Design Choices
Late arriving dimension data is common scenario in the data warehousing environment.This can happen in various scenarios like ETL job failures, systems going out of sync or replication delays. Below are the options in using which the late arriving dimension scenarios can be handled.
- Default Dimension Record
- HOLD Fact Record
- Infer Dimension Record
Option 1: Default Dimension Record
First option is to assign the “unknown” dimension value to the fact records.The unknown records will be maintained the backlog table which can be reprocessed with the actual dimension value once the dimension data is loaded.
Pros:
- No delay in loading the fact record
Cons:
- Late Dimension records are grouped in unknown bucket which will lead to inaccuracy in reporting.
Option 2: HOLD the FACT Record
Second option is to hold the fact record in the a backlog table until the dimension record arrives. Backlog table contains a flag which denotes the processed records. Every day ETL job checks for the unprocessed records against the dimensions and updates the flag accordingly.
Pros:
- Easier to implement
- Suitable for batch processing with nightly loads which are not sensitive for data delays
Cons:
- Fact record is not available for reporting until the dimension record the loaded. Business teams might be impacted due to missing data.
Option 3: Infer the Dimension record
Third option is to insert the dimension record with a new surrogate key and use the same value to load the fact table. This option is less feasible when enough information about the dimension is not available in the fact table.
Summary
There are multiple ways to handle the late arriving date problem but choosing the right approach entirely depends on the business use cases ,stakeholder requirements and nature of the data. Impact of Data accuracy and missing data plays a vital role in the design choice.
Thanks for reading!