Designing a Data Platform

Dinesh Shankar
3 min readJun 12, 2022

--

Motivation

Data platform design is not only about ETL pipelines but also about other components which makes data analytics journey efficient and robust.The purpose of this article is to provide introduction and scope of about each component in a complete data platform.

Below are the 6 main pillars for a data platform.

Pillars of Data Platform

  1. ELT Layer
  2. Data Discovery Layer
  3. Data Lineage
  4. Data Quality
  5. Automated Monitoring and Alerts
  6. Data Governance( Audit and Access control)

Prerequisites for Designing a Data Platform

1. Identify the source systems

Source system in which the transaction data is stored and available to consume for analytics.

  • Transactional relational databases for services (MySQL,Postgres, SqlServer)
  • No SQL Databases (AWS DynamoDB,MongoDB..)
  • Object Storage (S3,Azure Blob storage etc)
  • Message Queues (SQS,MSMQ ..)
  • Streaming Sources ( AWS Kinesis, Spark Streaming,Kafka)

2. Identify stakeholders and Data Access pattern

Stakeholders who will consuming data from the data platform for data insights.

  • Executives — Leadership dashboards which provides high level summaries
  • Product owners — For identify new business opportunities
  • Data Science — Training ML models and running A/B testing..
  • SWE — Programmatic access to DW data
  • Data Engineers/BIE — Building data marts and BI reporting

The access pattern for stakeholders can be very different and the data platform needs to designed to handle diverse access patterns of the stakeholders without compromising on the performance.

Common Data Access patterns

  • Batch Data Loads — Data access from ETL Tools
  • Interactive analytics — Access using query tools and IDE (adhoc SQL queries)
  • Data Science — Data access using Python ML libs
  • Reporting — Data access from reporting tools like Tableau ,Quicksight ..

Designing the Data Platform

Below are the high level overview of the data platform.

High Level Overview of Data Platform

1. ELT — Extract Load and Transform

Extracts the data from the source system and loads to the data lake in the native format which is then cleansed ,transformed and loaded into a new layer which will used for data analytics. Below are properties of the data lake layer

  • Schema on read
  • Stored with Compression (gzip,snappy,etc) — decided based on the file format
  • Storage File format ( Parquet, Avro, CSV,TSV,JSON)

2. Data Discovery

Data discovery layer will enable consume to discover the data available in platform.

  • Table and Column Details
  • Column descriptions
  • Partition key information
  • Dataset Owner
  • Access details (who can access the data)

AWS Glue Catalog ,Hive Catalog are the services which enable data discovery.

3. Data Lineage

Data Lineage layer provides the ability to track upstream and downstream dependencies for a dataset.In typical DW environment ,data is loaded from 0 to N levels after transforms. Tracking the lineage is essential to to fix data quality issues and identify impact.

4. Pipeline Monitoring and Alerting

Data Accuracy and meeting SLA are critical for any data analytics system to satisfy the end users . DW pipelines do fail for transient reasons. In order to manage a data platform effectively automated monitoring and alerting are required.

  • Automated retries (with interval and backoff limit )
  • Alerting for manual actions (Email or tickets)

5. Data Quality

Data Quality is critical component in the entire platform. Every dataset should have checks for below attributes

  • Completeness
  • Accurate
  • Timeliness
  • Consistent

Data quality checks can be created based on past trends or purely based on statistics.

Tools: Deequ, Great Expectations and DBT

6. Data Governance(Auditing and Privacy)

Data governance ensures tracking of data usage and ensures right people have access to the right dataset.

Recently legal compliance laws like GDPR,CCPA have gained importance ,its essential to design data platform which can extract and delete data at most granular level (record level in a table)

Access control system will ensure fine-grain based data access control. ie (access policy at table or column or partition level)

CRUD on Data Lake: Iceberg, Apache Hudi, Delta lake

Summary

Put everything together ,below is example design using AWS technologies.

Each service can be replaced with open source or other cloud platform services.

--

--