Data is the most important asset for every company in today’s era. Data engineering as a role is getting more familiar in recent times. The purpose of this article to share the details about the skills required to be a successful data engineer.

What is Data Engineering?

Data Engineering is a discipline in engineering which focuses on handling the data transformation process to provide business teams with actionable insights. It covers the wide spectrum in the lifecycle of data which transforms from raw format to an actionable insight. Below are the primary responsibilities of a data engineering team.

  • Building ETL pipelines for data translation…

Overview

AWS announced its new-generation redshift node types named RA3 which decouples the compute and storage enabling users to manage and scale independently based on needs.

The purpose of article to introduce RA3 node types and provide a comparison with existing node types. In the end I will also discuss about the migration options to RA3 nodes Let’s get started.

What is Redshift RA3?

RA3 nodes are built on the next generation Nitro powered compute instances which comes with high-bandwidth networking, managed storage that uses local SSD-based storage backed by Amazon Simple Storage Service (S3).

Redshift Managed Storage Architecture

RA3 instances with managed storage use high performance SSDs for…


Overview

Flask is a web framework for python that provides a simple interface for dynamically generating responses to web requests.

Docker is an open-source application that allows administrators to create, manage, deploy, and replicate applications using containers.

The purpose of this article is to provide step-by-step instructions for running a FLASK app integrated with gunicorn and NGINX running inside a single container hosted in AWS EC2 .

Components

  • Flask — Python based web server backend
  • Gunicorn — Python WSGI HTTP Server for web applications.
  • NGINX —HTTP cache,load balancer, and reverse proxy server.
  • Docker — Tool designed to make it easier to create…


Amazon Redshift

Amazon Redshift is a fully managed petabyte scale datawarehouse designed to handle large scale datasets, perform data analysis and business intelligence reporting.

Redshift delivers fast query performance by using columnar storage technology to improve I/O efficiency and parallelizing queries across multiple nodes.

The scope of this article is to share the table design practices which showed significant performance improvements.

Architecture

Amazon Redshift is based on MPP architecture in which cluster is the core component. A cluster is composed on leader and compute nodes.

Leader Node: Coordinates the compute nodes and handles external communication.

  • SQL endpoint to the cluster
  • Stores metadata of…

Dishan

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store