Data Skew : 101
What is Data Skew?
Data skew is the pattern in which data in the table is unevenly distributed across the partitions leading to performance bottlenecks.
Issues with Data Skew
- Jobs taking long time to run
- Jobs stuck in the final stage (90+% tasks completed in minutes but final task running for hours)
- Jobs failing with OOM Errors (Out of Memory)
Approach to handle Data Skewness
1. Filter skewed records due to unknown values
Skewness occurs when column used in the join or aggregate operation contains values much higher than the other key values. Often times the skewness is triggered by bad records/unknown values ( NULL,Blank string, unknown values)
It will be good idea to filter these filter wherever possible.
GROUP BY 1
ORDER BY 2 DESC
2. Handling skews in Joins
When the values present in the legit values then we have to use a different technique to solve the skewness.
For example : Popular sports star is followed by millions of followers in a social media platform compared to normal users. Data for the celebrity will be skewed and all the data for key is valid.
Joins strategies :
Broadcast Join : When one side of the join operation table is very small compared to the other table then broadcasting the smaller table is better than shuffles. Broadcast joins doest trigger a shuffle operation
SortMerge Join/Shuffle Join: Join techqniue used by spark/hive to scan the data in specific order and perform the join. This type of join is non skew resistant and requires data to be partitioned .
Using Skew Hints: Skew joins are hybrid joins which process the skewed records using broadcast join and remaining non skewed values using shuffle hash join or sort merge join.
- Identify the key values in the join stage
- Add skew hints
/*+ SKEWED_ON(a.userid='12345', a.userid='123455') */
FROM TABLEA a
INNER JOIN TABLEB b
ON a.userid = b.userid
- Supports only on Join operations (Except Full outer join)
- LEFT JOIN — skewed data can be only on the left side
- Right JOIN — skewed data can be only on the right side
Spark Configurations to handle skewness.
Auto Skew Mitigation:
Spark detects the skewness automatically and handles it accordingly to reduce shuffles. Supported only in joins.
3. Handling Skews in Aggregates
Data skew in aggregate operations like group by affects the jobs in the aggregation phase and slows down the job increasing the overall runtime.
Partial aggregation : Spark feature which adds a stage before the aggregate phase to partially aggregate the data before shuffling. It reduces the amount of data sent to the final stages of aggregate phase.
4. Salting Technique
Salting technique is used when there is no key can be used to evenly distribute the data. Idea is generate a new ket which guarantees the even distribution of data. ‘salt’ is the random key which will repartition the data into different partitions in the evenly distributed manner.
withColumn and repartition functions can be used in this method.
Other important technique to handle data skewness is to use AQE skew mitigation.
Thanks for reading!