Author: Mayank Malhotra (Big Data Engineer)
Apache Spark has become an indispensable tool for all big data processing workloads whether on cloud or on in-premise infrastructure. Being an in-memory processing tool, memory utilization can sometimes become a bottleneck especially when your spark jobs are not optimized carefully.
In this post I am going to share some spark job optimization techniques which one must know if they are creating spark applications:
1. Repartitioning Vs Coalesce
Repartition and Coalesce both are used to change the number of partitions of a data frame/RDD. Repartitioning allows you to increase or decrease the number of partitions whereas coalesce permits only decrease in the number of partitions.
Repartitioning also allows shuffling which is not in the case of coalesce. Let’s consider data frame df1 which has 6 partitions.
Above command can evenly distribute the data of 6 partitions into 3 partitions. Data of one partition can go to any partition because of the shuffling whereas in below command no shuffle happens.
It simply takes the data of the last 3 partitions and dumps it into the first 3 partitions. Data in the first 3 partitions remain there itself. However, this can lead to uneven distribution of data into partitions leading to job bottlenecks. On the other hand, coalesce is fast as there is no shuffle taking place.
In a nutshell, both coalesce and repartition have their pros and cons depending upon your use case.
Broadcasting data frames becomes super useful if one of your data frame sizes is relatively smaller than the other data frames. Broadcasting is the method in which the driver sends the smaller data frame to all the executors so that executor does not need to check for the data in other executors but it can treat this broadcast data frame as a lookup table.
This can increase job performance many folds. However, you need to keep in mind that size of the broadcast data frame should not be very large otherwise driver may take longer time to send the data frame to the executors thus negatively impacting job performance instead of improving.
You must be aware that spark performs lazy evaluation meaning it processes the code whenever there is an active call. It creates the lineage of all the transformations and starts executing when an action is invoked.
If you need a data frame multiple times in your spark job, then you can consider persisting the data frame in memory to save recompute time. Cache/Persist comes handy here.
When you perform cache on a data frame, it stores the data frame into memory and keeps it in memory till the end of the job. Do consider the size of data frame though. Persisting data frame larger in size than executor memory can cause data spill on disk which can lead to decrease in performance.
I hope you find these suggestions useful in your work. If you have other suggestions you would like to share with us, feel free to do so.