Databricks Spark jobs optimization techniques: Multi-threading
Spark is known for its parallel processing, which means a data frame or a resilient distributed dataset (RDD) is being distributed across the worker nodes to gain maximum performance while processing. However, one problem we could face while running Spark jobs in Databricks is this: How do we process multiple data frames or notebooks at the same time (multi-threading)?
The benefits of parallel running are obvious: We can run the end-to-end pipeline faster, reduce the code deployed and maximize cluster utilization to save costs. Let’s see what this looks like with an example comparing sequential loading and multi-threading.
Sequential loading vs. Multi-threading
The following scenario shows an example when we have multiple sources to read from, coalesce into one parquet file, and then write in the destination location for each part. In this scenario, coalescing into one partition can only work on one CPU core in Spark, so all the other cores will become idle. By using a multi-threading pool, each CPU will have jobs to work on, which not only saves time but also creates a better load balance.
Our test cluster has one 4 cores/8 GB master node with two 4 cores/8 GB worker nodes.
Without multi-threading, under the sequential method, we read each part from the source, filter the data frame and write the result as one parquet file in the destination, which took about 20 secs to load 8 tables.
Under the same functions, after applying ThreadPool (8 threads at the same time), 8 tables can be loaded within 5 secs which is 4x faster than the sequential loading method.
Multi-threading is relatively quick to set up compared with other optimization methods. The improvement could be unlimited if we have a large enough cluster and plenty of jobs to run parallelly (under suitable scenarios). We can also use the multi-threading pool to parallel run multiple notebooks which do not have dependencies on each other even if we do not have the same scenario as shown above.
The purpose of using multi-threading is not only to save time, but also to fully utilize the clusters’ compute power to save cost by finishing the same amount of jobs within less time, or within the same amount of time on a smaller cluster, which gives us more options to manage the end-to-end pipeline.
Possible scenarios for the multi-threading technique
- Optimize bottlenecks in a pipeline to save end-to-end running time
- Parallel run independent notebooks to optimize load balance, saving both time and cost
- Read/write data from/to multiple tables
A multi-threading pool can also be developed by the concurrent.futures.ThreadPoolExecutor library in Python or the scala.concurrent.ExecutionContext library in Scala.
Want to learn more about Databricks Spark job optimization? Check out my previous blog on the topic to learn about the shuffle partition technique.