Spark - repartition() vs coalesce()

Spark – repartition() vs coalesce()

IT Nursery

May 10, 2022

According to Learning Spark

Keep in mind that repartitioning your data is a fairly expensive operation.
Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions.

One difference I get is that with repartition() the number of partitions can be increased/decreased, but with coalesce() the number of partitions can only be decreased.

If the partitions are spread across multiple machines and coalesce() is run, how can it avoid data movement?

18 Answers
18

18 Answers 18

Leave a Reply Cancel reply

18 Answers
18