According to Learning Spark
Keep in mind that repartitioning your data is a fairly expensive operation.
Spark also has an optimized version ofrepartition()
calledcoalesce()
that allows avoiding data movement, but only if you are decreasing the number of RDD partitions.
One difference I get is that with repartition()
the number of partitions can be increased/decreased, but with coalesce()
the number of partitions can only be decreased.
If the partitions are spread across multiple machines and coalesce()
is run, how can it avoid data movement?