rdd – IT Nursery

(Why) do we need to call cache or persist on a RDD

June 4, 2022 by IT Nursery

When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call “cache” or “persist” explicitly to store the RDD data into memory? Or is the RDD data stored in a distributed way in the memory by default? val textFile = sc.textFile(“/user/emp.txt”) As per … Read more

Spark performance for Scala vs Python

May 31, 2022 by IT Nursery

I prefer Python over Scala. But, as Spark is natively written in Scala, I was expecting my code to run faster in the Scala than the Python version for obvious reasons. With that assumption, I thought to learn & write the Scala version of some very common preprocessing code for some 1 GB of data. Data … Read more

What is the difference between cache and persist?

May 26, 2022 by IT Nursery

In terms of RDD persistence, what are the differences between cache() and persist() in spark ? 6 Answers 6

Difference between DataFrame, Dataset, and RDD in Spark

May 15, 2022 by IT Nursery

I’m just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]) in Apache Spark? Can you convert one to the other? 14 Answers 14

Spark – repartition() vs coalesce()

May 10, 2022 by IT Nursery

According to Learning Spark Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions. One difference I get is that with repartition() the number of partitions can … Read more