(Why) do we need to call cache or persist on a RDD
When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call … Read more
When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call … Read more
True… it has been discussed quite a lot. However, there is a lot of ambiguity and some of the answers provided … including … Read more
I prefer Python over Scala. But, as Spark is natively written in Scala, I was expecting my code to run faster in the … Read more
I’d like to stop various messages that are coming on spark shell. I tried to edit the log4j.properties file in order to stop … Read more
In terms of RDD persistence, what are the differences between cache() and persist() in spark ? 6 Answers 6
I’m trying to understand the relationship of the number of cores and the number of executors when running a Spark job on YARN. … Read more
Getting strange behavior when calling function outside of a closure: when function is in a object everything is working when function is in … Read more
My cluster: 1 master, 11 slaves, each node has 6 GB memory. My settings: spark.executor.memory=4g, Dspark.akka.frameSize=512 Here is the problem: First, I read … Read more
I read Cluster Mode Overview and I still can’t understand the different processes in the Spark Standalone cluster and the parallelism. Is the … Read more
I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column … Read more