apache-spark Archives

Programming, scala

IT Nursery

(Why) do we need to call cache or persist on a RDD

When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call ...

June 4, 2022
0 Comments

Java, Programming

IT Nursery

Add JAR files to a Spark job – spark-submit

True… it has been discussed quite a lot. However, there is a lot of ambiguity and some of the answers provided … including ...

June 2, 2022
0 Comments

Programming, scala

IT Nursery

Spark performance for Scala vs Python

I prefer Python over Scala. But, as Spark is natively written in Scala, I was expecting my code to run faster in the ...

May 31, 2022
0 Comments

apache-spark, Programming

IT Nursery

How to stop INFO messages displaying on spark console?

I’d like to stop various messages that are coming on spark shell. I tried to edit the log4j.properties file in order to stop ...

May 29, 2022
0 Comments

What is the difference between cache and persist?

In terms of RDD persistence, what are the differences between cache() and persist() in spark ? 6 Answers 6

May 26, 2022
0 Comments

Apache Spark: The number of cores vs. the number of executors

I’m trying to understand the relationship of the number of cores and the number of executors when running a Spark job on YARN. ...

May 25, 2022
0 Comments

Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects

Getting strange behavior when calling function outside of a closure: when function is in a object everything is working when function is in ...

May 23, 2022
0 Comments

Spark java.lang.OutOfMemoryError: Java heap space

My cluster: 1 master, 11 slaves, each node has 6 GB memory. My settings: spark.executor.memory=4g, Dspark.akka.frameSize=512 Here is the problem: First, I read ...

May 20, 2022
0 Comments

What are workers, executors, cores in Spark Standalone cluster?

I read Cluster Mode Overview and I still can’t understand the different processes in the Spark Standalone cluster and the parallelism. Is the ...

May 20, 2022
0 Comments

How to change dataframe column names in pyspark?

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column ...

May 18, 2022
0 Comments