How to stop INFO messages displaying on spark console?

I’d like to stop various messages that are coming on spark shell. I tried to edit the log4j.properties file in order to stop these message. Here are the contents of log4j.properties # Define the root logger with appender file log4j.rootCategory=WARN, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n # Settings to quiet third party … Read more

What are workers, executors, cores in Spark Standalone cluster?

I read Cluster Mode Overview and I still can’t understand the different processes in the Spark Standalone cluster and the parallelism. Is the worker a JVM process or not? I ran the bin\start-slave.sh and found that it spawned the worker, which is actually a JVM. As per the above link, an executor is a process … Read more

How to show full column content in a Spark Dataframe?

I am using spark-csv to load data into a DataFrame. I want to do a simple query and display the content: val df = sqlContext.read.format(“com.databricks.spark.csv”).option(“header”, “true”).load(“my.csv”) df.registerTempTable(“tasks”) results = sqlContext.sql(“select col from tasks”); results.show() The col seems truncated: scala> results.show(); +——————–+ | col| +——————–+ |2015-11-16 07:15:…| |2015-11-16 07:15:…| |2015-11-16 07:15:…| |2015-11-16 07:15:…| |2015-11-16 07:15:…| |2015-11-16 … Read more

Spark – repartition() vs coalesce()

According to Learning Spark Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions. One difference I get is that with repartition() the number of partitions can … Read more