Add JAR files to a Spark job – spark-submit

True… it has been discussed quite a lot.

However, there is a lot of ambiguity and some of the answers provided … including duplicating JAR references in the jars/executor/driver configuration or options.

The ambiguous and/or omitted details

The following ambiguity, unclear, and/or omitted details should be clarified for each option:

  • How ClassPath is affected
    • Driver
    • Executor (for tasks running)
    • Both
    • not at all
  • Separation character: comma, colon, semicolon
  • If provided files are automatically distributed
    • for the tasks (to each executor)
    • for the remote Driver (if ran in cluster mode)
  • type of URI accepted: local file, HDFS, HTTP, etc.
  • If copied into a common location, where that location is (HDFS, local?)

The options which it affects:

  1. --jars
  2. SparkContext.addJar(...) method
  3. SparkContext.addFile(...) method
  4. --conf spark.driver.extraClassPath=... or --driver-class-path ...
  5. --conf spark.driver.extraLibraryPath=..., or --driver-library-path ...
  6. --conf spark.executor.extraClassPath=...
  7. --conf spark.executor.extraLibraryPath=...
  8. not to forget, the last parameter of the spark-submit is also a .jar file.

I am aware where I can find the main Apache Spark documentation, and specifically about how to submit, the options available, and also the JavaDoc. However, that left for me still quite some holes, although it was answered partially too.

I hope that it is not all that complex, and that someone can give me a clear and concise answer.

If I were to guess from documentation, it seems that --jars, and the SparkContext addJar and addFile methods are the ones that will automatically distribute files, while the other options merely modify the ClassPath.

Would it be safe to assume that for simplicity, I can add additional application JAR files using the three main options at the same time?

spark-submit --jar additional1.jar,additional2.jar \
  --driver-library-path additional1.jar:additional2.jar \
  --conf spark.executor.extraLibraryPath=additional1.jar:additional2.jar \
  --class MyClass main-application.jar

I found a nice article on an answer to another posting. However, nothing new was learned. The poster does make a good remark on the difference between a local driver (yarn-client) and remote driver (yarn-cluster). It is definitely important to keep in mind.

7 Answers
7

Leave a Comment