What determines Kafka consumer offset?

I am relatively new to Kafka. I have done a bit of experimenting with it, but a few things are unclear to me regarding consumer offset. From what I have understood so far, when a consumer starts, the offset it will start reading from is determined by the configuration setting auto.offset.reset (correct me if I … Read more

What are workers, executors, cores in Spark Standalone cluster?

I read Cluster Mode Overview and I still can’t understand the different processes in the Spark Standalone cluster and the parallelism. Is the worker a JVM process or not? I ran the bin\start-slave.sh and found that it spawned the worker, which is actually a JVM. As per the above link, an executor is a process … Read more

Spark – repartition() vs coalesce()

According to Learning Spark Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions. One difference I get is that with repartition() the number of partitions can … Read more

Explaining Apache ZooKeeper

I am trying to understand ZooKeeper, how it works and what it does. Is there any application which is comparable to ZooKeeper? If you know, then how would you describe ZooKeeper to a layman? I have tried apache wiki, zookeeper sourceforge…but I am still not able to relate to it. I just read thru http://zookeeper.sourceforge.net/index.sf.shtml, … Read more