When you start an application, you need to think about where it’s going to run, and also how it’s going to run.
Basically, the way I use Spark is in 2 ways.
As a Developer
I just embed the Spark binaries (jars) in my Maven POM. In the app, when I need to have Spark do something, I just call the local’s master (quick example here).
Pro: this is the super-duper easy & lazy way, works like a charm, setup under 5 minutes with one arm in your back and being blindfolded.
Con: well, I have a MacBook Air, a nice MacBook Air, but still it is only a MacBook Air, with 8GB or RAM and 2 cores… My analysis never finishes (but a subset does).
As a Database User
Ok, some will probably find that shocking, but I use mostly Spark as a database on a distant computer (my sweet Micha). The app connects to Spark, tells it what to do, and the application “consumes” the data crunching done by Spark on Micha (a bit more of the architecture).
Pro: this can scale like crazy (I have benchmarks scheduled)
Con: well… after you went through all the issues I had, I don’t see much issues anymore (except that I still can’t set the # of executors — which starts to make sense, as I run in standalone mode).
And of course, as in all list of 2 elements, here is the third. It also a bit different, less natural to software engineers.
As a Data Scientist
You prepare your “batch” as a jar. I remember using mainframes this way (and submitting jobs to SAS).
Pro: very friendly to data scientists / researchers as they are used to this batch model.
Con: you need to prepare the batch, send it… The jar also needs to do with the results: save them in a database? Send an email? Send a PDF? Call the police?
More Ways to Run Spark
In this context, I run Spark as a standalone server. Not all configuration options are available in this scenario (like the number of executors).
So how do you run your apps?