Unlike the new iPhone, the release of Apache Spark v2.0.0 did not gather 1,000s of people in a room, but it is a very important event in the small world of analytics. This article will not cover all the updates, but a few that I considered important/affect my day-to-day life. Note that I have not deployed v2 yet (I may be less excited after).
Without counting the new features, it includes 2500+ patches from 300+ contributors. Thanks to this awesome community.
Definitely Dataset, Goodbye to the Rest
For me, the most important feature of this version 2 is the standardization of Dataset for everything. Goodbye RDD and other DataFrame. Although, they might not completely disappear, they are less in the way and application developers can focus on their code using the Dataset object and all the APIs using it.
Welcome to sessions! Like other data-oriented (and web-oriented) tools, we will now have sessions to replace SQLContext. It comes with a new (said simpler) API. Welcome SparkSession.
Lots of SQL improvements: like subqueries, support for ANSI SQL and Hive QL… I need to find out more about the native DDL command implementations
CSV Natively Crunched
Wasn’t that a little miserable to have to use Databricks’ library to digest CSV? Well, now we don’t need this – thanks Databricks for doing it in the first place (really appreciated) and very certainly thanks for giving it to the community.
OMG – Other Miraculous Goody
Off-heap memory management for both caching and runtime execution. I need to repeat that and put it in bold: off-heap memory management for both caching and runtime execution. This may seem like nothing, but this will simplify the developer’s life considerably as they will not have to tweak the JVM if it is a little on the “low memory” side. SysAdm will certainly be pleased to be receiving less calls at 2:27am as the big job just crashed the heap on the production server.
As I wrote earlier, generalization of DataFrame is great and applies to MLlib too. This will simplify all your machine learning applications (more on that soon).
Note, that some deprecations have come in the 2.x branch:
- Fine-grained mode in Apache Mesos.
- Support for Java v7.
- Support for Python v2.6.
Check out the details of what has been removed in Spark v2.0.0.
I am missing a lot of features and improvements. I will also update my recipes to make sure they work with v2 and I have added a new tag to this blog (very originally Spark v2.0) for v2.0 specific features/articles. Fantastic new release, I❤️💥. What’s your favorite enhancement?