File Ingestion in Apache Spark

The road to ingestion is in chapter 7 of Spark with Java — The road to ingestion is paved with good intentions, as you will discover in chapter 7 of Spark with Java.

In a typical Big Data analytics scenario, you will probably be tempted to ingest files. You know, those pesky CSV files where the comma is sometimes a semicolon or a tab or a pipe.

Well, Spark is great about that too.

Chapter 7 of Spark with Java details file ingestion (JSON, XML, Text, and even those pesky CSV) in extensive details, understanding the various options. Thanks to this chapter, you should not need any preparation outside of Apache Spark.

Chapter 7 starts the more pragmatic and hands-on chapters, so, if you were bored with theory, here comes the fun.

Chapter 8 is detailing how to ingest from databases (both RDBMS and Elasticsearch). It came back from my editors and I will update it soon so it’ll join the MEAP in a few weeks.

I just finished chapter 9’s first draft, which focuses on building your own data sources. It is a more complicated chapter, but the derived value should be great if you need to import data from weird sources (as we all do, right?).

All those chapters are linked to appendix I for ingestion, a useful reference for all the options, by version of the XML, JSON, CSV, and text parser (and more!).

Finally, chapter 7’s source code is available on GitHub. Go love it and fork it. There are lots of labs there.

I hope you continue to enjoy the book and share discounts like Manning’s Deal of the Day for April 28^th, 2018, celebrating the new chapter with half off Spark with Java. Use code dotd042818au at https://goo.gl/4chGxN. I knew that writing a book would not make me rich…

File Ingestion in Apache Spark

Let's be social

jgperrin.substack

/in/jgperrin

/jgperrin

Help share:

Let's be social

jgperrin.substack

/in/jgperrin

/jgperrin