Data predictions for 2019, without the need to read in the stars
Data predictions for 2019, without the need to read in the stars

Last November, Veracity Solutions asked me to present at their annual Converge conference on what to expect in data management for 2019. I prepared a bubble-based chart, similar to a mind map as an illustration to my talk.

For this article, I reused the bubble chart, but added the explanations I gave during my talk. You do not have to agree with those predictions, and I am happy to debate!

I divided the data world in four continents: the data shapes, the usages, the governance, and, my favorite, the people.

Eight very hot trends for data engineering and data science in 2019
Eight very hot trends for data engineering and data science in 2019, available in PDF on demand

Here is the legend for the illustration, explanations follow.

  • Grey indicates declining.
  • Green means mature, growing.
  • Orange means under scrutiny, with a certain growth and interest.
  • Red indicates the hottest areas in 2019.

Data shapes

Shapes include all the way the data is taking from relational data, files, streamed data… The data shapes are the way you store the data.

What stays

Relational data stores have been mature for a long time and continue to be. Document stores are also more and more mature as their use cases are becoming more common. Hybrid solutions like the ones built in IBM Informix or PostgreSQL have been out for long enough (and mature).

What is of interest

Streaming! Streaming! Streaming! In a world where everybody wants more and more real time operations, it seems that the only way to achieve this is via streaming. The eco system around streaming seems to mature significantly, as you can see some inter-operability issues (for example, Apache Kafka v0.8 and v0.10), the growing adoption of Kinesis on the AWS platform, and more. The maturing of structured streaming in Apache Spark confirms this trend allowing to have analytics on the fly as well.

Who thought of that? Files are getting back. Of course, not the CSV (which will still be there and remain a pain to parse), XML, and JSON, but those “newer” file formats like Parquet, ORC, and Avro. They are not new, but Parquet was introduced in 2013 by Cloudera (and Twitter), while ORC (optimized row columnar) dates back to just a month before Parquet and launched by Hortonworks. On its side, Avro found its sweet spot with Kafka, backed by Confluent. With the merger of Hortonworks and Cloudera, who knows how this file format scene will shake down. Of course, AWS Athena which indexes files in AWS S3 also brings back a new interesting approach

What is hot?

1)Blockchain (and please, let’s not talk crypto currencies) allows to build immutable storage, like WORM (write-once, read many) drive did in the 80s/90s, but this time, it’s all software. More usages still have to be discovered, but the technology, in the storage continent, is hot.

What’s going away?

I am pretty sure I can start a controversy here, but, for me, both EDW (enterprise data warehouses) and data lakes are going away. The idea of making data available for more analytics, including ML (machine learning) and AI (artificial intelligence) is not going away. Data warehouses were too complex, and the promise of data lakes is too simple, something must come there, maybe less tight to Hadoop, more aware of the governance? Maybe the data (in the lake, swamp, or warehouse) will be more transient to take the shape dictated by the consumer? (not that I worked on several projects along those lines).

Usages

What can you do with the data you gather? This is the definition of usage.

What stays

Transactional data processing is here to stay forever. Does anyone need an explanation? Those bank transactions, cashier (or cashier-less) interactions at your grocery stores, and so many more are here to stay.

What is of interest

What can I do with my data? Analytics! This field is constantly growing and ML (machine learning) brings a new dimension of possible exploitation of the data.

What is hot?

2)AI (artificial intelligence) is promising. I think it will deliver in 2019. Of course, general AI, with robots and human-like behavior is not for next year, but narrow, isolated, and focused AI will bring new business insights. Data scientists will be more guided through their design and modeling with assistant, like Clippy (the paperclip) in Microsoft Word. I suspect that the ability to suggest data correlation or assist in building models will come to data science platforms (like IBM Watson Studio, Databricks…).

3)In the past years, we lost a little track of unstructured data, or so it seems. Now, one of AI’s hot field of practice is bot. Yep, we’re, once more, not too far from Clippy. Those bots are usually visible on webpages trying to provide basic help before you realize you really need a human. They have been around for sometimes now, but they are getting better. Definitely something to look after and if you are running an ecommerce website, look closer.

4)Predictive models are also becoming hotter and hotter. If you are not familiar to models, see them as complex functions where you send data to and get intelligible and predicted (hence the predictive model name) data back. Models are using algorithms, like linear regression for example. Often the model is the result of the work of data scientists. Those models will need management (in data governance), better control of hyperparameters (also called the magic behind AI), and better understanding the implied bias they have.

5)Notebooks allow interactive work on data, along with notes taking, explanation, graphs, embedded apps, and more. Open source products include Apache Zeppelin and Jupyter. Between Databricks, IBM, Microsoft, and others, there is already active competition for hosted and collaborative solutions for notebooks. Expect more in 2019. Notebooks will definitely ease the use of AI models.

6)The usage around blockchains will intensify. What can we do with a secure ledger? Having worked in healthcare for most of the year, I can see wonderful use cases for blockchain for securing a data trails, while still being compliant with HIPAA (health insurance portability and accountability act) and the coming CCPA (California consumer privacy act). Blockchain developer has the strongest growth in the recently published LinkedIn’s list of emerging jobs in 2018.

What’s going away?

Big data is going away. The term does not mean anything anymore. It has been tarnished by too much marketing and false promises. Be careful, in 2020, it could sadly be AI.

Self-service has never delivered the promise it was selling: allowing anyone in an organization to play with the data. This has increasing been the work of data scientists, with preparation from data stewards (data curators).

Governance

With new worldwide requirement around data, more scandals about data breaches, it shows the necessity for more control or governance around data.

What is of interest

Data cataloging will continue to attract more, probably thanks to more automated solution, more interoperability between solutions, more automated discovery. There is a considerable need here and current solutions seem complex to say the least. Maybe projects like ODPi will help (unless they are too Hadoop centric)? Cataloging needs to be extended to predictive models and their parameters.

Lineage is the surveillance of where data is being transformed, so that one can use it as close to the source as possible, audit its modification, and more. Solutions exist, they will continue their development in 2019, hopefully with less complexity and more flexibility.

Automated data quality is a bit of a unicorn. Everybody wants it, nobody has it. I am pretty sure that AI will help in this field. Maybe predictive models will help/rescue data quality?

What is hot?

7)Data governance still needs to get out of its niche and become more mainstream. The benefits of data governance will appear more clearly to the upper level of management and thanks to newer tools, more interoperability, and more methodology, incremental approaches will generate more success stories.

People

People is my favorite continent in this list of prediction: without people and organization, data is staled.

What I have a hard time with

CDOs (chief data officers) have been among the hottest job in data for years, but yet, there are still not (very) popular (at all). Maybe CDO will never really exist and it will be a responsibility of the CEO? Data is so important anyway, that it might just fall on their plate.

What stays

Data stewards are confirming to be a cornerstone of data in 2019. We could see a shift in their name to data curator, which is really what they are doing now.

What is of interest

I barely remember the last time I met a database administrator, the good ol’ DBA. Data engineers have taken over, and the next iteration might simply be DataOps engineers. More and more responsibilities are given to data engineers as compliance, staging, productionize, distribution of the data in more shapes. Concerns about security (not only access, but also more and more regulations around privacy).

What is hot?

8)The hottest job remains the data scientist. Yeah, there is a surge in hotness for blockchain developers, but it remains a percentage of growth, not in numbers. Everybody wants a data scientist in their data team, but can everybody provide them the tools and clean data to work with? Data scientists may become less hot when the tools will be able to guide more the business analysts, but, in the meanwhile, we probably need more data scientist to build the tools…

Debate

These predictions are based on many discussions, conferences, articles, and feedback I have had/attended/give/read/wrote during the year. I may be off for some, but I do not think I will be completely off.

Nevertheless, feel free to start the discussion and let’s meet in a year to see how accurate these predictions were.