Botany is the scientific study of plants; data science does not have a funky name yet. Maybe datumlogy or datumnomy will appear?
Botany is the scientific study of plants; data science does not have a funky name yet. Maybe datumlogy or datumnomy will appear?

Before thinking about what is the outcome of data science, maybe I should take the two seconds I think it takes to define it. As how to define data science, I would quote Dr. Murtaza Haider: data science is something that data scientists do.

Data science is something that data scientists do.
– Dr. Murtaza Haider

But you’d agree that the definition is not complete if we do not focus on the people performing the science: the data scientists. The data scientists are the Marco Polo and Magellan of the 21st century. They explore new uncharted territories, sailing on seas and oceans of data, hiking the trails of relations between data points…

A data scientist must be curious, have some humor, but should be able to tell a great story. Just like those European explorers coming back in Venice or Seville, the data scientist must communicate his findings to his sponsor. This is probably where the analogy stops as the data science team is more likely to report to a CIO (chief information officer) than the doge of Venice or the king of Spain.

A good data scientist is knowledgeable about a specific field. Coming back to Haider’s experience, he  used his skills in civil engineering to gain knowledge about real estate. This is how he discovered patterns in the Toronto housing market.

Technology skills are important, but less than curiosity, a sense of humor, being argumentative, or even judgmental. Technical skills can be taught, it’s more difficult for the social and soft skills.

An applied case of data science and a lot of questions

During my entire career, I have been developing software and managing teams of engineers (and scientists) to build software for other software engineers. Call it frameworks, compilers, libraries, toolkits, or how you would like to call it, but my teams have been developing hundreds (literally) of those. The goal was always the same: enhance the productivity of other software developers (and ours).

I would love to embark on a journey about collecting software engineering data and, of course, analyzing it. You can easily imagine some of the datasets and their datapoints: how many commits per developer in Git? What is the frequency? Of course, this would be a first step as the analysis could look at questions like is there a relation between the number of commits and the number of bugs? Does the age of the engineer matters? Is the number of story points in Jira correlates to the number of bugs in six months? Is it applicable through all development languages?

Finally, discovering and establishing some fact-based recommendations: should a junior developer be mixed in a team of two engineers and three senior engineers? What is the ideal team size? What is the best team topology to avoid prohibitive maintenance costs? Agile defines the ideal team size by number of pizzas (note that the German and Alsatian version of the ideal team size is the size of the keg of beer), but is it intuition or is it science?

Building up expectations

But I digress. I am often asked what should I expect from a data science project. I still think that the ideal first outcome is a good report.

The ten elements of a good report are:

  1. A cover page reminds the reader of what the report is about with the title, the authors, and most importantly, the date (or even better period).
  2. A table of content provides a map to the document. Documents under five pages can omit the table of contents: this rule is not exclusively for data science…
  3. An executive summary (in the business sector) or an abstract (more used in the academic sector) distillate the quintessence of your study.
  4. An introduction provides the context of your report.
  5. The methodology describes the methods and data sources you have used.
  6. The results section brings your findings to the document, in tables, graphics, and models. In the business sector, your results are expected to be more consumable; tables can be in the appendices, like the yearly Deloitte United States Economic Forecast.
  7. The discussion is where your skills in story telling will be useful: you will craft your arguments using the data you collected and shared in the results section.
  8. As with any document, the conclusion is an essential part where the reader will see the points you are making. In a lot of business reports, only the executive summary and the conclusion are read.
  9. References enhance the credibility of your work. Along with acknowledgments and appendices, references are part of a good housekeeping.
  10. Acknowledgments is always good to flatter the ones who helped you. Ego is a powerful force. More seriously, it is polite and respectful towards the ones who helped you. Also, it increases your credibility.
  11. Appendices are here to store the details, reference parts, support, and all other elements you think will help your reader (like a glossary).

In my experience, I like those results to be a combination of a deck and a wiki page. However, the deck (PowerPoint or Keynote) should provide a synthesis and the wiki page provides the ancillary and background. In this regard, you can think of such a deck as the data scientist’s version of the Guy Kawasaki 10-slide deck for investor.

As data science matures, the delivery will evolve to provide more tangible (digital) assets like code, a model, a process, a (set of) notebook(s), but this initial report will still be needed.