Process is critical to scaling an org, and we’ve gotten the processes for analytics wrong.
A critical aspect of scaling organizations is process. Process allows you to normalize and incorporate best practices to ensure things work smoothly and scalably even when no one is minding the controls. But process in analytics organizations is something that is frequently overlooked, and too often we default to the same processes that engineering abides by: i.e. using git and related patterns to share and store analytics work.
For some parts of data science / analytics, such a transference of engineering behaviors is suitable: analytics infra, analytics engineering, deployment of machine learning models, libraries — all of these workflows are inherently code-based and benefit from the rigorous test + PR culture familiar in engineering organizations. But for the remainder of analytics work — the kind that occurs daily in SQL IDEs and Jupyter notebooks — the fit is poor. 90% of our job, by the nature of our work as analysts and data scientists, is exploratory. And here, unfortunately, engineering practices not only fall short, but can be detrimental to the org. Why?
Blindly enforcing version control and code review as a gatekeeper for sharing exploratory work leads to unshared exploratory work.
So I’d argue we need a different process. And to understand what the processes need to be, we first need to establish the objective of analytics organizations here. In engineering, maintainability, reliability, and scalability are the objectives that underpin practices like version control, code reviews, code coverage, validation testing. But in analytics work, the underlying objectives are necessarily different: reliability, maintainability, and scalability are still important, but they manifest differently. Let’s ditch the emperor’s clothes and replace these concepts with what we really want: discoverability and reproducibility. In other words, we need to put the “science” back into data science (and analytics).
With these concepts in mind, I’ll discuss the following in this article:
- Why discoverability and reproducibility are of the utmost importance in analytics and data science organizations.
- How to orient processes towards those ends.
The First Pillar: Discoverability
Why we need discoverability: Code is inherently discoverable. Analytics work is not.
This is an oversimplified engineering code base, where arrows indicate imports.
Anyone who’s spent any time in a modern IDE knows that traversing this graph is easy. Every modern IDE has “jump to” functionality where you can immediately jump to object references. Start in
def pet() , and you can easily jump to the definition of
class Llama , then follow the breadcrumb trail all the way back to the parent class
Your analytics code base, on the other hand, looks a bit different:
There is still interconnectivity (through the data itself), but these are connections are not discoverable through your IDE (hence the lines are dotted). This makes it substantially harder to see, say, table references in the same way engineers can see function references. So what’s the solution?
How to get discoverability: use a low-friction, networked system to do (or at least document) query work.
Before diving into that, though, let’s talk about the go-to solution for analytics organizations appropriating engineering best-practices: git. Many organizations turn to git to track any sort of query that drives an insight. This is reasonable, and allows for search over your data by, say, table name, but in my experience, git expresses a few inconveniences:
- Validation is cumbersome/blind. For SQL, you have to copy and paste and re-execute queries to validate work in a repo. It’s a small inconvenience, but because results aren’t generally tracked in git, it’s difficult to follow the flow of logic from query to query without re-running queries yourself. And for code-based notebooks, this gets even worse — you need to replicate virtual environments on top of pulling in requisite data. Unlike in engineering, outputs in data work are nearly as important as the code itself, so if you’re just looking at the code, you’re missing half the story. Git simply does not lend itself to a review process that traces the logic.
- Work often lacks sufficient context. In particular, for raw SQL, git provides little context, and queries are only as good as the context in which they were written. Sure you can add a README to go along with your query, but what about all the assumptions and snippets you wrote to validate those assumptions along the way? Did you get those in there? And if so, are you comfortable version-controlling them all?
- Work in Git is searchable, but not qualifiable. Sifting through past work is crude, for git provides no indication of the robustness or re-usability of a query or analysis — only that it’s been written before.
- Code reviews are great in an ideal world, but reviews cannot keep up with the pace of analytics. Code should still be checked, but the speed of business decisions often necessitates this happen after-the-fact and/or on a case-by-case basis, rather than as mandatory peer reviews. The usefulness of insights often decays exponentially once they’re uncovered, so blocking the business with a formal review cycle is not a good idea. Sure, you can use git without code reviews, but there’s always this latent pressure to fall back into a formal review process — it’s also probably a bad idea to get in the habit of merging code without reviews. 😉
My controversial conclusion from the above points:
Git is the wrong place for sharing most analytics work.
At the end of the day, these inconveniences are just that — inconveniences — but, perhaps surprisingly, these inconveniences are often enough to cause people to NOT share their work. I’ve seen this happen first-hand at Airbnb, Wayfair, and handfuls of other companies that attempted to enforce the same. While some amount of very polished work gets shared (e.g. through Airbnb’s knowledge-repo), but this only captures ~1% of work done. The remainder of work lives in tabs in SQL IDEs, local files, local Jupyter notebooks, and the work thus stands to be duplicated unnecessarily, over and over by different analysts and data scientists.
The solution? Make the work discoverable at all costs. A reasonable way to do this is to share your work in a non-git-backed place. We’ve built hyperquery.ai expressly for this sort of thing, but I’ve seen reasonable success with more general note-taking solutions like Notion, Confluence, or Library (h/t Brittany Bennett) being used, if you don’t mind keeping your IDE and your query-sharing environment separate. For Jupyter/R markdown, git may unfortunately still be the least of all evils. But this brings me to my second objective: reproducibility.
The Second Pillar: Reproducibility
Reproducibility is important because analytics is science.
Analytics / data science is science. And as such, it needs to be reproducible. Any insight you produce is two things: the insight itself and the steps you took to obtain that insight.
Reproducibility is particularly important because I’ve told you not to peer review in a hosted git platform. While it’s important to avoid blocking insight consumption with a mandatory peer review process, it is still, nonetheless, very important to get critical, decision-driving pieces of work checked by colleagues, and prioritizing reproducibility enables this. Moreover, reproducibility fosters careful validation of results much better than code reviews, which, as I’ve mentioned, are generally opaque when it comes to data outputs.
Reproducibility is most easily attained with SQL, not Python/R.
Why? Because SQL reduces the friction to reproduce work. As with discoverability, reducing friction here is important, or no one will go through the pains of faithfully reproducing your efforts and validating your conclusions. As they say, if you have a 5 line git commit, you’ll end up with 50 requested changes. If you have a 500 line commit, you’ll end up with a LGTM. Requiring users to set up virtual environments and wrestle with hidden states replete in code-based notebooks will mean that others will simply not try to reproduce your work. And there’s nothing worse than pushing the business in the wrong direction because you didn’t get your work double-checked.
So now we arrive at my second controversial statement of this article:
Try to use SQL where possible, not Python/R.
Python/R aficionados: before you close your browser and block me forever, hear me out. I love Python and R, and was one of the earliest users of IPython and Jupyter (I spent the entirety of grad school studying rivers in Python). I’ve even released several open-source libraries in Python.
That said, you have to admit: Jupyter notebooks and R Markdown are not the best place to store reproducible work. Hidden states, lack of easy executability, requirements files, and cached data provide numerous points of failure. At the end of the day, data in motion breaks, and SQL use minimizes the motion.
Of course, let’s say you don’t buy this — you are the master of making perfectly reproducible notebooks. This is entirely possible, if you have an odbc-based library that pulls data in directly from your warehouse, and if you make sure to always re-execute all of your cells from the top before sharing your code, and if you make sure you have a good system for sharing these notebooks without accidentally storing massive amounts of data (oops you executed
df and listed out several hundred mbs of data in plaintext). But even then, if you get all those things right, there’s a level of opaqueness (inaccessibility, particularly by stakeholders) to Jupyter notebooks that will inevitably degrade trust in your analyses. On the other hand, SQL snippets are self-contained and work out-of-the-box. Stakeholders can (and will) execute SQL if they’re in doubt, and reproducibility comes for free as long as your organization has reasonable data hygiene.
Processes to implement: (1) share knowledge and (2) SQL where possible
This may be already clear to you at this point in the article, but in my opinion, there are only two necessary processes that, when implemented, will drive discoverability and reproducibility forward by leaps and bounds. I’ll put them in a superfluous table to help you internalize them:
These suggestions are hopefully not as drastic as my earlier points might have primed you to expect:
- For discoverability, I’m not even saying to get rid of any git-based version-controlled analytics you have — just set up something where queries and ad-hoc work can be shared and discovered more readily. If you’re asked to do something and your org is big enough, guaranteed someone has done it before.
- For reproducibility, urge analysts and DS to use SQL where possible, rather than python, R, or excel. It’s obviously fine to dive into pandas or tidyverse when you are working on an obscure research problem, but make it the exception not the rule. When your DS are running basic aggregations in pandas, they are inherently sacrificing visibility and reproducibility for whatever they see the benefit of this to be (in my case, it was often, simply, familiarity).
The changes I’m suggesting are small. Explicitly state “SQL where possible” in your analytics onboarding docs, and set up a SQL-writing environment that enables knowledge sharing (see hyperquery.ai). But beyond that, there’s not much else you need to change.
Give it a try, and let me know how these adjustments work out for you (or if you have had success in driving forward discoverability and reproducibility by other means). These sorts of small changes can mean the difference between a frustrated, over-worked data team, and one that streamlines and mitigates this workload through re-usable, freely shared SQL.
Tweet @imrobertyi / @hyperquery to say hi.👋
Follow us on LinkedIn. 🙂
To learn more about Hyperquery (and to sign up for our private beta), visit Hyperquery.ai.
(And if you’re interested in helping up build, we’re hiring — check out our open positions.)