The Startup Founder’s Guide to the Modern Data Stack

Robert Yi

January 17, 2022

You need analytics. And while no-code solutions like Google Analytics will get you quite far, around your Series A, you’ll realize there are far too many questions you can’t answer, and you’ll start Googling “how to set up analytics properly” to jerry-rig a solution. You’ll swim through a sea of enterprise, all-in-one promises and will nearly get pulled in. But in this search, you’ll likely hear hopeful whispers of the promised land — the Modern Data Stack.

Unfortunately, what the stack is is ambiguous, and you’ll likely come across more thought leadership than solutions. For while what the Modern Data Stack peddles is compelling — the right way to set up analytics for your org — the tools are rarely enumerated, largely because the ground is still shifting and no one really knows.

But in the midst of that uncertainty, you still have a job to do, and you need clear direction.

This post is here to help. In what follows, I’ll give you:

a gentle introduction to the modern data stack so you can grok the buzz
suggestions for tools that’ll get you set up for scalable analytics quickly.

Before we get started, a disclaimer: I do have a horse in this race, so if you are hunting for the bias, it’ll be localized to mentions of Hyperquery. But outside of that, this is my measured take of the space after having spent the last year setting up and used each and every tool listed below.

What is the Modern Data Stack, and why should you care?

The data space in the last decade has undergone a massive refactor. In particular, cloud data warehouses overhauled how corporate data is handled by making it cheaper to store, faster to access, and cloud-first. They rewrote the economics of data, and in doing so, they rewrote the behaviors as well — how we think about storage, infrastructure, and data access, giving rise to a number of tools that capitalized on these changes to up-end not only how data is stored, but how it subsequently acquired and used.

At its core, the “modern data stack” is an ambiguous designation for the class of tools that have enabled this refactor in the data space. Why should you care? Because these tools have made it easier, cheaper, and faster than ever to get and use high-quality data. They also tend to be composable, meaning you’ll be able to swap vendors more easily than if you stick with a full end-to-end bundled solution.

What this looks like in practice

While the term “modern data stack” is often used (erroneously, in my opinion) to loosely refer to all modern tools in data, in my estimation there are 4 key processes that have become the immutable, atomic elements of the modern data stack: getting data, storing data, transforming data, and analyzing data.

The core components of the Modern Data Stack. (Image by author)

Let’s talk through each of these steps and the tools that I’ve personally found most promising in these sectors, particularly for startups and scale-ups:

🪓 getting data

ETL: ❤️ Airbyte, Hevo, Portable, Fivetran, Stitch, Meltano
Event tracking: Segment, Rudderstack

Airbyte, Hevo, or Portable for ETL -- either will cost you nothing to set up. If you're looking for a managed solution, Hevo is fantastic, and you'll be able to do your ETL, pre-business logic transformations, and reverse ETL in a single place. If you trust yourself to do some basic infra setup, Airbyte is a great open-source solution. Because it's managing pipelines not storage, an outage here will lead to broken pipelines at worst, but likely no data loss — an acceptable risk at early stages. If you’re looking for something obscure (but still well-maintained), Portable is fantastic.

Get something for event tracking too. We went with Segment, but my major issue here is that ad-blockers recognize Segment event tracking too easily, so you’ll likely miss events on half of your users. Not sure how well lesser-known tools are caught, but may be worth a try. Alternatively, you could track events through API calls, but I’m not an engineer so that may be a terrible long-term idea. There’s probably a better solution here, so I’ll refrain from making any real recommendations.

📦 storing data (warehouses)

Snowflake, BigQuery, Firebolt, Azure, Postgres, Redshift, Data lakes

If you want cheap to start, go BigQuery (especially if you’re a startup, you can get a ton of GCP credits). If you want cost transparency later, go Snowflake. Don’t get Redshift or you’ll hate managing the infra, and only go data lake if you really need things to be so unstructured and not because you’re just being lazy. You can also just read replica your live Postgres instance, but know that you’ll have to migrate everything later and you’ll probably get stuck with Redshift b/c of the Postgres compatibility. I’ve heard great things of the Microsoft stack as well, but I have not given that a try.

🪚 transforming data

❤️ dbt, dataform, Looker

Ever written a SQL transformation to answer a question, then written it again and again and again? Then you make a view that you always forget about. dbt is a place to version control and schedule these transformations in plain SQL so you can DRY better. Don’t worry about researching other options at this stage — there aren’t really any. dbt has the most vibrant community, the most active development, and it’s open source (completely free). Don’t mess up and do something like build a million Looker PDTs instead.

📕 analyzing data

❤️ Hyperquery, Noteable, Deepnote, Hex, Mode (data workspaces)
Metabase, Superset, Tableau, Google Data Studio, Power BI (dashboards)

At early stages, you don’t need separate tools for dashboarding, SQL munging, your report-building, and documentation: you need a single workspace to analyze + visualize data in SQL, then subsequently write up, share, and centralize findings — Hyperquery is perfect for this. It’s a query-enabled doc workspace where you can do all your basic question answering and exploration, without having to deal with a million contextless dashboards or IDE tabs. You also shouldn’t be doing things in Python until you’ve done all you can in SQL, as setting Python as the norm early on will set a poor precedent for your org at scale.

For all of these tools, I’d say actually don’t go for self-hosting an open source solution right now. You don’t want the infra to break and lose all your work.

🪜 [optional] operationalize data (reverse ETL)

Hightouch, Census, Grouparoo

You probably don’t need this off the bat, but reverse ETL lets you pipe your warehouse data into other tools, like Salesforce, Intercom, even Notion, so it can superpower operational work. Hightouch has more connectors in this space. Census is solid as well. Or use both if you’re chaotic evil.

Final comments

Critics will likely lament that I’ve limited the scope of the modern data stack too heavily, but the bolded asterisk behind all of this is that this is intended to get you started, not provide some minimalistic solution to all your data problems. Certainly, think about metrics and monitoring in particular as things scale and break. But you need to have data before it can be wrong. Start with the above — it’ll get you 95% of the way there, and one person can get it all going in a couple days.

‍

Additional references

Modern Data Stack Explained: Use Cases & Components

Tweet @imrobertyi / @hyperquery to say hi.👋
Follow us on LinkedIn. 🙂
To learn more about hyperquery, visit hyperquery.ai.‍