Dear Analyst #129: How to scale self-serve analytics tools to thousands of users at Datadog with Jean-Mathieu Saponaro

When you’re organization is small, a centralized data team can take care of all the internal data tooling, reporting, and requests for all departments. As the team grows from 100 to thousands of people, a centralized data team simply cannot handle the number of requests and doesn’t have the domain knowledge of all the departments. Jean-Mathieu Saponaro (JM) has experienced this transformation at Datadog. He first joined Datadog in 2015 as a research engineer. He was part of the inaugural data analytics team which now supports 6,000+ employees. In this episode, he discusses scaling a self-serve analytics tool, moving from ETL to ELT data pipelines, and structuring the data team in a hybrid data mesh model.

Building a data catalog for data discovery

According to JM, creating a data catalog is not that hard (when you’re organization is small). I’ve seen data catalogs done in a shared Google Doc where everyone knows what all the tables and columns mean. When the data warehouse grows to hundreds of tables, that’s when you’ll need a proper data cataloging solution to store all the metadata about your data assets. This is when you move to something like Excel (just kidding)! In all seriousness, a shared Google Sheet isn’t a terrible solution if your data warehouse isn’t that large and the data structure isn’t very complicated.

JM discussed a few strategies that helped them scale their internal data discovery tool:

Strong naming conventions

A pretty common pattern for data warehouses containing “business” data is using dim and fact tables. All tables in the data warehouse have to be prepended with dim or fact so that it’s clear what data is stored in the table. There are also consistent naming conventions for the properties in the table. Finally, the “display” name for the table should be closely related to the actual table name itself. For instance, if the table is dim_customers, the display name for the table would just be customers.

Snowflake schema

Another common pattern is using a snowflake scheme to structure the relationship between tables. This structure makes it easy to do business intelligence (e.g. reports in Excel) later on.

Customizing the data discovery experience

Datadog switched BI tools a few years ago so that the tool could be used by technical and non-technical users alike. They ended up going with Metabase because it didn’t feel as “advanced” as Tableau.

In terms of their data catalog, one of the key decisions going into picking a tool was being able to quickly answer the question: where do I start? Where do I go to learn about our customer data? Product data? This is where the discovery experience is important. JM said the entry point to their catalog is still just a list of 800+ tables but they are working on a custom home page.

JM’s team thought about the classic build vs. buy decision for their data cataloging tool. Given the size of their organization, they went with the building the tool internally. If the number of users was smaller, it would’ve been fine to go with an off-the-shelf SaaS tool. JM’s team set a goal to build the tool in a few months and it took them 3.5 months exactly. Building the tool internally also meant they could design and re-use custom UI components. This resulted in a consistent user experience for every step of the data discovery process.

Should you migrate data pipelines from ETL to ELT?

When JM joined Datadog, he found that all the ETL data pipelines were done in Spark and Scala. If you were to ask me a year ago what “ETL,” “data pipeline,” and tools like Spark and Scala mean I would’ve thought you were speaking a different language. But once you hear the same terms over and over again from various analysts and data engineers, you’ll start to understand how these different data tools and architecture work together. If you are new to Apache Spark, this is a quick intro video that I found useful:

As Datadog grew, so did the number of data pipelines. JM saw the number of data pipelines grow from 50 to hundreds and Spark didn’t make sense as a data processing framework anymore. Every time you wanted to add a new field to a table or change a workflow, it required an engineer to submit a pull request and deploy the change to the data warehouse.

Eventually tools like dbt and came onto the scene which prevented the need for relying on engineers to make changes to the data pipeline. Analysts who are not on the core data engineering team could develop and test data pipelines by writing SQL. One might saw dbt is like the no-code/low-code data processing framework democratizing who can create data pipelines. As the data team scaled, their data pipelines migrated from ETL to it cousin ELT. The team uses Airbyte for the “extraction” step and dbt does all the data cataloging.

Since dbt opened up the data pipeline development process to more people outside the data team, it became even more important to enforce best practices for naming conventions for tables and fields.

Pros and cons of data meshing

Another term I didn’t learn about until a few years ago: data mesh. A definition from AWS:

A data mesh is an architectural framework that solves advanced data security challenges through distributed, decentralized ownership. Organizations have multiple data sources from different lines of business that must be integrated for analytics. A data mesh architecture effectively unites the disparate data sources and links them together through centrally managed data sharing and governance guidelines.

When JM first started working at Datadog, there was a central data team that did everything for every department at Datadog. The data team did the extraction, ingestion, dashboarding, and even recruiting. This is totally reasonable when you’re a small organization.

As the organization grew to thousands of people, it became harder for this centralized data team to cater to all the departments (who were also growing in size and complexity). The data team simply didn’t have the domain knowledge and expertise of these business units.

If Datadog were to go full on data mesh, silos would form in each department. This is one of those situations where the data mesh sounds good in theory, but in practice, Datadog hired and structured their data teams to meet the needs of their data consumers. Having each team manage their own data extraction and ingestion would lead to a big mess according to JM.

Start with one team to prove the semi-data mesh model

JM’s team started with the recruiting team to prove this data mesh hybrid model would work. The recruiting team started hiring its own data analysts who understood the business priorities of the team. The analysts would help clean and process the data. An example of domain-specific data for the recruiting team might be engineering interview data. The analysts helped make sure that interviews were properly distributed among engineers so that no engineer was overloaded.

To see JM’s journey in more detail, take a look at this talk he gave in 2023 at the Compass Tech Summit:

Find something you like and have fun

People have given all types of great advice on this podcast in terms of how to switch to a career in data. Sometimes the advice goes beyond data and applies to life in general. Perhaps this is getting too touchy-feely but JM’s advice for aspiring data professionals is to “find a domain that you like and have fun.” Life is finite, after all. This is one of my favorite visualizations when I need to remind myself about the finiteness of life (read the full blog post from Wait But Why):

JM also talked about the ability to be flexible once you’re in the data world because tooling changes a lot. The definition of roles change a lot and new roles pop up every year the redefine what working in “data” means. Case in point is the analytics engineer role. JM’s advice is that you should feel empowered to follow wherever the industry may go.

Other Podcasts & Blog Posts

No other podcasts or blog posts mentioned in this episode!

Trackbacks/Pingbacks

Dear Analyst #132: How the semantic layer translates your physical data into user-centric business data with Frances O'Rafferty • - September 10, 2024
[…] rightly so) if you are managing thousands of data attributes and definitions. As I discussed in episode #129 with Jean-Mathieu, if you only have a handful of attributes, using Excel or Goole Sheets is completely doable as a […]