Data Products Aren't Exempt From Good Engineering Practices

Aug 12, 2022

I got a fascinating email a few weeks ago, the essence of which said “the reason all these data products don’t end up working is most don’t take engineering design into account.” I’ll be honest, my first instinct was to say “nah, you just don’t understand”, but over the past 2 weeks I’ve asked some staff engineers at various companies about their opinion on this issue. I heard comments about architecture, system design, scalability, latency, a lot of other terms ending in “y”. I came away with a few new ideas that pushed me to think about design of data products.

Let’s put aside the idea of persona, user needs, understanding the business problem and getting stakeholder buy-in. While those are hugely important for creating a useful data product, I’ve previously written about the pitfalls of not accounting for these components in a careful, methodical way. Instead, here are a few situations where the design of the product itself (more backend than frontend for now) is the bottleneck for growth and adoption of the product.

Latency of the data sources/pipelines on which the product relies. This was a hot topic of conversation. Namely, that if end users have to wait a long time or think the data isn’t relevant/fresh enough, the likelihood they engage with it later on is much lower than it would be with a low latency product. Put another way, users have limited patience and have high expectations about the recency of the data in the product. Sure, in some cases its fine to have a one day delay. Yet in other cases, even 5 minutes of delay might be a significant problem. Ever waited 3 minutes for a dashboard to load as the tables underlying it refresh? Me too. It isn’t fun.
Scalability and system design aren’t problems initially for most data products, but they create problems down the road. If you’ve never read about system design, that is okay. Coming from the statistics side of the world, I wasn’t familiar with it until I started managing more technical teams in business. Two of the engineers I spoke with said that even the most technical data scientists in their business were able to create prototypes that worked well for tens of users, but broke down when that went to hundreds, or thousands in production facing environments. As I asked around about this, it became apparent that what is taken for granted in engineering, design docs and architecture review, is less common in even technical data science environments. To be sure, it does happen (especially in companies where it’s a norm), but not paying the cost upfront creates scalability issues down the road.
Lack of understanding of consequences for changing data sources and pipelines. In summation, “people aren’t often sure what’s going to break when we make a change.” While this is part of the architecture and design point I mentioned above, there is also a documentation aspect to this issue. We build a lot of data products that rely on the data as it exists in that moment, but don’t think about how that design would need to change if the data sources, pipelines, or models underlying it changed. I’ve been fortunate to work with some incredibly talented people who think about these things, so I’m certain that I have a bias in thinking it’s a smaller problem than it is. So it might be worth asking yourself “if the tables/sources my product relied on changed tomorrow, do I know what I would do?”
Redundancy of “the builders”. Maybe the most common issue I heard about is the people who built the product moving onto other companies. There’s a notice period and poof, they are gone. Sometimes leaving good documentation in their wake, but at other times leaving open questions about how things actually work. To me, this is the most thorny problem but is connected to the issues above. High quality design, documentation, understanding of dependencies on data sources - each of these can play a role in mitigating the risk/loss from having people move onto other roles internally or externally. But honestly, this is still really difficult to overcome. There is often no replacement that understands things as well as the person who wrote the foundations. How we account for this cost of maintenance/development after the builder leaves is a critical decision point.

I’m not going to pretend to be an engineer, but I am actively trying to understand the right way to balance data science, engineering and product to create effective, sustainable and maintainable data products. I’m excited to keep learning, and hope you will share your own perspective and insights along the way.

Nick Zervoudis

Great post. It might be because the data products I’ve managed have always been in their first stages (and in orgs that didn’t have a pre-existing data product culture), but another one I’d add to your list is what I’ve come to call “productisation”: Often the data product wasn't only not scalable in the sense that if we had 10x more users it might break (sometimes true, others not), but also that it wasn’t generalised enough to suit new users’ usecases.

I’m not talking about new features here - just that the (in my mind) standard practice of parameterising inputs wasn’t something the data scientists who’d built the first or second iteration of the data product we’re accustomed to, so a lot of logic was hardcoded rather than flexible/extensible/generalised. Sometimes this was an easy fix (replace “2021” with “param_year”), other times it required a big refactor. What I this is noteworthy here is that building it the “productised” way wasn’t much more work than the other (so it’s not your typical case of tech debt in the name of getting an MVP out asap). Instead, the challenge was just that the devs didn’t have the software background that would’ve made that decision obvious (and for one reason or another those who did weren’t listened to by DS or leadership)

Expand full comment

From Data to Product

Discussion about this post