Golden Datasets: An Example of Evaluating Data Product Investments
“So the assumptions we’ve been operating off of to understand user engagement were different in all three data sources?” “Yes…” So began one of the most painful conversations I’ve had since I joined the data world. When datasets work, they feel great. When they conflict, chaos reigns. While that moment struck me as unusual then, I’ve learned that datasets are both an incredible data asset and also a major risk. When we don’t treat datasets as a product, with an owner, subject matter expert and defined view of customers, we open ourselves up to confusion, frustration, bad decisions and ultimately, a loss of trust in data.
I think it is helpful to use a concrete example: a user level table with various dimensions and measures about individual users. For example, we might have a user named Meta Analysis, with dimensions like location, sign-up date, paid/free user, and many inferred dimensions (actions last 30 days, likelihood to transact in next 7 days). A dataset like this is extremely common. After all, it feels great to have a source of truth about your users that everyone can query/engage with. People get used to querying this dataset and it becomes the go to for many dashboards and monitoring processes across the company.
Then time passes. Dimensions and measures are added to this dataset. Some old dimensions stop working. Some measures are out of date. A couple of years in, there are hundreds of details about a user connected to thousands of dashboards across the company. Yet no one knows if the dimensions/measures are still correct. Someone else figures out the underlying data about last 30 days for user actions is stale. Pretty quickly, the trust in the data degrades and people do what they naturally do, they find a workaround. Now there is not just a central user table, but also a bunch of queries floating around that may or may not be monitored.
What I’m talking about above is why datasets, particularly those intended to persist and be highly visible to the company, need to be treated like products, just as we do with external products. When we stop understanding what our users need, let the product get stale, stop thinking about new use cases and generally don’t invest in a data asset, trust and usefulness degrade. Practically speaking though, we can’t invest in everything or every dataset. Here are some ideas about how to prioritize your data product investment by “scoring” a number of different dimensions.
Define the immediate value unlock. The vast majority of data products we build solve some problem that is in front of us (or very close to it). That doesn’t mean every investment needs to pay off right now, but the majority of data investments need to have a clear value proposition for customers that solves pain they are feeling right now. It doesn’t only need to solve pain they are feeling right now or provide value only short-term, but you’ll get a much better response when that is the case. In the case of a dataset, how might it affect people’s ability to know where to go when building core dashboards?
Map out the full impact of your data product. I’ve noticed that many times we think about data products too narrowly and think about just one or two customers that are most impacted. But the story of each data product is often about just how far reaching it can be in how it affect product, engineering, finance, marketing and a host of other areas. While it might seem like the most important thing to define is your primary customer, sometimes the best thing you can do is help other customers become aware that they are affected, especially when they probably don’t know they are. In the case of a dataset, it might be useful to product managers and analysts, but also might be interesting/impactful for finance, marketing and other spaces.
Define the risk of not investing in your product for the long term. One thing I’ve learned is that we all have much shorter attention spans than we’d like to believe. Myself included. That means we take in a ton of information and filter it in extremely useful, but also extremely biased, ways. One thing that resonates with most people is downside more than upside. “If we don’t do this, this is the pain we will feel” often is more powerful than “if we do this, there’s a ton of potential gain”. When it comes to datasets, you might draw on the long term pain created for large analyst and data communities if you don’t have a central user table. You might also show the sheer amount of compute saved when there is a central table rather than everyone building their own.
Be clear about what it really takes to maintain your product. Generally speaking, we take on more than we can possibly handle when it comes to data responsibilities. We generally underestimate the effort, people and issues that arise in managing anything data related, particularly core datasets for a company. While in theory it might sound nice to say “this analyst is responsible for this dataset”, that doesn’t tell me much about what it takes to keep the product running when problems hit. What kinds of engineering resources are needed? Where would we pull other data science team members from to support an issue? Naming who is responsible does not tell you about investment level. It just creates a false sense of security.
Lastly, be clear about the long term value proposition. The goal here is not to put a guess about revenue created or money saved (though that’s what we often default to). Instead, the focus should be on what value is created for the customer that cannot easily be created by another product/approach. Value created is a mushy concept in the general case, and only matters once you can get specific. For example, if you’re going to build a core user table, there is short term pain that you address. You probably make people happy who are frustrated right now. But what is the long term value proposition? How might it transform how people build dashboards and create insights? How might it be a competitive advantage for the company? These are key issues to tackle.
This newsletter is about datasets, but it is also about principles by which we can evaluate the investment in different data products. I used datasets as an example because they are something we can all understand and have experienced before. But that doesn’t mean these principles only work for datasets. They apply to nearly any data product investment a company might make. Defining a short and long term value proposition, being realistic about what it takes to succeed, and speaking to both local and global impact are important for all of us, regardless of the product.