20 questions for data observability readiness
Is your company ready to implement data observability? How will you do it? Get started by asking yourself these 20 important questions that will help you orient yourself in the data observability universe.
Is your company ready to implement data observability? How will you do it? Get started by asking yourself these 20 important questions that will help you orient yourself in the data observability universe.
The data basics
1. Where is your data?
Is your data in a data warehouse, a production database, spreadsheets, an S3 bucket, or Kafka? Most data observability tools, like Bigeye, monitor data at rest in the data warehouse. They also support traditional databases like Postgres and MySQL. Make sure your data storage is compatible with the data observability tools on the market.
2. How much data do you have?
Understand how much data is being generated by your production applications every day, and what kind of data that is (for example: images, text, or video).
3. Is your volume of data increasing steadily?
As the volume of data increases, it becomes harder to query that data (takes more time and is also more computationally expensive). It's also harder to keep track of the state and quality. If your data is steadily growing, you may be ready for data observability investments.
4. How big is your data team?
Your data team size will help determine the optimal way you tackle data quality. If you have a data team of one, it probably makes more sense to add some basic data checks with open-source tools (and leverage their communities) instead of buying something off the shelf. Once you have a small data team (5-7 people), you should begin to think about testing risky assumptions with SQL, setting up continuous integration, and implementing basic data monitoring.
5. What are the most commonly-accessed datasets?
There are probably a few key tables upon which your business is heavily reliant. Identify them as candidates for more granular monitoring. For other tables that are less frequently accessed, you might just monitor a few metrics. This bifurcation is referred to as T-shaped monitoring.
6. Who are the major consumers of the data and what are they using it for?
Understand who is consuming the different tables in your dataset; whether it's analytics dashboards for executives, or a machine learning model for fraud detection. Identifying these consumers will also help you get answers to the next set of questions.
The cost of bad data
7. Do executives and engineers currently trust the data? Do you have an NPS score for that?
To quantify the general trust and user perception of the data, send out a single-question survey to engineers and executives to measure NPS (Net Promoter Score).
The question to ask is likely some variation of: How easily and confidently can you answer your questions using company data?
8. How many data outages have you had in the last quarter? What was their cost?
If you’ve had data incidents in the past quarter, write those down and determine their user-facing cost.
9. How sensitive are your machine learning models to out-of-date data?
Talk to your machine learning engineers and figure out how sensitive the models are to out-of-date or missing data. The less robust the model is, the more important it is to make sure it’s given solid inputs.
10. Are you planning on an imminent IPO or exit?
If the company is planning to IPO or exit soon, this often serves as a catalyst to get the company’s data in better shape. There’s a general understanding that mistakes in top-line metric reporting that are tolerated when the company is a startup, are not tolerated in the public markets, and may even have legal ramifications.
11. Are you doing a data migration soon?
Similarly, investments in data observability are often triggered by large engineering initiatives, like a data migration from one data warehouse to another. If you are planning on doing one of these soon, now may be a good time to invest in data observability, to lower the risk that you accidentally delete historical data that you shouldn’t, or migrate tables over incorrectly.
The data observability status quo
12. Have you set up data tests?
Simple data tests like DBT tests or Great Expectations are easy to set up and pay big dividends. They should be among your top data observability priorities.
13. Do you have change management and CI around your data?
Data tests should be paired with CI/CD to run every time a change is made to the data pipeline.
14. Do you have a staging environment for your data? How consistent is it with your production environment?
Along with the CI and the data tests, you should have a separate staging environment where data engineers can test queries before having others review them. This prevents breaking changes from polluting the production database.
15. Do you find yourself trying to extend/schedule your data tests?
If you find yourself trying to extend your data tests, and build frameworks and visualization platforms on top of them, you may actually be ready for data observability. Data tests and data observability differ in that tests only test for what you’ve already asserted to be correct, while observability is about knowing the complete state of the system.
16. How long does it take for the data scientists to find the data they need?
When data scientists begin a new project, how long does it take for them to find where the necessary data is? Are the datasets well documented, or is it mostly institutional knowledge that requires a number of Slack messages and emails to get to?
17. Are you able to definitively answer questions about key metrics at the company in under ten minutes?
For topline metrics like revenue, orders, website traffic, and search traffic, how long does it take for a product manager to get a number that they trust and are willing to make product and business decisions off of? If it’s more than ten minutes, there’s something wrong.
18. Do your engineers understand how data flows through the system and how it is transformed?
If a new data engineer joins the team, would each of the tenured engineers be able to draw an architecture diagram and explain how data flows through the system, from the production boxes, to the data warehouse? If not, why not? If the reason is that the data flows are too complicated, it may be time to look into data lineage solutions.
Specific data observability decisions
19. Do you want to monitor at the source or at the destination?
We generally recommend that you monitor for data quality at the destination (in the data warehouse, once all the transformations have occurred) rather than the source (the web applications where the data is actually being generated). However, one reason to monitor at the source is that you will catch problems before they percolate downstream.
20. Is data observability technology core to your business?
Once you’ve figured out that you want to invest in data observability, you should document the business case for data observability. We generally recommend that unless data observability technology is actually core to your business, you buy something off-the-shelf.
Monitoring
Schema change detection
Lineage monitoring