Automatically quarantining bad data with Bigeye and dbt
Let’s look at how I built an automatic data quarantining workflow by leveraging Bigeye’s automatic anomaly detection and dbt’s data transformation capabilities.
At Bigeye, we’re always working with the tools that our customers use and love in order to find new ways to bring data observability right into their most important workflows. Normally, we don’t reveal these capabilities until they are production-ready, but with dbt’s Coalesce Conference just around the corner, I wanted to preview an exciting new workflow I’ve built by connecting Bigeye to our internal dbt Cloud instance.
For the uninitiated, dbt is a popular data transformation tool that lets data analysts build their own data pipelines using SQL, and Bigeye is a powerful data observability tool that automatically identifies stale, damaged, and unexpected data. While these tools mostly focus on different portions of the data stack — dbt performing transformations inside the data warehouse and Bigeye sitting on top of the data warehouse to monitor data quality — they’re definitely interoperable.
Let’s look at how I built an automatic data quarantining workflow by leveraging Bigeye’s automatic anomaly detection and dbt’s data transformation capabilities.
Bigeye → dbt
Here's a common scenario: you're a product manager about to go into an important meeting with an executive when you realize that the numbers on your BI dashboard don't make sense. Maybe there's six hours yesterday with no data, or there are suddenly four subscription categories instead of three, with the fourth one being a misspelling.
Bigeye automatically detects these types of data issues by monitoring for 60+ data quality metrics, such as the percentage of nulls in a given column. Bigeye intelligently learns thresholds for these metrics based on historical patterns of the data and alerts you when the data is behaving anomalously. No more getting caught off guard by data quality issues, your data team will know about issues before your stakeholders are affected.
But what happens once an issue has been detected? The next step is figuring out the root cause of the bad or missing data.
Because all Bigeye data quality metrics are computed using dynamic queries generated at runtime, we can also dynamically generate a follow-up query to run when anomalies are detected. These follow-up queries can identify specific rows that contributed to the anomaly, and we show previews of these rows in the UI to speed up root cause analysis for our users.
But what if we could not only surface the anomalous rows for debugging purposes—what if we could also quarantine those rows so they don’t impact our stakeholder’s dashboard?
To find out, I connected Bigeye to our dbt Cloud account and wired it up to take our dynamically generated query, and produce a job within dbt to pull all affected rows out into a side table. This keeps them out of the stakeholder's dashboard and makes it easy for the data engineer to inspect them, all automatically—and crucially, in a way that can be easily reversed.
If I don’t like the rows Bigeye vacuumed out of our source table, I can easily generate another job to run the process in reverse and put the same rows back.
Next Steps for Bigeye <> DBT
The beauty of dbt is that it's a complete programming environment for data modeling. This opens up a number of possibilities for further Bigeye-dbt integrations.
We rarely share projects before they are in production, but we just couldn’t help ourselves in this case! If you get just as excited about these kinds of projects as we do, then we’d love to talk to you. Or if you are interested in leveraging Bigeye and dbt to automatically quarantine bad data, send me a note to [email protected].
Monitoring
Schema change detection
Lineage monitoring