Defining data quality with SLAs
At Bigeye, we believe SLAs can help answer a really big question for both data teams and the data consumers who depend on them: what does “data quality” mean exactly?
What does data quality even mean anyway?
Almost anyone working in or around software engineering or infrastructure will have heard about SLAs (Service Level Agreement) at some point. But despite that, they’re still often not fully understood. In this post, we’ll break down the SLA concept and show how it can be applied to data quality.
At Bigeye, we believe SLAs can help answer a really big question for both data teams and the data consumers who depend on them: what does “data quality” mean exactly?
“When you say the data is ‘wrong’… what do you mean?”
—your friendly neighborhood data engineer
Seeing eye-to-eye on data quality
One of the challenges to keeping data quality high is deceptively hard: agreeing on what high-quality data means in the first place.
Data consumers often have an intuitive sense of what “good” data is. But that intuition is rarely quantified and almost never documented. On top of that, the rate of change modern data platforms enable makes data quality a moving target.
Whatever “perfect” data quality might mean in theory isn’t achievable in practice.
Let’s leave data behind for a moment and say we’re selling fancy mechanical keyboards. When we buy switches from the keyboard-switch-making company — in batches of 10,000 — we expect a certain level of clickiness from pretty much all of the switches we paid for.
But how clicky is too clicky? How clicky is too mushy? If only a couple keys out of 10,000 are off, is that okay? Or do we get a refund?
We don’t expect our supplier to magically know what perfect clickiness means to us, and we probably don’t expect an entire batch to be perfect down to the last switch in a batch of 10,000. But we can agree on what’s good enough to do business and what isn’t. These definitions and tolerances can be captured by an SLA.Just like the keyboard-switch-makers and the fancy-keyboard-makers, data engineers and data consumers need to agree on the practical definition of data quality for a given use case, such as a dashboard, a scheduled export, or an ML model. That definition should crisply define how data quality will be measured, and what will happen if the standard isn’t met. But that’s easier said than done.Without a crisp definition of quality, ambiguity can create tensions and disagreements between the data team and data consumers.For example, an analyst might ask, “how fast can the transactions data be ready for querying?” But what they really meant was, “how soon will it be correct for this analysis I’m doing on recent order behavior?” This distinction matters when the datasets happens to have late arriving updates, like cancellations to orders that have already been placed. In that case, is the number for last month’s revenue usable before the 30-day window for merchandise returns has closed?This ambiguity can lead to analysis that isn’t correct because the data consumer didn’t ask exactly the right question, and the data engineer didn’t know the context to understand what their real need was. If somebody points out some discrepancy after the analysis has been shared around, it could cause headaches for both the analyst and the data engineer.This is where the SLA comes in.
Service Level Agreements
Telecom operators started using SLAs in the 1980s, and now they can be found in lots of places, like Google’s Site Reliability Engineering function. They were designed to clarify and document expectations between a service provider and their users.For example, the Google Cloud Platform SLA states that monthly uptime should be >= 99.95% for standard storage, meaning their users should expect their storage to be available for at least 43,178 of the minutes in each month, and they be prepared to accept that it could be unavailable for up to 22 minutes. If the uptime isn’t met, the customer is eligible for a credit on their bill. The SLA assures the customer a level of quality in very concrete terms.Similarly, the data team is effectively a service provider to both internal and external data consumers, and SLAs can bring the same level of accountability in that relationship. The only difference for internal SLAs is that there isn’t a fine or refund for missed targets since internal teams rarely pay one another directly.The most commonly used SLAs I saw on Uber’s data platform were for freshness. The data in a given table might be guaranteed to be no more than 48hrs delayed, for example. These were often followed by a completeness metric (which is a longer topic to save for a dedicated future post). This gave consumers a clear expectation for how recent the data should be in their queries and dashboards, and if they needed it faster, it prompted a conversation with data engineering.
How are SLAs structured?
SLIs and SLOs and SLAs, oh my!Service Level Indicators (SLIs) measure specific aspects of performance. In software engineering, SLIs might be “web page response time in milliseconds” or “gigs of storage capacity remaining.” For a data engineer, SLIs might be “hours since dataset refreshed” or “percentage of values that match a UUID regex.” When building an SLA for a specific use case, the SLIs should be chosen based on what data the use case relies on. If an ML model can tolerate some null IDs but not too many, the rate of null IDs is a great SLI to include in the SLA for that model.Service Level Objectives (SLOs) give each SLI a target range. In software engineering, it could be “99% of pages in the last 7 days returned in under 90ms.” Going back to our data engineering SLIs, the relevant SLOs could be “less than 6 hours since the dataset refreshed” or “at least 99.9% of values match a UUID regex.” As long as an SLI is within the range set by its SLO, it’s considered acceptable for the target use case, and that aspect of its parent SLA is being met.Service Level Agreements (SLAs) put several SLOs together into an overall agreement, and sets a budget for how long any included SLO can be out of bounds before the team has failed their end of the commitment. An SLA with a 99.9% uptime guarantee (or a 0.1% error budget) allows for 43 minutes and 50 seconds of downtime each month. This overall document gives the end user a concrete way to know whether the data meets their expectations, and what level of reliability they should expect week to week or month to month. For higher impact use cases, the SLAs may need to contain more SLIs and have stricter SLOs, and/or a smaller error budget. That increased reliability comes at a cost to the data engineering team’s time, but in exchange, ensures the business can depend on that data, which allows them to put it into more impactful (and profitable) use cases.
How SLAs help:
Data Engineers
- SLAs improve communication between engineers and stakeholders by clearly defining the scope of data quality and making it crystal clear what “okay” and “not okay” means. Without this definition, a tug of war can erupt over what does and doesn’t deserve attention.
- SLAs also help engineers decide how much time should be spent on new projects and how much time should be spent making existing projects more durable — all thanks to the “error budget.” If a 99.9% SLA allows for 43 minutes and 50 seconds of downtime each month, and the data engineer has only had 10 minutes of downtime by the 21st, then they can confidently prioritize new projects. It’s clear that they are meeting their obligations.
- After all that work has been done, SLAs help data engineers quantify the impact of that work over time — something that’s easily forgotten by the business when things are running smoothly! If error budgets are being met, or even becoming tighter each quarter, the team has clear metrics to show their success. Data consumers can clearly see the effort that has been put into improving reliability.
Data consumers
- SLAs give stakeholders confidence in the data they’re using, often a dashboard or a query output. The agreement clarifies the difference between “operating as expected” and “something is broken.” With the right tools, they can even see a real time status of the SLA for their dashboard or query without needing to ask anyone whether the data is healthy.
- They get the confidence that when something breaks, data engineering will get it fixed. There is no debating whether something needs attention or not. Everything has already been agreed to. If something does need attention, engineering’s error budget clock is ticking.
- They get better engineering! The error budget helps the data engineering team prioritize reliability vs. more features for the data platform, like new pipelines. In the short term, the consumer might not experience the benefits as clearly as checking the SLA status for their dashboard. But in the long run, they’ll feel the impact when less engineering time is spent fixing things that aren’t broken and more time spent on new data platform capabilities.
Data team leaders
- The error budget is the key to helping data team leads prioritize the efforts of their team. Should they be focusing more on new improvements or on reliability and maintenance? The amount of time dedicated to each should be directly related to which error budgets are being exhausted and which aren’t. By reviewing the state of all their error budgets, a team lead can easily see where to invest in reliability projects and where they shouldn’t waste their time.
- At the end of the quarter or year, SLAs also provide team leaders with a clear measurement of the return on their investment in reliability improvements. If error budgets were met consistently, then more time can be budgeted for extending the data platform with new pipelines or tools. Or the budgets can be tightened to provide even stricter reliability guarantees to the business! And if they aren’t being met, it becomes dead simple to defend the prioritization of data quality engineering work until they are.
Getting started with SLAs
We see six major steps to putting SLAs to work on data quality:
- Identify the applications that are worth applying an SLA to. A key executive-facing dashboard or a core machine-learning model would be great candidates, a seldom used table… not so much.
- Engineers and stakeholders need to identify the correct SLIs and SLOs for the query, dashboard, app feature, or machine learning model. This should be a conversation led by how the data will be used by the stakeholder and backstopped by what can and can’t be measured in practice by engineering.
- The SLI definitions, SLO levels, and error budget need to be documented to craft the SLA itself. The definitions should be clear enough for stakeholders and include the actual queries used to determine them. The error budget should be stated in both: 9’s (e.g. 99.9% is “three 9’s”) which are easy to say, and in minutes or hours which explain what they mean in terms of user’s time.
- Tracking needs to be implemented on the SLIs and their SLOs. In the best case, the SLOs are tracked over time in one place together with the error budget.
- Alerting needs to be in place to make sure both engineers and stakeholders know where to look when an SLA has been violated, and where they can see historical performance.
- The error budget consumption must be tracked, allowing the data team to allocate their time to either shoring up data quality or building new improvements to the data platform.
Building SLAs at Bigeye
Implementing SLAs can sometimes take cultural change. No software can magically create alignment between data engineers and users — but we’re trying to make it easier by adding support for creating SLAs right inside Bigeye. Removing the manual effort needed to create and track them in spreadsheets takes away one big barrier to adoption that could prevent a data team from getting started.
Data quality engineering is our mission at Bigeye, and enabling SLAs is a key part of that mission. To learn more, shoot us an email (hello@bigeye.com) or book a demo with us here.
Monitoring
Schema change detection
Lineage monitoring