Thought leadership
-
August 30, 2023

20 data reliability use cases from real-life teams

How do modern data teams build and enforce data reliability? We researched 20 use cases from real-life data teams in order to find out.

Liz Elfman
Get Data Insights Delivered
Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.
Stay Informed
Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.
Get the Best of Data Leadership
Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Modern, data-driven companies need reliable data. But on the path to building trustworthy analytics, ML models, and data products, your data team is bound to hit some roadblocks. In theory, data reliability is straightforward. But when it comes to the messy business of actually implementing it, what do real-life teams do?

Organizations like Lyft, Walmart, and LinkedIn have applied data reliability techniques to solve their data challenges. In this post, we highlight 20 of those real-world examples.

1. Monitoring data freshness/staleness

LinkedIn built a system called Data Health Monitor (DHM) that automatically monitors the freshness and staleness of datasets. With this system, they can detect issues like pipelines unintentionally using an older dataset version.

2. Monitoring data volume changes

LinkedIn’s DHM also monitors for sudden drops or increases in data volume, which can indicate partial data or insufficient resources, respectively. Being aware of volume changes helps LinkedIn maintain pipeline and data quality.

3. Monitoring the quality of offline data

Lyft built Verity, a check-based system to monitor the quality of offline data. It allows users to define checks that run queries to validate expectations - for example, checking for null values in a column. Verity checks can be configured to run automatically on a schedule or as part of data pipelines. The check results are stored to enable debugging when failures occur.

4. Bringing objectivity to quality

Walmart built DQAF (Data Quality Assessment Framework), which is the company’s product for Continuous Data Quality. The DQAF enables stakeholders to define objective thresholds for quality metrics based on what "good quality" means to them. This makes quality less subjective. For example, a business user can set a threshold that a critical column must be 95-100% complete.

5. Clarifying ownership

Walmart’s framework also assigns ownership of quality scores for different data domains to the relevant teams. So for example, the "Orders" team owns order data quality scores. This functionality delineates responsibilities across siloed teams.

6. Tracking improvements

By storing quality scores over time, Walmart can quantify improvements as data stewards fix issues. If the completeness score for a column goes from 80% to 95% after fixes, it demonstrates the business impact of quality efforts.

7. Depicting interconnectedness

Even though teams own data in silos, quality scores in Walmart’s framework show interdependencies between data sets. For example, Orders team data quality is connected to the Customer team data quality.

8. Enabling custom algorithms

At Walmart, teams can define custom data quality algorithms tailored to their specific data needs. For example, the Orders team could create a validity check unique to a column in the Orders table.

9. Answering questions about data quality

Through Walmart's quality score tracking, analysts can also answer questions like "Why did data quality dip last month?" These analyses provide data-driven narratives around quality that can be surfaced to executives.

10. Making data discoverable

LinkedIn built "Super Tables", which are centralized, well-documented datasets that have been pre-computed and normalized. They aim to be the “go-to datasets” for certain domains, e.g.:

  • JOBS Super Table: Consolidates data from 57+ different job-related data sources into a single table with 158 columns. Provides precomputed information commonly needed for job analytics and insights.
  • Ad Events Super Table: Consolidates data from 7 different ad-related tables, including ad impressions, clicks, video views, etc. Joins in campaign and advertiser dimensions. Provides 150+ columns for ad analytics and reporting.

The goal of both Super Tables is to simplify data discovery, reduce redundant joins and storage, and precompute commonly used data for downstream analytics.

11. Guaranteeing table availability

Linkedin’s Super Tables also have well-defined service level agreements (SLAs) that specify availability, supportability, and change management commitments.

For availability, the goal is to achieve 99%+ uptime. For a daily Super Table flow, this translates to about one SLA miss per quarter. To improve availability, Super Tables can be materialized in multiple clusters with active-active configurations. This provides redundancy in case of failures.

Upstream data sources must also commit to SLAs that enable the Super Table to meet its own SLA. The SLAs of upstream sources are tracked and monitored.

12. Managing schema changes in upstream sources

By default, schema changes (additions, deletions etc.) in upstream source data do not automatically affect the Super Table schema.

If a new column is added in a source, it does not appear in the Super Table. If a source column is deleted, its value is nullified in the Super Table.

The Super Table governance body is notified of source schema changes that could potentially impact the table. All planned schema changes to the Super Table itself are documented and communicated to downstream consumers, and there is a monthly release cadence for accepting schema change requests to the Super Table.

13. Reducing alert fatigue

Uber used tiering to classify and prioritize its various data assets, such as tables, pipelines, machine learning models, and dashboards. By assigning different tiers to these assets, Uber is able to manage its resources more efficiently, ensuring that only the most important data gets alerted on:

  • Tier 0: These are the most critical data assets that are foundational for the business to operate. Any disruption in these assets could have severe consequences. Kafka as a service, for example, falls under this category.  
  • Tier 1: Extremely important datasets that could be essential for decision-making, analytics, or operational aspects. These could be things like user data, transaction data, etc.
  • Tier 2: Important but not critical datasets. These could be important for some departments or features but aren't as universally crucial.
  • Tiers 3, 4: Less critical data that may still be useful for specific analyses or features.
  • Tier 5: These are individually owned datasets, often generated in staging or test environments. They have no guarantees of quality or availability and are the least prioritized.

By identifying just 2,500 Tier 1 and Tier 2 tables out of over 130,000 tables, Uber focused its efforts on a manageable but critically important subset of its data, allowing for better quality, reliability, and resource allocation.

14. Reducing manual data issue debugging

Stripe built a centralized observability platform and internal UI that allowed users to select different runs of a data job and compare metrics like runtime, data volume processed, and logs across the run.

Based on current runtime progression and historical runtimes, the UI would also predict estimated completion time for running jobs, which would help address stakeholder questions.

Finally, users could configure standardized fallback behaviors for different failure cases, and data tests, through the UI.

15. On-call training

Playbooks and runbooks are documents that outline the steps for responding to specific types of issues/incidents. In the context of running a data organization, they ensure that everyone involved has a shared understanding of the plan of action. More specifically, they provide a checklist of action items so that nothing is forgotten. This checklist can also be used to train new staff on data issue response.

16. Data producer-consumer alignment

Convoy pioneered data contracts. These are API-based agreements between software engineers who own services and business-focused data consumers, with the goal of generating well-modeled, high-quality, trusted data. They allow a service to define the entities and application-level events they own, along with their schema and semantics.

Data contracts ensure that production-grade data pipelines are treated as part of the product, with clear SLAs and ownership. They also orient everyone in the same direction so that problem-solving work is effective.

17. Prevent degradation in machine learning model performance

At Lyft, input features to models are validated in real-time against valid value ranges. This catches issues like incorrect units or data types passing to models.

They also monitor distributions of model score outputs with time series alerts, and analyze historical logs of features and predictions to catch unusual statistical deviations that could imply model degradation. If upstream feature changes or data drift is detected, they automatically retrain models to prevent performance from declining.

18. Making it easier for business users to answer data questions

Pinterest built Querybook, an open-source data collaboration platform for sharing SQL queries, datasets, and insights. Querybook also has a ChatGPT-like interface to automatically generate and execute SQL queries from plain text questions. For example, users can ask natural language questions like "How many daily active users in the past month?" and it will generate the appropriate SQL query.

19. Making data incidents less stressful

Following the principles of data reliability will hopefully mean you face fewer data incidents, but it also means that even when data incidents occur, they’re less stressful.

You can apply standard incident response frameworks to data incidents too. For example, the response process (Incident detection, response, root cause analysis, and resolution, and blameless post-mortem) and the response team (incident leader, SME, liaison, scribe). Therein lies your tried and true plan of attack.

20. Encouraging data-driven business decisions

Ultimately, you’re not collecting and analyzing all this data at a company for fun: it should be in service of making business or product decisions. Data reliability principles ensure that analyses and reports are accurate, that metrics and trends can be tracked over time, and that key financial information is always up-to-date and correct for compliance reasons.

Final thoughts

Modern data stacks enable tremendous analytical capabilities but also introduce reliability challenges from complexity and scale. Companies like Lyft, LinkedIn, Uber, Walmart, and Pinterest, apply data reliability principles to build trust and confidence in their data products and make better business choices.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.