paint-brush
Data Quality Score: One Score to Rule Them Allby@bmarquie
721 reads
721 reads

Data Quality Score: One Score to Rule Them All

by Bruno MarquiéDecember 20th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Take this article as your entry door to the Airbnb data quality journey, and then be prepared to become addicted and dive deep into their data engineering practices!
featured image - Data Quality Score: One Score to Rule Them All
Bruno Marquié HackerNoon profile picture

One score to rule them all, one score to find them, one score to bring them all, and in the data's clarity bind them.



I found this article very interesting because it mentioned two different approaches used at Airbnb to boost data quality internally.


Airbnb is big on the data scene, with lots of open source action, including Apache Airflow and Apache Superset, and a massive daily data flow. Yet, as their data game got more intricate, serving up precise and high-quality data to a diverse crowd posed a bit of a challenge.


To tackle this, Airbnb initially rolled out a certification process (Midas) for critical datasets and metrics, aiming to rebuild trust among data consumers. Despite the intention, the certification strategy proved non-scalable, encountered resistance, and placed a disproportionate burden on the frontline data team. This observation is a mix of my past experiences with such processes and what they actually say in the article.


As a result, there's been a move away from a simple "certified or not" system to a more teamwork-oriented model. In this refined approach, all teams, including both producers and consumers, play an active role in boosting data quality.


Airbnb changed how it handles data quality, shifting from a strict enforcement strategy to one focused on incentives with the introduction of the Data Quality Score (DQ Score). This score serves as a solid measure of assessing the quality of data assets. Its purpose is to naturally drive data producers to collaborate with data consumers, motivating them to actively participate in enhancing the quality of the data they provide.


While reading about the second approach, the DQ Score, described in the article, I was surprised to learn that it followed the first one, not the other way around. As you will see, the score is particularly polished, but bringing observability first to your stack is usually a good start.



First, let's take a look at what Midas was - their existing data quality certification approach.

I quickly skimmed through the document introducing the process.


In 2019, an internal customer survey revealed that Airbnb's data scientists were finding it increasingly difficult to navigate the growing warehouse and had trouble identifying which data sources met the high quality bar required for their work.


So, the target here is really to help data consumption inside Airbnb and restore trust from data practitioners in Airbnb datasets.


For a data quality guarantee to be relevant for many of the most important data use cases, we needed to guarantee quality for both data tables and the individual metrics derived from them.


An important point they mention is raw datasets but also derived metrics. They encounter issues about consistency, timeliness,…


Midas certification does not come without challenges. In particular, quality takes time. Requirements for documentation, reviews, and input from a broad set of stakeholders mean building a data model to Midas standards is much slower than building uncertified data. Re-architecting data models at scale also requires substantial staffing from data and analytics engineering experts (we’re hiring!), and entails costs for teams to migrate to the new data sources.


There are four distinct reviews in the Midas process:


Spec Review: Review the proposed design spec for the data model, before implementation begins.


Data Review: Review the pipeline’s data quality checks and validation report.


Code Review: Review the code used to generate the data pipeline.


Minerva Review: Review the source of truth metric definitions implemented in Minerva, Airbnb’s metrics service.


I don't doubt that the quality is there; I hope it is. However, it seems like a hefty cost to bear, particularly on the human aspect.


Issues often arise in the implementation of certification or standardization processes. External reviews can disincentivize team members, hindering adoption and collaboration, and they may be perceived as a waste of time and an impediment. Collaboration and compromises are crucial. Baby steps and incremental approaches, rather than one-time sacred validations, prove effective. Inclusion is key, and genuine peer-to-peer communication, instead of top-down enforcement and hierarchy, helps to smooth the process.


The worst part is that it also puts a lot of stress and responsibilities on the validation team. Because, at the end of the day, it's not a scalable approach. One team, as good as it is, cannot review and validate the contributions of all the other teams. As complaints grow and the backlog increases, the stress increases, and if more resources could help, the investments are generally not made against the majority complaining voice in an organization.


We made the decision that we could no longer rely on enforcement to scale data quality at Airbnb, and we instead needed to rely on incentivization of both the data producer and consumer.


I have been on the side of certification/standardization teams, and I know that adoption is key.

To achieve that, you can only be a facilitator, an orchestrator of a process that has to be distributed, supported, and owned by everybody, and relinquish control.


This second approach is a lot more sound!



Now that we can feel the flow of this enforcement approach, let's look at how they decided to tackle the issues to incentivize the org to improve quality and boost adoption.


With 1.4 billion cumulative guest arrivals as of year-end 2022, Airbnb's growth pushed us to an inflection point where diminishing data quality began to hinder our data practitioners.


The sheer volume of transactions and data points logged per guest arrival is truly mind-boggling. Just for this guest booking, we're looking at a whopping 4 million per day. Now, picture the significant implications for the storage and computing infrastructure. 😍


And that's not all… guest arrival is only one of the many activities managed by Airbnb; you probably also have various datasets like listing details, user reviews, financial transactions, and operational metrics… all catering to different data practitioners.


The goal is to maintain or enhance data quality even in the face of significant data growth.


So, how did they improve the situation?


In 2022, we began exploring ideas for scaling data quality beyond Midas certification. Data producers were requesting a lighter-weight process that could provide some of the quality guardrails of Midas, but with less rigor and time investment.

To fully enable this incentivization approach, we believed it would be paramount to introduce the concept of a data quality score directly tied to data assets.


While everyone would agree that data quality is a prerequisite for any meaningful work, concerns may arise about the related costs, particularly if the established process tends to relieve individuals of their responsibilities.


Tracking and improving this score by itself is the incentive.


Guided by our principles, we eventually settled on having four dimensions of data quality: Accuracy, Reliability (Timeliness), Stewardship, and Usability.


You're all familiar with the saying: 'If you can't measure it, you can't manage it,' attributed to Peter Drucker.


In their case, it would be more fitting to state: 'If you can't measure it, you can't understand the need to address it, you can't identify the issues, you can't track progress, you can't ensure observability, you can't set up effective alerting, you can't drive improvements, you can't convince stakeholders, and you can't foster adoption.


With the input of a cross-functional group of data practitioners, we aligned on these guiding principles:


Full coverage — score can be applied to any in-scope data warehouse data asset


Automated — collection of inputs that determine the score is 100% automated


Actionable — score is easy to discover and actionable for both producers and consumers


Multi-dimensional — score can be decomposed into pillars of data quality


Evolvable — scoring criteria and their definitions can change over time


And so, an important point for me is that this score, by being controlled by these five principles, stays updated once defined (or scripted, some kind of score as code), and it opens up a lot of different usages: monitoring, alerting, observability, explainability and more without an additional cost for the producers.


Even better, they own it, and it becomes a living thing that they can fully visualize and nurture.


This is why I was saying that it should be a prerequisite, a guiding light, even useful in the context of Midas.



Let's wrap up this reading by delving into some details about this score.

It provides a comprehensive understanding of the criteria to be considered as part of it and how to present it.


Guided by our principles, we eventually settled on having four dimensions of data quality: Accuracy, Reliability (Timeliness), Stewardship, and Usability.


We could also weigh each dimension according to our perception of its importance in determining quality. We considered 1) how many scoring components belonged to each dimension, 2) enabling quick mental math, and 3) which elements our practitioners care about most to allocate 100 total points across the dimensions:

That's neat, but even if we understand that these individual scores are supposed to be computed automatically by some kind of pipeline, it allows too much room for interpretation by each individual team. This certainly leads to a lot of disparities between scores from different teams, rendering it useless for data consumers.


However, by delving a little deeper, they provided enough guidance to achieve a good level of homogeneity and still avoid the pitfalls of the overarching previous certification approach.

Much better. Now, we understand the great work done by the team of selected data practitioners to come up with these score definitions.


It provides enough guidance to any team to create something that will be homogeneous across teams and still gives them enough flexibility and ownership to implement it.


We can see that this is not a one-time effort but something that can evolve and is regularly and automatically updated.


Our final design presents data quality in three ways, each with a different use case in mind:


A single, high-level score from 0–100. We assigned categorical thresholds of “Poor”, “Okay”, “Good”, and “Great” .. Best for quick, high-level assessment of a dataset’s overall quality.


Dimensional scores, where an asset can score perfectly on Accuracy but low on Reliability. Useful when a particular area of deficiency is not problematic (e.g., the consumer wants the data to be very accurate but is not worried about it landing quickly every day).


Full score detail + Steps to improve, where data consumers can see exactly where an asset falls short and data producers can take action to improve an asset’s quality.


And then it provides enough detail for a data practitioner to have a bird's-eye view or delve deeper to fully trust the quality of the dataset.


Note that Midas is still present for their more crucial data points:

The DQ Score has not replaced certification (e.g., only Midas-certified data can achieve a DQ Score > 90).


I'd question the necessity of keeping Midas. I get that abandoning it after investing so much in the process would likely be a tough decision. However, if they had opted to introduce the DQ score initially, would there have been a need for a subsequent certification process?

I'm not entirely convinced, but then again, I wasn't there to witness the challenges of the pre-Midas era.


Driving up quality through observation is a smart move.


Navigating blindly without radar is definitely not a wise approach. That's why having a solid observation backbone in place is a very good start to guide and improve. And that's exactly what this score represents: your backbone, your radar.


This score introduces a more nuanced grading approach, going beyond the binary certified/uncertified distinction and making it visible to everyone. At this point, we figure assigning this score becomes a local team thing, and it's supposed to be easier to pull off than getting a certification.


Initially, their strategy involved compelling data teams to provide reliable and high-quality data to data practitioners. To achieve this objective, they implemented a certification process as a means of ensuring compliance and maintaining data integrity.


However, with the second approach, they found a way to scale and, more importantly, federate everyone around the goal of improving data quality while retaining ownership, accountability, and fostering adoption. This was achieved by understanding all the intricacies and maintaining the full liberty to build on top of it.



Found this article useful? Follow me on Linkedin, Hackernoon, and Medium! Please 👏 this article to share it!