As a Data Consultancy, a significant portion of our work is about working through the varying degrees of data quality provided by our clients. We need to turn data ore into numeric gold. Most of our clients still rely on legacy systems, which don’t provide the same degree of standardization or unification as modern connected systems.
One of our clients grew through acquisitions and relied on a set of disparate point of sales systems, varying from shop to shop, some dating as far back as the 70s and 80s. The client wanted to be able to run a set of standardized user journey across markets.
Some of the challenges faced included missing fields and values, duplicated primary key, consents sent on the entire user base either as consented or unconsent, data being pushed to the wrong account, special characters treatment, or invalid emails.
These data quality issues had direct business impact, for instance special characters in the names caused some of our sms to be broken onto multiple sms causing poor customer experience and tripling our bill while the number of invalid emails on the customer made the domain used to send email to be considered as as spam account and the mail sent classified as spam. Missing fields or values also severly impacted the original business case with 60% of the userbase being uncountactable due to missing contact details or valid consent.
Notwithstanding these direct business aspects, given the importance of some these attributes in driving Customer deduplication and our journeys, the level of issues with data quality mandated to implement a series of control before onboarding the data onto the client’s data platform and marketing ecosystem.
Data Quality Availability & Assessment (DQAA)
Given that data is the fuel, that powers a lot of the digital marketing initiatives, in order to avoid some of these surprises, we normally kick off scoping phase with a set of data quality & availability assessments, to try to gauge where to concentrate the efforts, better understand the use cases that can be achieve and to what extent, and what level of effort / data cleansing iinitiative would be needed in order to achieve the use case value.
We normally split a lot of the work onto phases or DQAA levels, to help informs some of the decisions, typically we split this onto 1 (self reported) + 4 levels of DQAAs:
- DQAA Level 0: Getting a first sense of the overall data quality and availability through self assessment by or through interview of the different data responsibles (whether they are data owners/stewards or custodians).
- DQAA Level 1: A shallow and wide initial assessment provide an initial scan of the available data availability and quality, focusing on high level metrics (e.g. total count, percentage of non null values) and the presence of specific entities or attributes as part of the data model.
- DQAA Level 2: Looks to implement a narrow but deep profiling of available, looking at the distribution of records on the data, patterns and trends in the data.
- DQAA Level 3 : Looks to provide a detailed profiling of the data, implementing business rules to assess data quality, look to identify root causes of data discrepancies, and implement corrective measures.
- DQAA Level 4: Represents a continuous process for managing data quality, including the implementation of predictive anomaly detection and machine learning to improve data quality on an ongoing basis.
Working with our clients, we typically help them define a set of minimal data to power some of their use cases and defining quality metrics around these, to track their progression around the in bringing an adequate level of data quality to power their operational processes.
Building a Level 4 DQAA process
Implementing a DQA process
Faced with the sometimes lacking data quality on input data meant to drive the client’s business processes, we needed to setup a series of control in order for these processes to be positively effective.
We decided to create a first pass data quality assurance process leveraging Great Expectation (GE). Great Expectation is an open source data quality tool, that can be used for checking data quality of input files or analytical tables. It supports both high-level schema checks, but allows to implement specific business logic or lineage checks. GE further allows to set acceptable threshold for data quality or register the statistics of the different data quality runs in a database.
Great Expectations provides as a standard offering the ability to build html reports based on the checks implemented.
In the span of two weeks, we implemented a series of initial data quality checks on the onboarded file and implemented an initial data quality assurance process using their existing workflow system (Airflow), notifying the sender of data of data quality failure.
Building a monitoring setup
Once we had implemented the initial set of pipelines and key datasets, the need for data quality observability became a bit more pressing. Changes made on the data at the source, and through data engineering code created both improvements and regressions that needed to be monitored beyond a simple pass or fail threshold.
We decided to build a dashboard based on the available observability tooling at our client (Grafana), based on the statistics of every run registered in the metadata database.
The base statistics were however not enough for our increased needs for observability and decided to increase logging of metrics in the metadata db by introducing column level statistics such as unexpected_count or unexpected_percent. We coupled this with a periodic run of our great expectations checkpoints in order to be able to better spot data quality trends over time.
Together with this initial implementation, we introduced an operational process to review the results of this dashboard and decide on the introduction of new data quality rules (e.g expectations).
Data Quality classification and Alerting
In our next step of our journey, we worked on focusing on improving the process for managing the different checks implemented, with over 100s checks in place, it became difficult to manage the different failure and understand what was relevant and what needed to be tackled right away.
To this end we focused on classifying the different checks in terms of importance, such as general priority and if the check failure should cause an alert or just a general warning. We as well focused on determining appropriate thresholds for each with our stakeholders, allowing us to get a firmer grasp on data quality and create a more robust process.
We implemented specific altering on top of our data quality pipelines to let the engineering team know for the most severe data quality failures.
Enhancing full end to end traceability
Besides these typical spot checks for the dataset, we as well wanted to get a better overview of the data quality across a data pipeline ETL lineage and wanted to be able to use the results of some of our data quality checks and count to control and reconcile some of our data end to end. We implemented specific checks leverage Great Expectations evaluation parameters to that end.
Another feature that we started to implement within our pipelines records triaging, essentially parking some of the records failure onto specific tables or views, so that they could be inspected and excluded from some of the downstream integrations processes — for instance excluding them from hitting our Customer Data platform customer deduplication process — avoiding such case that personal data might be sent to a john.doe@gmail.com. We couldn’t leverage our base data quality framework for this, as no built in feature supported that use case and instead opted to leverage Delta live tables expectations to implement this on top of our Databricks setup, but still leveraging Great Expectations for monitoring on the data quality.
Building a Data Quality Assurance application
With records parked and not integrated onto the CDP platform, we ended with a lot of potentially valuable data that would to be corrected in order to become usable. Given that we were getting a regular data feed of data from diverse source systems, any changes that we needed to make to the data would have had to be corrected at the source in order not to be overwritten.
Some of the source records, needed to be manually investigated in order to understand what we should be done in terms of data quality, issues of typo in email address, or wrong country prefixes in phone numbers, addresses with city fields put into an addressline1 instead of city field and so on could easily be corrected but some other issues such as email address without valid mx records might need some involvement with the client to be fully corrected and could require a visit of the client to one of our store.
In order to address the client’s need for better data quality, we looked to build a specific application going through the different parked records and allowing the client’s Data Stewardship/Ops team to manually review these exceptions and decide how to treat the records. These treated records ended up into a separate data store before being integrated back onto the source system.
Conclusion
Unlocking the power of data is a long and difficult journey, where an understanding of the quality of the “raw” materials need to be cut & polished onto gemstones. It requires a focus not just on analytics, but on data strategy, processes and technical implementation in order to bring its full potential and value.
At WiseAnalytics.io, we help our clients leverage data in an operational way and help to bring data onto a state of control and improve upon it.