Data quality is a journey, it doesn’t come in one day, and the focus should be more about improving data quality than having it right on day one.
A more thorough approach looks at the different areas of planning, validation, cleansing, surfacing, and documentation of the various data objects.
Having a clear plan on what information and how it should be collected is the first step to be able to have a good data quality.
It is crucial within a data planning to define what to collect and where to collect it from, identifying the different sources of information.
Data layer: A data-layer is a software component that provides simplified access to data stored. It is often used to refer to a front-end component to integrated with GTM and external tags components.
Part of defining the data is to set up the attributes and event definition that we want to collect and map these to what the different tags/systems are expecting.
Naming Conventions: help to make clear what specific data is for, they can be used to define table names where prefix and suffix can give insights to the origin of a table, being a table from an operational data store (eg: O_ ), a staging table (eg: STG_) or an aggregate table (eg: A_) . They are also useful in defining field names, and can also be used to standardize source input, such as utm parameters.
Format & Standardization: Data feeding into the systems should abide by a specific format defined beforehand. This data should be already standardized to some extent. There exists iso standard for quite a few data points such as country code (ISO 3166), gender (ISO 5218), LatLong (ISO 6709), language code (ISO 639), currency (ISO 4217), dates (ISO 8601).
Logging and Data Structures: Logging frameworks complement standard event logging with additional data in a standard format as well as allow for the integration of these events into existing data pipelines.
There is also a need to define how the data structure should store the events. There are trade-offs between having generic data structures and specialized data-structures.
Generic data structure makes it easy to leverage one data set along with others, think, for example, of a contacts table that would encompass every interaction with a customer, be it SMS, email, direct mail, or another source of information. Having the data in one generic data structure makes it very easy to consume these sets of information together. Generic data-structure, however, does not make it as easy to consume the more specialized information. For example, in the case of the contact table we previously mentioned, an email open-rate, in generic data structures, this information would likely sit within a nested object.
There are quite a few checks that can be performed to see if the data receives is matching expectation. From data structure checks, value checks, lifecycle checks, or a referential checks, all these validation processes help ensure that the data is up to spec.
Data Structure checks: Checks that the incoming data conform to the data structure. Checks the number of columns present, the datatypes.
Value Checks: Check that the values in the datasets match what would be expected from incoming data. Think, for example, of product prices. It is not normally expected to find a product price with a negative value.
Lifecycle checks: Can show if there might be missing data in the dataset. Think about a purchase on an e-commerce site. For a purchase to happen, a couple of actions need to be performed before it, such as browsing the website, clicking an add to cart or purchase button, and flowing through the checkout. A lifecycle check would check that this is indeed happening.
Referential integrity checks: provide a validation that the reference to other data objects, also called “Foreign Keys” exist. If we go back to the e-commerce purchase example, a referential integrity check could be to look at the products presents in the orders and see if they also exist in the product master data.
Monitoring and alerting can be set up to warn about multiple issues in DQ such as:
There are multiple steps or approach to data cleansing. The first decision is whether or not to make the data flow through the system when not passing validation. The second decision is if you decide to make it flow through the systems, how to treat it.
Reformatting of data from different systems: Reformatting the data can be possible in some cases when data types can be cast when the string can be harmonized through replacing, regexp matching, or through a lookup table.
Values: Specific values can be inferred or capped to help with the interpretation of the data down-streams; product prices can be set up to be higher than 0.
Missing Lifecycle data: In cases where lifecycle checks failed, it might be possible to infer some of the values for this lifecycle. Taking the example of the product purchase that we touched based on previously, it can be possible to reconstruct the checkout event for customers who made a purchase but where not checkout event is present. Some of the data in a checkout event, for instance, can be directly inferred from a purchase event. For example, the products and prices, some others such as the timing of the event are less certain.
It is worth noting that the way the data within the event is inferred will have an impact on the downstream metrics being calculated.
Referential validation failing: Reference not existing can happen for multiple reasons, it could be that the product data was deleted before an import could capture it, but a transaction based on this product already happened. In this case, it can be useful to insert the information we know about the product into the reference table.
Master Data management: help create a consistent view of your key data entities. It unifies and merge the different records that relate to the same entity and promote fields coming from varying degrees of authority and promote specific attributes to the “golden record.”
One of the challenges in the cleansing process is that it is very much dependent on the cause and applications for the data. It might not be great for reporting not to include some records, when one of the fields doesn’t pass a DQ check, but it might be worth to exclude them when attempting to build a predictive model.
Surfacing data to multiple stakeholders, through dashboards or datasets, help improve data quality by having multiple pairs of eyes on the data. Data that is surfaced and continuously monitored for business performance tend to be the data of the highest quality, as there is a business incentive to have anomalies identified and corrected.
Analyses following metrics deviation can help identify data facing some issues. For instance, the wrongful assignment to a category for a given event, imagine the assignment of a video view to a device web, ios or android:
If the video views are suddenly getting assigned to a different category (web in the example above), chances are is that there is a wrongful assignment, most likely coming from a logging bug. Surfacing this data in the dashboard makes it very easy to identify this behavior and puts some organizational pressure on fixing the root causes of these anomalies.
Besides having clean data in the system, it is important to be able to surface the necessary information related to the datasets, so that users can find the required information, use them effectively and without making the wrong assumption as to their meaning. The concept of data dictionary is important to that endeavour, a document providing information related to the data structures and describing the data’s content.
There are quite a few tools out there that can help in that endeavours:
Some companies go a bit beyond these and include an approval workflow to handle the lifecycle of a metric definition “Metric Governance”.
There are multiple steps to take to ensure data quality, they need to be taken at all stages of the data usage funnel. Only by going with an incremental approach, it is realistic to achieve the goal of having “clean data”.