Is there something like too much data quality? Yes, actually! Data quality serves a business value. If the project loses that focus, the data quality initiative becomes a self-serving project.
Companies need to keep in mind that every data quality rule incurs costs – direct costs of defining the rule and indirect costs when the business process must deal with the rule violation. In a typical project, a data quality initiative is started with good intentions. The most important rules are defined, e.g., if the quantity is greater than 1000, then trigger an alert for somebody to validate the sales order before shipping a container full of pencils instead of just the usual box. A simple rule can save a lot of money by reducing the probability of returns. A rule to standardize the spelling of a city name can have a huge business impact as well, e.g., in the analytics application somebody decides to close the San Jose shop, while most of the revenue is made in San José. But the more fine grained the rules are, the lesser the overall business value becomes. At some point, it is better to draw the line.
Another interesting question is the workflow around a rule violation. If we use the above address data as an example, one argument would be that modification of a city name into a standardized form is only required for the analytical application. Following that train of thought, the data are extracted from the ERP system and loaded into the data warehouse. This process includes adding a standardized city name. Fully automatic, no user intervention.
On the other hand, if this process uncovers an inconsistency – maybe the city name is simply wrong, but the postal code and street name point to an exact address hit -, it makes sense to update the ERP system to help the shipping process. A workflow like this would come to mind: If the address correction process is of a confidence level greater than 90 percent, update the data in the ERP system automatically, else send the address to the ERP system and ask somebody to confirm or disregard.
The costs and the benefit of the workflow will be different, and the goal must be to find the right balance.
Rule result history
Another interesting question is what happens when a rule changes. Again, I’d like to demonstrate on two extreme examples.
In one case, the rule marked 5 percent of the sales orders as warnings over the last months and years because they were of an illegal combination of shipping method and distribution channel. Just now, the company figures out that the rule was wrong: It ignored one exceptional case where the combination it usually flags is valid. So, it changes the rule. The error rate drops to 1 percent. In such a case, there must be a method to rerun the rule for past data and update the history.
The other extreme would be that a company is told that from today onwards, one shipping channel is no longer allowed. If this new rule is added, it should not mark all perfectly fine orders of the past as invalid, increasing the error rate to 50 percent. It is a new rule valid from today onwards and it should not be applied to past orders. The alternative is to add the time component to the rule test, e.g., if the shipping date is greater than today’s date, then mark rows with this distribution channel as invalid. Then we could apply the rule to the millions of past orders, but that would just be a huge waste of processing power.
Point is, there is no clear argument for one or the other. It is a property of the rule to apply it to the past or not. But the important point is, it should be possible to add rules and say, “This rule should have been there from the beginning, we just did not have it back then.”
I have collected my wisdom of the last 15 years of working in the data quality area in this four-part series. From a technical, product strategy, and user requirement point of view. I am not aware of any tool that comes even close to compete in all the areas that I’ve mentioned, only tools that claim to be the one-stop solution for everything but fall short of their promises. I hope you find the information in this series helpful in validating your data quality initiative and making a conscious decision on how to implement the project, what can be done and what can and maybe should not.
This is the fourth and last article in a series! If you would like to read the first one, click here.