Posts Tagged Redman
Some interesting discussion on my last post got me thinking on the meaning of data quality. What does quality data really mean? Is it free from errors and omissions? Are defined terms used consistently throughout the data set? Or is it something even larger than that – does the data accurately reflect the reality of a business?
Errors and Omissions
This is the most obvious category of data quality problems. Errors at source will never be accurately reflected in any business intelligence system, but these errors will appear plainly in the source system too. An order to the wrong customer or for the wrong quantity will simply show as it is. Worse are omissions – for these prevent data rollups from functioning as they should, and source system may not even necessarily be concerned about these data items. Is an omitted field a “zero” or “unknown”? This is a critical question because of the basic arithmetic behind it. 1 plus 0 equals 1. But 1 plus unknown equals unknown. Throw in a few thousand unknowns and your summarization is invalid. Proper handling of unknowns requires input from your business analysts.
If enough unknowns pollute your data set, you may have major data quality problems – to the point that your data set may not be useable. Early reports from Canada’s ongoing 2011 voluntary “census” indicate that this will be a problem, one that was widely anticipated by data professionals. Data profiling can help determine the extent of errors and omissions in a data set.
In my line of work, this is the most common problem I see in data quality. This is where one group or department uses a term for something, and another group or department uses a similar or identical term for something else. When these groups get together to discuss data, they invariably accuse each other of having “bad” or “invalid” data. Differences can be attributable to simple terms or codes, dates (sales date, payment date, shipping date etc.), or even periods of time (fiscal dates, annual dates etc.) This is essentially a problem of communication, not the data itself. Thomas Redman writes extensively on this issue in his book Data Driven.
But what about Business Reality?
What managers are really looking for in their data is a picture of reality. Sometimes the data set does not capture this. A good example of this happened to me a few years ago when I was dealing with the inventory system of a large corporate client. They were taking daily snapshots of inventory from their warehouses, but one of their warehouses was off-line each and every weekend. In the case of the off-line warehouse, inventory changes were reported each Monday, but the Friday, Saturday and Sunday midnight snapshots were not correct. Even though their business was running on a state-of-the-art SAP system, their inventory data did not reflect reality. In order to solve this conundrum, we resorted to reading transactions from the General Ledger and inserting adjustments into the inventory snapshots. It was a round-about solution but it worked. Of course, one could have suggested that the warehouse be on-line 24/7 like all the others but often data processes have to accommodate the business, not the other way around.
My point on this subject is that data quality can be bigger than the data set itself, and sometimes a larger view needs to be taken to see the reality of a business situation. Even perfectly clean and correctly recorded data can be wrong if it doesn’t match the business’s reality or meet the business’s needs.