Posts Tagged Dimensions
When Good Data Goes Bad
Posted by Scott Andrews in Data Quality on May 18, 2010
How many times have you heard the mother of a wayward son say “He’s not bad. He’s just misunderstood.” So it goes with data as well. For how can you tell bad data? Did that sale really happen? Does this employee really exist? Well, 99% of the time, yes. In my experience bad data isn’t bad, it’s just misunderstood.
Nowhere is this clearer than when you consider dates. A single transaction can and often is associated with a plethora of dates. When was it ordered? When was it promised? When was it shipped? When did the customer take delivery? When was it posted to the general ledger? If any business users are not talking about the same date at any point in their analysis, there will be trouble. “Why don’t our March sales numbers match??” My first question is “What is the date that drives those numbers?”
A measure’s definition can come into play. When measuring sales, do you include returns or not? Are sales gross or net? When counting employees, do you count contractors and temps? How are measures defined precisely?
Likewise, a dimension’s definition can also muddy the water. How are your sales broken down? How do you define your list of employees at headquarters or in a particular region? How are dimensions defined precisely?
When confronting a data discrepancy, my approach is to find the differences, find out what the “correct” result should be and then apply the rule change across the data set. Employee X was showing up on report Y because they were defined by postal code, not by reporting manager. What is the correct rule? Your decision makers will have to make that call. The decision must then be incorporated into the data warehouse design and also be communicated out to the business analysts for future consistency in reporting. Communication is a key component to ensuring data quality across an enterprise.
