Posts Tagged Star schema
Handling Dirty Data
Posted by Scott Andrews in Data Quality, ETL on February 18, 2010
A business user recently expressed surprise that my data warehouse loading process would reject dirty data. His process didn’t and our numbers didn’t match. But he was running a simple query – no check for code integrity against control tables – so his data set included invalid results.
In a Kimball-style star schema using surrogate keys, data integrity is forced by lookups. In fact you can’t really fudge it without making up new dimension keys out of thin air. And isn’t that worse? I say it is better to reject data and report it in another location than to force non-conforming data to conform. Unless specific business rules have been provided on how to clean dirty data, your best choice is to reject and report.