Posts Tagged Kimball
I had the opportunity to hear Bill Inmon speak this week on a variety of subjects, including the proverbial divide in the business intelligence community between Inmon and Kimball data warehouse architectures. It was a very informative discussion, and even though I had read Inmon’s latest book I felt I came away with a better understanding of Inmon’s data warehouse philosophy and a better appreciation for what Inmon’s approach has to offer.
First and foremost, I found Inmon to be an engaging and down-to-earth speaker. I couldn’t help but notice that Inmon referred to Kimball by his first name, giving the sense that they are colleagues and perhaps even friends who respect each other’s work. He did say that some Kimballites he had spoken to over the years were very dogmatic and dismissive of his approach, but this did not seem to be a grudge he held personally against Ralph Kimball.
That said, Inmon highlighted the differences between the architectures first by pointing out the reason why each had been developed. Inmon noted that his architecture was designed to deal primarily with data integration across an enterprise, commonly referred to as the single version of the truth. Kimball’s architecture was designed to make reporting faster and easier, as indeed it does. Inmon pointed out that Kimball architecture tends to deliver a series of fact table-based data marts, joined by conformed dimensions to give a “data warehouse”. His approach is more holistic in the sense that an integrated data warehouse is built, and then data marts may follow.
A common misperception of Inmon’s architecture is that a data warehouse must be built in its entirety first. He said this is not so. An Inmon data warehouse can be built over time. He likened it to the growth of a city – you start out with certain districts and services and as the city grows the architecture of the city grows with it. You certainly don’t go out to build a complete city overnight; likewise with an enterprise data warehouse.
While it is true that Kimball’s approach is more practical and hands-on (and perhaps because of this many vendors have built data warehouse tools with Kimball architecture “baked-in”), Inmon did raise many valid and interesting points. His approach struck me as more enterprise-integration oriented as opposed to the almost ad-hoc nature of Kimball’s. I also found that I have followed some of Inmon’s approach without even realizing it – if you are archiving historical data, using an integrated staging area or enforcing “a single version of the truth”, you are to a certain degree following Inmon already. But of course if you have slowly-changing dimensions, star schemas and surrogate keys you are following Kimball too. Ultimately, Inmon said a hybrid approach is certainly a valid and viable option. Perhaps one day instead of seeing each as a competing architecture we will see each as a “tool set” we can draw upon, and the Kimball versus Inmon debate will finally be put to rest.
I know that a junk dimension is standard Kimball theory, but I still find it to be unfortunately named. When talking to other BI professionals, it is OK to refer to a junk dimension – but when it comes up in conversation with business users, I find that they start to register alarm. “Why are we storing junk?!” I can read on their faces. What is worse is when you actually name your junk dimension D_Junk. Now you are just asking a DBA to delete your table!
Don’t get me wrong – a junk dimension plays a very important role in a data warehouse. It allows a single physical table to store small groups of codes or descriptions that may otherwise require numerous additional tables. It is a central repository of groups of codes that don’t fit anywhere else and are too small to need their own table structure. But I would never name the table D_Junk, just as I don’t use the prefix x_ on any table I don’t want deleted. I prefer to name my junk dimensions D_Other or some variation like that.
Perhaps the term junk was chosen for the very reason that it is a word that would not be commonly used to describe a standard dimension. But it should still be a term you use with care. Make sure your audience knows what you are talking about before you start throwing this term around.
I have not yet found an ideal solution in dealing with data quality, and find that the circumstances around the particular issue will decide how to approach it. My approaches fall into one of the following categories:
a) Reject bad data
The data modeler in me says to reject bad data. It doesn’t fit, so don’t accept it. A control table has been established for a reason – if the code or entry doesn’t match anything in it, it should be rejected. Rejected records can be reported in a myriad of ways, but the hazard of this approach is that these rejects can be easily ignored – that is, until someone decides the numbers “don’t match”. Then there is much consternation that the ETL process is “dropping” records “arbitrarily”, which is not at all true of course. But this approach does force correction in your source system, and your data warehouse contains only business-defined legitimate data.
b) Load bad data anyway
The business analyst in me says to load the bad data anyway – the data is there, so they should see it. This is problematic for a Kimball-style dimensional table, for essentially some fudging must go on. Inferred dimensional records can be created or a broad and unholy category of “unknown” must be added. Both approaches can mask the problem, the latter much more so. Bad data can reside in the data warehouse for months or even years before it is addressed, hidden away under a miscellaneous category or placeholder surrogate keys.
c) Cleanse the data
This is only possible if specific business rules or fuzzy logic can be applied to the data in question, which I find is not usually the case. If it was easy to fix, they would do it at source themselves.
d) Flag the data
Flagging the data is another approach, but also has its pitfalls – the largest being that the business users will simply ignore the flag. Surrogate key placeholders will still be required in the dimensional table. This is really just a minor variation of choice b. But it can also be useful for identifying data that is not rejected but still outside the business process norms (such as a sale price below cost or a hired employee above or below a certain age).
Identifying the source of the problem is another issue – was the data entered incorrectly or was a business process not followed that led to data outside the limits of acceptance? The difference between the two is huge and can only be determined by the business side, often only through considerable time and investment on their part. The effort they spend on this will depend on how large a problem they perceive the data in question to be.
What is of paramount importance to me is to recognize and respect your business users’ ownership of the data. This is doubly important as an outside consultant. Your users know their data and can often tell a problem instinctually. Don’t tell them a report was built to spec if they complain that data is wrong. Even if it was built to spec, the specs could be wrong – and often are. Find the source of the problem. It can be anything from an incorrect data source or calculation to a faulty assumption or misunderstanding.
Sometimes feats of creativity are required to deal with specific problems. Once I had to deal with an inventory system that went offline on weekends for certain locations, and the only source of reliable inventory data turned out to be back-dated GL entries. We had to insert adjustments into our weekend inventory snapshots after the fact based on these accounting journal entries because the official inventory count was not correct. Only your business users will know the best way to handle situations like this. The heart of the issue is that the data warehouse needs to reflect the reality of the business, and if it doesn’t you have a data quality problem.
I finally read a book by Inmon (DW 2.0 Inmon et al) and must say I found it very interesting. As a long time adherent to Kimball’s star schema based data warehouse, I must admit to some bias when approaching Inmon’s work. But I did find Inmon’s philosophy gave me pause for thought. Inmon’s critique of the star schema based data warehouse does have valid points. It can be “brittle” as he puts it, resistant to business change. Unfortunately I did not find that he offered an alternative data modeling paradigm, at least not in this book, that would rectify this shortcoming.
Overall, to put it in terms of the proverbial forest and the trees, Kimball’s approach tells you how to properly fell a tree. Inmon’s approach attempts to explain forest management; it is vast and complex, intending to deal with vast and complex data. He describes the four sectors of the data warehouse which are intended to manage data volumes in the most efficient ways possible, based on the statistical likelihood of being called upon. It is difficult to envision this in action, but if it could be done it would be a fast and efficient system.
Kimball’s appeal to me is the simplicity. The star schema approach, like the SQL statement, is a simple concept that can be expanded to include vast complexity. The star schema also fits so perfectly with OLAP modeling it seems like they were made for each other. I think this has led to the wide adoption of the Kimball philosophy.
I have found that business changes can be incorporated easily within a star schema based data warehouse, as long as they are new dimension elements or measures. Difficulty ensues when new fact tables are required, over and over again, especially when granularity becomes an issue. Like an ever expanding puzzle, Kimball’s approach allows additional facts to fit in the model as needed, so additional business requirements can be accommodated. However, if particularly voluminous data, unwieldy data structure or enterprise-wide standard measures across a huge corporation are concerns, Inmon’s approach may certainly be helpful.
A business user recently expressed surprise that my data warehouse loading process would reject dirty data. His process didn’t and our numbers didn’t match. But he was running a simple query – no check for code integrity against control tables – so his data set included invalid results.
In a Kimball-style star schema using surrogate keys, data integrity is forced by lookups. In fact you can’t really fudge it without making up new dimension keys out of thin air. And isn’t that worse? I say it is better to reject data and report it in another location than to force non-conforming data to conform. Unless specific business rules have been provided on how to clean dirty data, your best choice is to reject and report.