Posts Tagged Inmon
I had the opportunity to hear Bill Inmon speak this week on a variety of subjects, including the proverbial divide in the business intelligence community between Inmon and Kimball data warehouse architectures. It was a very informative discussion, and even though I had read Inmon’s latest book I felt I came away with a better understanding of Inmon’s data warehouse philosophy and a better appreciation for what Inmon’s approach has to offer.
First and foremost, I found Inmon to be an engaging and down-to-earth speaker. I couldn’t help but notice that Inmon referred to Kimball by his first name, giving the sense that they are colleagues and perhaps even friends who respect each other’s work. He did say that some Kimballites he had spoken to over the years were very dogmatic and dismissive of his approach, but this did not seem to be a grudge he held personally against Ralph Kimball.
That said, Inmon highlighted the differences between the architectures first by pointing out the reason why each had been developed. Inmon noted that his architecture was designed to deal primarily with data integration across an enterprise, commonly referred to as the single version of the truth. Kimball’s architecture was designed to make reporting faster and easier, as indeed it does. Inmon pointed out that Kimball architecture tends to deliver a series of fact table-based data marts, joined by conformed dimensions to give a “data warehouse”. His approach is more holistic in the sense that an integrated data warehouse is built, and then data marts may follow.
A common misperception of Inmon’s architecture is that a data warehouse must be built in its entirety first. He said this is not so. An Inmon data warehouse can be built over time. He likened it to the growth of a city – you start out with certain districts and services and as the city grows the architecture of the city grows with it. You certainly don’t go out to build a complete city overnight; likewise with an enterprise data warehouse.
While it is true that Kimball’s approach is more practical and hands-on (and perhaps because of this many vendors have built data warehouse tools with Kimball architecture “baked-in”), Inmon did raise many valid and interesting points. His approach struck me as more enterprise-integration oriented as opposed to the almost ad-hoc nature of Kimball’s. I also found that I have followed some of Inmon’s approach without even realizing it – if you are archiving historical data, using an integrated staging area or enforcing “a single version of the truth”, you are to a certain degree following Inmon already. But of course if you have slowly-changing dimensions, star schemas and surrogate keys you are following Kimball too. Ultimately, Inmon said a hybrid approach is certainly a valid and viable option. Perhaps one day instead of seeing each as a competing architecture we will see each as a “tool set” we can draw upon, and the Kimball versus Inmon debate will finally be put to rest.
On October 1, 2012 legend/guru Bill Inmon spoke to the Ottawa data warehousing and BI community at an event organized by the local chapter of DAMA in conjunction with Coradix. Among other subjects, Mr. Inmon spoke at length on the idea of “Textual ETL”, a method for bringing semi-structured and unstructured data into the data warehouse, and making in available for analysis using conventional BI tools.
Mr. Inmon estimated that at least 80% of the data in an enterprise exists in this form – as emails, word documents, PDFs etc. – and he has spent almost a decade on the problem of organizing this data into a form that is queryable. The result is what he calls Textual ETL.
In essence this refers to a process for integrating the attributes of a text document (such as a contract) into a database structure that then enables query-based analysis. In the case of a contract, the document might contain certain key words that can be interpreted as significant, such as “Value” or “Royalties”. Rather than simply indexing the document, the Textual ETL process (which can contain over 160 different transformations) is designed to take unstructured documents and produce database tables that enable the user to create “SELECT”-style queries. In the case of a contract-type document, such queries might be to answer questions such as “find all the contracts that are of a value between X and Y that refer to product Z”.
A user with a system to manage such documents might have already added attributes such as “product” and “contract value” to the management system thus already enabling such queries, but the beauty of Textual ETL is that it enables the use of the application of taxonomies to documents to resolve the meanings of the texts themselves. This can extend to things like the resolution of things like synonyms. Mr. Inmon gave the example of texts (emails, for example) that refer to different brands of cars – Porche, Ford, and GM, say – or perhaps use the word “automobile”, but never use the word “car” explicitly. A well-designed textual ETL process would result in tables the allowed for ability to search for emails that refer to cars. It would do this by matching the brands of cars, or the word “automobile” to the word “car”, in effect appending “car” to the brands listed.
The process can be extended to dealing with documents where the same expression might mean very different things. Doctors may use similar, short expressions that mean different things depending on context. The application of Textual ETL to these kinds of documents would (must!) resolve these to different meanings.
The problems of implementing Textual ETL don’t seem trivial, and Mr. Inmon only presented a bare outline of how it is done. However, the implications for organizations that produce or deal with huge amounts of unstructured but critical texts – which is almost any organization of any size – could be considerable. In theory Textual ETL enables items that are thought of as not part of the normal domain of data warehousing to be brought into the data warehouse and subjected to the same kinds of analysis normally applied to such things as inventory levels, sales records and so forth.
I finally read a book by Inmon (DW 2.0 Inmon et al) and must say I found it very interesting. As a long time adherent to Kimball’s star schema based data warehouse, I must admit to some bias when approaching Inmon’s work. But I did find Inmon’s philosophy gave me pause for thought. Inmon’s critique of the star schema based data warehouse does have valid points. It can be “brittle” as he puts it, resistant to business change. Unfortunately I did not find that he offered an alternative data modeling paradigm, at least not in this book, that would rectify this shortcoming.
Overall, to put it in terms of the proverbial forest and the trees, Kimball’s approach tells you how to properly fell a tree. Inmon’s approach attempts to explain forest management; it is vast and complex, intending to deal with vast and complex data. He describes the four sectors of the data warehouse which are intended to manage data volumes in the most efficient ways possible, based on the statistical likelihood of being called upon. It is difficult to envision this in action, but if it could be done it would be a fast and efficient system.
Kimball’s appeal to me is the simplicity. The star schema approach, like the SQL statement, is a simple concept that can be expanded to include vast complexity. The star schema also fits so perfectly with OLAP modeling it seems like they were made for each other. I think this has led to the wide adoption of the Kimball philosophy.
I have found that business changes can be incorporated easily within a star schema based data warehouse, as long as they are new dimension elements or measures. Difficulty ensues when new fact tables are required, over and over again, especially when granularity becomes an issue. Like an ever expanding puzzle, Kimball’s approach allows additional facts to fit in the model as needed, so additional business requirements can be accommodated. However, if particularly voluminous data, unwieldy data structure or enterprise-wide standard measures across a huge corporation are concerns, Inmon’s approach may certainly be helpful.