On October 1, 2012 legend/guru Bill Inmon spoke to the Ottawa data warehousing and BI community at an event organized by the local chapter of DAMA in conjunction with Coradix. Among other subjects, Mr. Inmon spoke at length on the idea of “Textual ETL”, a method for bringing semi-structured and unstructured data into the data warehouse, and making in available for analysis using conventional BI tools.
Mr. Inmon estimated that at least 80% of the data in an enterprise exists in this form – as emails, word documents, PDFs etc. – and he has spent almost a decade on the problem of organizing this data into a form that is queryable. The result is what he calls Textual ETL.
In essence this refers to a process for integrating the attributes of a text document (such as a contract) into a database structure that then enables query-based analysis. In the case of a contract, the document might contain certain key words that can be interpreted as significant, such as “Value” or “Royalties”. Rather than simply indexing the document, the Textual ETL process (which can contain over 160 different transformations) is designed to take unstructured documents and produce database tables that enable the user to create “SELECT”-style queries. In the case of a contract-type document, such queries might be to answer questions such as “find all the contracts that are of a value between X and Y that refer to product Z”.
A user with a system to manage such documents might have already added attributes such as “product” and “contract value” to the management system thus already enabling such queries, but the beauty of Textual ETL is that it enables the use of the application of taxonomies to documents to resolve the meanings of the texts themselves. This can extend to things like the resolution of things like synonyms. Mr. Inmon gave the example of texts (emails, for example) that refer to different brands of cars – Porche, Ford, and GM, say – or perhaps use the word “automobile”, but never use the word “car” explicitly. A well-designed textual ETL process would result in tables the allowed for ability to search for emails that refer to cars. It would do this by matching the brands of cars, or the word “automobile” to the word “car”, in effect appending “car” to the brands listed.
The process can be extended to dealing with documents where the same expression might mean very different things. Doctors may use similar, short expressions that mean different things depending on context. The application of Textual ETL to these kinds of documents would (must!) resolve these to different meanings.
The problems of implementing Textual ETL don’t seem trivial, and Mr. Inmon only presented a bare outline of how it is done. However, the implications for organizations that produce or deal with huge amounts of unstructured but critical texts – which is almost any organization of any size – could be considerable. In theory Textual ETL enables items that are thought of as not part of the normal domain of data warehousing to be brought into the data warehouse and subjected to the same kinds of analysis normally applied to such things as inventory levels, sales records and so forth.