Archive for category ETL
I recently ran into an old problem working on an SSIS data package. It had to do with the management of inferred dimensions in a Slowly Changing Dimension (SCD) structure. When encountering this error, you will find that SSIS complains about your table column not being of type DT_BOOL (or Boolean). But when you check the table in question you see that your table column is defined correctly. So what is going on?
The problem has to do with the data set, not the table structure. SSIS is really complaining about null values in the table itself. Correct your data set and the problem will be resolved.
A client of mine recently upgraded from Cognos Series 7 to Cognos 10. With it, the Cognos DecisionStream catalog was brought up to Cognos Data Manager. However, when my client went to schedule the Cognos Data Manager jobs she noticed that Cognos Scheduler is not part of the Cognos 10 environment.
Cognos Scheduler was an extra tool you could use to schedule a variety of Cognos 7 related tasks, including DecisionStream jobs. While a handy feature, I found that virtually all of my past clients never used it. Usually there was a corporate scheduling tool or the operating system’s scheduling feature would be used instead. So this was never an issue for one of my clients until now.
With Cognos Scheduler now retired and scheduling Cognos reports simply a function of reports on the portal, there is still a service in the Cognos 10 environment that allows you to run Cognos Data Manager jobs through IBM Cognos Connection. This is the Data Movement Service. Cognos Data Manager must be installed on your Cognos 10 server for this to work. You publish your jobs to schedule from Cognos Data Manager to the Cognos portal.
That said, most of my clients still use Windows scheduler or Unix cron jobs to schedule events like this.
Last night I attended a presentation hosted by OttawaSQL.net and presented by Matt Masson on the newly released SQL Server 2012. I was principally interested in the new SQL Server service Data Quality Services (DQS), the latest addition to the SQL Server lineup.
Boiled down to its simplest form, DQS acts as a data “spell checker” that can apply statistical data correction, user-defined knowledge base data correction, or third party web service data correction. In the DQS interface, users define Term Based Relations (TBR) rules which can be applied against the data set. While correcting data, DQS will generate its own list of rules which you can validate.
Another relative newcomer to the SQL Server lineup is Master Data Services (MDS), which first appeared in SQL Server 2008 R2. This serves as a central repository of “golden” records, the single source of validated truth. It allows for a Master Data model to be generated and kept. This then serves as a lookup source for Integration Services packages.
SQL Server Integration Services (SSIS) is still the linchpin of data movement in the SQL Server environment. It integrates seamlessly with the DQS and MDS services. It has seen feature enhancements and improvements such as a catalog feature (for ease of configuration, security and management), change data capture, and built in reports for troubleshooting and logging.
Anyone interested in a deeper understanding of SQL Server 2012 should note that Microsoft will be hosting a free day-long workshop on June 22 at their Ottawa office.
As an ETL developer, I cut my teeth on Microsoft Data Transformation Services (DTS) in 1999. This experience led to engagements in Cognos DecisionStream, and later Microsoft SQL Server Integration Services (SSIS), Cognos Data Manager and more recently IBM InfoSphere DataStage. I also have experience in Oracle SQL Loader and Unix scripting for data loading purposes. In all, I have been working in ETL for more than 12 years.
But never mind all that. When it comes to contract opportunities, tools rule.
Wanted: Truck Driver – must have 5 years experience driving a Ford
Most ads read X number of years in Y tool, usually at least 2 to 5 years. When you are new to a tool, even with a decade of experience in others, you can be dropped from consideration. This approach is very limiting, both to me as a contractor and to clients who are hiring. Their pool of applicants is limited, and they eliminate an excellent resource like me!
Being a self-taught computer techie in Delphi, VB.Net, Crystal Reports, Gupta databases and to a certain extent SQL Server itself (I am formally trained in Cognos and Oracle), I am capable of ramping up on new products and technologies and delivering professional-grade, quality results in a short amount of time. But I often find myself pigeon-holed into a small group of tools, and recent experience in new products is often heavily discounted if not written off completely.
ETL development itself is very conceptual. It is more than the tool, the database, or even how the tool and database work together. I recently reduced the runtime of an ETL process by 80% by using database techniques that were completely unrelated to the ETL tool in question. This technique could have worked with almost any database or ETL tool. As such ETL solutions are not as tool driven as managers or procurement might have you believe.
This is not to say that ETL tools do not have differences, sometimes even major ones. But at the end of the day, like a truck, they get your data from point A to point B. And just like any quality, experienced truck driver, ETL developers should be able to change their vehicle of choice and drive it home.
Have you ever wondered why the only way you can read SAP data is through SAP connection plug-ins for your ETL tool-of-choice? Why can you not simply read SAP tables as is from the source database? Why is everything data transfer-wise just a little bit harder with SAP?
Well, for years I just assumed that SAP (being a proprietary system) liked a tight grip on its data systems. But as I was looking into a situation for a client, I learned the truth was a bit more complicated.
It all started when I wanted to join the transaction header table (BKPF) with the transaction detail table (BSEG). Couldn’t this be done at the SAP source, instead of in the data warehouse staging area? If this was a straight-up relational database, joining these tables would be a no-brainer.
Now anyone familiar with SAP knows about the cryptic table names and columns, and probably has at least a passing awareness of the ABAP/BAPI/IDOC programming languages that read SAP data. So this alphabet soup is where we start. But the cryptic names aren’t the half of it.
The first thing you need to understand is that SAP is not what you would consider to be a garden-variety relational database. SAP actually maintains three different types of tables, 2 of which are unreadable outside of SAP’s programming environment. One of these table types is called a cluster table. It is a control table of sorts that reads from many different physical tables. BSEG is just such a table, and is therefore not readable by straight SQL commands. The second table type is called a pooled table, which is used to store program parameters. The third and final table type is called transparent, and this is what most SQL programmers would call “normal”. It is physically and logically the same table, and can therefore theoretically be read by standard SQL commands (although virtually all SAP interfacing is done through SAP programming in any case).
Knowing this helps SAP’s world make a little more sense.
Early in my data management career, dealing with dates was a royal headache. Every database system and flat file seemed to have its very own date format, which was typically completely incompatible with all other date formats. This often meant reading dates in as strings and converting them, an often ugly and painful process.
Fortunately, dates seem to be easier nowadays. Maybe it’s just because my data sources are generally limited to Oracle, Microsoft SQL Server or SAP systems. But even here dates can be problematic.
The first thing to note when reading Oracle dates into a SQL Server database is that the date ranges do not match. Oracle dates range from Jan 1, 4712 BC to Dec 31, 4712 AD. SQL Server dates range from Jan 1, 1753 to Dec 31, 9999. You might reply “Who cares? I’m time-stamping general ledger entries, not archaeological digs.” But this is where data quality comes into play. Say your business user enters the year 99. Oracle may store this date as 1099. And when you read this date into SQL Server, the record will be automatically rejected because it is outside the valid SQL Server date range (Yes, this has happened to me). In this scenario it may be readily apparent that the user meant to enter 1999, and data cleansing can be performed. But that may not always be the case. Is this a birth date? It could conceivably be 1899 or 1999. Make sure your data cleansing assumptions are valid before you proceed.
Another interesting date that I often deal with is the SAP Julian Date. Whole numbers count days and fractions count part of a day. Don’t confuse the SAP Julian Date with the classic Julian Date (JD) which has been used by astronomers for centuries. The classic Julian Date starts at Jan 1, 4713 BC at noon Greenwich Time (they chose noon because historically astronomers didn’t make observations during the day). SAP’s Julian Date is actually Modified Julian Date (MJD) which was invented in the 1950’s at the dawn of the computer age. Its fractional component starts and ends at midnight (not noon) and its whole number component is reduced significantly (MJD = JD – 2400000.5). Typically there is no auto-convert feature available with SAP Julian Dates. If you are dealing with SAP Julian Dates in Cognos, I highly recommend making a standard function to convert your dates for you. Handling SAP Julian dates is easy as long as you get this function right. There is help online for Julian date conversion, plus many Julian calendars and converter calculators – these are very helpful for debugging purposes.
Incidentally, Microsoft Excel also stores dates as a whole number plus a fractional component. These are neither JD nor MJD but rather serial dates, starting at 1 on Jan 1, 1900 (dates entered prior to this will not be recognized as dates). Standard Excel dates can be read into SQL Server automatically without any problem.
A business user recently expressed surprise that my data warehouse loading process would reject dirty data. His process didn’t and our numbers didn’t match. But he was running a simple query – no check for code integrity against control tables – so his data set included invalid results.
In a Kimball-style star schema using surrogate keys, data integrity is forced by lookups. In fact you can’t really fudge it without making up new dimension keys out of thin air. And isn’t that worse? I say it is better to reject data and report it in another location than to force non-conforming data to conform. Unless specific business rules have been provided on how to clean dirty data, your best choice is to reject and report.