Archive for category ETL

Making Sense of SAP Databases

Have you ever wondered why the only way you can read SAP data is through SAP connection plug-ins for your ETL tool-of-choice?  Why can you not simply read SAP tables as is from the source database?  Why is everything data transfer-wise just a little bit harder with SAP?

Well, for years I just assumed that SAP (being a proprietary system) liked a tight grip on its data systems.  But as I was looking into a situation for a client, I learned the truth was a bit more complicated.

It all started when I wanted to join the transaction header table (BKPF) with the transaction detail table (BSEG).  Couldn’t this be done at the SAP source, instead of in the data warehouse staging area?  If this was a straight-up relational database, joining these tables would be a no-brainer.

Now anyone familiar with SAP knows about the cryptic table names and columns, and probably has at least a passing awareness of the ABAP/BAPI/IDOC programming languages that read SAP data.  So this alphabet soup is where we start.  But the cryptic names aren’t the half of it.

The first thing you need to understand is that SAP is not what you would consider to be a garden-variety relational database.  SAP actually maintains three different types of tables, 2 of which are unreadable outside of SAP’s programming environment.  One of these table types is called a cluster table.  It is a control table of sorts that reads from many different physical tables.  BSEG is just such a table, and is therefore not readable by straight SQL commands.  The second table type is called a pooled table, which is used to store program parameters.  The third and final table type is called transparent, and this is what most SQL programmers would call “normal”.  It is physically and logically the same table, and can therefore theoretically be read by standard SQL commands (although virtually all SAP interfacing is done through SAP programming in any case).

Knowing this helps SAP’s world make a little more sense.

, , ,

No Comments

The Hazards of Dates

Early in my data management career, dealing with dates was a royal headache. Every database system and flat file seemed to have its very own date format, which was typically completely incompatible with all other date formats. This often meant reading dates in as strings and converting them, an often ugly and painful process.

Fortunately, dates seem to be easier nowadays. Maybe it’s just because my data sources are generally limited to Oracle, Microsoft SQL Server or SAP systems. But even here dates can be problematic.

The first thing to note when reading Oracle dates into a SQL Server database is that the date ranges do not match. Oracle dates range from Jan 1, 4712 BC to Dec 31, 4712 AD. SQL Server dates range from Jan 1, 1753 to Dec 31, 9999. You might reply “Who cares? I’m time-stamping general ledger entries, not archaeological digs.” But this is where data quality comes into play. Say your business user enters the year 99. Oracle may store this date as 1099. And when you read this date into SQL Server, the record will be automatically rejected because it is outside the valid SQL Server date range (Yes, this has happened to me). In this scenario it may be readily apparent that the user meant to enter 1999, and data cleansing can be performed. But that may not always be the case. Is this a birth date? It could conceivably be 1899 or 1999. Make sure your data cleansing assumptions are valid before you proceed.

Another interesting date that I often deal with is the SAP Julian Date. Whole numbers count days and fractions count part of a day. Don’t confuse the SAP Julian Date with the classic Julian Date (JD) which has been used by astronomers for centuries. The classic Julian Date starts at Jan 1, 4713 BC at noon Greenwich Time (they chose noon because historically astronomers didn’t make observations during the day). SAP’s Julian Date is actually Modified Julian Date (MJD) which was invented in the 1950’s at the dawn of the computer age. Its fractional component starts and ends at midnight (not noon) and its whole number component is reduced significantly (MJD = JD – 2400000.5). Typically there is no auto-convert feature available with SAP Julian Dates. If you are dealing with SAP Julian Dates in Cognos, I highly recommend making a standard function to convert your dates for you. Handling SAP Julian dates is easy as long as you get this function right. There is help online for Julian date conversion, plus many Julian calendars and converter calculators – these are very helpful for debugging purposes.

Incidentally, Microsoft Excel also stores dates as a whole number plus a fractional component. These are neither JD nor MJD but rather serial dates, starting at 1 on Jan 1, 1900 (dates entered prior to this will not be recognized as dates). Standard Excel dates can be read into SQL Server automatically without any problem.

, , , , ,

No Comments

Handling Dirty Data

A business user recently expressed surprise that my data warehouse loading process would reject dirty data. His process didn’t and our numbers didn’t match. But he was running a simple query – no check for code integrity against control tables – so his data set included invalid results.

In a Kimball-style star schema using surrogate keys, data integrity is forced by lookups. In fact you can’t really fudge it without making up new dimension keys out of thin air. And isn’t that worse? I say it is better to reject data and report it in another location than to force non-conforming data to conform. Unless specific business rules have been provided on how to clean dirty data, your best choice is to reject and report.

, ,

No Comments