Archive for category Data Quality
Goldman Sachs and the Lessons of History
Posted by Scott Andrews in Data Quality on April 17, 2010
Recent business history has shown that any corporation accused of committing fraud invariably invokes the “we knew nothing” response from senior management. Pleading incompetence in place of malicious intent is hardly comforting in my opinion. We have yet to see how Goldman Sachs will respond to the allegations levered by the SEC but I will not be surprised if it follows this well-worn script.
I’ve already written that I believe bad management is at the heart of deliberately fraudulent data. But what can an organization do to prevent rogue employees or poor quality suppliers from ruining their data quality? Put simply, due diligence. Without due diligence, you have corporations that don’t know what’s going on. Limited or poor due diligence may give senior executives probable deniability, but it also comes across as bad management.
A number of years ago, there was a major fraud in Canada’s mining sector known as the Bre-X scandal. On the face of it, Bre-X appeared to have discovered a mammoth gold find in Indonesia. Their stock went up exponentially and then collapsed when it was a proven fraud. Although it passed the stock market listing requirements in Canada and the United States, there were early indications of problems that were ignored, including:
- No independent testing or drilling for gold was conducted for more than 2 years. Drilling from Freeport McMoran eventually proved the claim false.
- Standard industry procedures were ignored. Core samples were crushed entirely instead of being split for independent analysis by partners.
- An Australian company that had previously held the claim and had tested the site was always sceptical of the Bre-X’s find.
- The gold flakes extracted from the core fell easily from the sample and were more like river-extracted gold than mined gold.
- A fire at Bre-X’s field office destroyed many sampling records.
Bre-X management always denied knowledge of the fraud, but they ought to have been in tune with these issues. That they were not suggests either complicity or gross incompetence. I suspect that the legal odyssey now unfolding with Goldman Sachs will be another lesson in due diligence gone wrong.
Where Bad Data Stops and Bad Management Begins
Posted by Scott Andrews in Data Quality on April 15, 2010
A fish rots from the head down. - Greek Proverb
Data quality is an issue in every organization. But there is a point in time, in my opinion, that data quality stops being the problem. This is when management implicitly or explicitly allows, facilitates or encourages bad data, employing wilful blindness or even conspiracy. This is what brought down Enron. This continues to play itself out to some extent in the world financial crisis.
Sometimes those in trouble attempt to fix data in their favour. This can be at a small scale when a line manager applies hand-picked statistics to back up a previously made decision. This can also be at a corporate or national scale where excessive debts are off-loaded into secret accounts to make balance sheets more palatable. But do not mistake these actions for poor data quality management or even bad BI. This is bad management, pure and simple.
The real danger here is that management might start believing their own lies, or worse yet, get everyone else to. Like a rogue financial advisor spinning a Ponsi scheme, this might work for awhile but a day of reckoning will come. Once lost, hard-earned trust can be gone for good.
Business Intelligence Coming Home
Posted by Scott Andrews in Business Intelligence, Data Quality on April 11, 2010
Remember all those promises of household appliances that would work in sync and communicate with each other? The refrigerator that can manage its own inventory. The house that can manage its power consumption. Although still in the realm of science fiction, I have read two articles recently that indicate we may be moving in this direction. The first of these is an article in Wired Magazine which speaks of Mark Hamblin of Touch Revolution. He is creating a platform where home appliances can communicate with one another which should be available later this year. The second article speaks of development work at IBM which is focused on the “internet of things”. Specifically this involves devices that communicate with each other to gather and share information, with the ultimate objective of increased efficiency in such areas as energy consumption or traffic patterns.
What does this have to do with Business Intelligence? A lot, really. Consider a typical manufacturing facility with disparate machines of various technological capabilities. Unless machines are able to communicate with one another (or at minimum create machine readable logs of their processes), then you are dependent on machine operators to gather your manufacturing business intelligence. These human operators are fallible. They may forget to take readings or mark down information at the right time. They may even skew their stats to make their performance look better than it is by not recording spoilage or downtime.
But if machines can manage their own data collection, these particular problems should disappear. The more automated these processes can become, the cleaner the statistics should be coming out of them. So if we are moving towards greater integration of devices, whether in the home or otherwise, this should be good news for both business intelligence and data quality.
Getting The “Right” Data
Posted by Scott Andrews in Data Quality on April 8, 2010
I have been reading Thomas Redman’s quite excellent book Data Driven. Dr. Redman is one of the foremost leading experts on data quality. He has a Ph.D. in statistics and created and led the Data Quality Lab at AT&T Bell Laboratories from 1987 to 1995. He is widely known as the “Data Doc”.
To put it simply, Dr. Redman defines data quality as the “right” data at the “right” time, with a distinction between getting the data “right” and getting the “right” data. On the face of it this may appear obvious and almost trite, much like the Monty Python skit in which an actor explains it’s not just a matter of memorizing words but also about getting them in the right order. But it belies an important distinction. To get data right, one must debug, analysis, look for errors and such. But to get the right data, one must communicate effectively with the one asking the questions. In fact, Dr. Redman explains that often technological issues are confused with communication issues with respect to data quality. Definitions of business terms and assumptions must be made explicitly among business intelligence practitioners. Communication is thus a key component to effective data quality.
The Hazards of Dates
Posted by Scott Andrews in Data Quality, ETL on March 19, 2010
Early in my data management career, dealing with dates was a royal headache. Every database system and flat file seemed to have its very own date format, which was typically completely incompatible with all other date formats. This often meant reading dates in as strings and converting them, an often ugly and painful process.
Fortunately, dates seem to be easier nowadays. Maybe it’s just because my data sources are generally limited to Oracle, Microsoft SQL Server or SAP systems. But even here dates can be problematic.
The first thing to note when reading Oracle dates into a SQL Server database is that the date ranges do not match. Oracle dates range from Jan 1, 4712 BC to Dec 31, 4712 AD. SQL Server dates range from Jan 1, 1753 to Dec 31, 9999. You might reply “Who cares? I’m time-stamping general ledger entries, not archaeological digs.” But this is where data quality comes into play. Say your business user enters the year 99. Oracle may store this date as 1099. And when you read this date into SQL Server, the record will be automatically rejected because it is outside the valid SQL Server date range (Yes, this has happened to me). In this scenario it may be readily apparent that the user meant to enter 1999, and data cleansing can be performed. But that may not always be the case. Is this a birth date? It could conceivably be 1899 or 1999. Make sure your data cleansing assumptions are valid before you proceed.
Another interesting date that I often deal with is the SAP Julian Date. Whole numbers count days and fractions count part of a day. Don’t confuse the SAP Julian Date with the classic Julian Date (JD) which has been used by astronomers for centuries. The classic Julian Date starts at Jan 1, 4713 BC at noon Greenwich Time (they chose noon because historically astronomers didn’t make observations during the day). SAP’s Julian Date is actually Modified Julian Date (MJD) which was invented in the 1950’s at the dawn of the computer age. Its fractional component starts and ends at midnight (not noon) and its whole number component is reduced significantly (MJD = JD – 2400000.5). Typically there is no auto-convert feature available with SAP Julian Dates. If you are dealing with SAP Julian Dates in Cognos, I highly recommend making a standard function to convert your dates for you. Handling SAP Julian dates is easy as long as you get this function right. There is help online for Julian date conversion, plus many Julian calendars and converter calculators – these are very helpful for debugging purposes.
Incidentally, Microsoft Excel also stores dates as a whole number plus a fractional component. These are neither JD nor MJD but rather serial dates, starting at 1 on Jan 1, 1900 (dates entered prior to this will not be recognized as dates). Standard Excel dates can be read into SQL Server automatically without any problem.
Data Quality Revisited
Posted by Scott Andrews in Data Quality on March 17, 2010
I have not yet found an ideal solution in dealing with data quality, and find that the circumstances around the particular issue will decide how to approach it. My approaches fall into one of the following categories:
a) Reject bad data
The data modeler in me says to reject bad data. It doesn’t fit, so don’t accept it. A control table has been established for a reason – if the code or entry doesn’t match anything in it, it should be rejected. Rejected records can be reported in a myriad of ways, but the hazard of this approach is that these rejects can be easily ignored – that is, until someone decides the numbers “don’t match”. Then there is much consternation that the ETL process is “dropping” records “arbitrarily”, which is not at all true of course. But this approach does force correction in your source system, and your data warehouse contains only business-defined legitimate data.
b) Load bad data anyway
The business analyst in me says to load the bad data anyway – the data is there, so they should see it. This is problematic for a Kimball-style dimensional table, for essentially some fudging must go on. Inferred dimensional records can be created or a broad and unholy category of “unknown” must be added. Both approaches can mask the problem, the latter much more so. Bad data can reside in the data warehouse for months or even years before it is addressed, hidden away under a miscellaneous category or placeholder surrogate keys.
c) Cleanse the data
This is only possible if specific business rules or fuzzy logic can be applied to the data in question, which I find is not usually the case. If it was easy to fix, they would do it at source themselves.
d) Flag the data
Flagging the data is another approach, but also has its pitfalls – the largest being that the business users will simply ignore the flag. Surrogate key placeholders will still be required in the dimensional table. This is really just a minor variation of choice b. But it can also be useful for identifying data that is not rejected but still outside the business process norms (such as a sale price below cost or a hired employee above or below a certain age).
Identifying the source of the problem is another issue – was the data entered incorrectly or was a business process not followed that led to data outside the limits of acceptance? The difference between the two is huge and can only be determined by the business side, often only through considerable time and investment on their part. The effort they spend on this will depend on how large a problem they perceive the data in question to be.
What is of paramount importance to me is to recognize and respect your business users’ ownership of the data. This is doubly important as an outside consultant. Your users know their data and can often tell a problem instinctually. Don’t tell them a report was built to spec if they complain that data is wrong. Even if it was built to spec, the specs could be wrong – and often are. Find the source of the problem. It can be anything from an incorrect data source or calculation to a faulty assumption or misunderstanding.
Sometimes feats of creativity are required to deal with specific problems. Once I had to deal with an inventory system that went offline on weekends for certain locations, and the only source of reliable inventory data turned out to be back-dated GL entries. We had to insert adjustments into our weekend inventory snapshots after the fact based on these accounting journal entries because the official inventory count was not correct. Only your business users will know the best way to handle situations like this. The heart of the issue is that the data warehouse needs to reflect the reality of the business, and if it doesn’t you have a data quality problem.
Handling Dirty Data
Posted by Scott Andrews in Data Quality, ETL on February 18, 2010
A business user recently expressed surprise that my data warehouse loading process would reject dirty data. His process didn’t and our numbers didn’t match. But he was running a simple query – no check for code integrity against control tables – so his data set included invalid results.
In a Kimball-style star schema using surrogate keys, data integrity is forced by lookups. In fact you can’t really fudge it without making up new dimension keys out of thin air. And isn’t that worse? I say it is better to reject data and report it in another location than to force non-conforming data to conform. Unless specific business rules have been provided on how to clean dirty data, your best choice is to reject and report.


