Archive for August, 2011
A number of years ago, I worked at a national print production centre that supported three regional printing centres. When I started with them, statistics were largely collected in spreadsheets and emailed around. Not only was this system highly error-prone, it was open to blatant manipulation. For instance, one print centre refused to provide spoilage counts. This centre had had managerial problems and a very ugly strike previously, so front end workers saw no need to report any statistics that could paint them in a bad light. The other production centres dutifully reported their spoilage and resented the one that did not. The statistics that came out of this exercise was a mug’s game and did not really help management understand what was happening in the business.
Eventually we were able to collect spoilage statistics directly from the printers in machine readable logs and the situation was resolved. But there are still times that you will be entirely dependent on end users to provide accurate data. And when these measures include things like returns or complaints, the end user may attempt to shield him or herself from blame. Even if it is not intended to be so, the end user may see it as a punitive exercise.
So how do you deal with such situations?
- Data should not be used to blame or punish. Data is meant to show a picture and reveal opportunities for improvement. Managers and end users need to understand that.
- Make data collection automated wherever possible. This will not only make end users’ jobs easier and substantially reduce errors; it will also prevent rogue employees from fixing the stats in their favour.
- Collection of data cannot be the duty of an employee if that data reveals the performance measures of the said employee. Any employee may be tempted to trump up the good and hide the bad.
- At the end of the day, data will reveal only symptoms, not diseases. Collaborate with your employees to determine the root causes of business problems and you will go a long way to winning their trust.
Some interesting discussion on my last post got me thinking on the meaning of data quality. What does quality data really mean? Is it free from errors and omissions? Are defined terms used consistently throughout the data set? Or is it something even larger than that – does the data accurately reflect the reality of a business?
Errors and Omissions
This is the most obvious category of data quality problems. Errors at source will never be accurately reflected in any business intelligence system, but these errors will appear plainly in the source system too. An order to the wrong customer or for the wrong quantity will simply show as it is. Worse are omissions – for these prevent data rollups from functioning as they should, and source system may not even necessarily be concerned about these data items. Is an omitted field a “zero” or “unknown”? This is a critical question because of the basic arithmetic behind it. 1 plus 0 equals 1. But 1 plus unknown equals unknown. Throw in a few thousand unknowns and your summarization is invalid. Proper handling of unknowns requires input from your business analysts.
If enough unknowns pollute your data set, you may have major data quality problems – to the point that your data set may not be useable. Early reports from Canada’s ongoing 2011 voluntary “census” indicate that this will be a problem, one that was widely anticipated by data professionals. Data profiling can help determine the extent of errors and omissions in a data set.
In my line of work, this is the most common problem I see in data quality. This is where one group or department uses a term for something, and another group or department uses a similar or identical term for something else. When these groups get together to discuss data, they invariably accuse each other of having “bad” or “invalid” data. Differences can be attributable to simple terms or codes, dates (sales date, payment date, shipping date etc.), or even periods of time (fiscal dates, annual dates etc.) This is essentially a problem of communication, not the data itself. Thomas Redman writes extensively on this issue in his book Data Driven.
But what about Business Reality?
What managers are really looking for in their data is a picture of reality. Sometimes the data set does not capture this. A good example of this happened to me a few years ago when I was dealing with the inventory system of a large corporate client. They were taking daily snapshots of inventory from their warehouses, but one of their warehouses was off-line each and every weekend. In the case of the off-line warehouse, inventory changes were reported each Monday, but the Friday, Saturday and Sunday midnight snapshots were not correct. Even though their business was running on a state-of-the-art SAP system, their inventory data did not reflect reality. In order to solve this conundrum, we resorted to reading transactions from the General Ledger and inserting adjustments into the inventory snapshots. It was a round-about solution but it worked. Of course, one could have suggested that the warehouse be on-line 24/7 like all the others but often data processes have to accommodate the business, not the other way around.
My point on this subject is that data quality can be bigger than the data set itself, and sometimes a larger view needs to be taken to see the reality of a business situation. Even perfectly clean and correctly recorded data can be wrong if it doesn’t match the business’s reality or meet the business’s needs.
(The instructions below present setting up C10 for output to a file location on the network within the context of bursting reports, but there is no reason you can’t set up file output for the normal manual or scheduled execution of reports – PB)
Cognos 10 (like all versions of Cognos BI since ReportNet) has a fairly straightforward way of configuring a given ReportNet report for “burst” output – that is, for generating a set of reports from a specific report specification, where the only difference between the reports is some selected value. Consider a generic sales report, where we have 2 different sales reps.
We might want to “burst” the report across the sales rep identifier, so we would get one report for each sales rep. We could then distribute each report to the appropriate rep.
Setting a report up for bursting is performed in the Report Studio interface. Under File… Burst Options we set how the report will burst. We also have the option of selecting how the report will be distributed – either as an email or as a Cognos directory entry. The value for the both the burst specification and the distribution must come from a query in the report.
However, it is quite possible that we might want the output to go out to a file location instead. To set this up requires a little bit of configuration, but it is quite straightforward. In versions of Cognos BI prior to 8.3 this was a bit limiting – we essentially had only one destination we could output to. In even older versions controlling the name of the output report was a pain as well – we needed secondary scripting to re-name the report in the output file based on an associated XML file. This is no longer necessary.
Note about the instructions below: this is not limited to burst output – setting up C10 for file system output can be useful for saving any report you run to the file system – a manually run report, a scheduled report, or burst report.
First, we need to create a shared folder on our server. This can be any name, but should not be located in the installation directory. The user under which the C10 service runs must have full rights to the folder. In this case I’ve created a folder called CognosOutput.
Now I must start Cognos Configuration, and navigate to Actions… Edit Global Configuration:
Under General, I enter the value of my \\server\share combination, prefixed with file://
Click the Test button, and then OK.
Returning back to the main configuration screen, select Data Access… Content Manager, and set the Save Report Outputs… value to True
You are now set up for report output. IBM notes that it is very important that you not be running your Cognos installation as “localhost”, but rather under the name of the server the service is running on.
These steps have set up the top-level directory under which we can save report output. Within Cognos Connection we must now define what the actual destination output locations within this folder will be.
Open up IBM Cognos Administration from the Launch menu in Cognos Connection. Then navigate to the Configuration tab and select Dispatchers and Services, and in the upper right side of the screen select Define File System Locations:
Give the new location a name under the Name section, and (optionally) a description and screen tip. Finally, give it a location – this is where it will appear under the output file folder you set up above. You can use the “\” character to nest a folder beneath another folder. You do not declare the top level folder, so in this case NewOutput could be used as a location, but not CognosOutput\NewOutput.
Now you are ready to burst the report to the file system! Select Run with Options for the report in Cognos Connection, and under Delivery method select Save the Report. Then click Advanced Options and on the the next page, select Save To the Filesystem, and select Edit the Options
In this case I have selected “New Output”, which I have set up to output to NewOutput/NewOutput1 on my file system. I have also renamed the report to August_Sales_Reports
Select OK, and select Burst The Reports from the radio button on the lower left side. Then click Run.
The reports will now be burst to the CognosOutput/NewOutput/NewOutput1 folder:
A couple of quirks: Cognos will append the language setting to the name of the report. It will also append the value by which the report was burst (useful for organizing the reports). It will also output a second XML file that describes the report.