Archive for category Data Quality

Keeping End Users on the Straight and Narrow

A number of years ago, I worked at a national print production centre that supported three regional printing centres.  When I started with them, statistics were largely collected in spreadsheets and emailed around.  Not only was this system highly error-prone, it was open to blatant manipulation.  For instance, one print centre refused to provide spoilage counts.  This centre had had managerial problems and a very ugly strike previously, so front end workers saw no need to report any statistics that could paint them in a bad light.  The other production centres dutifully reported their spoilage and resented the one that did not.  The statistics that came out of this exercise was a mug’s game and did not really help management understand what was happening in the business.

Eventually we were able to collect spoilage statistics directly from the printers in machine readable logs and the situation was resolved.  But there are still times that you will be entirely dependent on end users to provide accurate data.  And when these measures include things like returns or complaints, the end user may attempt to shield him or herself from blame.  Even if it is not intended to be so, the end user may see it as a punitive exercise.

So how do you deal with such situations?

  • Data should not be used to blame or punish.  Data is meant to show a picture and reveal opportunities for improvement.  Managers and end users need to understand that.
  • Make data collection automated wherever possible.  This will not only make end users’ jobs easier and substantially reduce errors; it will also prevent rogue employees from fixing the stats in their favour.
  • Collection of data cannot be the duty of an employee if that data reveals the performance measures of the said employee.  Any employee may be tempted to trump up the good and hide the bad.
  • At the end of the day, data will reveal only symptoms, not diseases.  Collaborate with your employees to determine the root causes of business problems and you will go a long way to winning their trust.

 

No Comments

The Meaning of Data Quality

Some interesting discussion on my last post got me thinking on the meaning of data quality. What does quality data really mean? Is it free from errors and omissions? Are defined terms used consistently throughout the data set? Or is it something even larger than that – does the data accurately reflect the reality of a business?

Errors and Omissions

This doesn't look good - 2011-05-28
Creative Commons License photo credit: 4nitsirk

This is the most obvious category of data quality problems. Errors at source will never be accurately reflected in any business intelligence system, but these errors will appear plainly in the source system too. An order to the wrong customer or for the wrong quantity will simply show as it is. Worse are omissions – for these prevent data rollups from functioning as they should, and source system may not even necessarily be concerned about these data items. Is an omitted field a “zero” or “unknown”? This is a critical question because of the basic arithmetic behind it. 1 plus 0 equals 1. But 1 plus unknown equals unknown. Throw in a few thousand unknowns and your summarization is invalid. Proper handling of unknowns requires input from your business analysts.

If enough unknowns pollute your data set, you may have major data quality problems – to the point that your data set may not be useable. Early reports from Canada’s ongoing 2011 voluntary “census” indicate that this will be a problem, one that was widely anticipated by data professionals. Data profiling can help determine the extent of errors and omissions in a data set.

Defined Terms

In my line of work, this is the most common problem I see in data quality. This is where one group or department uses a term for something, and another group or department uses a similar or identical term for something else. When these groups get together to discuss data, they invariably accuse each other of having “bad” or “invalid” data. Differences can be attributable to simple terms or codes, dates (sales date, payment date, shipping date etc.), or even periods of time (fiscal dates, annual dates etc.) This is essentially a problem of communication, not the data itself. Thomas Redman writes extensively on this issue in his book Data Driven.

But what about Business Reality?

What managers are really looking for in their data is a picture of reality. Sometimes the data set does not capture this. A good example of this happened to me a few years ago when I was dealing with the inventory system of a large corporate client. They were taking daily snapshots of inventory from their warehouses, but one of their warehouses was off-line each and every weekend. In the case of the off-line warehouse, inventory changes were reported each Monday, but the Friday, Saturday and Sunday midnight snapshots were not correct. Even though their business was running on a state-of-the-art SAP system, their inventory data did not reflect reality. In order to solve this conundrum, we resorted to reading transactions from the General Ledger and inserting adjustments into the inventory snapshots. It was a round-about solution but it worked. Of course, one could have suggested that the warehouse be on-line 24/7 like all the others but often data processes have to accommodate the business, not the other way around.

My point on this subject is that data quality can be bigger than the data set itself, and sometimes a larger view needs to be taken to see the reality of a business situation. Even perfectly clean and correctly recorded data can be wrong if it doesn’t match the business’s reality or meet the business’s needs.

No Comments

The Information Continuum

Recently an American citizen was turned away from a Canadian border crossing for having a 20 year old conviction related to a high school prank. An immigration lawyer commented that he was not surprised as “the databases are getting deeper and deeper”. The Globe and Mail complained that this case and other similar ones fail to accomplish the goal of keeping Canada safe from real criminals and terrorists. While this decision appears frivolous or even silly, I think it highlights the differences between data, information and business intelligence.

The Information Continuum

Data – A Fact or a piece of information, or a series thereof

Information – Knowledge discerned from data

Business Intelligence – Information Management pertaining to an organization’s policy or decision-making, particularly when tied to strategic or operational objectives

So when hazy, ill-formed or poorly constructed objectives are tied to simple data access, bad business intelligence can be the result.  Let’s consider two strategic objectives:

“Keep Canada safe from real criminals and terrorists”

(as suggested by the Globe and Mail editorial)

- This leads to questions like

  • What is a real criminal or terrorist?
  • How can they be identified?

- This might lead to

  • Information sharing with FBI and Interpol

- Data sets might include

  • FBI’s most wanted list
  • Interpol’s wanted list

In this case, the Information Continuum is  approaching the Business Intelligence level.

“Keep convicted criminals out of Canada”

(the apparent objective applied in this case)
  • This is a blanket statement that will catch everyone ever convicted of any crime (assuming the data is available), regardless of circumstances
  • It invites no further questions
  • You will have false positives (prior convictions with no future criminal intent) and false negatives (active criminals with no prior convictions)
  • In this case, the Information Continuum does not move beyond the data level

Simple data access will not make better business intelligence – if your people don’t have the right resources and direction, it could make it worse.  Your analysts should be able to distinguish between real information and noise, and they should always be striving towards your organization’s clear and thoughtful objectives.

6 Comments

The Panic Factor

“Don’t Panic!” – The Hitchhikers Guide to the Galaxy

Panic
Creative Commons License photo credit: scott1723

I often find that when business users are faced with large, even colossal, discrepancies in data sets, their first response is sheer panic.  For me, the larger the discrepancy the calmer I am about it.  Why?  Because discovering source problems in data sets is almost always much easier the larger they are.  The truly bedevilling problems are the inconsistent, small, and seemingly random discrepancies that occur.

What do I find the number one cause of numbers being 5, 10 or even hundreds of times greater than they should be?  A time series snapshot that is summing all time periods.  This is remarkably easy to fix, and can happen easily in a data cube without proper current time period settings.  It is a common mistake for junior report developers, especially when they are expecting to see current period only.

Other common and simple problems include rounding errors or inconsistencies, improper unit of measure calculations, unexpected null values nixing a summarization total, or cube update failures.  Always check for obvious and easy solutions first, and work your way down to more complicated resolutions as necessary.  Note when a problem started and attempt to isolate what has changed since then.  Be methodical and don’t panic!

No Comments

Why Data Quality Matters

Why do we collect data?  What is it good for?  Do we even need it?  These are the questions that I see posed in the Canadian census debate.  As a data practitioner, I have seen my share of useless data, poor data, fudged data, and absolutely essential data.  Today’s corporations wade through masses of data to find nuggets of data gold.  Running a corporation today without data is like flying a modern aircraft without a functioning navigational system.  The same could be said of running a government.

Censuses have been conducted in all sophisticated societies in history, usually with the most up-to-date technology of the day.  The U.S. Census of 1890 employed the newly invented Hollerith tabulating machine.  Within decades tabulating machines were essential to major enterprises.  Following a merger in 1911, Hollerith’s company was renamed International Business Machines in 1924.

Census data collection has evolved since then, with some trail-blazing nations forgoing the census altogether.  But make no mistake: in place of mandatory long forms, there is a centralized registry of citizens complete with national ID numbers.  I think this is a good and efficient system but would libertarians ever agree to this?  Surely not if a 20% chance of filling out a form once every 5 years is too “invasive”.

Why doesn’t a voluntary form work?  Simply put, “responder bias”: your sample population is self-selecting or otherwise skewed.  In one of the most famous cases of responder bias in history, George Gallup correctly called the 1936 presidential re-election of Franklin D. Roosevelt when everyone else got it wrong.  Most other pollsters of the day sent mail-out ballots to potential voters based on phone numbers and car registries.  But in those days, millions of voters had neither telephones nor cars!  How did Gallup do it?  He sent pollsters to talk to people in person.  And hence the Gallup poll became a mainstay of politics.

198/365 - Quality
Creative Commons License photo credit: aithom2

Any census must deal with the question of data quality.  Much has been made of the “Jedi Knight” entries under “religion”.  How companies deal with data quality is by employing standards or business rules against a data set.  Certainly collecting data as close to its source as possible is a very good way to ensure quality data, as is automating data collection.  But what is proposed in Canada will weaken data quality, not strengthen it.  No superior alternative is being proposed.

Is the long form census perfect?  Not at all.  Is it 100% correct?  No.  Is it labour-intensive and quickly outdated?  Yes.  Could we collect data in a better way?  Yes.  But it is better by far than a voluntary form because a voluntary form will degrade data quality.

And why does data quality matter at the end of the day?  Because bad management starts with bad data.  Sometimes bad data is systemic, such as that which led to the global financial crash of 2009.  Sometimes bad data is deliberate, such as that which led to the rise and eventual demise of Enron.  But hiding or fudging data is dangerous and damaging – it will be discovered eventually and your reputation will show it.  Whether you are trying to hide toxic assets, off-balance sheet debt, shoddy manufacturing, unsafe products, poor employee performance, or entire segments of your population, you will be found out by independent researchers, international governance organizations, concerned consumers, outraged citizens or inside whistleblowers.  And the day of telling will not be pretty.

5 Comments

How Denmark does it: The Danish Census Model

Throughout the ongoing controversy in Canada over the end of the mandatory long form census, many have argued that Denmark (among other Scandinavian countries) no longer conducts a census.  I asked fellow data professional and blogger Henrik Liliendahl Sørensen to explain how his country manages population data as a guest contributor to BIProfessional.com:

Census Options: The Scandinavian Model

Dannebrog
Creative Commons License photo credit: @boetter

The Scandinavian model exemplified through the Danish variant does not require citizens to periodically fill out a census form.  Census information is extracted automatically when needed from administrative registers.

When a new Danish citizen is born (typically at a hospital) the child is assigned a national identification number within minutes. The ID is linked to the mother’s ID and, if she is married, also automatically to her husband as father as well. Otherwise the father’s ID (if possible) is obtained within a short time. In case of immigration, procedures exist for assigning national ID and collecting basic data. All information is kept in a centralized citizen registry.

The less romantic consequence of a marriage is that the two national IDs are linked in the citizen registry from that day forward. A divorce will result in a deactivation of the link.

All buildings, and if not a single family house, all the apartments within, are reflected in a centralized registry. When establishing a new house or apartment a lot of data is captured and if the residence is changed the data will be updated.

Your place of living is a relation between your national ID and the unique ID of the residence having the valid-from-date being the day you moved in until the day you move on is registered as the valid-to-date.  

Practically all events in the life of a citizen involving a public sector body are logged with the national ID. This also includes healthcare and interaction with financial services and employer relations where mandatory reporting exists.

The technical opportunities for compiling census information based on these registrations are plenty. However every case must be approved by a body within the authorities and wherever possible data must be made anonymous in the actual processing.

6 Comments

A Senseless Change to the Census in Canada

There has been considerable controversy brewing here in Canada since the government announced this month that the 2011 mandatory long form census will be dropped, to be replaced by a voluntary one.  Opposition has been fierce and on many fronts, from statisticians, politicians, business leaders and social advocacy groups.  Yesterday the chief statistician with the government quit his job in protest.  All agree that if returned on a voluntary basis, the results will be skewed and the data set will not be comparable with previous census data.

Incidentally, the U.S. government experimented with this idea in 2003 and quickly dropped it.  They found that the overall response rate dropped by one third, and the response among some demographic groups dropped to a mere 20%.  The Canadian government proposes to get around these shortcomings by sending out even more forms and presumably cleansing the degraded data, an approach the U.S. government rejected on cost considerations.

Putting aside political arguments, what if we were to look at this strictly from a data quality point of view?  Clearly a fundamental loosening of the rules around data collection will have profound consequences on the data collected.  Imagine if you instructed your sales staff that they could enter their sales data voluntarily.  Some may continue to enter data as before.  Some may stop entering data at all.  And still others may enter data if they get around to it, perhaps fudging or guessing.  Maybe they will deem the exercise “optional”, or perhaps “unimportant” or even “useless”.  Worse yet, maybe they will attempt to “game” the system in their favour (yes, I have seen this happen).  With missing, incomplete or false data, your data quality is wholly compromised.  Now who is your best sales person?  Your worst sales person?  Can’t tell anymore?  Data cleansing might be able help, but it is labour-intensive, error-prone and very expensive.

Those of us who work with data professionally know that data integrity is determined by business rules.  When you change the rules you change the results.  And when you change the results you change the quality.  It is very common to see organizations attempt to improve data quality through stricter or more explicit business rules, but it is quite bizarre to see our government choose to do the opposite.

5 Comments

How to rapidly resolve a problem data set

Step 1:  Verify what you have been told

A user has brought a data problem to your attention.  Be sure to verify that everything your user has told you is in fact correct.  Your user may not fully understand the data set or may be operating under false assumptions.  Make sure you can see the problem data for yourself and that you understand the problem as presented.  This will save you hours from investigating non-existent problems that were merely data misinterpretation.

Step 2:  Reduce the data set

Focus like a laser beam on the smallest possible set of data that exhibits the data set problem.  Ideally you’d like this to be less than 10 if possible, but anything fewer than 100 is easily manageable.  This will allow you to speak to your user in concrete terms possibly about a single transaction, and will also allow you to rapidly run ETL or OLAP cubes quickly and repeatedly on this reduced data set.

If your data is in an OLAP cube, pulling a reduced data set can be simply a matter of navigating the hierarchies of a dimension for a small record set.  Otherwise, your user may do this for you up front by providing a specific example.  Ask for an example from your business user if necessary.

Step 3:  Isolate the environment

Where is the problem?  In the ETL code?  In the OLAP cube delivery?  In the report?  In a report’s individual query?  This will take some investigative work.  Be sure to identify the root cause, not merely a symptom.  The quicker you can isolate the environment of your problem the sooner you will be able to find the problem source.

Step 4:  Find and resolve

The simplest explanation is usually the first place to check.  Look for recent changes to an environment.  In a report, check prompts, filters and calculations.  In ETL, check logic against business rules and ensure results are as expected.  In all cases, check query results closely.  Individual queries, calculations and other derivatives are a common source of problems.

Step 5:  Follow Through

Ensure that you follow through with the user.  You may have resolved an issue, but it might not necessarily have been the problem that the user had in mind.  Do not assume your user will magically know the problem is fixed.  Confirm with the user that the issue is in fact resolved.

If multiple issues exist, this cycle may need to be repeated until all issues are addressed.

1 Comment

The Curious Case of Google and the Utah Jaywalker

A woman struck by a car crossing a highway is suing Google for bad directions.  This case raises some interesting points between data providers and data consumers.

Caution Signs
Creative Commons License photo credit: DJ-Dwayne

Firstly, while the data itself was technically correct, it was obviously faulty but taken on faith to be an optimal solution.  How many times a day does this happen in corporate life?  The numbers are computer-generated and appear in a web portal, so they must be correct and the best we have!  This is not always so, and common sense should not be ignored.  Just because you have a state of the art dashboard doesn’t mean your organization can be set to permanent auto-pilot.  Anomalies need to be investigated.  Data that doesn’t make sense needs to be questioned.  Savvy business people often have a sixth-sense for bad information, even if they can’t immediately prove it.  Be skeptical.  Question your process and your numbers.  Look for ways to continuously improve.  Business intelligence is about supplementing your own intelligence, not replacing it.

Secondly, where does responsibility lie?  I remember a law class I took years ago referred constantly to the proverbial “reasonable person”.  Would a “reasonable person” take directions to walk into highway traffic literally?  While Google clearly generated a bad result in this case, is it their fault entirely that someone blindly followed those directions?  It reminds me of the defense of the war-crime-accused “I was just following orders”.  Sometimes following orders isn’t good enough.  Responsibility is shared between those who create data and those who consume it.  This is not to let Google or bad BI off the hook but to put data consumers on their guard.

Thirdly, who was at the greatest risk in this situation?  Obviously the woman following the Google directions.  While it is true that Google led her to the highway, she herself chose to step into the busy traffic – a life-and-death decision for her!  Think about this the next time you take a major decision based on third party data:  What do you have to lose?  What do they?  Who is going to have the greatest incentive to ensure data quality?  He or she who carries the greatest risk.

To conclude, if you know in your gut that something is wrong, don’t proceed!  Computers are fallible and business intelligence is not always correct.  BI is only as good as the assumptions used to build its underlying data models.  Computer models are only a best-effort attempt to replicate reality in numbers, not a virtual alternate all-knowing reality in themselves.  Good business intelligence comes as a result of computers and smart business people working together.  Don’t check your brain at the door – you’ll need it.

, , ,

4 Comments

When Good Data Goes Bad

scream and shout
Creative Commons License photo credit: mdanys

How many times have you heard the mother of a wayward son say “He’s not bad.  He’s just misunderstood.”  So it goes with data as well.  For how can you tell bad data?  Did that sale really happen?  Does this employee really exist?  Well, 99% of the time, yes.  In my experience bad data isn’t bad, it’s just misunderstood.

Nowhere is this clearer than when you consider dates.  A single transaction can and often is associated with a plethora of dates.  When was it ordered?  When was it promised?  When was it shipped?  When did the customer take delivery?  When was it posted to the general ledger?  If any business users are not talking about the same date at any point in their analysis, there will be trouble.  “Why don’t our March sales numbers match??”  My first question is “What is the date that drives those numbers?”

A measure’s definition can come into play.  When measuring sales, do you include returns or not?  Are sales gross or net?  When counting employees, do you count contractors and temps?  How are measures defined precisely?

Likewise, a dimension’s definition can also muddy the water.  How are your sales broken down?  How do you define your list of employees at headquarters or in a particular region?  How are dimensions defined precisely?

When confronting a data discrepancy, my approach is to find the differences, find out what the “correct” result should be and then apply the rule change across the data set.  Employee X was showing up on report Y because they were defined by postal code, not by reporting manager.  What is the correct rule?  Your decision makers will have to make that call.  The decision must then be incorporated into the data warehouse design and also be communicated out to the business analysts for future consistency in reporting.  Communication is a key component to ensuring data quality across an enterprise.

, , ,

No Comments