Why do we collect data? What is it good for? Do we even need it? These are the questions that I see posed in the Canadian census debate. As a data practitioner, I have seen my share of useless data, poor data, fudged data, and absolutely essential data. Today’s corporations wade through masses of data to find nuggets of data gold. Running a corporation today without data is like flying a modern aircraft without a functioning navigational system. The same could be said of running a government.
Censuses have been conducted in all sophisticated societies in history, usually with the most up-to-date technology of the day. The U.S. Census of 1890 employed the newly invented Hollerith tabulating machine. Within decades tabulating machines were essential to major enterprises. Following a merger in 1911, Hollerith’s company was renamed International Business Machines in 1924.
Census data collection has evolved since then, with some trail-blazing nations forgoing the census altogether. But make no mistake: in place of mandatory long forms, there is a centralized registry of citizens complete with national ID numbers. I think this is a good and efficient system but would libertarians ever agree to this? Surely not if a 20% chance of filling out a form once every 5 years is too “invasive”.
Why doesn’t a voluntary form work? Simply put, “responder bias”: your sample population is self-selecting or otherwise skewed. In one of the most famous cases of responder bias in history, George Gallup correctly called the 1936 presidential re-election of Franklin D. Roosevelt when everyone else got it wrong. Most other pollsters of the day sent mail-out ballots to potential voters based on phone numbers and car registries. But in those days, millions of voters had neither telephones nor cars! How did Gallup do it? He sent pollsters to talk to people in person. And hence the Gallup poll became a mainstay of politics.
Any census must deal with the question of data quality. Much has been made of the “Jedi Knight” entries under “religion”. How companies deal with data quality is by employing standards or business rules against a data set. Certainly collecting data as close to its source as possible is a very good way to ensure quality data, as is automating data collection. But what is proposed in Canada will weaken data quality, not strengthen it. No superior alternative is being proposed.
Is the long form census perfect? Not at all. Is it 100% correct? No. Is it labour-intensive and quickly outdated? Yes. Could we collect data in a better way? Yes. But it is better by far than a voluntary form because a voluntary form will degrade data quality.
And why does data quality matter at the end of the day? Because bad management starts with bad data. Sometimes bad data is systemic, such as that which led to the global financial crash of 2009. Sometimes bad data is deliberate, such as that which led to the rise and eventual demise of Enron. But hiding or fudging data is dangerous and damaging – it will be discovered eventually and your reputation will show it. Whether you are trying to hide toxic assets, off-balance sheet debt, shoddy manufacturing, unsafe products, poor employee performance, or entire segments of your population, you will be found out by independent researchers, international governance organizations, concerned consumers, outraged citizens or inside whistleblowers. And the day of telling will not be pretty.


#1 by Julian Schwarzenbach on July 27, 2010 - 3:52 am
Scott,
A similar debate has recently started in the UK with the government committed to the 2011 Census in its current form, but suggesting scrapping/changing the approach for the 2021 Census.
In a former role in a water utility we used the census returns to forecast changes in consumption as part of long term capacity planning. Whilst not perfect, the 1991 census information (updated with ‘mid-year’ estimates) was a very valuable starting point. On top of this planned developments were overlaid to forecast how consumption would change.
However, the 2001 census was far less useful for this purpose. This was not due to poorer methodology, but due to the fact that an opt-out ‘tick box’ had been added in order to prevent direct marketers using census information to target potential customers. This had the side effect of restricting the granularity of the data at post code level (whilst it was still usable at the far larger post code sector level).
In order to overcome these shortcomings, we explored whether a credit rating agency (who boasted how good their economic data was) could provide a viable alternative. They were using similar source data but were unable to provide the required level of granularity without large amounts of synthesis and estimation.
I agree with you that the census process tends to be expensive and still has errors, so in the current economic climate, it is correct to review how it is undertaken. It is essential, though, that any alternative should be assessed not just on the basis of costs, but also on data quality.
Julian
#2 by Dylan Jones on July 27, 2010 - 5:30 am
“…bad management starts with bad data…” – that’s really the crux of this argument I guess, the more information your nation collects, the more balanced and informed action taking should take place.
great post, I think Julian raises some good points too about what happens when we opt-out.
#3 by Annie Pettit on July 27, 2010 - 2:14 pm
Here, here, for data quality. It is a very technical and complicated process but passionate folks like you help keep us on track.
#4 by John Owens on July 27, 2010 - 6:56 pm
Great post.
“… you will be found out …… and the day of telling will not be pretty.”
So very true. However, in far too many enterprises management are willing to ignore the long term costs, which they believe, or hope, will never arise, in order to reap what they (mistakenly) think are the short benefits.
In enterprises that have never experienced having good data, management often have no idea of the real benefits it brings. It is difficult for many of them to envisage how great these are.
One passionate data quality practitioner I know, says, “The benefits of good data is like good sex. Until you have experienced it you have no idea that such a thing exists. But once you have experienced it you will never settle for anything less.”
There might be a good marketing message in there for Data Quality providers!
Regards
John