Mon Oct 15th, 2007 at 10:05:42 AM EST
A cause of increasing concern to me is the way numbers and statistics are handled in the public political debate. On the one hand, opinion shapers of various kinds cling to studies and statistics that they consider favourable to their cause, however questionable such statistics may be, with all the desperate fervour of a vampire hunter clinging to his crucifix and wooden stake. On the other hand, there is an impression among most people that 'statistics can be made to say anything,' an impression that allows partisan hacks to get away with sweeping dismissal of entirely valid studies underpinned by solid statistics.
Especially irritating is, of course, the tendency of those same partisan hacks to switch between these two views of statistics, frequently within the same interview or column, relying on the unfortunately short half-life of public memory to conceal this rhetorical two-step.
In the Ideal World(TM), reporters, politicians, pundits and - most importantly - the general public would have a sufficiently solid schooling in mathematics and statistics to render this sort of abuse of statistics a swift form of political suicide. In the real world, unfortunately, this is not the case, and most people have to rely on deconstructions, such as the ones frequently posted here on ET.
Promoted by Colman
But however regular and excellent those deconstructions are, it is impossible to cover the entire span of the public debate, partially because those who post them are doing their work pro bono and have to attend to a day job as well, while the political hacks and professional liars in various think tanks are paid lavish salaries to purposefully muddy the waters of public discourse, but mostly because people unburdened by principles or honesty have one great advantage: It takes far less time to cobble up a phony 'study' (or to tell an outright lie, for that matter) than it does to provide a convincing and comprehensive refutation.
With that fact in mind, and with all possible regard for the excellent and needed work of those who spend time deconstructing hack job statistics here and elsewhere, I propose a different approach: Arming people who do not necessarily have formal schooling in math, science or economics with sufficiently sensitive BS detectors to spot irregular use of numbers in their daily newspaper (I'll leave it as an exercise to the reader to determine whether it is the honest or the dishonest treatment of numbers that can be said to be 'regular' in their local paper).
To that end, I am going to present what will hopefully become a series of examples of shoddy or outright dishonest use of statistics. The main point will not be to deconstruct them, however, although of course that will be part of the exercise. The main point will be to attempt to extract some more general warning signs that one is dealing with sub-par number handling, and hopefully thereby equip the non-mathematically inclined reader with a set of red flags that will serve more general use than the deconstruction of these specific examples.
Picking on the Swedes
Bar graphs, highlighting and correlations
About a year and a half ago, a friend of mine sent me a breathless e-mail with an attached study (pdf) which he claimed 'disproved' the Danish welfare model. Such strong language naturally set my mental antennae on edge so I decided to take a look at what this study of his actually said. Needless to say, I was underwhelmed.
Those readers who are swedish-challenged need not worry. Understanding what they write in the text part of their report is not actually necessary in order to follow this deconstruction or the following extraction of indicators for your BS detector. Suffice is to say that the text itself is actually rather moderate. They couch their report in language that is very careful to not actually state any inflammatory conclusions in so many words, instead relying on the reader to draw the wrong conclusions based on their dishonest presentation of data.
So let's take a look at their graphs.
First off, they show a bar graph comparing a the GINI score of a number of countries (GINI is a measure of income inequality - high GINI means unequal income distribution). Notice that Sweden is highlighted in red:
Not being an economist, I cannot speak as to the suitability of using GINI as a measure in this case. In the following, we shall assume that it is, indeed, an appropriate measure, as the report claims.
Then, in rapid succession, come another three bar graphs, three showing the disposible income of the first, second through fourth and fifth quintile, respectively (I've swapped the last two relative to the original report, hence the funny numbering).
(For the non-mathematically inclined a 'quintile' is a fifth of the population you're looking at - i.o.w. the first quintile of the income distribution is the fifth of the population with the lowest incomes, while the fifth quintile is the fifth of the population with the highest income.)
Disposible income for the poorest fifth of population. Sweden is highlighted.
Disposible income for the middle three-fifths of population. Sweden is highlighted.
Disposible income for the richest fifth of population. Sweden is highlighted.
Lastly they have a graph showing growth in the disposible income for the first quintile. Again, Sweden is highlighted in red:
Looking at these graphs, it's quite clear that:
a) Sweden has very low income inequality.
b) The disposible income in Sweden isn't all that great compared to the other countries in the study. Not even for the poorest fifth of its population.
c) The growth in the disposible income for the poorest fifth of the Swedish population is very small.
The casual reader may be forgiven for concluding that these data show income equality to make society poorer across the board - including the poor people it was supposed to help! And conversely, concluding that income inequality makes society - including the poor - richer across the board (the second conclusion does not, in fact follow from the first, but I digress). This is, at least, the conclusion that my friend came to, based on his presumably cursory examination of the study. The casual reader would be wrong, however. [NB: This paragraph was revisited on 14/10-07 - Jake]
To illustrate what's wrong with making that conclusion based on the highlighting of Sweden, let's repost the images. But now I've made a slight modification. I've highlighted Norway in bright green as well as leaving the original red highlight of Sweden:
Looks kinda different now, doesn't it?
The reason it does, of course, is that highlighting a single country out of a set in a bar graph is the entirely wrong way to go about this kind of data analysis. What they should have done was make a real measure of statistical correlation, or at the very least made scatterplots instead of bar graphs.
That means, of course, that my own little highlighting stunt does not prove that income equality increases wealth, is correlated with increased wealth, or even that it doesn't hamper the increase of wealth. All it proves is that the analysis Svenskt Näringsliv provided is seriously flawed.
There are a couple of other things about the study that are worth consideration (such as the number of countries they have chosen to sample, the utter lack of transparency w.r.t. the method by which the countries in the sample were chosen, etc.), but I think I've presented enough to conclude this essay.
Here I hope to summarize the analysis above, and if possible extract the lessons that are most applicable for everyday use. In this case, I have two lessons that the reader is encouraged to take to heart:
Beware of bar graphs - if someone tells you that X causes Y and presents you with bar graphs, scrutinize them carefully. The proper graph to show correlation is in most cases a scatterplot.* If he's using something else, chances are he's trying to pull a fast one on you.
Especially beware of highlighting - I'm sure highlighting single data points has legitimate uses, but off the top of my head, I cannot think of a single one. A very good indication that Someone Is Up To No Good.
*As an aside, the current incarnation of the wikipedia page on scatterplots shows another thing to beware of: Inappropriately drawn regression curves. I hope to expand a bit on that topic in the future, for now it's sufficient to note that the straight line connecting the two clusters of points on the Old Faithful-graph on the wiki-page is meaningless. At best.
Quite apart from these lessons, there are two points that I wish to drive home:
First, notice that I managed to deconstruct this study purely by looking at the way it treated its data. I never once had to question the underlying assumptions or the validity of the data used to underpin their analysis. This showcases a very important point: It may be easy to lie with statistics, but if the reader is minimally numerate, it's agonizingly hard to lie convincingly.
Second, lest the reader assume that this is some idle intellectual exercise, or that I'm picking on an easy target, let me assure you that I have seen mainstream, serious news outlets from all over the political spectrum citing studies that were at least as questionable as this one as primary sources. My hope is that the reader should now be able to spot at least some of them as well. Because there's only one way to make them go away: Stop buying into shoddy studies, stop trusting the newsies that do buy into them, and make very sure that the newsies in question know why you don't trust them. If their subscriber base decides to demand integrity in their number-crunching, they might just wake up and smell the coffee.
An aside: I did plot the data from the figures above myself and did a little chi-by-eye analysis. If you're prepared to take my word on the issue, you may rest assured that the data does not in any way, shape or form support the notion that increased income disparity is beneficial. If you're not prepared to take my word for it, I am, of course, prepared to show my work in the comments upon request, but I've left it out here, as it really is beside the point of this diary.
Another aside: There was another figure in the study that I thought I'd include for the general amusement of the readership. I'll leave it as an exercise to the reader to figure out why it made me laugh my butt off:
(Hint: It has something to do with significant figures, sample size and goodness-of-fit.)
The Entire Series:
How To Lie With Numbers - a short guide to politics and other things - introduction - bar graphs - highlighting.
How To Lie With Numbers 1½ - more bar graphs - a cautionary note
European Tribune - How To Lie With Numbers 2 - Laffer Nonsense From The WSJ - scatterplots- fitting methods - data grouping