This graph is so manifestly absurd that I don't believe that it would actually fool any of my readers. Nevertheless, I will analyze it in some detail, to show why it's wrong. This is not only an exercise in poking fun at the WSJ, though there is certainly an element of that to it as well; the main point of the exercise will be illustrate some general principles with regard to curve fitting.
First off, I should note that displaying a curve in the same plot as a data set can serve two general (honest) purposes: It can either be a model that is independent of the data displayed and is provided for comparison or evaluation, or a model that is fitted to the data. Examples of the former include such things as theoretical computations of expected experimental results.
In academic writing, it is considered highly sloppy not to identify the origin of curves, but apparently newsies - or at least WSJ editorial newsies - do not feel particularly constrained by this custom. When no indication of the curve's origin is given, it is usually assumed to be some kind of fit to the data at hand - which also seems to be the case here, judging from the fact that the data point for Norway is exactly on the curve.
But failure to correctly identify the nature and origin of curves is just sloppy - it does not nearly rise to the level of dishonesty that justifies using a graph as an example of dishonest treatment of data. No, what's wrong with this curve is that it gives the impression of being a fit to the data, while it is anything but! There are many and more ways to fit curves to data, but all of them have one thing in common: The resulting curve should be somewhere in the vicinity of the data that it is fitted against.
For most kinds of fitting the fit should be above roughly half of the data points and below the rest. Furthermore, the points that lie above or below the curve should be fairly uniformly distributed across the length of the sample (e.g. a curve is usually a bad fit if all the points in the left-hand side of the plot are below the curve and all the points in the right-hand side are above, even if half the points are below and half above in total). It should be noted that while some experiments are sufficiently precise to follow theory almost exactly, resulting in a great number of points that are precisely on the fitted line, this is rare even in physics and I have never seen any examples of this outside that field.
This curve manifestly does not obey any of these rules - in fact it is off by so much that I doubt that the bozos at the American Enterprise Institute who made it even used a fitting algorithm to draw it!
Alright, that was the bad, now let's move on to the ugly.
This guy apparently thinks that he's found a Laffer curve in the data after all. Now, this is supposed to be mainly a review of curve fitting techniques, not a fisking of the WSJ's dishonest graph, but as it happens, The Englishman illustrates both a legitimate data analysis technique and a less permissible way of treating data.
The basic idea behind this graph
is to average the data in a number of intervals, in order to smooth out the plot. This is an entirely valid way of doing business, and can often be very useful.
So far so good. The averaging is done by eye, which is sloppy, but not inherently wrong. Unfortunately, he makes a major mistake when he connects the resulting averages by line segments and declares that there is a maximum. Grouping data is a way of simplifying your data set, but as a rule of thumb, if this kind of connect-the-dots exercise he's engaging in here is not justified before you compress the data, it's probably not justified afterwards either.
What he should have done was fit some curve to his new data. Unfortunately for him, it is not exactly clear what function describes the Laffer curve, so the best he could have done was try some low-order polynomials to at least get a handle on the behaviour of his data. I have done precisely that below, but that is rather beside the point of this diary.
When you see a curve plotted in the same figure as a set of data points, you should note the following:
Is it a theoretical curve or a fitted curve? - if it doesn't say, assume that it's a fit.*
Fits should have roughly half the data points above and half below - furthermore, the data points that are above and below the curve should be distributed in the same way along the x axis.
Grouping data into intervals of one variable and averaging the other is usually a valid way of simplifying your data set
Connect-the-dots is usually not a valid way of fitting data - this is one of the most commonly seen honest mistakes.
Don't get cocky - this guide is not by any stretch of the imagination a complete overview of data analysis or curve fitting - people can and do write entire books on this subject. What I have presented here is a few rules of thumb, but the reader is encouraged to exercise caution in applying them, even more so than for the other diaries in this series.
*Note that in some cases the curve will be the result of a combination of fit and simulation - such is the case for climate models, for instance, but this is usually noted in the nearby text.
Beware of bar graphs - if someone tells you that X causes Y and presents you with bar graphs, scrutinize them carefully. The proper graph to show correlation is in most cases a scatterplot. If he's using something else, chances are he's trying to pull a fast one on you.
Especially beware of highlighting - I'm sure highlighting single data points has legitimate uses, but off the top of my head, I cannot think of a single one. A very good indication that Someone Is Up To No Good.
Bar graphs are properly used to compare quantities - (naturally, such quantities as are compared must be comparable). This makes them particularly useful to present the results of polls, surveys and elections.
That someone isn't lying doesn't mean he isn't wrong - just because you can't catch someone red-handed in manipulating data is no excuse to disengage your other critical thinking processes.
The Entire Series:
How To Lie With Numbers - a short guide to politics and other things - introduction - bar graphs - highlighting.
ow To Lie With Numbers 1½ - more bar graphs - a cautionary note
How To Lie With Numbers 2 - Laffer Nonsense From The WSJ - scatterplots- fitting methods - data grouping
An Aside: Giving the AEI the benefit of the doubt for the moment, I can actually think of one kind of fit that would create such a silly curve: If you have very strong theoretical (or, in the case of the AEI, political) reasons to expect that the maximum or minimum of the y variable has some known relationship with the x variable - in other words, if you think you know the general shape of the curve that all data points will lie either above or below, you may draw an envelope curve - a curve that obeys your prior knowledge of the function's shape and envelopes all the data points as closely as your assumed function permits.
Example of envelope curves. harmonic.dat is simulated movement of a mass suspended in a spring. f and g are theoretical (i.e. not fitted) envelope curves. (The image apparently did not take kindly to being rescaled - click on image for original version.)
There are two general problems with such curves, and three that are specific to this particular curve: The general problems are that such an envelope is highly sensitive to outliers, and that such an envelope will always be provisional (in precisely the same way that the 'oldest fossil of species X' will always be provisional - if an older fossil of species X shows up, the previous oldest fossil will immediately cease being the oldest fossil).
The specific problems with this particular envelope curve is first of all that it isn't - some values that are clearly not outliers are clearly above the curve, and some values that are clearly not outliers are clearly below the curve. So if the curve is supposed to be an envelope curve, it's the worst example I've ever seen. The second problem is that as an envelope curve, it makes no sense, because it intersects with the x-axis at a point somewhere between 31 % and 32 %! In other words, if this were to be a theoretical maximum revenue curve, it would be impossible to collect revenue from a corporate tax rate above 32 percent! This is clearly nonsense.
The third problem with assuming that it is an envelope curve is that the AEI hack who drew it up tells us that it's supposed to be a Laffer curve - and the Laffer curve isn't supposed to be an envelope curve at all. At least it isn't in voodoo-economics-land - if it were, one could argue that one could increase revenue by moving up on the plot (i.e. by more efficient tax collection) as well as moving to the left (i.e. towards lower tax rates), and that would sort of blow the whole Reaganomics nonsense out of the water, wouldn't it... (Personally I suspect that the only way a Laffer curve makes sense is as an envelope, and even then I think it's overly simplistic.)
Another Aside: I decided, for the sake of the experiment, to take my own shot at curve fitting. To do that, I first read the coordinates of each data point off the original WSJ graph and turn them into a data file. This of course leads to a certain inaccuracy in the values, but as we shall shortly see that will not be a problem in the end.
After that I use GNUPLOT 4.2.2 to fit three different functions to the data:
The red curve is a second-order polynomial, the blue curve is a first-order polynomial, and the green curve is a first-order polynomial with forced intercept in (0,0).
Now, I don't know about you, but none of these babies looks particularly convincing to me, so let's take The Englishman's lead and fit to grouped data. In order to get a decent number of points in each group, I decided to use intervals of ten percentage points to group the data. The result:
Well, that didn't make us a whole lot wiser. It looks like the first-order polynomial with forced intercept is a bad fit, but let's see what happens when we add error bars to our data:
The error bars in this graph were computed by dividing the y value of each point by the square root of the number of data points that went into making it.
Now we see that all three functions are actually plausible (even if my error bar estimate was pessimistic by a factor of two - i.e. the error bars were twice as big as they should be - all three curves would still be within two error bars of all the data points). Back to square one.
We could, of course, continue to fit increasingly more complicated functions to our data, but unless we have some sort of theoretical underpinning for doing that, it would be largely meaningless.
My personal conclusion is that the best fit to the data in question is the neo-Laffer curve.
[Edited to add:] The above approach to error bars is rather quick and dirty, but it usually gives at least a rough idea about whether you have enough data to say something meaningful. Being less fatigued today (and a bit concerned about the hypocrisy of lambasting other people for sloppiness while engaging in sloppy curve fitting myself) than when I originally published the diary yesterday, I decided to revisit these errorbars and do them properly.
The data points in the new plot below are computed by averaging both the x value and the y value (yesterday I averaged only the y value) in the intervals. The error bars are computed by taking the standard deviation of the averaged numbers and dividing by the square root of the number of data points that went into averaging. While I was at it, I also reran the fitting in light of the new error estimates, and the curves you see in the graph are the result of the new fits. (This gives a better estimate of the error bars and the fit.)
As you see, the new approach, while it has the advantage of not being sloppy, does not substantially change the conclusions previously drawn.