The essay starts off in the worst possible way: not only is the title
very ambitious, but the introductory paragraph confirms it.
Taleb wants to tell us no less than the limits of human knowledge.
If statistics
is the core of knowledge, then knowing its limits is either trivial or
profound. It's a very risky way to convince people to read an essay.
Statistical and applied probabilistic knowledge is the core of
knowledge; statistics is what tells you if something is true, false,
or merely anecdotal; it is the "logic of science"; it is the
instrument of risk-taking; it is the applied tools of epistemology;
you can't be a modern intellectual and not think probabilistically?
[...](let's face it: use of probabilistic methods for the estimation
of risks did just blow up the banking system).
The most memorable part of the essay is the turkey metaphor. It's best
to get this out of the way early.
For a thousand days,
a turkey gets fed and all is well, yet on the 1001st,
something awful happens. RIP. See how statistics is wrong?
No amount of extrapolating
from the first thousand days can obtain the 1001st result.
But this is only the first part. In the second part, the graph is
relabeled, switching the turkey into the present economy. Suddenly,
the example is all too real.
The turkey story is offered as a spectacular failure of statistics.
But is it really? To obtain a failure, we first need to formulate
a problem that we're actually going to fail to solve, otherwise it's
just so much griping after the fact. In this case, the turkey story
serves to set up the problem: can the turkey predict the date of
its own demise? The answer is obviously no, and now comes the rhetorical
switcheroo: the turkey is really a magic rabbit maskerading as a
bear market.
By relabelling the graph but keeping its shape intact, the reader is
made to transfer the turkey problem (what the goal of prediction is
and how well the solution works) to a financial time series. But wait,
what is the actual prior statistical problem to be solved here?
There isn't any particular one, it's just a graph.
But the reader doesn't realize it
because he's still thinking about turkeys. In this way, Taleb implies that
statistics was powerless to predict this year's huge losses after the
fact.
Yet it's easy to think of many other problems that might have been
posed regarding the same graph. For example, in 2005, could statistics
have predicted the next year's aggregate income rise? A simple
straight line fits reasonably well in all years except the last, so
the answer's clearly yes. What about 2002, predicting income for 2003?
If anybody had predicted a huge loss then, they would have failed
quicker than they could say margin call. Taleb was a trader for 20
years. Was his job all this time really to predict the single
date of the financial crash of 2008? If not, why is he talking about it?
Statistics is a descriptive science. It's a way of stating what a
dataset looks like for a particular purpose.
Sometimes that's easy (like in the turkey problem),
sometimes not (like in the turkey problem). Taleb's pièce de
résistance is The Map: a convenient two by two table containing
all the statistical problems in the world, with the fourth quadrant
containing the supposedly impossible problems. In keeping with the
grandiose claims, he fills the map with all the big problems of
humanity (after all statistics is the "core" of human knowledge). His
Tableau of Payoffs tells us the true place of Medicine, Gambling,
Insurance, Climate problems, Innovation, Epidemics, Terrorism,
etc. Who knew that one could learn so much as a computer trader, eh?
But what is The Map? In the first column, he puts the so-called light
tailed distributions. Intuitively, a distribution is a smooth
theoretical curve (or surface in higher dimensions) which describes
the frequency of occurrence of the values of some quantity (called a
random variable) which can be observed repeatedly. Light tailed
distributions (Mediocristan) fit variables with a limited natural
range. Heavy tailed distributions (Extremistan), which Taleb puts in
the second column, fit variables whose natural range is very wide.
Statistically speaking, a distribution representing a real quantity
can only be identified by looking at empirical data. When the number
of available data points starts to grow, they form clusters in the
light tail case, and also in the heavy tail case. But heavy tailed
data points can also spread out in much more unexpected places. This
is why heavy tailed distributions are traditionally used for modelling
extremes, like storms, floods, bankruptcies, etc. In one dimension,
there are only two unexpected places. Of course, the
problem is much more complicated in higher dimensions. In that case,
all distributions spread out widely because there is so much more
freedom of space.
The quadrants in Taleb's Map represent the difficulty of a decision
problem (decision problems maximize an expected payoff based on an assumed
statistical distribution). Taleb claims that the fourth quadrant
is hopeless.
Fourth Quadrant: Complex decisions in Extremistan: Welcome to
the Black Swan
domain. Here is where your limits are. Do not base your decisions on
statistically based claims. Or, alternatively, try to move your
exposure type
to make it third-quadrant style ("clipping tails").
I'm going to take a few paragraphs to explain what he means, but if you're
sharp, you might already wonder what all the fuss is about, when
all you need to escape
the fourth quadrant is clipping the tails...
The swiss army knife of statistics is the Central Limit Theorem. It works
with all light tailed observations, and forms the basis for fitting the
parameters of theoretical distributions. Since the CLT applies to nearly
everything of interest in that context, much of statistical methodology is
concerned with
computing means and variances, which are the quantities which specify the
Gaussian limit distribution.
But the CLT fails for heavy tailed distributions. This is the basis
for Taleb's claim. As the datapoints multiply, there always comes an
extreme point which is just large enough to seriously perturb the mean
and variance yet again. No single Gaussian limit appears, and the
usual techniques don't work in the long run. Yet extremes have
meaning. If the random variable is an earthquake magnitude, a high
value can ruin your whole day. The recent stock crash is an extreme
datapoint. People want to know how likely the extremes are, and maybe
even predict them (in advance if they're newbies).
But extremes are rare. If they weren't, then we'd soon exhaust the
range of possibilities in a dataset, and then we'd really have a light
tailed distribution spread wide.
For example, the Dutch are worried about floods breaking their dykes,
which is an entirely different kind of wall street worry.
Sooner or later, a flood bigger than any previous one will come, so
it's hard to make the walls high enough by looking at historical
records. Taleb would think that
it's outright impossible (note: making the walls 1km high is not
a realistic option).
How can we fit an arbitrary distribution in the part of space where we have
no datapoints? It's obviously impossible!
Enter Extreme Value Theory, the branch of statistics which specializes
in this kind of problem. To understand what's going on, it might help
to review
the CLT first. If you have a bunch of datapoints and the CLT holds,
then plotting a histogram of frequencies will show a bell shaped
Gaussian distribution. If you've ever tried to do this for yourself,
you've come across the bandwidth problem. Just how wide do you make
the bins? If they're wide enough to contain a lot of observations
each, then you might see a Gaussian. But if the bins are so narrow
that some bins have only one observation, and most bins have none at
all, then the plot shows nothing usable! It seems crazy that
the CLT can tell us what's going on in the regions of space in
between the datapoints, yet plainly, it does. In fact, the
mathematical statement of the CLT does not talk about histograms or
bandwidths at
all. In an analogous way, Extreme Value Theory tells us what the tails
look like, even if we don't have datapoints throughout.
As many of you know, the CLT concerns the behaviour of sums of random
variables. In EVT, the fundamental theorem concerns the behaviour of
the maximum of a collection of random variables. The extremes
we care about are all maxima: order all the observations seen so far
in a row, then the right most is the maximum, and the left most is the
minimum. Exchange left and right, and the maximum becomes the minimum
and vice versa.
The CLT states that the only possible limit for a (suitably
scaled and shifted) sum of random
variables is Gaussian. The Gaussian family has two parameters, the
mean and variance, and statistics concerns the problem of extracting
the maximum information from all the variables so as to estimate the
asymptotic mean and variance.
The fundamental theorem of EVT states that the only possible limit for
a (suitably scaled and shifted) maximum of random variables is one of
three fixed distributions: the Gumbel, the Fréchet or the
negative Weibull. There is in fact a single formula for all three,
the Extreme Value Distribution,
which contains a single parameter, called the tail index.
Moreover, this limiting result holds regardless of whether the
random variables in
question have light or heavy tails, so is more general than the CLT.
So what of Taleb's claims and mathematical appendix? In short, the
fourth quadrant is not as impossible as he leads us to believe. That's
not to say it's ever easy or routine. All worthwhile math problems are
hard, otherwise anybody could solve them for breakfast.
What bothers Taleb is the "robustness" of statistical methods near the
tails. The theory of robustness is another big field of statistics, which
is concerned with what happens to estimates when the datapoints are
shifted a little bit or a lot. In other words, it's about quantifying
the quality of the fit.
Here's a typical bogus argument, though:
For instance, if you move alpha from 2.3 to 2 in the publishing
business, the
sales of books in excess of 1 million copies triple!
What exactly does that mean? He's comparing two quantities, yet doesn't
tell us anything about their units. Alpha is merely a "parameter", yet we
are supposed to believe that a (presumably insignificant) difference of 0.3
causes a serious misestimation! And how do we
know it's a serious
misestimation? Oh, it's the book publishing business, so anything in
excess of 1 million is obviously a big deal.
As a former trader, one might have expected that his best example
would come
from a banking related business, although with the kind of numbers
being talked about in banking nowadays,
1-3 million seems positively puny and
hardly worth bothering with. Much better to pick an ominous sounding
example, and claim this as a proof of unpredictability throughout
the Fourth Quadrant, no? To be sure, we also get a graph of alpha values
in some aggregated dataset from "40 thousand economic variables".
Are they all comparable? Are the estimation methods comparable?
What's their interpretation? Should we transfer the book publishing
insights to all of those economic quantities as-is? Do you smell
another turkey sandwich?
I should say at this point that there is in fact an underlying set of
mathematical ideas that concern this kind of issue. Physicists, engineers,
statisticans know it well under various names such as well posedness,
condition numbers, and robustness. In all cases however, the
mathematics must be tempered with the problem interpretation and units
used, especially at the infinite end of the real numbers, where log scales
play a major role. For instance, an abstract version of the book publishing
example simply states that
a small change of 0.3 in the parameter leads to a small shift in the
range of the observations in log scale, since log(3) = 1.09.
That doesn't sound nearly so bad, does it?
In the case of EVT, the simple functional form of the asymptotic
distribution is valuable for estimation. For example,
at the tail end, it can be shown that the last few datapoints
(order statistics) can always
be transformed into a sample from a Poisson point process.
This is useful for assessing the quality of the fit at the tail end,
and therefore plays a prominent role in questions of robustness.
Just as you would look for a bell shaped distribution when expecting
Gaussians, you might look at the spacings near the tail for confirmation
that you're on the right track.
In fact, all the
usual statistical methods have some sort of counterpart in EVT,
such as maximum
likelihood fitting, etc.
You'll find textbooks on extreme value theory
in the usual places.
Taleb also often
has the wrong viewpoint on other things:
This absence of "typical" event in Extremistan is what makes
prediction markets ludicrous, as they make events look binary. "A war"
is meaningless: you need to estimate its damage?and no damage is
typical. Many predicted that the First War would occur?but nobody
predicted its magnitude. One of the reasons economics does not work is
that the literature is almost completely blind to the point.
The idea of the "typical" event is really a remnant of elementary
statistics.
In one dimensional statistics, the
location of the peak of a distribution has the highest likelihood of
occurrence, which is great for
Anschaulichkeit, ie the sense
that we can actually see what's going on. So this can be thought as
typical, which is significant because it cuts out the complexity.
What's a typical point on the surface of a sphere, though? There
isn't one, they're all the same!
In higher dimensions, and most big problems are high dimensional, a
mode doesn't matter nearly so much. The observations are most likely
not near the mode. There is no single "typical" observation that's
easy to locate from looking at the distribution. That's true
regardless of the shape of the tails, so don't believe Taleb when
he says it's all about Extremistan.
For example, take the simplest light tailed distribution (just to make
life hard in the first column, where things are
supposed to be easy): the uniform
on the interval [0,1]. Every observational value is equally likely
throughout the interval. Now do the same in two dimensions (uniform on
the unit square), three, etc. You might think that in 12 dimensions,
the observations are spread out evenly in the corresponding hypercube,
but you'd be wrong. Once the CLT starts to work, the observations all
lie geometrically on a thin spherical shell with spikes, like a
hedgehog. Worried yet? Mutatis mutandis with other distributions.
I don't recommend that you read Taleb's mathematical appendix. It's
written in an elliptic lecture notes style that's difficult to follow,
and since I haven't touched on some other of his ideas, such as his beliefs
about asymptotics, it's difficult to summarize. Presumably he's expounded
those ideas in more detail somewhere else, so I'll leave the review of
it to the relevant experts.
The essay ends with some free advice for quants faced with the fourth
quadrant. I find this somewhat ironic, given that a few paragraphs earlier, in
the section marked Beware the Charlatan, he writes
So all I am saying is, "What is it that we don't know", and my advice
is what to avoid, no more.
At the risk of repeating myself, I don't have an issue with
claims of difficulty or incompetence in the way that statistics is
used in finance. The same statement could be made in lots of other
fields. I do have an issue with Taleb's arguments though.
p.s. This rant was commissioned by Migeru. For those whose eyes haven't glazed over yet,
I added a paragraph or two on the issues raised
in the linked thread, so I won't be making a direct reply over there.