it still considers that unregulated high growth followed by massive crash is somehow better than a slower, steadier version
Suppose you're doing econometrics and you want to estimate the "average rate of growth". As everyone knows since their school days, one of the first things you do to a data set is remove the outliers. And it so happens that, to an economist, the crisis is an outlier...
You do outlier removal on a series of 5,000 business days, taking away 0.2% of the sample and what do you get? A 50% larger average rate of growth. A man of words and not of deeds is like a garden full of weeds; a man of deeds and not of words is like a garden full of turds — Anonymous
- Jake If you only spend 20 minutes of the rest of your life on economics, go spend them here.
Thinking about it, it occurs to me that you shouldn't do it to an ensemble either... In fact, you shouldn't do it at all.
If you can get your hands on lower-order data, you fit to that. Partly because of the foregoing, and partly because higher-order data is always more noisy. Subtracting two large numbers from each other (which you have to do to obtain the higher-order data) gives a very high relative uncertainty on the result.
A linear fit of what? There is only one variable here, assuming a stationary model, and that is the return of the index. If you remove the outliers to do the fit you get the same effect Taleb and Mandelbrot are illustrating. A man of words and not of deeds is like a garden full of weeds; a man of deeds and not of words is like a garden full of turds — Anonymous
What your figure illustrates is what happens when you take the difference in GDP between any two measurements, subtract it, remove all outliers in that data set, and then average. If you remove all outliers in the GDP data set, and then run a linear fit you remove fewer points, and get less noisy data on your fit. What's not to like?
To illustrate: Suppose you have GDP numbers for twenty years, indexed to year 1 (indexing is merely a matter of units of measurement - it does not affect the behaviour of the data).
01 100 02 099 -1 03 101 2 04 102 1 05 103 1 06 104 1 07 103 -1 08 102 -1 09 103 1 10 102 -1 11 104 2 12 100 -4 13 102 2 14 102 0 15 104 2 16 105 1 17 107 2 18 108 1 19 109 1 20 109 0
I want to fit the second column to the first column, after removing any outliers (there are none in this case). Your sarcastic suggestion is that economists might want to average the third column, after removing the -4 (because it is "obviously" an outlier).
To a first approximation you can assume the succesive differences are uncorrelated, and then you could try to do a linear fit. Which in this case means an average of the 3rd column. And then you remove the outlier because it has a large Mahalanobis distance.
What you should try to do is filter (e.g., taking successive differences is a filter) the original series untill you get something that presumably is stationary and then fit some sort of ad-hoc model. ARIMA models are ways to reduce the model to some linear regression or other, and you always have the issue of outlier rejection. A man of words and not of deeds is like a garden full of weeds; a man of deeds and not of words is like a garden full of turds — Anonymous
But then again, GDP growth isn't uncorrelated either...
As everyone knows since their school days, one of the first things you do to a data set is remove the outliers.
It is true that outlier detection and rejection is taught in elementary statistics courses.
It is also true it is an outrageously bad idea to simply reject outliers.
And it is still a bad idea to reject outliers with justification because there is a strong risk of bias and of fitting an explanation to the data just to get rid of inconvenient points.
In fact, it is possible that outlier rejection and detection shouldn't be taught at all in elementary statistics. A man of words and not of deeds is like a garden full of weeds; a man of deeds and not of words is like a garden full of turds — Anonymous