Shannon  established the perspective that within a given message there is a fixed amount of (unknown) information. Extracting it requires not only finding it but recognizing it once found. Since no recovery method will be perfect, and since the noise will always corrupt the message in some fashion, there also needs to be a way to determine how close the information is to the source signal and how much residual distortion remains.
In conventional signal processing, such as that used for telephony or storing music or video, there are physiological and psychological criteria that have been developed experimentally which determine acceptability. Many modern compression schemes deliberately throw away data, but the message is still found satisfactory due to the relatively low requirements of people when receiving these sorts of messages. For example, audio compression is based around the limits of human hearing; information which is determined to be aesthetically unimportant is discarded to save space.
In the case of scientific observational data, there are well-established rules that define acceptability when the "message" is extracted. A common example is curve fitting, where goodness of fit is determined by least squares or a similar test. Simultaneously, various confidence levels are generated such as variance. The use to which the message will be put determines whether a given level of confidence is acceptable or not. In controlled experiments, differences between control and experimental data are the message, and depending on confidence in the strength of the signal (as measured by a variety of standard statistical tests), the hypothesis may or may not be confirmed at the acceptable level of confidence.
There are many methods used to extract the message. I'll just mention a few.
- Filtering. If the noise is known to be in one area while the signal is in another, filtering can be used to separate them. There are three popular types of filters, low pass, high pass and band pass. Low pass filters are used in telephony to eliminate high-frequency hiss above the speech range. High pass filters are used to eliminate low-frequency hum from poorly shielded public address systems. Band pass filters are used to capture a single radio station along the dial. Digital signal processing has made the implementation of much more highly complex filters feasible; filters can also adapt to a changing signal in real time. Such systems are used to eliminate feedback at live concerts.
- Signal averaging. This is useful when the message is repeated. By gathering multiple copies of the message, the signal/noise ratio is increased. Astronomical observation uses this, as does radar processing.
- Subtracting out noise. If the characteristics of noise can be fairly well determined the noise can be subtracted from the message. Modern digital cameras take a blank picture (which should only contain noise from the camera's sensors) and subtract it from the desired image.
- Predictive reconstruction. Many messages tend to vary slowly so a loss of part of the signal can be reconstructed from adjacent information, whether adjacent in space, time, frequency, or other dimension. Digital TV sets have frame buffers which compare one frame to the next. If there is a loss of signal, prior frames are used to estimate the missing information. Since the criteria for acceptability of a moving image is low, this works well as long as the interruptions are relatively short. Humans use this when processing speech. Much speech is highly redundant and missing a word or two can usually be compensated for by comparing the message to expectations of what the words should have been.
There is a fundamental difference between the first method and the others. In the first case, the technique is to remove information from the data set. It is hoped that more irrelevant information (noise) is removed than message information. For example, when filtering a particular radio station out of the electro-magnetic spectrum, the result has less information than before, but filtering has made the signal more observable.
In the other cases, external information is fed into the system to make the data set larger. It is hoped that the extra information is relevant to the message. However, because of this addition, it is important to account for potential bias in the result.
Examining statistical methods through the lens of information theory
Let's look at some common statistical methods, especially as treated by econometrics. I can't cover all the techniques that have been developed, but once I have shown the pattern it should be possible to make the proper analogies between the two disciplines.
All techniques which subtract selected data are type 1, employing filtering to enhance the signal/noise. A common approach is to use, say, a lagged, three month average when looking at employment data or the like. This is a form of high pass filtering. Short term fluctuations are filtered out leaving the desired long wave signal - the longer trend.
Making seasonal adjustments to data is also a form of filtering; in this case low pass. The repetitive shifts throughout the year are like hum, it is periodic (or quasi-periodic) enough to be filtered out.
Sliding windows and sub-sampling are type 2, signal averaging. A new data set is collected which is similar to the prior one in the important respects and various averaging techniques are used to remove the differences leaving the desired information. Whether old samples are dropped off the end or the sample size is increased should depend upon some knowledge of how much the message is changing. Out-of-sample forecasting is also a type of signal averaging. The new samples can be considered a new instance of the message.
Techniques using dummy variables are type 3, analogous to subtracting out noise. There is "information" present which is known to be irrelevant and the characteristics are also reasonably well understood so that it can be well described. The dummy variables can be used to "subtract" this from the calculations.
Bayesian techniques are type 4, a variety of predictive reconstruction. The idea is that there is some information external to the data set which is known independently about the environment. Adding this in improves the signal/noise ratio. An event that occurs in the data may be strongly correlated with an event that is not; for example, a data set tracking the relation between river flooding and rainfall may not include riverbed construction events, but the correlation between the two make it appropriate to introduce this additional information. Adding in the 'missing' information adds only a small amount of noise (the inverse of the correlation), and may improve the resulting signal. Similarly, using an independently derived model, whether hypothetical or based upon previous cases, adds information to the system. If the extracted message conforms closely to the model then it increases the probability that it is the "right" message. However, as the difference between the predictive model and reality can be difficult to estimate, expectations may, therefore, distort the results.
As with signal processing there is a limit to how much information can be extracted from any "message", the entropy of the system. Another analogy will be useful. Many people have seen the crime shows on TV where the detective turns to the technician and says about an image on the screen "can you enhance that"? Poof! the evil doer is revealed. In the real world there is a simple law of optics, identical in formulation to Shannon's measurement of information complexity which determines the resolution of a given image. Attempting to enhance images beyond this Nyquist limit produces no additional information. An information-contributing technique blowing it up larger or sharpening it may make it easier to view, but it doesn't add any new information. In this case the absolute limit is determined by the diameter of the lens (assuming it has no other aberrations) and the wavelength of light. To get more detail you need to change one or the other. That's why electron microscopes don't use light to resolve fine detail, instead using electrons (which have a much smaller wavelength).
The same thing is true with finite data sets There is only so much information available within the data set. It is tempting to try to find patterns that conform to desired expectations, but this is done at the expense of certainty. A common case of this is when a small number of current data points are combined with model data to extrapolate future results, as is routinely done in, e.g., population predictions. Both of these methods (introducing a model and extrapolating) introduce uncertainty and the possibility for bias.
However, in many cases the claims made are not subjected to a rigorous enough error analysis, and low-confidence results are presented as truth. Is it adequate to say that the results may be so and so with only a 75% confidence level? This is an important consideration, and matters when defining policy, but is outside the scope of information theory.
One of the issues about a data set is whether it is time-based or not. While there is often a distinction made between them, time-based and non-time-based data sets are equally amenable to information analysis. From an information point of view, time is just another dimension. For example, one could be trying to make a prediction on the effect of change in consumption of some commodity over time compared with some behavioral characteristic - say chocolate consumption vs weight gain. Using time based analysis one could using a sliding window to create a series of data sets which incorporate out-of-sample material. This analysis is conceptually no different than gathering a data set from a specific geographic region, and then considering the out-of-sample data as coming from a different region.
Curve fitting is a popular technique in the social sciences, for both interpolation and extrapolation of data. It can also be used as a form of band-pass filtering. Any finite signal can be decomposed into a sum of orthogonal functions. The most common sets are the polynomials and the trigonometric functions (sine or cosine). For example, a popular method of finding out the acoustical properties of a performance space is to capture a sharp sound, as from a gunshot, and then decompose it into a harmonic series using a Fast Fourier Transform (FFT). This produces a frequency response curve and a reverberation curve for the space depending on the method used.
This technique has been extended to the spatial domain, allowing for analysis of complex optical paths in lenses. Instead of the painstaking prior methods of establishing image quality, which required imaging a test chart of increasing finer lines, a single slanted knife edge is used and the resulting image is scanned digitally. The knife edge is equivalent to the acoustic spike; both are step functions. The resulting image is decomposed using FFT into a series of spatial frequencies which translate into resolving power and contrast at each frequency. I'm not aware of an econometric equivalent of impulse or step signal testing.
Recognizing the signal (or lack thereof)
Another issue concerns the problem is that the desired message may not be in the data set, or may not be complete. This is not the same as a simple data loss or a sampling error, but is like looking for the needle in the wrong haystack. To go back to my chocolate example, perhaps chocolate is a factor, but it is peanut butter which most influences weight. We have measured the wrong thing, and while we may get a correlation it may not be the most important. Recognizing the message is as vital as finding it.
I think this problem is much more common in the social sciences than is appreciated. With so many factors present in the real world. one has to make assumptions about what to measure or even what one can measure. This involves a bit of assuming the answer, and Bayesian statistics won't help if the possible choices are all bad ones.
Drug testing suffers from the needle in the haystack problem. Many new drugs only affect a small percentage of people with a given condition. In order to observe the effect the sample size must be large enough. This is referred to as the number needed to treat (NNT). Suppose a given statin helps prevent heart attacks in three out of 100 people who take it; then there is a reasonably large chance that a sample of only 100 will show no positive effect. The inverse case is even worse. Suppose, at the same time, the drug adversely affects one in 1000. The positive effect will be found in a sample of 1000, with a good degree of confidence, but not the side effects. Incorrect choice of the population size can introduce difficult-to-detect bias such as by selecting a population size unlikely to exhibit rare but serious undesired effects. However, if the 'cost' of these rare events is sufficiently high, the results of the experiment may not lead to an acceptable real-world policy This has led to serious problems such as with the recall of the drug Vioxx due to rare but potentially lethal side effects. In social sciences limited to observational data, the experimenter does not get to select the population size, and so this effect may occur involuntarily; the analyst may not even consider it a source of bias. The SETI project which is looking for signs of extra-terrestrial life uses advanced signal processing techniques, but suffers from this problem.
The problem of extracting a coherent "message" from social science data is not fundamentally different from the same task in other disciplines. As with many fields of study, insularity and the development of a unique terminology has made knowledge transference less efficient than it might be.
These techniques can be used to reveal the possibility of hidden bias in findings, in the form of signal added by the analyst. This hidden bias may be an inadvertent product of analysis, or may have been introduced consciously or unconsciously as a result of a specific agenda. Analysts in some sciences, such as physics, are held to a high standard of impartiality because of the paradoxical situation where those closest to the data are the most likely to possess an unconscious bias toward the outcome. This makes it very difficult for data analysts to see or believe that they may be introducing bias. To assist in producing pure results, techniques such as clarity of methods, sharing of data, and independent verification of results are crucial to producing truly scientific results.
Identifying and eliminating bias in the social sciences is crucial. Because the results of analysis in the social sciences, especially economics and sociology, are used to understand civil and fiscal issues, and construct public policy, errors or bias in social science findings can affect the lives and fortunes of millions of people. However, the difficulty of identifying bias leads to abuses, as those with a predefined agenda misuse the techniques to obtain results which support their position. The most common misuses seem to involve biased selection of data, either in the type selected or in the range used for analysis. This is a type of filtering, and can completely distort or eliminate the original signal; but because of the nature of the data tampering, it is invisible unless observers have access to the original, unfiltered data set.
Some professions are diligent in monitoring this type of scientific misconduct, but this diligence is less visible, and perhaps even more important, in the social sciences. Because of the difficulty of running independent, confirmatory experiments, social science results must be scrutinized even more carefully than those in sciences more amenable to experimental exploration.
There have also been misuses over methodology, but this usually centers on the choices one makes for various parameters. The fact that some ideologues claim more certainty than is warranted doesn't help either, and by association blackens the reputation of those who are more scrupulous. Maybe the fact that many of the ideologues are employed by organizations which profess the same viewpoints makes attempts at censure difficult to enforce, but this only highlights the importance of accurate, unbiased results in the social sciences.
Thus we have seen that econometrics and statistical analysis used in the social science are really equivalent to other information processing techniques. Advanced modeling is simply a way to add in external information believed to be relevant, and the type and complexity of the model do not determine how much this additional signal improves the reliability of the analysis.