|
HOME There are lies, damned lies and statistics and, if you're not careful, those who lie with statistics are likely to get away with it. |
Lies, Damned Lies and Statistics
Fortunately, there are ways to ferret out falsehoods, even
those hidden in statistics. It isn't easy to generate data that
can fool someone who knows what to look for. Patterns of data
that are typical of fudging tell you that suspicion is in order.
If data have been 'massaged'(1)
to present a false picture, the data will bear telltale signs
of the deviation. I had a college professor who was developing a technique for
determining whether data had been artificially generated. To
test his theory, he had half his students flip a coin 500 times;
the other half generated a list of 'Heads' and 'Tails' to imitate
a coin being flipped. These were graduate students in statistics,
so they had a pretty good idea of how to imitate such a sequence.
He was able to quickly identify which was which with 100%
accuracy by simply looking at the longest run (all heads or all
tails) reported. In randomly flipping a coin 500 times, at some
point you'll get a run of 8 or 10 or more heads or tails in a
row. Yet, in generating false data, no one thought to put in
such a lengthy run. It was a telltale sign of the falsity of
the data. This example illustrates the first rule of spotting false
data: to recognize when data has been tampered with, you must
know what the genuine data should look like. This also means that you must have access to the
raw data, not just summations of it. Without the raw data, you
must depend on other people's statistical machinations
rather than your own. If you have any suspicions about the accuracy
of a statistical report, don't accept summations - demand access
to the raw data. Of course, you must keep in mind that tampering is not the
only reason for deviation from the expected. There may be other
possible explanations. But you can determine whether the data
shows any signs typical of tampering. Then you must
determine what the cause of the deviation actually was. Finally, the person giving you false data may not be deliberately
trying to mislead you. They, themselves, may have no knowledge
the alterations. Unless they collected the data themselves (rare),
it may well have been altered before they received it. As British economist Sir Josiah Stamp put it, "The government
[is] very keen on amassing statistics...But you must never forget
that every one of these figures comes in the first instance from
the village watchman, who just puts down what he damn well pleases."
People alter data for many different reasons. Fortunately, regardless
of the reason for the deception, the results are similar. The first step in detecting false data is to draw its picture.
The best picture to start with is called a histogram. (Many computer
programs, including most spreadsheets, will easily produce a
histogram.) A histogram partitions the dataset into ranges (eg
1-10, 11-20, 21-30, 31-40, ...) and then count how many individual
data points fall into each range. Figure 1 shows a histogram
of a normal distribution (the popular bell-shaped curve). The
data clusters around the center and decreases towards both ends.
Note that the histogram is not perfectly smooth - real world
data never is - but the overall trend, the distribution is clear.
With a histogram, you can immediately identify several features
of your data. The easiest type of alteration to spot is that of a truncated
dataset - one in which offending data points beyond a threshold
value have been removed. Another way people sometimes deal with unwelcome values is
not by simply eliminating them, but rather changing them so that
they fall within the threshold value. Dr. W. Edwards Deming,
perhaps the most influential statistician of the twentieth century,
told of consulting with a manufacturer. The company had carefully
collected data on the percentage of defective product produced.
But the data, when charted on a control chart (figure 3)(2),
did not vary as expected. With an average defect rate of nearly 9%, roughly one third of the data points plotted on the control chart should fall outside the green lines, and one point in twenty should fall outside the blue lines. But in fact, none of the charted data
fall very far from the average. This data is false, but to know
it, you first need to know what the honest data would have looked
like. (Control charts are easily developed and interpreted with
only a few hours training. To learn about simple control charting,
read more here.) After much effort, and insisting that the data could not naturally
fall into such a pattern, Dr. Deming finally uncovered the culprit.
The woman charged with doing the final inspection labored under
the belief that the factory would be closed if the percent defective
ever went above 10%. She, and all her co-workers, would lose
their jobs. Needless to say, she never found more than 10% defective.
Dr. Deming could look at the graph and detect the falsity of
the data because he knew that the pattern he saw was not what
would be realistically expected. Figure 4 shows what a control
chart with valid data might have looked like; note the much wider
spread of data between the upper and lower control limit lines. These examples are certainly not the only kinds of false data, and I haven't even touched on ways to lie with statistics without altering the actual numbers, but they serve to illustrate the point. There are indeed lies, damned lies and statistics, but even when dealing with statistics, it's still possible to tell truth from lies. 1. 'Massaged' - a word used to indicate that the data has been altered to compensate for some perceived problem. This is sometimes appropriate due to problems with data collection, but it is sometimes done inappropriately in order to guarantee desired results. 2. This chart has been slightly modified for this article (the blue and green lines were added), it comes from W. Edwards Deming, "Quality, Productivity, and Competitive Position", 1982, Massachusetts Institute of Technology Center for Advanced Engineering Study, Cambridge, MA 02139, Page 209. |