HOME
MINI REVIEWS
CONSULTING
SEMINARS
LINKS
ADVICE
RESOURCES
ABOUT ME
MAIL

There are lies, damned lies and statistics and, if you're not careful, those who lie with statistics are likely to get away with it.

 

Lies, Damned Lies and Statistics


Feb. 24, 2000

By Elizabeth Clarkson
Copyright ©2000, Elizabeth Clarkson

Mark Twain once said there were three kinds of lies: lies, damned lies, and statistics. However, he didn't mention how he arrived at that delineation. In my opinion it depends on how hard it is to prove a lie false. Simple lies are easily shown to be false, damned lies are difficult, and statistics seemingly impossible. Someone who lies with statistics is likely to get away with it.

Fortunately, there are ways to ferret out falsehoods, even those hidden in statistics. It isn't easy to generate data that can fool someone who knows what to look for. Patterns of data that are typical of fudging tell you that suspicion is in order. If data have been 'massaged'(1) to present a false picture, the data will bear telltale signs of the deviation.

I had a college professor who was developing a technique for determining whether data had been artificially generated. To test his theory, he had half his students flip a coin 500 times; the other half generated a list of 'Heads' and 'Tails' to imitate a coin being flipped. These were graduate students in statistics, so they had a pretty good idea of how to imitate such a sequence.

He was able to quickly identify which was which with 100% accuracy by simply looking at the longest run (all heads or all tails) reported. In randomly flipping a coin 500 times, at some point you'll get a run of 8 or 10 or more heads or tails in a row. Yet, in generating false data, no one thought to put in such a lengthy run. It was a telltale sign of the falsity of the data.

This example illustrates the first rule of spotting false data: to recognize when data has been tampered with, you must know what the genuine data should look like.

This also means that you must have access to the raw data, not just summations of it. Without the raw data, you must depend on other people's statistical machinations rather than your own. If you have any suspicions about the accuracy of a statistical report, don't accept summations - demand access to the raw data.

Of course, you must keep in mind that tampering is not the only reason for deviation from the expected. There may be other possible explanations. But you can determine whether the data shows any signs typical of tampering. Then you must determine what the cause of the deviation actually was.

Finally, the person giving you false data may not be deliberately trying to mislead you. They, themselves, may have no knowledge the alterations. Unless they collected the data themselves (rare), it may well have been altered before they received it.

As British economist Sir Josiah Stamp put it, "The government [is] very keen on amassing statistics...But you must never forget that every one of these figures comes in the first instance from the village watchman, who just puts down what he damn well pleases." People alter data for many different reasons. Fortunately, regardless of the reason for the deception, the results are similar.

The first step in detecting false data is to draw its picture. The best picture to start with is called a histogram. (Many computer programs, including most spreadsheets, will easily produce a histogram.) A histogram partitions the dataset into ranges (eg 1-10, 11-20, 21-30, 31-40, ...) and then count how many individual data points fall into each range. Figure 1 shows a histogram of a normal distribution (the popular bell-shaped curve). The data clusters around the center and decreases towards both ends. Note that the histogram is not perfectly smooth - real world data never is - but the overall trend, the distribution is clear.



With a histogram, you can immediately identify several features of your data. First, where is it centered? Second, how spread out is the data? Next, are there any identifiable patterns in the data. Is it symmetrical - that is, does the left half of the graph mirror the right?

The easiest type of alteration to spot is that of a truncated dataset - one in which offending data points beyond a threshold value have been removed. A truncated distribution looks like the graph in figure 2. One end (or tail as statisticians like to call it) is cut off; in this example, the right end. While there are legitimate reasons for a dataset to suddenly truncate at some value (for example, if the data represent the time lapse between two events, it can never fall below zero), it's also a popular method of tampering with data. People eliminate data points which can cause problems if others find out about them. By throwing out those values, they can postpone or eliminate having to deal with those problems. The dataset, however, will bear the telltale sign of such tampering.



Another way people sometimes deal with unwelcome values is not by simply eliminating them, but rather changing them so that they fall within the threshold value. Dr. W. Edwards Deming, perhaps the most influential statistician of the twentieth century, told of consulting with a manufacturer. The company had carefully collected data on the percentage of defective product produced. But the data, when charted on a control chart (figure 3)(2), did not vary as expected.

With an average defect rate of nearly 9%, roughly one third of the data points plotted on the control chart should fall outside the green lines, and one point in twenty should fall

outside the blue lines. But in fact, none of the charted data fall very far from the average. This data is false, but to know it, you first need to know what the honest data would have looked like. (Control charts are easily developed and interpreted with only a few hours training. To learn about simple control charting, read more here.)

After much effort, and insisting that the data could not naturally fall into such a pattern, Dr. Deming finally uncovered the culprit. The woman charged with doing the final inspection labored under the belief that the factory would be closed if the percent defective ever went above 10%. She, and all her co-workers, would lose their jobs. Needless to say, she never found more than 10% defective. Dr. Deming could look at the graph and detect the falsity of the data because he knew that the pattern he saw was not what would be realistically expected. Figure 4 shows what a control chart with valid data might have looked like; note the much wider spread of data between the upper and lower control limit lines.



These examples are certainly not the only kinds of false data, and I haven't even touched on ways to lie with statistics without altering the actual numbers, but they serve to illustrate the point. There are indeed lies, damned lies and statistics, but even when dealing with statistics, it's still possible to tell truth from lies.

1. 'Massaged' - a word used to indicate that the data has been altered to compensate for some perceived problem. This is sometimes appropriate due to problems with data collection, but it is sometimes done inappropriately in order to guarantee desired results.

2. This chart has been slightly modified for this article (the blue and green lines were added), it comes from W. Edwards Deming, "Quality, Productivity, and Competitive Position", 1982, Massachusetts Institute of Technology Center for Advanced Engineering Study, Cambridge, MA 02139, Page 209.

Back to the top