Based on some questions I received this week, the distinction I recently made between statistics and probability was not clear. Statistics are the record of past events. Probability deals with future events. The statistics are what they are. That is not controversial. However, the way statistics get interpreted to derive a probability to some accuracy can be very controversial.

I like provocative examples because they emphasize the reality of the value of knowing how to make valid decisions based on observations. For instance, back in the seventies my family and I moved from California to what is commonly called the “Deep South.” We noted almost immediately the rural black residents in our new community had several highly correlated features. Compared to the urban population, they were poor, uneducated, and likely out of work. Those were statistics. To go from those observations and interpret them to mean that black people were inherently poor, unintelligent, and lazy is improper. But here’s the dangerous part: those statistics can be correctly used to predict black youngsters would not do well in school provided no other parameters in the environment were changed. You see how that conclusion can be dangerous and help propagate racial prejudice. Remember the old saying, “Figures don’t lie, but liars figure.”

Deriving probabilities from statistics is straightforward (although various algorithms are best used for different cases). Analyzing statistics for underlying causality is a different beast. In common speech, these two processes are often confounded.

The errors in logic made by people with whom we disagree are more obvious than the errors made by people we agree with. My example above seems to be obvious, but what about the converse? Assuming you have some consistent way to define race and human parameters, then what is the probability that any parameter you name does not correlate with race?

What about the genders? Very good statistics show that men have greater lung capacity on the average than women. They differ in many other measurable physical parameters. No one will argue with that, but these differences can lead one to hypothesize that the genders differ in psychological ways also. In the past, this easy extrapolation from one set of measurements (statistics) to predicting how women will behave (probability) was done inappropriately and led to gross unfairness in the way women were treated by society. However, that history is not justification for the equally unsupported hypothesis that no psychological differences exist (whether hypothetical differences originate in nature or nurture is a different question similar to the racial analysis).

Reasonably good statistics exist correlating the ownership of firearms with illegal activity. The statistics of crime prevention due to armed civilians is more anecdotal, but still voluminous. The emotional feelings of those supporting the right to bear arms and those who want to ban arms (or highly regulate them) are serious impediments to valid interpretation of statistics to define the probability of harmful activities given any change in the laws one way or the other. An article in Scientific American pointed out some years ago that much more money is spent arguing about gun control than studying the effects of it in a fair and honest way.

The bottom line is that statistics are ideally value-free data. That goal is seldom achieved because the process of gathering statistics and analyzing them is performed by human beings who always have an a priori mental model of what to expect (read: axe to grind). Overcoming pre-conceptions when faced with valid data is greatly facilitated by having a grasp of how rational decision theory works. All growth depends on overcoming preconceptions. Phrased that way, statistics is not so boring.

In response to the interest my original tutorial generated, I have completely rewritten and expanded it. Check out the tutorial availability through Lockergnome. The new version is over 100 pages long with chapters that alternate between discussion of the theoretical aspects and puzzles just for the fun of it. Puzzle lovers will be glad to know that I included an answers section that includes discussions as to why the answer is correct and how it was obtained. Most of the material has appeared in these columns, but some is new. Most of the discussions are expanded compared to what they were in the original column format.

[tags]statistics, probability, decision theory[/tags]