Last week I mentioned a serious problem that has bothered philosophers, gamblers, and scientists. How can we derive the probability of an event (future actions) based on the statistics of occurrence (past events)? We considered the example of evaluating generals based on whether they had won five consecutive battles and decided that for a likely probability of 0.5, using only five events in the statistical database was likely insufficient. Further, we have an intuitive feeling that the more events we include in the statistical database, the better will be the derived probability of future events. But is there a way to quantify the statistical requirements?

If we do not have a method to make that transition, then Hume was correct in saying that we have no reason to believe the sun will rise tomorrow based on past observations. In fact, Hume was correct in the sense that the best we can do with a statistical analysis of past dawns is to derive a probability the sun will rise tomorrow and that derived probability will have an uncertainty. The sun could blow up this evening, but in that event, we would not likely be worried about arguing with Hume — for whom the sun does not rise anymore in any case.

For instance, if we throw a fair coin 1000 times, It might reasonably come up heads something like 498 times and tails 502. Does that mean the probability of heads is exactly 0.498? Of course not. [Note: as the number of statistical tosses increases, the inferred probability will approach closer to 0.5, but the probability of it becoming exactly 0.5 decreases to insignificance — a delightful paradox.]

So is there a way to convert statistics to probability? Yes, there are many, and they give different answers. Part of the problem is in the detailed and exact definition of the terms being used. Seemingly slight changes in the way the problem is posed can change the algorithms. I think it is fair to say that the book is closed on this problem only if you accept the point of view of the person closing the book. The reasons for this lack of closure can be found in many places, but try starting with Wikipedia.

The first big division is between what is usually called logical probability and frequency probability. If you wade through the Wikipedia article, you will probably (!) Begin to understand why professionals in this field do not always use data to test a prediction (the hypothesis), but prefer to test if the data supports the logical inverse (null-hypothesis). That is, if the data does not support the null-hypothesis, then the actual hypothesis is deemed to be more likely to be true. To go beyond the coarse generalizations I use here to specific equations and results is beyond the scope of a short column. My main intent is to emphasize that different methods exist to derive reasonable probabilities from a database of past measurements. All methods will result is a probability which has some inherent uncertainty in it unless one has performed an infinite number of tests (and that might not be good enough).

Does all this sound too esoteric to be of any practical use? In fact, these considerations are key to the effective operation of anti-spam filters, among many other things. So the next time you empty your junk email folder, thank the brave statisticians and mathematicians who laid the framework for your safe emailing — well, safer than it would be otherwise.

In response to the interest my original tutorial generated, I have completely rewritten and expanded it. Check out the tutorial availability through Lockergnome. The new version is over 100 pages long with chapters that alternate between discussion of the theoretical aspects and puzzles just for the fun of it. Puzzle lovers will be glad to know that I included an answers section that includes discussions as to why the answer is correct and how it was obtained. Most of the material has appeared in these columns, but some is new. Most of the discussions are expanded compared to what they were in the original column format.

[tags]statistics, probability[/tags]