Bayes Theorem Implications and Applications

Document Sample
Bayes Theorem Implications and Applications Powered By Docstoc
					Bayes Theorem Implications and Applications
Peter Dunne
School of Computing and Technology University of Sunderland


Base Rate Fallacy Conditional Probability Base Rate, Prosecutor’s and Other Fallacies Application to IDS Fun Digression: The Doomsday Argument P-Values and Other Pitfalls Doing Better Bayes Theorem and the British Courts

Base Rate Fallacy

A Medical Test
You have undergone a medical test for a serious disease. The test is 99% accurate. Your doctor tells you the test was positive. How worried should you be? What is the probability you have the disease?

The Base Rate Fallacy
It is not possible to answer that last question with the information given. It could be anything from, say, 0.1 to 0.001. It all depends on the “base rate” – that is the underlying rate of the disease in the population of which you are a member. This failure to give, or use, the base rate is often called the base rate fallacy. To be able to give a precise account of this we need the tools of conditional probability.

Conditional Probability

Probability Theory
Let Ω be the set of all possible outcomes of an experiment (or random situation etc). Let F be the set of all1 possible subsets of Ω. These are the events we are interested in. Eg suppose Ω = {1, 2, 3, 4, 5, 6} (we are modelling throwing a six-sided die). Then F = {∅, {1}, {2}, . . . , {6}, . . . , {1, 2, 3, 4, 5, 6}} and the event “an odd number was thrown” is represented by the set {2, 4, 6} A probability measure is then a function P : F → [0, 1] satisfying certain conditions. The triple < Ω, F, P > is a probability model.
If Ω is not a discrete or countable set “all” is too many – we will ignore this technicality.

The probability measure (or just probability) P satisfies: 1. P(∅) = 0 2. If A ∈ F then P(¬A) = 1 − P(A) 3. If events A and B are mutually exclusive (ie A ∩ B = ∅) then P(A ∪ B) = P(A) + P(B). From these basic properties it follows that P(Ω) = 1 If A and B are not mutually exclusive then P(A ∪ B) = P(A) + P(B) − P(A ∩ B) If E = E1 ∪ E2 ∪ E3 ∪ · · · where the Ei are mutually disjoint then P(E ) = i Ei

Probability Interpretation
There are a number of different ways of interpreting the above mathematical definition of probability. Among these are: frequency intepretation The probability of E is the (long run) frequency of the event E in a number of repettions of the experiment. subjective probability interpretation The probability of an event E is the chance that you think E will occur. In principle this could be determined by finding out how much you are prepared to bet on E happening. Kolmogorov complexity interpretation This approach is based on how complicated a procedure is required to generate E .

The actual interpretation of what a probability is does not impact on the mathematical theory of probability. However it can have a major impact when we come to apply probability theory especially in the area of statistics. Hence by way of “full disclosure”:

(In so far as I am a statistician) I am a Bayesian.

Conditional Probability
Very often we are interested in knowing about the probability of some event A given that some other event B has occured. We write P(A|B) (read as “the (conditional) probability of A given B”). It can be defined as P(A|B) = P(A ∩ B) P(B)

Note that if P(B) = 0 then this is undefined. We say the events A and B are (stochastically) independent if P(A ∩ B) = P(A)P(B) from which it follows that for independent events P(A|B) = P(A) and P(B|A) = P(B).

Law of Total Probability
An important consequence of the above definitions is the Law of Total Probability. If E1 , E2 , . . . , En are a set of mutually exclusive (ie Ei ∩ Ej = ∅ for all i, j) and exhaustive (ie E1 ∪ E2 ∪ · · · ∪ En = Ω) events then for any event A we have

P(A) =

P(A|Ei )P(Ei )

This has a useful special case: For any events A and B P(A) = P(A|B)P(B) + P(A|¬B)P(¬B)

Bayes Theorem
From the definition of conditional probability we can obtain the celebrated Bayes Theorem. Let A and B be events (with P(A) = 0 and P(B) = 0) then P(B|A) = P(A|B)P(B) P(A)

Let Ei be a set of mutually exclusive and exhaustive events, then combining Bayes Theorem with the Law of Total Probability we have P(B|A) = P(A|B)P(B)
n i=1 P(A|Ei )P(Ei )

Again with a special case: P(B|A) = P(A|B)P(B) P(A|E )P(E ) + P(A|¬E )P(¬E )

Base Rate, Prosecutor’s and Other Fallacies

Medical Test Revisted
Returning to the scenario with which we started. Let D be the event “you have the disease”. Let T be the event that “the test is positive”. What you want to know is P(D|T ). Using the last equation above we can write P(D|T ) = P(T |D)P(D) P(T |D)P(D) + P(T |¬D)P(¬D)

The test is 99% accurate so we take P(T |D) = 0.99 P(¬T |¬D) = 0.99 Also P(T |¬D) = 1 − P(¬T |¬D) = 0.01

P(D|T )
The quantity P(D) is the only unknown. Note that before the test the probability that you have the disease is just P(D) – which (given you have no other information) is just the rate of the disease in the general population. If the underlying rate of the disease is P(D) = 0.0001 (ie 1 in 10,000 have the disease) then P(D|T ) = 0.0098 ≈ 0.01 whilst if it were 0.00001 (ie 1 in 100,000) then P(D|T ) = 0.00098 ≈ 0.001 etc

Prosecutor’s Fallacy
This type of faulty reasoning has become associated with careless (or unscrupulous) prosecution lawyers. A murder has been committed. DNA taken from the scene is found to match that of a man’s stored sometime before in a national database. The man is arrested and charged with the murder. At the trial the prosecution states that the probability is 1 in 1,000,000 that the two samples could have matched by chance. They then go on to claim that therefore the probability that the man is innocent in 1 in 1,000,000. This is a false claim. Let E be the evidence, let I be the event “the accused is innocent”. Then we are given that P(E |I ) = 0.0000001. But what we want is P(I |E ).

Prosecutor’s Fallacy (cont.)
By Bayes’ Theorem we have P(I |E ) = P(E |I )P(I ) P(E )

This indicates that we need to assess both the apriori probability of innocence as well as the overall probability of the evidence before we can arrive at the needed estimate of P(I |E ).

Defender’s Fallacy
The defence might argue that there are 60,000,000 samples in the database so we would expect on the order of 60 matches. Hence there is is a 59 in 60 chance that the accused is innocent. Again this is ignoring other evidence.

Interrogator’s Fallacy
R. Matthews points out that for a confession to contribute to evidence of guilt we must have P(C |G ) > P(C |I ) where C is the event “a confession has been made”, G is the event “the accused is guilty” and I that “the accused is innocent”. There can be situations where the above inequality does not hold – eg a trained and committed terrorist is less likely to confess under pressure than a suggestible innocent. Assuming then that confession can never reduce the probability of guilt is the Interrogator’s Fallacy.

Robert Matthews on the Interrogator’s Fallacy interro.htm Ian Stewart has also discussed the Interrogator’s Fallacy at http: //∼030116/116/articles/mathrec1.htm

Intrusion Detection Systems

IDS Types
(What follows is based upon Axelsson 1999) ID systems may be classified into two main types: Anomaly Detection Systems Misuse Detection Systems Both categories need to consider Effectiveness Efficiency Ease of Use Security Inter-operability Transparency Effectiveness issues can be explored by means of the base-rate fallacy.

False Positive/False Negative
Suppose we have an IDS monitoring our system. There are four possibilities. Alarm True Positive False Positive No Alarm False Negative True Negative

Intrusion No Intrusion

Let A be the event “there is an alarm” and I that “there is an intrusion”. P(A|I ) – detection rate (quantity obtained when testing detector on known intrusions) P(A|¬I ) – false alarm rate (false positive) P(¬A|I )) – false negative P(¬A|¬I ) – true negative

What We Really Really Want
What we really want to know is P(I |A) (sometimes called the “Bayesian detection rate”) ie given an alarm have we really got an intrusion on our hands. P(¬I |¬A) ie if there is no alarm can we be confident there is no intrusion. We want both these quantities to be very large. Bayes Theorem allows us to write: P(I |A) = and P(¬I |¬A) = P(¬A|¬I )P(¬I ) P(¬A|¬I )P(¬I ) + P(¬A|I )P(I ) P(A|I )P(I ) P(A|I )P(I ) + P(A|¬I )P(¬I )

You Can’t Always Get What You Want
Suppose (using Axelsson’s example) we get 2 intrusions a day and each intrusion generates 10 audit trail records we can use to spot the intrusion. Suppose that the system generates 1,000,000 audit records a day. Then the probability that any given audit record shows an intrusion 106 is P(I ) = 1/ = 2 × 10−5 20 Also P(¬I ) = 1 − P(I ) = 0.99998 Then P(I |A) = 2× 10−5 2 × 10−5 × P(A|I ) × P(A|I ) + 0.99998 × P(A|¬I )

Then even with P(A|I ) = 1.0 to get, say, P(I |A) = 0.8 we will need P(A|¬I ) = 0.000005.

The Human Element
With more realistic detection and false alarm rates the Bayesian detection rate soon falls to 50% and below. The problem here is that even though the number of events is low, a low Bayesian detection rate will (almost certainly) have the effect of “teaching” the security officer to ignore all alarms! This is especially true if the staff member has other duties etc. Training might help to ameliorate this problem. Using the same figures in the equation for P(¬I |¬A) shows less cause for concern.

False Positive Nearly Zero is Good?
I believe that it is worth pointing out that, though the probability of a false positive being close to zero is what we want to achieve, unless it is actually zero there can still be problems that must be adressed. In particular if the false positive rate is near zero it will become easy to assume that it is in fact zero. However since it actually is not zero there will, on occasion, be times when some is falsely accused. How is this possibility to be handled by your system?

Some General Points
The previous analysis of ID systems from the perspective of conditional probability and Bayes Theorem, paying attention to base rates, is applicable to any situation where we carry out tests that can fail. This applies to fingerprints iris scans DNA matching etc Note also that false negatives have not been considered much above. If the rate of false negatives is not very low there may also be problems!

Fingerprint Verification Competion Measures
In 2000,2002,2004 and 2006 a number of academic institutions organised and ran a Fingerprint Verification Competition. The basic performance measures were (from the Wikipedia entry) for each database and for each algorithm:: Each sample in the subset A is matched against the remaining samples of the same finger to compute the False Non Match Rate FNMR (also referred as False Rejection Rate - FRR). The first sample of each finger in the subset A is matched against the first sample of the remaining fingers in A to compute the False Match Rate FMR (also referred as False Acceptance Rate - FAR). Then the following performance indicators were reported: REJENROLL (Number of rejected fingerprints during enrollment) REJNGRA (Number of rejected fingerprints during genuine matches)

REJNIRA (Number of rejected fingerprints during impostor matches) REJNIRA (Number of rejected fingerprints during impostor matches) Impostor and Genuine score distributions FMR(t)/FNMR(t) curves, where t is the acceptance threshold ROC(t) curve EER (equal-error-rate) EER* (the value that EER would take if the matching failures were excluded from the computation of FMR and FNMR) FMR100 (the lowest FNMR for FMR¡=1%) FMR1000 (the lowest FNMR for FMR¡=0.1%) ZeroFMR (the lowest FNMR for FMR=0%) ZeroFNMR (the lowest FMR for FNMR=0%)

Average enrollment time Average matching time Average and maximum template size Maximum amount of memory allocated The results can be found on the home page (linked to from the Wikipedia entry).

The Doomsday Argument

Doomsday Argument
(This treatment taken from Nick Bostrom’s summary.) Let your birth rank be the position in the sequence of all humans who will ever have existed. This is approximately 60 billionth. Consider the following two hypotheses: H1 : There will have been a total of 200 billion humans H2 : There will have been a total of 200 trillion humans. Now suppose that after considering the various events that could lead to the extinction of the human race that you assign probabilities to H1 and H2 as follows: P(H1 ) = 0.05 P(H2 ) = 0.95

Doomsday Argument (cont)
However it is more probable that your birth rank is 60 billionth if H1 is true than if H2 is true. In fact P(R = 60B|H1 ) = 1/200 billions P(R = 60B|H2 ) = 1/200 trillions Hence we can apply Bayes Theorem to get P(H1 |R = 60B) =

P(R = 60B)P(H1 ) ≈ P(R = 60B|H1 )P(H1 ) + P(R = 60B|H2 )P(H2 )

That is the human race has a high probability of becoming extinct relatively soon. Just where this argument goes wrong (and everyone thinks it does somewhere) is the subject of much debate.

Visit Nick Bostrom’s home page for many resources about the Doomsday argument and related issues (including the intriguing question “Are you living in a Computer Simulation?”):

Be Careful There’s a P-value About

Psychology Papers
There is growing interest in the “Psychology of Security”. You may at some time be asked to decide if one product is better than another on the basis of some psychological study. It is important to understand that the currently standard way that psychologists (and others in fact) report their findings is fraught with pitfalls. In particular many, many pscychology research papers take the route of “Null Hypothesis Testing” Here a so-called null hypothesis, H0 , is formed, typically that two different tests/treatments have no difference in effect. Then an experiment is carried out, measurements taken and a p-value calculated. If p < 0.05, then H0 is said to be rejected at the 5% level and the result is “significant at the 5% level”. Sometimes 1% is used as the cut-off level.

p-values and Their Interpretation
A p-value is often interpreted as the probability the null hypothesis is true. So if we have p < 0.05 then we would be saying that there is a 95% chance it is false. This interpretation is WRONG, WRONG, WRONG! (The falsity of this interpretation is pointed out time and again in the literature – but it continues to be made.) In fact a p-value is just (effectively) P(Data | H0 ). So what we want is P(H0 | Data) and we know that to get this we have to use Bayes Theorem. Which means that we need to have a prior: P(H0 ). It can be shown that to conclude P(H0 | Data) = 0.05 from P(Data | H0 ) = 0.05 you would need P(H0 ) < 0.1. IE before the experiment you would need to be at least 90% confident that H0 was false!

There are other things problematic about p-values. Eg: significance levels are arbitrary – why 0.05 but not 0.049 or 0.051? p-values fail to convey information about the effect size p-values fail to convey information about the sample size statistical significance is not the same as clinical significance or scientific importance Amusing Example: G. M. Fitzsimons, T. Chartrand, G. J. Fitzsimons Auomatic Effects of Brand Exposure on Motivated Behaviour: How Apple Makes You “Think Different” Journal of Consumer Research (in press) (Authors are at Duke University)∼gavan/GJF articles/ brand exposure JCR inpress.pdf

Does Apple Make You More Creative?
This has been picked up by the popular press under the headline “Apple Logo improves creativity!”. Amongst other things the relevant experiment had a (fairly large) number of (psychology) students either exposed to the Apple logo or the IBM logo. (There was also a control group.) They were then asked to come up with a number of different uses for a brick. (Above summary is very high-level – the experiment seems well designed and conducted.) The null hypothesis was “There is no difference in the mean number of uses between the Apple and IBM groups”. The authors rejected the null hypothesis at the 1% level. The results of the study were thus “statistically significant”.

Since my prior belief in the null hypothesis would be of the order P(H0 ) = 0.99 or P(H0 ) = 0.9999 the p-value of 0.01 would not cause me to reject it! In fact it would hardly change my prior at all. The authors do discuss (in essence) why my prior might be considered too high by considering ealier work on “priming effects”. Crudely – that prior exposure to Apple’s brand and to IBM’s predisposes students to behave in particular ways which come through in the experiment. Well I’m not convinced – and I haven’t so far mentioned the actual difference in mean number of uses! It was just over 1 (for “no-delay” subjects) rising to a bit less than 2 (for “delay” subjects). Another criticism of p-value significance levels is that they are very prone to exaggerate the significance of small effects.

Confidence Intervals
Because of the problems with p-values some areas of science have moved to using “Confidence Intervals”. Eg the recent Cochrane review on vitamin supplements: G Bjelakovic, D Nikolova, LL Gluud, RG Simonetti, C Gluud Antioxidant supplements for prevention of mortality in healthy participants and patients with various diseases Cochrane Database of Systematic Reviews 2008 Issue 2 clsysrev/articles/CD007176/frame.html “found significantly increased mortality by vitamin A (RR 1.16, 95% CI 1.10 to 1.24), beta-carotene (RR 1.07, 95% CI 1.02 to 1.11), and vitamin E (RR 1.04, 95% CI 1.01 to 1.07), but no significant detrimental effect of vitamin C (RR 1.06, 95% CI 0.94 to 1.20).”

CI’s are definitely better than p-values since they take into account effect size and sample size. However they are possibly even more prone to mis-interpretation than p-values. The obvious interpretation of a statistic being given a 95% CI is that the probability that the true value lies in that interval is 95%. Again this is WRONG,WRONG,WRONG! Again this is pointed about in every statistics text known to man. The correct interpretation is (from Wikipedia): “.. a 95% CI for a population parameter is an interval that is calculated from a random sample of an underlying population such that, if the sampling was repeated numerous times and the confidence interval recalculated from each sample according to the same method, 95% of the confidence intervals would contain the population parameter in question.”

Fisher vs Neyman-Pearson
The current standard practice in hypothesis testing is an “unholy” amalgam of approaches put forward by two separate groups of statisticians in the 1920’s and 1930’s. R. A. Fisher introduced the p-value and null hypothesis as an aid to inductive inference. J. Neyman and E. Pearson introduced hypothesis testing as an alternative to Fisher’s significance testing. The idea was to determine how often a rule for deciding between two hypotheses would lead a researcher astray. Significance testing provides a measure of the strength of evidence does not say how often that measure might mislead

Hypothesis testing error rates describe how one often one might be mislead into making a particular error does not measure the strength of evidence Naturally researchers put the two together – much to the disgust of the originators. This has resulted in an ad-hoc methodology irresolvable controversies such as those surrounding adjustment for multiple looks at accumulating data misinterpretation of key quantities (eg p-values as type I error rates)

Better Approaches

ROC Curves
A ROC (Receiver Operating Characteristic) curve plots the true positive rate (sensitivity) against the false positive rate (1-specificity).

ROC curves for different predictors.

(both figures from the Wikipedia ROC entry)

More on ROC curves: Wikipedia entry: curve The Magnificent ROC – A Web page with interactive examples: T. Fawcett (2004) ROC Graphs: Notes and Practical Considerations for Researcher Technical report, Palo Alto, USA: HP Laboratories∼tom.fawcett/public html/ papers/ROC101.pdf

Full-On Bayes
As a Bayesian I (and other Bayesians) would say that: By far the best approach would be drop all the above nonsense and carry out proper a Bayesian analysis, whereupon we could get the quantities that we are actually interested in eg: actual probability that a null-hypothesis were true actual probability that a statistic lay in a particular interval. Well we would say that wouldn’t we? And there are some technical issues (let alone “hysterical raisins”) which can cause problems for a Bayesian. Perhaps some sort of half-way house is possible?

Odd’s Form of Bayes Theorem
Suppose you are trying to decide on P(G |E ) (probability of guilt given some evidence) and/or P(¬G |E ) (probability of innocence given some evidence). It is often considered better to use Bayes Theorem in the ”odd’s” form in this context: P(G |E ) P(E |G ) P(G ) = × P(¬G |E ) P(E |¬G ) P(¬G ) ie Posterior Odds = Likelihood Ratio × Prior Odds This also applies to the case where we are considering the probability of an intrusion given an alarm. The Likelihood Ratio is also known as the Odds Ratio or the weight-of-evidence.

Critical Prior Interval
Matthews suggest using something called the Critical Prior Interval (CPI) with confidence intervals to help judge the true significance of a result. See R.A.J.Matthews Why should clinicians care about bayesian methods? http: // (which also contains a good discussion of the pitfalls of p-values and CI’s). Alternatively: R.A.J.Matthews Methods for assessing the credibility of clinical trial outcomes Drug Inf Journal 35 (4) 1469-1478 http: //

Bayes Theorem and the British Courts

Bayes Theorem and British Courts
Regina v Denis John Adams
Adams was charged with rape when his DNA, collected in response to another offence, was found to match that stored in connection with the rape case. This was the only evidence offered by the prosecution. Defense offered a number of pieces of evidence that argued for his innocence. Because the prosecution presented the DNA match probability as a precise number (1 in 200 million – though the actual figure was challenged) the defense argued that the jury needed to utilise Bayes Theorem in order to properly combine the various pieces of evidence correctly. Defense were allowed to attempt to school the jury in the use of Bayes Theorem using a questionnaire. The jury found Adams guilty.

Regina v Denis John Adams (cont)
A number of appeals were made on the basis that trial judges had not properly directed the jury. These all failed with the appeals court eventually finding that “Bayesian inference is not a suitable method for juries to use to take non-scientific evidence into account.” Use of Bayesian arguments apparently not formally banned from court.

Regina vs Sally Clark
Sally Clarke accused of murdering her two young children. Defense argued that they died from SIDS (or other non-specified natural causes). Prosecution expert witness stated that chance of two SIDS deaths in a family was 1 in 73 million. Clark was found guilty. It is widely assumed that the “1 in 73 million” figure was interpreted by the jury as giving the probability of Clark’s innocence. Initial appeal was rejected on the basis that the case against Clark was overwhelming.

Regina vs Sally Clark (cont)
There was widespread condemnation of the verdict from statisticians. First the expert witness’s figures were based on an invalid calculation (he assumed independence between cases of death). Secondly the figure was used inappropriately – (crudely) needed to compare figure for probability of two SIDS with figure for probability of two murders. Both very small numbers! Clarks’ conviction was eventually overturned when undisclosed evidence of possible respiratory infection in the second child came to light.

Readings 1
Stefan Axelsson: The Base Rate Fallacy and its Implications for the Difficulty of Intrusion Detection Proceedings of the 6th ACM Conference on Computer and Communications Security, pp. 1-7, November 1-4, 1999 http: // Emilie Lundin, Erland Jonsson (2002) Survey of Intrusion Detection Research Stefan Axelsson (2003) Intrusion Detection Systems: A Survey and Taxonomy http: //

Readings 2
Linda Ackerman: Biometrics and Airport Security (A good summary of the problems.) Anil K. Jain, Sharath Pankanti Biometrics Systems: Anatomy of Performance Biometrics Research at MSU Technical Report 2000: MSU-CSE-00-20 Steven A. Sloman, Lila Slovak Frequency Illusions and Other Fallacies John K. Kruschke The Role of Base Rates in Category Learning

Shared By:
Description: Bayes Theorem Implications and Applications