2. Bayes Decision Theory

Document Sample
2. Bayes Decision Theory Powered By Docstoc
					2. Bayes Decision Theory

       Prof. A.L. Yuille
     Stat 231. Fall 2004.
    Decisions with Uncertainty
• Bayes Decision Theory is a theory for how
  to make decisions in the presence of

• Input data x.
• Salmon y= +1, Sea Bass y=-1.
• Learn decision rule: f(x) taking values
        Decision Rule for Fish.
• Classify fish as
  Salmon or Sea Bass
  by decision rule f(x).
          Basic Ingredients.
• Assume there are probability distributions
  for generating the data.
• P(x|y=1) and P(x|y=-1).
• Loss function L(f(x),y) specifies the loss of
  making decision f(x) when true state is y.
• Distribution P(y). Prior probability on y.
• Joint Distribution P(x,y) = P(x|y) P(y).
           Minimize the Risk
• The risk of a decision rule f(x) is:

• Bayes Decision Rule f*(x):

• The Bayes Risk:
           Minimize the Risk.
• Write P(x,y) = P(y|x) P(x).
• Then we can write the Risk as:

• The best decision for input x is f*(x):
                 Bayes Rule.
• Posterior distribution P(y|x):

• Likelihood function P(x|y)
• Prior P(y).

• Bayes Rule has been controversial (historically)
  because of the Prior P(y) (subjective?).
• But in Bayes Decision Theory, everything starts
  from the joint distribution P(x,y).
• The Risk is based on averaging over all
  possible x & y. Average Loss.
• Alternatively, can try to minimize the worst
  risk over x & y. Minimax Criterion.

• This course uses the Risk, or average
   Generative & Discriminative.
• Generative methods aim to determine probability
  models P(x|y) & P(y).

• Discriminative methods aim directly at estimating
  the decision rule f(x).

• Vapnik argues for Discriminative Methods: Don’t
  solve a harder problem than you need to. Only
  care about the probabilities near the decision
       Discriminant Functions.
• For two category case the Bayes decision rule
  depends on the discriminant function:

• The Bayes decision rule is of form:

• Where T is a threshold, which is determined by
  the loss function.
         Two-State Case
• Detect “target” or “non-target”.

• Let loss function pay a penalty of 1 for misclassification,
  0 otherwise.

• Risk becomes Error. Bayes Risk becomes Bayes Error.

• Error is the sum of false positives F+ (non- targets
  classified as targets) and false negatives F- (targets
  classified as non-targets).
            Gaussian Example: 1
• Is a bright light flashing?

• n is no. photons emitted by dim or bright light.
              8. Gaussian Example: 2
•                                              are Gaussians with
    means and s.d. .
•   Bayes decision rule selects “dim” if   ;

•   Errors:
      Example: Multidimensional
       Gaussian Distributions.
• Suppose the two classes have Gaussian
  distributions for P(x|y).
• Different means
  but same covariance
• The discriminant function is a plane:

• Alternatively, seek a planar decision rule without
  attempting to model the distributions.
• Only care about the data near the decision
  Generative vrs. Discriminant.
• The Generative approach will attempt to
  estimate the Gaussian distributions from
  data – and then derive the decision rule.

• The Discriminant approach will seek to
  estimate the decision rule directly by
  learning the discriminant plane.

• In practice, we will not know the form of the
  distributions of the form of the discriminant.
• Gaussian Case with unequal covariance.
 Discriminative Models & Features.
• In practice, the Discriminative methods are usually
  defined based on features extracted from the data. (E.g.
  length and brightness of fish).

• Calculate features z=h(x).

• Bayes Decision Theory says that this throws away
• Restrict to a sub-class of possible decision rules – those
  that can be expressed in terms of features z=h(x).
Bayes Decision Rule and Learning.
• Bayes Decision Theory assumes that we know,
  or can learn, the distributions P(x|y).
• This is often not practical, or extremely difficult.
• In real problems, you have a set of classified
• You can attempt to learn P(x|y=+1) & P(x|y=-1)
  from these (next few lectures).
• Parametric & Non-parametric approaches.
• Question: when do you have enough data to
  learn these probabilities accurately?
• Depends on the complexity of the model.
          Machine Learning.
• Replace Risk by Empirical Risk

• How does minimizing the empirical risk relate to
  minimizing the true risk?
• Key Issue: When can we generalize? Be
  confident that the decision rule we have learnt
  on the training data will yield good results on
  unseen data?
           Machine Learning
• Vapnik’s theory gives a mathematically elegant
  way of answering these issues.
• It assumes that the data is sampled from an
  unknown distribution.
• Vapnik’s theory gives bounds for when we can
• Unfortunately these bounds are very
• In practice, train on part of dataset and test on
  other part(s).
Extensions to Multiple Classes
 Conceptually straightforward – see Duda, Hart & Stork.

 The decision partitionsf the feature space into k subspaces

              ik1 i      i   j   , i  j


                  1                          4