Bayes_1ppt - Reza Shadmehr home page

Shared by: maclaren1
Categories
Tags
-
Stats
views:
3
posted:
5/7/2010
language:
English
pages:
15
Document Sample
scope of work template
							         580.691 Learning Theory
             Reza Shadmehr


           Bayesian learning 1:
Bayes rule, priors and maximum a posteriori
                     Frequentist vs. Bayesian Statistics

       Frequentist Thinking                              Bayesian Thinking
True parameter:   w*                       Does not have the concept of a true
                                           parameter.
Estimate of this parameter:   ˆ
                              w
 Bias: E  w  w*
                                           Rather, at every given time we have
           ˆ                               knowledge about w (the prior), gain
 var  w
       ˆ                                   new data, and then update our
                                           knowledge using Bayes rule (the
Many different ways in which we can        posterior).
                                            Prior Distr.        Conditional Distr.
come up with estimates (e.g.
Maximum Likelihood estimate), and
we can evaluate them.
                                                         p w  p  D | w          p w , D 
                                           p(w | D )                         
                                                              p D                p w, D  dw
                                  Posterior distr.

                                             Given Bayes rule, there is only ONE
                                             correct way of learning.
                           Binomial distribution and discrete random variables
        Suppose a random variable can only take one of two variables (e.g., 0 and 1,
        success and failure, etc.). Such trials are termed Bernoulli trials.

                           x  0,1           P  x  1                    P  x  0  1 

                                
                           x  x (1) , x (2) ,      , x( N )   
  Probability density
  or distribution              
                           p x   (1)       x(1)
                                                  1   1 x(1)


                                                       1 x (1)                       1 x (2)                       1 x ( N )
                           p x     x (1)
                                               1                   x (2)
                                                                               1                 x( N )
                                                                                                              1   
Probability distribution
of a specific sequence                                                                          
of successes and
failures


                           n  number of times the trial succeeded
                                 N
                           n    x (i )
                                i 1
                                     N n         N n        N!                   N n
                           p  n     1                        n 1   
                                     n                  n ! N  n !
                           E  n   N
                           var  n   N 1   
        Poor performance of ML estimators with small data samples
• Suppose we have a coin and wish to estimate the outcome (head or tail) from
observing a series of coin tosses.  = probability of tossing a head.
• After observing n coin tosses, we note that:                    
                                                              D  x(1) ,      , x( n)   
          out of which h trials are head.
• To estimate whether the next toss will be head or tail, we form an ML
estimator:
                                              
                        L    p x(1) , , x ( n) 
                                                       Probability of observing a particular
                                                       sequence of heads and tails in D

                                  px   px   px  
                                         (1)            (2)            ( n)

                                                  nh
                                   h 1   
                       log L    h log   n  h  log 1   
                    d               h nh
                        log L          0
                   d                1
                                    h
                             ML 
                                    n
• After one toss, if it comes up tail, our ML estimate predicts zero probability of
seeing heads. If first n tosses are tails, the ML continues to predict zero prob. of
seeing heads.
            Including prior knowledge into the estimation process
• Even though the ML estimator might say  ML  0 , we “know” that the coin
can come up both heads and tails, i.e.:   0
• Starting point for our consideration is that  is not only a number, but we will
give  a full probability distribution function
•Suppose we know that the coin is either fair (=0.5) with prob. p or in favor of
tails (=0.4) with probability 1-p.
• We want to combine this prior knowledge with new data D (i.e. number of
heads in n throws) to arrive at a posterior distribution for . We will apply Bayes
rule:
                           Prior Distr.     Conditional Distr.

    Posterior distr.
                          p  , D          p   p  D |                     p   p  D |  
            p( | D )                                              
                           p D            p   p  D |   d
                                                                         n

                                                                          p     p  D |    
                                                                          
                                                                             1


 The numerator is just the joint distribution of  and D, evaluated at a particular D.
 The denominator is the marginal distribution of the data D, that is, it is just a
 number that makes the Numerator integrate to one.
                     Bayesian estimation for a potentially biased coin
     • Suppose that we believe that the coin is either fair, or that it is biased toward
     tails:  = probability of tossing a head. After observing n coin tosses, we note
                 
     that: D  x (1) , , x ( n)
                                    out of which h trials are head.


                 p     for   0.5
                 
                                                          p  D     h 1   
                                                                                 ( n h)
        p    1  p for   0.4
                  0 otherwise
                 

                                    p 0.5h 0.5n h                       p 0.5n
      P   0.5 | D                                        
                         p 0.5h 0.5nh  1  p  0.4h 0.6n h p 0.5n  1  p  0.4h 0.6n h

      P   0.4 | D  
                             1  p  0.4h0.6nh
                         p 0.5n  1  p  0.4h 0.6n h

Now we can accurately calculate the probability that we have a fair coin, given some data D. In
contrast to the ML estimate, which only gave us one number ML, we have here a full probability
distribution, that is we know also how certain we are that we have a fair or unfair coin.

In some situation we would like a single number, that represents our best guess of . One
possibility for this best guess is the maximum a-posteriori estimate (MAP).
                              Maximum a-posteriori estimate
  We define the MAP estimate as the maximum (i.e. mode) of the posterior
  distribution.

                                                
 MAP estimator: arg max p  D   arg max p  D   p                   
                                             MAP  arg max log  p  D     log  p   


The latter version makes the comparison to the maximum likelihood estimate
easy:

 ML  arg max p  D |    arg max  log  p  D |    
                                   

 MAP  arg max p  | D   arg max  log  p  D |     log  p    
                                       


We see that ML and MAP are identical, if p() is a constant that does not
depend on .
Thus our prior would be a uniform distribution over the domain of . We call
such a prior for obvious reasons a flat or uniformed prior.
           Formulating a continuous prior for the coin toss problem
• In the last example the probability of tossing a head, represented by , could
only be either 0.5 or p=0.4. How should we choose a prior distribution if  can be
between 0 and 1?
•Suppose we observed n tosses. The probability density that exactly h of those
tosses were heads is:
                             n h            nh
                   p  h     1                 Binomial distribution
                              h
                                  n!                   nh
                                          h 1   
                             h! n  h !
  = probability of tossing a head
   0.5                          n  10
                           0.25

                           0.2             n  20
                  p  h   0.15

                           0.1

                           0.05


                                     5       10     15   20
                                             h
               Formulating a continuous prior for the coin toss problem
    •  represents the probability of a head. We want a continuous distribution that
    is defined between 0 and 1, and is 0 for 0 and 1.

                                 p  D      1   
                                                            n 

                                              1             n 
                                      p      1                 Beta distribution
                                              c
                                              1
                                                             n 
                                          c    1   
                                                                   d     normalizing constant

                                              0
      = probability of tossing a head
         2.5                           4; n  8                  3.5        1; n  8
          2                            3; n  6                   3           1; n  6
                                         2; n  4                2.5                1; n  4
p     1.5
                                           1; n  2
                                                                    2
                                                                                         1; n  2
          1                                                        1.5
                                                                    1
         0.5
                                                                   0.5

               0.2   0.4       0.6      0.8       1                        0.2    0.4       0.6   0.8   1
                                                                                       
               Formulating a continuous prior for the coin toss problem
• In general, let’s assume our knowledge comes in the form of a beta
distribution:

            1             
    p      1                             When we apply Bayes rule to integrate
            c                                      some old knowledge (the prior) in the
               1                                   form of a beta-distribution with
                              
         c    1    d
                                                  parameters  and , with some new
               0                                   knowledge h and n (coming from a
                             nh
p  D |     h 1                            binomial distribution), then we find that
                                                   the posterior distribution also has the
                   1                       nh
                      1     h 1          form of a beta distribution with
p  | D         c
                                                   parameters +h and +n-h.
               1
                   1                      nh
                     1     h 1    d
                   c
               0
            1                   nh
                                                   Beta and binomial distribution are
               h 1                         therefore call conjugate distributions.
            d
               1
                                     nh
        d     h 1   
                                            d
               0
                         MAP estimator for the coin toss problem
Let us look at the MAP estimator if we start with a prior of =1, n=2, i.e. we have
a slight belief in the fact that the coin is fair.
                                                                1.5
                                                                                                  n  2; h  1
Our posterior is then:                                 p      1

             1               n1h                              0.5
 p  | D    h1 1   
             d                                                          0.2     0.4       0.6   0.8      1
                                                                                      
Let’s calculate the MAP-estimate so that we can compare it to the ML estimate.
                                                                  1
log p  | D    h  1 log    n  h  1 log 1     log  
                                                                  d 
d log p  | D  h  1 n  h  1
                                      0
      d                      1
1 n  h 1
      
          h 1
1 n  h 1 h 1
   
          h 1                                                    Note that after one toss, if we get a tail, our
        h 1
 MAP                                                             probability of tossing a head is 0.33, not
        n2                                                        zero as in the ML case.
                 Classification with a continuous conditional distribution
Assume you only know the height of a person, but not their gender. Can height
tell you something about gender?
Assume y=height and x=gender (0=male or 1=female).
     What we have: densities p  y | x  0  and p  y | x  1
     What we want: probability P  x  1| y 

p  y | x  1
                                   p  y | x  0                         P  x  1 p  y x  1
                                                     P  x  1| y      1

                                                                         P  x  i p  y x  i
                                                                        i 0




Height is normally distributed in the population of men and in the population of women,
with different means, and similar variances. Let x be an indicator variable for being a
female. Then the conditional distribution of y (the height becomes):

                            1              2 
                       exp   2  y   f  
                     1
  p  y | x  1 
                   2p      2               
                     1      1               2 
  p  y | x  0      exp   2  y  m  
                   2p      2                
                     Classification with a continuous conditional distribution
Let us further assume that we start with a prior distribution, such that x is 1
with probability p.
                                P  x  1 p  y | x  1
P  x  1| y 
                 P  x  1 p  y | x  1  P  x  0  p  y | x  0  The posterior is a logistic function of
                             1                   2 
                                                                         a linear function of the data and
                      p exp   2  y   f                            parameters (remember this result the
                            2                    
          1                 2                   1                 2  section on classification!).
  p exp   2  y   f    1  p  exp   2  y  m  
          2                                    2                  
                                                                         The maximum-likelihood argument
                      1
                                                                        would just have decided under which
                    1                   2 
      1  p  exp   2  y  m                                      model the data would have been
  1                2                    
                 1                  2 
                                                                         more likely.
           p exp  
                   2
                       2      y        f      
                                                                                          The posterior distribution gives us
                                          1                                                the full probability that we have a

            
            
                  1 p
    1  exp  log 
                   p
                           
                           
                              1
                            2
                                2               y      y     
                                                         m
                                                             2
                                                                         f
                                                                             2
                                                                                           male or female.
                                                 1                                         We can also include prior knowledge

                 1 p                                                                   in our scheme.
    1  exp  log 
                  p
                           
                           
                              1
                            2
                                    2   f 2   2 y  m   f
                                2  m
                                                  
                                                   1
                                                                                    
                                                                                      
          1

    1  exp  θT y 

                                    f 2         m  f 
                                                                 T
     1 p                       2
                                                                 
θ  log                                                         , y  1, y 
                              m
                        
                                                                                T
                                                ,
      p                        2 2                          
                                                           2
                                                                
           Classification with a continuous conditional distribution
  Computing the probability that the subject is female, given that we
  observed height y.
                                                               1
                  P  x  1| y  
                                               1 p     m   f   m   f  
                                                              2     2

                                     1  exp  log                             y
                                               p           2 2         2       
                                                                                   

                                               m  176cm
                                                f  166cm
                                                 12cm




                                 1
                                                             P  x  1  0.5
                                0.8                                             Our prior probability
                                                             P  x  1  0.3
Posterior                       0.6
probability:   P  x  1| y 
                                0.4

                                0.2


                                         120   140      160      180     200    220
                                                         y
                                      Summary
•Bayesian estimation involves the application of Bayes rule to combine a prior
density and a conditional density to arrive at a posterior density.
                                     p  D   p  
                          p  D  
                                          p  D

•Maximum a posteriori (MAP) estimation: If we need a “best guess” from our
posterior distribution, often the maximum of the posterior distribution is used.
                                                         
             MAP  arg max p  D   arg max p  D   p   
•The MAP and ML estimate are identical, when our prior is uniformly distributed
on , i.e. is flat or uniformed.
•With a two-way classification problem and data that is Gaussian given the
category membership, the posterior is a logistic function, linear in the data.

                             P x 1 y 
                                                 1
                                                    
                                            1  exp θT y

						
Related docs
Other docs by maclaren1
Giovanni Buenaventura Vivían
Views: 26  |  Downloads: 0
Oldland Parish Council - DOC
Views: 5  |  Downloads: 0
Practice Exam I
Views: 5723  |  Downloads: 7
Regional Schools- What_When Why_Where and How
Views: 6  |  Downloads: 0
AIPC CONFERENCE CENTRE CRITERIA
Views: 23  |  Downloads: 0
Living Favor Minded
Views: 21  |  Downloads: 0