Tutorial 6
• Bias and variance of estimators
• The score and Fisher information
• Cramer-Rao inequality
236607 Visual Recognition Tutorial 1
Estimators and their Properties
• Let { p( x | )}, be a parametric set of distributions.
Given a sample D x ( n ) x1 , , xn drawn i.i.d from one of
the distributions in the set we would like to estimate its
parameter (thus identifying the distribution).
• An estimator for w.r.t. D is any function T ( D)
notice that an estimator is a random variable.
• How do we measure the quality of an estimator?
• Consistency: An estimator T for is consistent if
T ( x ( n ) ) , as n
p
this is a (desirable) asymptotic property that motivates us
to acquire large samples. But we should emphasize that we
are also interested in measures for finite (and small!)
sample sizes.
236607 Visual Recognition Tutorial 2
Estimators and their Properties
• Bias: Define the bias of an estimator to be b(ˆ) E [ˆ] 2
Here, the expectation is w.r.t. to the distribution p( x | ).
The estimator is unbiased if its bias is zero b(ˆ) 0
• Example: the estimators x and x 1 n i1 xi , for the mean
n
of a normal distribution, are both unbiased.
The estimator 1 in1 ( xi x12 n for its variance is biased
n
)
whereas the estimator n1 i1 ( xi x ) is unbiased.
2
• Variance: another important property of an estimator is its
ˆ.
variance varp ( x| ) ( ) We would like to find estimators with
minimum bias and variance.
• Which is more important, bias or variance?
236607 Visual Recognition Tutorial 3
Risky Estimators
• Employ our decision-theoretic framework to measure the
quality of estimators.
• Abbreviate ˆ T ( x ( n ) ) and consider the square error loss
function (ˆ, ) (ˆ ) 2
• The conditional risk associated with when is the true
parameter R(ˆ | ) E (ˆ )2 (ˆ )2 p( x( n) | )dx( n)
• Claim: R(ˆ | ) var(ˆ) b(ˆ) variance+bias
• Proof: E (ˆ )2 E (ˆ Eˆ Eˆ )2
2 2
ˆ ˆ ˆ ˆ ˆ
E E 2 E E E Eˆ
E E E variance+bias
2 2
ˆ ˆ ˆ
236607 Visual Recognition Tutorial 4
Bias vs. Variance
• So, for a given level of conditional risk, there is a tradeoff
between bias and variance.
• This tradeoff is among the most important facts in pattern
recognition and machine learning.
• Classical approach: Consider only unbiased estimators and
try to find those with minimum possible variance.
• This approach is not always fruitful:
– The unbiasedness only means that the average of the
estimator (w.r.t. to p( x | )is . It doesn’t mean it will
be near for a particular sample (if variance is large).
– In general, an unbiased estimate is not guaranteed to
exist.
236607 Visual Recognition Tutorial 5
The Score
• The score v of the family p( x | ) is the random variable
p( x | )
v ln p( x | )
p( x | )
measures the “sensitivity” of p( x | )as a function of the
parameter .
• Claim: E[v] 0
• Proof:
p( x | )
E[v] p ( x | )dx p( x | )dx
p( x | )
p( x | )dx 1 0
• Corollary: var[v] E (v E[v])2 E[v 2 ]
236607 Visual Recognition Tutorial 6
The Score - Example
• Consider the normal distribution N ( ,1)
1 1
p( x | ) exp ( x ) 2
2 2
1 1
ln p( x | ) ln(2 ) ( x ) 2
2 2
v ln p( x | ) x
• clearly, E[v] E[ x ] E[ x] 0
• and var(v ) E[v 2 ] E[( x ) 2 ] 2 1
236607 Visual Recognition Tutorial 7
The Score - Vector Form
• In case where (1 , , k ) is a vector, the score v is
the vector whose i th component is
vi ln p ( x | )
i
1 1
• Example: p( x | , ) exp 2 ( x ) 2
2 2
1 1
ln p( x | , ) ln(2 ) ln 2 ( x ) 2
2 2
x
ln p( x | , )
2
1 ( x )2
ln p ( x | , )
3
x 1 ( x )2
v 2 ,
3
236607 Visual Recognition Tutorial 8
Fisher Information
• Fisher information: Designed to provide a measure of how
much information the parametric probability law p( x | )
carries about the parameter .
• An adequate definition of such information should possess
the following properties:
– The larger the sensitivity of p( x | ) to changes in , the
larger should be the information
– The information should be additive: The information
carried by the combined law p( x1 , x2 | ) should be the
sum of those carried by p( x1| ) and p( x2 | )
– The information should be insensitive to the sign of the
change in and preferably positive
– The information should be a deterministic quantity;
should not depend on the specific random observation
236607 Visual Recognition Tutorial 9
Fisher Information
• Definition (scalar form): Fisher information (about ), is
the variance of the score
2
J ( ) E ln p( x | )
• Example: consider a random variable ~ N ( , 2 )
1 1
ln p( x | , ) ln(2 ) ln 2 ( x ) 2
2 2
x
v ln p( x | , ) 2
x 2 1 2
J ( ) E v E 2 4 E ( x ) 4 1/ 2
2
2
236607 Visual Recognition Tutorial 10
Fisher Information - Cntd.
• Whenever (1 , , k ) is a vector, Fisher information
is the matrix J ( ) J i , j ( ) where
J i , j ( ) cov log p( x | ), log p( x | )
j
i
• Remainder:
cov X , Y E X E[ X ]Y E[Y ]
• Remark: the Fisher information is only defined whenever
the distributions p( x | ) satisfy some regularity conditions.
(For example, they should be differentiable w.r.t. i and
all the distributions in the parametric family must have
same support set).
236607 Visual Recognition Tutorial 11
Fisher Information - Cntd.
• Claim: Let x ( n ) x1 , , xn be i.i.d. random variables ~ p( x | ).
The score of p( x( n ) | ) is the sum of the individual scores.
• Proof: v( x )
(n)
ln p ( x | )
(n)
ln p( xi | )
i
ln p ( xi | )
i
v( xi )
i
• Example: If x ( n ) x1 , , xn are i.i.d. ~ N ( , 2 ) the score is
,
x
n ln p( x | , ) n 2
236607 Visual Recognition Tutorial 12
Fisher Information - Cntd.
• Based on n i.i.d. samples, the Fisher information about
2
is
J n ( ) E ln p( x ( n ) | )
2
n
v 2 ( x ( n ) ) E v( xi )
E
i 1
n
E v 2 ( xi ) nJ ( )
i 1
• Thus, the Fisher information is additive w.r.t. i.i.d. random
variables.
• Example: Suppose x x1 , , xn are i.i.d. ~ N ( , 2 ) . From
(n)
previous example we know that the Fisher information 2
about the parameter based on one sample is J ( ) 1/
Therefore, based on the entire sample, J n ( ) n /
2
236607 Visual Recognition Tutorial 13
The Cramer-Rao Inequality
• Theorem: Let be an unbiased estimator for . Then
var(ˆ) 1
J ( )
• Proof: Using Ev 0 we have:
E v Ev ˆ Eˆ E v ˆ Eˆ
E vˆ EˆEv
E[vˆ]
236607 Visual Recognition Tutorial 14
The Cramer-Rao Inequality - Cntd.
• Now
p( x | )
E vˆ
ˆ p( x | )dx
p( x | )
p( x | )ˆdx
p( x | )ˆdx
ˆ 1
E
236607 Visual Recognition Tutorial 15
The Cramer-Rao Inequality - Cntd.
• So, E v Ev ˆ Eˆ E[vˆ] 1
• By the Cauchy-Schwarz inequality
E 2
2
1 E v Ev E
ˆ ˆ E v Ev E ˆ ˆ
2
ˆ
E v 2 var( )
ˆ
J ( ) var( )
• Therefore,
1
var(ˆ)
J ( )
1
2
ˆ
( E )
• For a biased estimator we have: ˆ
var( )
J ( )
236607 Visual Recognition Tutorial 16
The Cramer-Rao General Case
• The Cramer-Rao inequality also true in general
form: The error covariance matrix for θ is ˆ
bounded as follows:
ˆ ˆ
C E[(θ - θ)(θ - θ)t ] J 1 ( )
236607 Visual Recognition Tutorial 17
The Cramer-Rao Inequality - Cntd.
• Example: Let x ( n ) x1 , , xn be i.i.d. ~ N ( , 2 ) . From
previous example n J n ( ) n / 2
• Now let ˆ( x ) n xi be an (unbiased) estimator for .
(n) 1
i 1
2 2
ˆ ˆ
var( ) E Eˆ ˆ
E ˆ
Eˆ 2 2 E 2 Eˆ 2 2
2
1
E xi 2 n 2 2 n 2
n
1
Eˆ 2
n 2 i 1 n
2 2 / n
• So var(ˆ) / n matches the Cramer-Rao lower
2
bound.
• Def: An unbiased estimator whose covariance meets the
Cramer-Rao lower bound is called efficient.
236607 Visual Recognition Tutorial 18
Efficiency
• Theorem (Efficiency): The unbiased estimator ˆ
θ is
efficient, that is,
ˆ
Eθ θ
ˆ ˆ
C E[(θ - θ)(θ - θ)t ] J 1 (θ)
iff
ˆ
J (θ)(θ - θ) ν
ˆ
• Proof (If): If J (θ)(θ - θ) ν then
ˆ ˆ
E[J (θ)(θ - θ)(θ - θ)t J t (θ)] J (θ)CJ t (θ) E[ νν t ] J (θ)
meaning C J 1 (θ)
236607 Visual Recognition Tutorial 19
Efficiency
• Only if: Recall the cross covariance between ˆ
νand(θ θ) :
2
ˆ
E[ ν (θ - θ)t ] I
The Cauchy-Schwarz inequality for random variables says
2
ˆ ˆ ˆ
I E[ ν (θ - θ)t ] E[ νν t ]E[(θ - θ)(θ - θ)t ] JC 1
ˆ
(θ - θ) ν;C 2 J; J 1 ;
thus
ˆ
J (θ)(θ - θ) ν
236607 Visual Recognition Tutorial 20
Cramer-Rao Inequality and ML - Cntd.
• Theorem: Suppose there exists an efficient estimator
for all . Then the ML estimator ML is .
1
• Proof: By assumption var( )
J ( )
By previous claim v or
J ( )
log p( x| )
J ( )( ) for all
This holds at ML and since this is a maximum point
the left side is zero so
ML
236607 Visual Recognition Tutorial 21