Tutorial 6
• Bias and variance of estimators • The score and Fisher information • Cramer-Rao inequality
236607 Visual Recognition Tutorial
1
Estimators and their Properties
• Let { p( x | )}, be a parametric set of distributions. Given a sample D x ( n ) x1 , , xn drawn i.i.d from one of the distributions in the set we would like to estimate its parameter (thus identifying the distribution). • An estimator for w.r.t. D is any function T ( D) notice that an estimator is a random variable. • How do we measure the quality of an estimator? • Consistency: An estimator T for is consistent if p T ( x ( n ) ) , as n this is a (desirable) asymptotic property that motivates us to acquire large samples. But we should emphasize that we are also interested in measures for finite (and small!) sample sizes.
236607 Visual Recognition Tutorial 2
Estimators and their Properties
to be b(ˆ) E [ˆ] 2 • Bias: Define the bias of an estimator Here, the expectation is w.r.t. to the distribution p( x | ). The estimator is unbiased if its bias is zero b(ˆ) 0
• Example: the estimators x and x 1 n i1 xi , for the mean of a normal distribution, are both unbiased. ) The estimator 1 in1 ( xi x12 n for its variance is biased n 2 whereas the estimator n1 i1 ( xi x ) is unbiased.
n
• Variance: another important property of an estimator is its ˆ. variance varp ( x| ) ( ) We would like to find estimators with minimum bias and variance. • Which is more important, bias or variance?
236607 Visual Recognition Tutorial 3
Risky Estimators
• Employ our decision-theoretic framework to measure the quality of estimators. • Abbreviate ˆ T ( x ( n ) ) and consider the square error loss function (ˆ, ) (ˆ ) 2 • The conditional risk associated with when is the true parameter R(ˆ | ) E (ˆ )2 (ˆ )2 p( x( n) | )dx( n) • Claim: R(ˆ | ) var(ˆ) b(ˆ) variance+bias • Proof: E (ˆ )2 E (ˆ Eˆ Eˆ )2
ˆ ˆ ˆ E E E variance+bias
2 2 2
ˆ ˆ ˆ ˆ ˆ E E 2 E E E Eˆ
2
236607 Visual Recognition Tutorial
4
Bias vs. Variance
• So, for a given level of conditional risk, there is a tradeoff between bias and variance. • This tradeoff is among the most important facts in pattern recognition and machine learning. • Classical approach: Consider only unbiased estimators and try to find those with minimum possible variance. • This approach is not always fruitful: – The unbiasedness only means that the average of the estimator (w.r.t. to p( x | )is . It doesn’t mean it will be near for a particular sample (if variance is large). – In general, an unbiased estimate is not guaranteed to exist.
236607 Visual Recognition Tutorial 5
The Score
• The score
v of the family p( x | ) is the random variable
p( x | ) v ln p( x | ) p( x | )
measures the “sensitivity” of p( x | )as a function of the parameter . • Claim: E[v] 0 • Proof: p( x | ) E[v] p ( x | )dx p( x | )dx p( x | ) p( x | )dx 1 0 var[v] E (v E[v])2 E[v 2 ] • Corollary:
236607 Visual Recognition Tutorial 6
The Score - Example
• Consider the normal distribution N ( ,1) 1 1 p( x | ) exp ( x ) 2 2 2 1 1 ln p( x | ) ln(2 ) ( x ) 2 2 2 v ln p( x | ) x
• clearly, • and
E[v] E[ x ] E[ x] 0
var(v ) E[v 2 ] E[( x ) 2 ] 2 1
236607 Visual Recognition Tutorial
7
The Score - Vector Form
• In case where (1 , , k ) is a vector, the score the vector whose i th component is
vi
v is
• Example:
ln p ( x | ) i 1 1 p( x | , ) exp 2 ( x ) 2 2 2 1 1 ln p( x | , ) ln(2 ) ln 2 ( x ) 2 2 2 x ln p( x | , ) 2
1 ( x )2 ln p ( x | , ) 3 x 1 ( x )2 v 2 , 3
236607 Visual Recognition Tutorial 8
Fisher Information
• Fisher information: Designed to provide a measure of how much information the parametric probability law p( x | ) carries about the parameter . • An adequate definition of such information should possess the following properties: – The larger the sensitivity of p( x | ) to changes in , the larger should be the information – The information should be additive: The information carried by the combined law p( x1 , x2 | ) should be the sum of those carried by p( x1| ) and p( x2 | ) – The information should be insensitive to the sign of the change in and preferably positive – The information should be a deterministic quantity; should not depend on the specific random observation
236607 Visual Recognition Tutorial 9
Fisher Information
• Definition (scalar form): Fisher information (about ), is the variance of the score
J ( ) E ln p( x | )
2
• Example: consider a random variable
~ N ( , 2 )
1 1 ln p( x | , ) ln(2 ) ln 2 ( x ) 2 2 2 x v ln p( x | , ) 2 x 2 1 2 2 2 J ( ) E v E 2 4 E ( x ) 4 1/ 2
236607 Visual Recognition Tutorial 10
Fisher Information - Cntd.
• Whenever (1 , , k ) is a vector, Fisher information is the matrix J ( ) J i , j ( ) where
J i , j ( ) cov log p( x | ), log p( x | ) j i
• Remainder:
cov X , Y E X E[ X ]Y E[Y ]
• Remark: the Fisher information is only defined whenever the distributions p( x | ) satisfy some regularity conditions. (For example, they should be differentiable w.r.t. i and all the distributions in the parametric family must have same support set).
236607 Visual Recognition Tutorial 11
Fisher Information - Cntd.
• Claim: Let x ( n ) x1 , , xn be i.i.d. random variables ~ p( x | ). The score of p( x( n ) | ) is the sum of the individual scores. (n) (n) • Proof: v( x ) ln p ( x | ) ln p( xi | ) i ln p ( xi | ) i
v( xi )
i
• Example: If
x ( n ) x1 , , xn
are i.i.d. ~ N ( , 2 ) the score is ,
x n ln p( x | , ) n 2
236607 Visual Recognition Tutorial 12
Fisher Information - Cntd.
• Based on n i.i.d. samples, the Fisher information about 2 is J n ( ) E ln p( x ( n ) | )
v 2 ( x ( n ) ) E v( xi ) E i 1
n 2
E v 2 ( xi ) nJ ( )
i 1
n
• Thus, the Fisher information is additive w.r.t. i.i.d. random variables. (n) • Example: Suppose x x1 , , xn are i.i.d. ~ N ( , 2 ) . From previous example we know that the Fisher information 2 about the parameter based on one sample is J ( ) 1/ 2 Therefore, based on the entire sample, J n ( ) n /
236607 Visual Recognition Tutorial 13
The Cramer-Rao Inequality
• Theorem: Let be an unbiased estimator for . Then
ˆ) 1 var( J ( )
• Proof: Using Ev 0 we have:
E v Ev ˆ Eˆ E v ˆ Eˆ E vˆ EˆEv E[vˆ]
236607 Visual Recognition Tutorial
14
The Cramer-Rao Inequality - Cntd.
• Now
p( x | ) E vˆ ˆ p( x | )dx p( x | ) p( x | )ˆdx p( x | )ˆdx ˆ 1 E
236607 Visual Recognition Tutorial 15
The Cramer-Rao Inequality - Cntd. • So, E v Ev ˆ Eˆ E[vˆ] 1
• By the Cauchy-Schwarz inequality
ˆ ˆ 1 E v Ev E
E 2 ˆ E v Ev E ˆ ˆ E v 2 var( ) ˆ J ( ) var( )
2 2
• Therefore,
var(ˆ)
1 J ( )
• For a biased estimator we have:
1 ˆ var( )
ˆ ( E ) J ( )
2
236607 Visual Recognition Tutorial
16
The Cramer-Rao General Case
• The Cramer-Rao inequality also true in general ˆ form: The error covariance matrix for θ is bounded as follows: ˆ ˆ C E[(θ - θ)(θ - θ)t ] J 1 ( )
236607 Visual Recognition Tutorial
17
The Cramer-Rao Inequality - Cntd.
• Example: Let x ( n ) x1 , , xn be i.i.d. ~ N ( , 2 ) . From previous example n J n ( ) n / 2 1 (n) • Now let ˆ( x ) n xi be an (unbiased) estimator for .
ˆ ˆ ˆ var( ) E E
i 1
2
ˆ E
n 2
2
ˆ Eˆ 2 2 E 2 Eˆ 2 2
Eˆ 2
1 1 E xi 2 n 2 2 n 2 n 2 i 1 n 2 2 / n
2 • So var(ˆ) / n matches the Cramer-Rao lower bound. • Def: An unbiased estimator whose covariance meets the Cramer-Rao lower bound is called efficient.
236607 Visual Recognition Tutorial
18
Efficiency
• Theorem (Efficiency): The unbiased estimator efficient, that is, ˆ Eθ θ
ˆ θ is
ˆ ˆ C E[(θ - θ)(θ - θ)t ] J 1 (θ)
iff
ˆ J (θ)(θ - θ) ν
then
ˆ • Proof (If): If J (θ)(θ - θ) ν
ˆ ˆ E[J (θ)(θ - θ)(θ - θ)t J t (θ)] J (θ)CJ t (θ) E[ νν t ] J (θ)
meaning
C J 1 (θ)
236607 Visual Recognition Tutorial 19
Efficiency
• Only if: Recall the cross covariance between
ˆ E[ ν (θ - θ)t ]
ˆ νand(θ θ) :
2
I
The Cauchy-Schwarz inequality for random variables says
ˆ ˆ ˆ I E[ ν (θ - θ)t ] E[ νν t ]E[(θ - θ)(θ - θ)t ] JC 1
ˆ (θ - θ) ν;C 2 J; J 1 ;
thus
2
ˆ J (θ)(θ - θ) ν
236607 Visual Recognition Tutorial
20
Cramer-Rao Inequality and ML - Cntd.
• Theorem: Suppose there exists an efficient estimator for all . Then the ML estimator ML is .
• Proof: By assumption var( ) By previous claim
1 J ( )
v or J ( )
log p( x| ) for all J ( )( ) This holds at ML and since this is a maximum point the left side is zero so ML
236607 Visual Recognition Tutorial 21