Docstoc

tutorial - Download Now PDF

Document Sample
tutorial - Download Now PDF Powered By Docstoc
					      Some Properties Of the Normal Distribution
                               Jianxin Wu
                   GVU Center and College of Computing
                     Georgia Institute of Technology
                                   October 18, 2005


1     Introduction
The normal distribution is the most widely used probability distribution in
statistical pattern recognition and machine learning. The nice properties of the
normal distribution might be the main reason for its popularity.
    In this short note1 , I try to organize the basic facts about the normal dis-
tribution. There is no advanced theory. However, in order to understand these
facts, some linear algebra and analysis are needed, which are not always covered
sufficiently in undergraduate texts. The attempt of this note is to pool these
facts together, and hope that it will be useful.


2     Definition
2.1     Univariate Normal
The probability density function of a univariate normal distribution has the
following form:
                                     1     (x−µ)2
                           p(x) = √     e− 2σ2 ,                         (1)
                                    2πσ
in which µ is the expected value of x, and σ 2 is the variance. We assume that
σ > 0.
    We have to first verify that eq. (1) is a valid probability density function.
It is obvious that p(x) ≥ 0 always holds for x ∈ R. From eq. (96) in Appendix
  1 I planned to write a short note listing some properties of the normal distribution. However,

somehow I decided to keep it self-contained. The result is that it becomes very fat.




                                               1
                     ∞                 2          √
A, we know that      −∞
                          exp − xt         dx =       tπ. Applying this equation, we have

                 ∞                                ∞
                                             1             (x − µ)2
                     p(x) dx =             √        exp −           dx                 (2)
                −∞                           2πσ −∞          2σ 2
                                                  ∞
                                             1              x2
                                 =         √        exp − 2 dx                         (3)
                                             2πσ −∞        2σ
                                             1 √ 2
                                 =         √     2σ π = 1,                             (4)
                                             2πσ
which means that p(x) is a valid p.d.f.
                                               2
    The distribution with p.d.f. √1 exp − x
                                    2π        2    is called the standard nor-
mal distribution. In Appendix A, it is showed that the mean and standard
deviation of the standard normal distribution are 0 and 1 respectively. By
                                                               ∞
doing a change of variables, it is easy to show that µ = −∞ xp(x) dx and
       ∞
σ 2 = −∞ (x − µ)2 p(x) dx for a general normal distribution.

2.2    Multivariate Normal
The probability density function of a multivariate normal distribution has the
following form:

                             1                   1
             p(x) =                  1/2
                                            exp − (x − µ)T Σ−1 (x − µ) ,               (5)
                      (2π)d/2    |Σ|             2

in which x is a d-dimensional vector, µ is the d-dimensional mean, and Σ is the
d-by-d covariance matrix. We assume that Σ is a symmetric, positive definite
matrix.
    We have to first verify that eq. (5) is a valid probability density function.
It is obvious that p(x) ≥ 0 always holds for x ∈ Rd . Next we diagonalize Σ as
Σ = U T ΛU in which U is an orthogonal matrix containing the eigenvectors of
Σ, Λ = [λ1 , . . . , λd ] is a diagonal matrix containing the eigenvalues of Σ in its
diagonal entries and |Λ| = |Σ|. Let’s define a new random vector as

                                  y = Λ−1/2 U (x − µ).                                 (6)

   The mapping from y to x is one-to-one. The determinant of the Jacobian




                                                  2
            ∂y                        −1/2
matrix is   ∂x   = Λ−1/2 = |Σ|               . Now we are ready to calculate the integral

                                      1                 1
        p(x) dx      =                       1/2
                                                   exp − (x − µ)T Σ−1 (x − µ)             dx (7)
                               (2π)d/2    |Σ|           2
                                      1                1/2        1
                     =                       1/2
                                                   |Σ|       exp − y T y      dy             (8)
                               (2π)d/2 |Σ|                        2
                           d
                                       1     y2
                     =                √ exp − i                dyi                           (9)
                          i=1
                                       2π     2
                           d
                     =          1=1                                                         (10)
                          i=1

in which yi is the ith component of y, i.e. y = (y1 , . . . , yd ). This equation gives
the validity of the multivariate normal density function.
   Since y is a random vector, it has a density, denoted as pY (y). Using the
inverse transform method, we get

     pY (y)      = p(µ + U T Λ1/2 y) U T Λ1/2                                               (11)
                         U T Λ1/2                  1                 T
                 =                  1/2
                                          exp −      U T Λ1/2 y          Σ−1 U T Λ1/2 y     (12)
                     (2π)d/2    |Σ|                2
                        1         1
                 =           exp − y T y                                                    (13)
                     (2π)d/2      2

   The density defined by

                                             1          1
                          pY (y) =             d/2
                                                   exp − y T y .                            (14)
                                          (2π)          2

is called a spherical normal distribution. Let z be a random vector formed by
a subset of the components of y. By marginalization it is clear that pZ (z) =
                                                                2
    1                                                         yi
(2π)|z|/2
          exp − 1 z T z , and specifically pYi (yi ) = √1 exp − 2 . Using this
                  2                                    2π
fact, it is straightforward to show that the mean vector and covariance matrix
of a spherical normal distribution are 0 and I, respectively.
    Using the inverse transform of eq. (6), we can easily calculate the mean
vector and covariance matrix of the density p(x).

                         E(x)     = E(µ + U T Λ1/2 y) = µ + E(U T Λ1/2 y) = µ (15)
     E (x − µ)(x − µ)T            = E (U T Λ1/2 y)(U T Λ1/2 y)T                             (16)
                                  = U T Λ1/2 E(yy T )Λ1/2 U                                 (17)
                                  = U T Λ1/2 Λ1/2 U                                         (18)
                                  = Σ                                                       (19)


                                                   3
3     Notation and Parameterization
When we have a density of the form in eq. (5), it is often written as

                                  x ∼ N (µ, Σ)                               (20)

or,
                                     N (x; µ, Σ)                             (21)
    In most cases we will use the mean vector µ and covariance matrix Σ to
express a normal density. This is called the moment parameterization. There
is another parameterization of a normal density, called the canonical parame-
terization. In the canonical parameterization, a normal density is expressed
as
                                              1
                     p(x) = exp α + η T x − xT Λx ,                       (22)
                                              2
in which α = − 1 d log 2π − log |Λ| + η T Λ−1 η is a normalization constant.
                2
The parameters in these two representations are related by

                                 Λ     = Σ−1                                 (23)
                                 η     = Σ−1 µ                               (24)
                                 Σ     = Λ−1                                 (25)
                                 µ     = Λ−1 η.                              (26)

Notice that there is a confusion in our notation: Λ has different meanings in eq.
(22) and eq. (6). In eq. (22), Λ is a parameter in the canonical parameteriza-
tion of a normal density, which is not necessarily diagonal. In eq. (6), Λ is a
diagonal matrix formed by the eigenvalues of Σ. It is straightforward to show
that the moment parameterization and canonical parameterization of the nor-
mal distribution are equivalent. In some cases the canonical parameterization
is more convenient to use than the moment parameterization.


4     Linear Operation and Summation
4.1    Univariate case
                        2                    2
Suppose x1 ∼ N (µ1 , σ1 ) and x2 ∼ N (µ2 , σ2 ) are two independent univariate
                                                               2
normal variables. It is obvious that ax1 + b ∼ N (aµ1 + b, a2 σ1 ), in which a and
b are two scalars.
    Now consider a random variable z = x1 + x2 . The density of z is calculated
as:




                                         4
      pZ (z)   =              pX1 (x1 )pX2 (x2 ) dx1 dx2                              (27)
                   z=x1 +x2

               =                       pX1 (x1 + µ1 )pX2 (x2 + µ2 ) dx1 dx2           (28)
                   z=x1 +x2 −µ1 −µ2

               =        pX1 (x1 + µ1 )pX2 (z − x1 − µ1 ) dx1                          (29)

                      1                       x2    (z − x − µ1 − µ2 )2
               =                  exp −         2 −           2             dx        (30)
                    2πσ1 σ2                  2σ1           2σ2
                                                                             2
                                                                         2
                                                                               
                             (z−µ1 −µ2 )2                    (z−µ1 −µ2 )σ1
                    exp         2   2
                               σ1 +σ2
                                                         x−     σ12 +σ 2
                                                                        2
               =                                exp −                          dx   (31)
                                                                              
                             2πσ1 σ2                           2σ 2 σ 2
                                                                   1   2
                                                                  2   2
                                                                 σ1 +σ2


                      1                (z − µ1 − µ2 )2            2 2
                                                                2σ1 σ2
               =            exp             2    2              2     2π              (32)
                    2πσ1 σ2               σ1 + σ2              σ1 + σ2
                              1                (z − µ1 − µ2 )2
               =    √                    exp        2    2         ,                  (33)
                        2π      2    2
                              σ 1 + σ2            σ1 + σ2

in which the step from eq. (31) to eq. (32) used the result of eq. (96). The sum
of two univariate normal random variables is again a normal random variable,
with the mean value and variance summed up respectively, i.e. z ∼ N (µ1 +
      2    2
µ2 , σ1 + σ2 ). The summation rule is easily generalized to n independent normal
random variables.

4.2    Multivariate Case
Suppose x1 ∼ N (µ1 , Σ1 ) is a d-dimensional normal random variable, A is a q-
by-d matrix and b is a q-dimensional vector, then z = Ax + b is a q-dimensional
normal random variable: z ∼ N (Aµ + b, AΣAT ).
   This fact is proved using the characteristic function (see Appendix B). The
characteristic function of Z is
          ϕZ (t)   = EZ exp(itT z)                                                    (34)
                                         T
                   = EX exp it (Ax + b)                                               (35)
                                  T                  T     T
                   =      exp(it b)EX exp i(A t) x                                    (36)
                                                     1
                   =      exp(itT b) exp i(AT t)T µ − (AT t)T Σ(AT t)                 (37)
                                                     2
                                               1
                   =      exp itT (Aµ + b) − tT (AΣAT )t ,                            (38)
                                               2
in which the step from eq. (37) to eq. (38) used the eq. (106) in Appendix B.
Appendix B states that if a characteristic function ϕ(t) is of the form exp(itT µ−

                                               5
1 T
2 t Σt),then the underlying density p(x) is normal with mean µ and covariance
matrix Σ. Applying this fact to eq. (38), we immediately get

                             z ∼ N (Aµ + b, AΣAT ).                         (39)

   Suppose x1 ∼ N (µ1 , Σ1 ) and x2 ∼ N (µ2 , Σ2 ) are two independent d-
dimensional normal random variables, and define a new random vector z =
x + y. We can calculate the probability density function pZ (z) using the same
method as we used in the univariate case. However, the calculation is complex
and we have to apply the matrix inversion lemma in Appendix C.
   Characteristic function simplifies the calculation. Using eq. (109) in Appen-
dix B, we get

           ϕZ (t)   = ϕX (t)ϕY (t)                                          (40)
                                   1                          1 T
                    = exp itT µ1 − tT Σ1 t exp itT µ2 −         t Σ2 t      (41)
                                   2                          2
                                          1
                    = exp itT (µ1 + µ2 ) − tT (Σ1 + Σ2 )t     ,             (42)
                                          2
which immediately gives us z ∼ N (µ1 + µ2 , Σ1 + Σ2 ). The summation of two
multivariate normal random variables is as easy to compute as in the univariate
case: sum up the mean vectors and covariance matrices. The rule is same for
summing up several multivariate normal random variables.
   Now we have the tool of linear transformation and let’s revisit eq. (6). For
convenience we retype the equation here: x ∼ N (µ, Σ), and

                              y = Λ−1/2 U (x − µ).                          (43)

    Using the properties of linear transformations on a normal density, y is in-
deed normal (in section 2.2 we painfully calculate p(y) using the inverse trans-
form method), and has mean vector 0 and covariance matrix I.
    The transformation of applying eq. (6) is called the whitening transforma-
tion since the transformed density has an identity covariance matrix.


5    Geometry and Mahalanobis Distance
Figure 1 shows a bivariate normal density function. Normal density has only
one mode, which is the mean vector, and the shape of the density is determined
by the covariance matrix.
   Figure 2 shows the equal probability contour of a bivariate normal random
variable. All points on a given equal probability contour must have the following
term evaluated to a constant value:

                       r2 (x, µ) = (x − µ)T Σ−1 (x − µ) = c                 (44)

  r2 (x, µ) is called the Mahalanobis distance from x to µ, given the covariance
matrix Σ. Eq. (44) defines a hyperellipsoid in d-dimensional, which means that

                                        6
                        Figure 1: Bivariate normal p.d.f.




     Figure 2: Equal probability countour of a bivariate normal distribution


the equal probability contour is a hyperellipsoid in d-dimensional. The principal
axes of this hyperellipsoid are given by the eigenvectors of Σ, and the lengths
of these axes are proportional to square root of the eigenvalues associated with
these eigenvectors.


6     Conditioning
Suppose x1 and x2 are two multivariate normal random variables, which have
a joint p.d.f

        x1                     1
 p               =
        x2           (2π)
                         (d1 +d2 )/2
                                       |Σ|1/2
                                                T               −1
                               1   x1 − µ1          Σ11   Σ12        x1 − µ1
                     · exp −                                                   ,
                               2   x2 − µ2          Σ21   Σ22        x2 − µ2

in which d1 and d2 are the dimensionalities of x1 and x2 respectively, and
       Σ11 Σ12
Σ=                  . The matrices Σ12 and Σ21 are covariance matrices between
       Σ21 Σ22
x1 and x2 , and satisfying Σ12 = ΣT .21
    The marginal distributions x1 ∼ N (µ1 , Σ11 ) and x2 ∼ N (µ2 , Σ22 ) are easy
to get from the joint distribution. We are interested in computing the condi-
tional probability density p(x1 |x2 ).
    We will need to compute the inverse of Σ, and this task is completed by
using the Schur complement (see Appendix C). For notational simplicity, we

                                            7
denote the Schur complement of Σ11 as S11 , defined as S11 = Σ22 − Σ21 Σ−1 Σ12 .
                                                                       11
Similarly, the Schur complement of Σ22 is S22 = Σ11 − Σ12 Σ−1 Σ21 .
                                                            22
    Applying eq. (119) and noticing that Σ12 = ΣT , we get (writing x1 − µ1 as
                                                21
x1 , and x2 − µ2 as x2 )
                        −1             −1
        Σ11     Σ12                  S22                     −1
                                                          −S22 Σ12 Σ−122
                             =      −1 T −1                                      (45)
        Σ21     Σ22               −Σ22 Σ12 S22     Σ−1
                                                    22
                                                            −1 T −1
                                                         + Σ22 Σ12 S22 Σ12 Σ−1
                                                                            22

and
                                   T                −1
                        x1 − µ1        Σ11   Σ12           x1 − µ1
                        x2 − µ2        Σ21   Σ22           x2 − µ2
                      −1                           −1
               = x1T S22 x1 + x2T (Σ−1 + Σ−1 ΣT S22 Σ12 Σ−1 )x2
                                     22     22 12           22
                                       −1
               = (x1 + Σ12 Σ−1 x2 )T S22 (x1 + Σ12 Σ−1 x2 ) + x2T Σ−1 x2 .
                             22                     22             22            (46)

     Thus we can split the joint distribution as

                   x1
           p
                   x2
                   1                                     −1
                                   (x1 + Σ12 Σ−1 x2 )T S22 (x1 + Σ12 Σ−1 x2 )
                                              22                      22
       =              −1 1/2 exp −                      2
           (2π)d1 |S22 |
                     1            1 T −1
           ·                  exp  x Σ x                                      (47)
             (2π)d2 |Σ−1 |1/2
                       22         2 2 22 2

in which we used the fact that |Σ| = |Σ22 | |S22 | (from eq. (118) in Appendix C).
    Since the second term in the right hand side of eq. (47) is the marginal
p(x2 ) and p(x1 , x2 ) = p(x1 |x2 )p(x2 ), we now get the conditional probability
p(x1 |x2 ) as

                       1                                  −1
                                     (x + Σ12 Σ−1 x2 )T S22 (x1 + Σ12 Σ−1 x2 )
p(x1 |x2 ) =             −1     exp − 1        22                      22
                                                                                ,
               (2π)d1 |S22 |1/2                          2
                                                                             (48)
or

              x1 |x2                            −1
                        ∼ N (µ1 + Σ12 Σ−1 x2 , S22 )
                                       22                                        (49)
                        ∼ N (µ1 + Σ12 Σ−1 (x2 − µ2 ), Σ11 − Σ12 Σ−1 Σ21 )
                                       22                        22              (50)


7      Product of Gaussians
Suppose p1 (x) = N (x; µ1 , Σ1 and p2 (x) = N (x; µ2 , Σ2 ) are two independent
d-dimensional normal random variables. Sometimes we want to compute the
density which is proportional to the product of the two normal densities, i.e.
pX (x) = αp1 (x)p2 (x), in which α is a proper normalization constant to make
pX (x) a valid density function.


                                             8
   In this task the canonical notation (see section 3) will be extremely helpful.
Writing the two normal densities in canonical form

                                                1
                     p1 (x)   = exp α1 + η T x − xT Λ1 x
                                           1                                        (51)
                                                2
                                                1
                     p2 (x)   = exp α2 + η T x − xT Λ2 x ,
                                           2                                        (52)
                                                2

the density pX (x) is easy to compute, as

            pX (x)    = αp1 (x)p2 (x)
                                                   1
                      =   exp α + (η 1 + η 2 )T x − xT (Λ1 + Λ2 )x .                (53)
                                                   2

This equation states that in the canonical parameterization, in order to compute
product of Gaussians, we just sum the parameters.
    This result is readily extended to the product of n normal densities. Sup-
pose we have n normal distributions pi (x), whose parameters in the canonical
                                                                        n
parameterization are η i and Λi , i = 1, 2, . . . , n. Then pX (x) = α i=1 pi (x) is
also a normal density, given by
                                                                    
                                    n       T                n
                                                       1
            pX (x) = exp α +          ηi       x − xT          Λi x .        (54)
                                   i=1
                                                       2    i=1


   Now let’s go back to the moment parameterization. Suppose we have n
normal distributions pi (x), in which pi (x) ∼ N (x; µi , Σi ), i = 1, 2, . . . , n. Then
             n
pX (x) = α i=1 pi (x) is normal,

                                  p(x) = N (x; µ, Σ)                                (55)

where

                      Σ−1 = Σ−1 + Σ−1 + · · · + Σ−1
                             1     2             n                                  (56)
                             −1      −1
                      −1
                     Σ µ = Σ1 µ1 + Σ2 µ2 + · · · + Σ−1 µn
                                                    n                               (57)

    Now we have listed some properties of the normal distribution. Next let’s
see how these properties are applied.


8     Application I: Parameter Estimation
8.1     Maximum Likelihood Estimation
Let us suppose that we have a d-dimensional multivariate normal random vari-
able x ∼ N (µ, Σ), and n i.i.d (independently, identically distributed) samples
D = {x1 , x2 , . . . , xn } sampled from this distribution. The task is to estimate
the parameters µ and Σ.

                                           9
   The log-likelihood function of observing the data set D given parameters µ
and Σ is:

              l (µ, Σ|D)                                                                       (58)
                    n
          =   log         p(xi )                                                               (59)
                    i=1
                                                                n
                nd          n             1
          = −      log(2π) + log |Σ−1 | −                            (xi − µ)T Σ−1 (xi − µ).   (60)
                 2          2             2                    i=1

    Taking derivative of the likelihood function with respect to µ and Σ−1 gives
(see Appendix D):
                                       n
                          ∂l
                                   =         Σ−1 (xi − µ)                                      (61)
                          ∂µ           i=1
                                                       n
                      ∂l               n    1
                                   =     Σ−                 (xi − µ)(xi − µ)T ,                (62)
                     ∂Σ−1              2    2         i=1

in which eq. (61) used eq. (124) and the chain rule, and eq. (62) used eq.
(131), eq. (132) and the fact that Σ = ΣT . The notation in eq. (61) is a little
bit confusing. There are two Σ in the right hand side, the first represents a
summation and the second represents the covariance matrix.
    In order to find the maximum likelihood solution, we want to find the max-
imum of the likelihood function. Setting both eq. (61) and eq. (62) to 0 gives
us the solution:
                                           n
                                       1
                     µM L          =             xi                                            (63)
                                       n   i=1
                                            n
                                       1
                     ΣM L          =             (xi − µM L )(xi − µM L )T                     (64)
                                       n   i=1

    Eq. (63) and eq. (64) clearly states that the maximum likelihood estimation
of the mean vector and covariance matrix are just the sample mean and sample
covariance matrix respectively.

8.2    Bayesian Parameter Estimation
In this Bayesian estimation example, we assume that the covariance matrix
is known. Let us suppose that we have a d-dimensional multivariate normal
density x ∼ N (µ, Σ), and n i.i.d samples D = {x1 , x2 , . . . , xn } sampled from
this distribution. We also need a prior on the parameter µ. Let’s assume that
the prior is µ ∼ N (µ0 , Σ0 ). The task is to estimate the parameters µ.
    Note that we assume µ0 , Σ0 , and Σ are all known. The only parameter to
be estimated is the mean vector µ.


                                                      10
                                                    ˆ
   In Bayesian estimation, instead of find a point µ in the parameter space that
gives maximum likelihood, we calculate the posterior density for the parameter
p(µ|D), and use the entire distribution of µ as our estimation for this parameter.
   Applying the Bayes’ law,

                         p(µ|D)     = αp(D|µ)p0 (µ)                          (65)
                                               n
                                    = α            p(xi )p0 (µ)              (66)
                                           i=1

in which α is a normalization constant which does not depend on µ.
   Apply the result in section 7, we know that p(µ|D) is also normal, and

                            p(µ|D) = N (µ; µn , Σn )                         (67)

where

                            Σ−1
                             n      = nΣ−1 + Σ−1
                                               0                             (68)
                         Σ−1 µn
                          n         = nΣ−1 µ + Σ−1 µ0
                                                 0                           (69)

    Both µn and Σn can be calculated from known parameters. So we have
determined the posterior distribution p(µ|D) for µ given the data set D.
    We choose the normal distribution to be the prior family. Usually, the prior
distribution is chosen such that the posterior belongs to the same functional form
as the prior. A prior and posterior chosen in this way are said to be conjugate.
We have seen that normal distribution have the nice property that both the
prior and posterior are normal, i.e. normal distribution is auto-conjugate.
    After p(µ|D) is determined, a new sample is classified by calculating the
probability
                         p(x|D) =        p(x|µ)p(µ|D) dµ.                    (70)
                                     µ

   Eq. (70) and eq. (27) has the same form. Thus we can guess that p(x|D) is
normal again, and
                         p(x|D) = N (x; µn , Σ + Σn ).                  (71)
The guess is correct, and is easy to verify by repeating the steps in eq.(27)
through eq. (33).


9     Application II: Kalman Filter
9.1     The Model
The Kalman filter address the problem of estimating a state vector x in a
discrete time process, given a linear dynamic model

                             xk = Axk−1 + wk−1 ,                             (72)



                                          11
and a linear measurement model

                                 z k = Hxk + v k .                           (73)

   The process noise wk and measurement noise v k are assumed to be normal:

                                 w   ∼ N (0, Q)                              (74)
                                 v   ∼ N (0, R)                              (75)

    At time k − 1, assume we know the distribution of xk−1 , the task is to
estimate the posterior probability of xk at time k, given the current observation
z k and the previous state estimation p(xk−1 ).
    In a broader point of view, the task can be formulated as estimating the
posterior probability of xk at time k, given all the previous state estimates and
all the observations up to time step k. Under certain Markovian assumption, it
is not hard to prove that the two problem formulations are equivalent.
    In the Kalman filter setup, we assume the prior is normal, i.e. at time t = 0,
p(x0 ) ∼ N (x; µ0 , P0 ). Instead of using Σ, here we use P to represent a covari-
ance matrix, in order to match the notations in the Kalman filter literature.

9.2    The Estimation
I will show that, with the help of the properties of Gaussians we have obtained,
it is quite easy to derive the Kalman filter equations.
     The Kalman filter can be separated in two (related) steps. In the first step,
based on the estimation p(xk−1 ) and the dynamic model (72), we get an estimate
p(x− ). Note that the minus sign means that this estimation is done before we
     k
take into account the measurement. In the second step, based on p(x− ) and
                                                                         k
the measurement model (73), we get the final estimation p(xk ).
     First, let’s estimate p(x− ). Assume that at time k − 1, the estimation we
                              k
already got is a normal distribution

                         p(xk−1 ) ∼ N (x; µk−1 , Pk−1 ).                     (76)

This assumption coincides well with the prior p(x0 ). We will show that, under
this assumption, after the Kalman filter update, p(xk ) will also become normal,
and this makes the assumption reasonable.
    Applying the linear operation equation (39) on the dynamic model (72), we
immediately get the estimation for x− :
                                     k

                            x−
                             k
                                             −
                                  ∼ N (µ− , Pk )
                                        k                                    (77)
                            µ−
                             k    = Aµk−1                                    (78)
                             −
                            Pk    = APk−1 AT + Q                             (79)

   The estimate p(x− ) conditioned on the observation z k gives p(xk ), the es-
                     k
timation we want. Thus the conditioning property (50) can be used. In or-
der to use eq. (50), we have to build the joint covariance matrix first. Since

                                        12
              −
Var(z k ) = HPk H T + R (applying eq. (39) to eq. (73)) and

                      Cov(z k , x− )
                                 k     = Cov(Hx− + v k , x− )
                                               k          k              (80)
                                       = Cov(Hx− , x− )
                                               k    k                    (81)
                                            −
                                       = HPk ,                           (82)

the joint covariance matrix of (x− , z k ) is:
                                 k

                                 −          −
                                Pk        Pk H T
                                   −      − T          .                 (83)
                               HPk      HPk H + R

    Applying the conditioning equation (50), we get

             p(xk )   =   p(x− |z k )
                              k                                          (84)
                      ∼   N (µk , Pk )                                   (85)
                            −      −     −              −
                Pk    =   Pk − Pk H T (HPk H T + R)−1 HPk                (86)
                            −      − T   − T       −1
                µk    =   µk + Pk H (HPk H + R) (z k − Hµk )             (87)

   The equations (77∼79) and (84∼87) are the Kalman filter updating rules.
              −        −         −1
   The term Pk H T HPk H T + R      appears in both eq. (86) and eq. (87).
Defining
                             −        −
                     Kk = Pk H T (HPk H T + R)−1 ,                    (88)
the equations are simplified as
                                               −
                          Pk    = (I − Kk H)Pk                           (89)
                          µk    = µ− + Kk (z k − Hµ− ).
                                    k              k                     (90)

The term Kk is called the Kalman gain matrix, and the term z k −Hµ− is called
                                                                  k
the innovation.


A      Gaussian Integral
We will compute the integral of the univariate normal p.d.f. in this section.
The trick in doing this integration is to consider two independent univariate
gaussians at one time.




                                          13
               ∞                               ∞                          ∞
                        2
                   e−x dx =                             e−x2 dx                   e−y2 dy    (91)
              −∞                               −∞                        −∞
                                               ∞        ∞
                                 =                           e−(x2 +y2 ) dx dy               (92)
                                           −∞       −∞

                                               ∞        2π
                                 =                           re−r2 dr dθ                     (93)
                                           0        0
                                                                ∞
                                              1
                                 =        2π − e−r2                                          (94)
                                              2                 0
                                     √
                                 =        π,                                                 (95)
in eq. (93) the polar coordinates are used, and the extra r appeared inside the
equation is determinant of the Jacobian.
    The above integral can be easily extend as
                                     ∞                                   √
                                                        x2
                        f (t) =           exp −                 dx =         tπ              (96)
                                     −∞                  t
in which we assume t > 0. Then we have
                                               ∞
                            df        d                          x2
                                 =                  exp −                dx                  (97)
                            dt        dt    −∞                    t
                                          ∞   2
                                               x         x2
                                 =               2
                                                   exp −                 dx                  (98)
                                         −∞    t          t
and
                           df   d√      1                        π
                              =    tπ =                            .                         (99)
                           dt   dt      2                        t
The above two equations give us
                             ∞
                                                   x2               t2   π
                                 x2 exp −                    dx =          .                (100)
                            −∞                      t               2    t
   Applying eq. (100), we have
                   ∞
                          1      x2                           1 4                 π
                      x2 √ exp −                        dx = √                      =1      (101)
                   −∞     2π     2                             2π 2               2
   It is obvious that
                                 ∞
                                   1      x2
                                x √ exp −                           dx = 0                  (102)
                             −∞    2π     2
               2
since x exp − x is an odd function.
               2
    Eq. (102) and eq. (101) have proved that the mean and standard deviation
of a standard normal distribution are 0 and 1, respectively.

                                               14
B     Characteristic Functions
The characteristic function of a random variable with p.d.f. p(x) is defined as
its Fourier transform                        T
                                ϕ(t) = E[eit x ]                         (103)
              √
in which i = −1.
    Let’s compute the characteristic function of a normal variate.

           ϕ(t)
         = E[exp(itT x)]                                                  (104)
                    1             1
         =                   exp − (x − µ)T Σ−1 (x − µ) + itT x       dx (105)
              (2π)d/2 |Σ|1/2      2
                        1 T
         = exp(itT µ − t Σt)                                              (106)
                        2
   Since the characteristic function is defined as a Fourier transform, the in-
verse Fourier transform of ϕ(t) will be exactly p(x), i.e. a random variable is
completely determined by its characteristic function. When we see that a char-
                                                    1
acteristic function ϕ(t) is of the form exp(itT µ − 2 tT Σt), we know that the
underlying density is normal with mean µ and covariance matrix Σ.
   Suppose that x and y are two independent random vectors with the same
dimension, and define a new random vector z = x + y. Then

                   pZ (z)   =              pX (x)pY (y) dxdy              (107)
                                  z=x+y

                            =     pX (x)pY (z − x) dx.                    (108)

Since convolution in the function space is a product in the Fourier space, we
have
                            ϕZ (t) = ϕX (t)ϕY (t),                      (109)
which means that the characteristic function of the sum of two independent
random variables is just the product of the characteristic functions of the sum-
mands.


C     Schur Complement and the Matrix Inversion
      Lemma
The Schur complement is very useful in computing the inverse of a block matrix.
  Suppose M is a block matrix expressed as

                                       A    B
                                M=              ,                         (110)
                                       C    D

in which A and D are non-singular square matrices. We want to compute M −1 .

                                      15
   Some algebraic manipulations give

                             I      0           I −A−1 B
                                        M                                   (111)
                           −CA−1    I           0   I
                             I      0       A    B         I −A−1 B
                 =                                                          (112)
                           −CA−1    I       C    D         0   I
                           A     B                   I   −A−1 B
                 =                                                          (113)
                           0 D − CA−1 B              0     I
                           A     0                       A 0
                 =                              =                 ,         (114)
                           0 D − CA−1 B                  0 SA

in which the term D − CA−1 B is called the Schur complement of A, denoted as
SA .
    Equation XM Y = Z implies that M −1 = Y Z −1 X. Hence we have
                                                          −1
                           I −A−1 B         A 0                   I     0
          M −1     =                                                        (115)
                           0   I            0 SA                −CA−1   I
                                         −1
                           A−1    −A−1 BSA                 I      0
                   =                  −1                                    (116)
                            0       SA                   −CA−1    I
                                       −1                        −1
                           A−1 + A−1 BSA CA−1             −A−1 BSA
                   =              −1                          −1            (117)
                                −SA CA−1                    SA

   Taking the determinant of both sides of eq. (115), it gives

                                   |M | = |A||SA |.                         (118)

    We can also compute M −1 by using the Schur complement of D, in the
following way:
                                   −1                   −1
                                  SD                  −SD BD−1
            M −1       =          −1 −1         −1           −1             (119)
                             −D    CSD      D        + D−1 CSD BD−1
             |M | = |D||SD |.                                               (120)

   Eq. (117) and eq. (119) are two different representations of the same matrix
M −1 , which means the corresponding blocks in these two equations must be
             −1                  −1
equal, e.g. SD = A−1 + A−1 BSA CA−1 . This result is known as the matrix
inversion lemma:
      −1
     SD = (A − BD−1 C)−1 = A−1 + A−1 B(D − CA−1 B)−1 CA−1 .                 (121)

    The following result, which comes from equating the upper right block is
also useful:

              A−1 B(D − CA−1 B)−1 = (A − BD−1 C)−1 BD−1 .                   (122)

This formula and the matrix inversion lemma are useful in the derivation of the
Kalman filter equations.

                                          16
D     Vector and Matrix Derivatives
Suppose y is a scalar, A is a matrix, and x and y are vectors. The partial
derivative of y with respect to A is defined as:
                                        ∂y                ∂y
                                                    =                         (123)
                                        ∂A     ij        ∂aij
where aij the i, j-th component of the matrix A.
  From the definition (123), it is easy to get the following rule.
                            ∂             ∂
                                xT y =        yT x = y                         (124)
                           ∂x            ∂x
    For a square matrix A that is n-by-n, the matrix Mij defined by removing
from A the i-th row and j-th column is called a minor of A. The scalar cij =
(−1)i+j Mij is called a cofactor of A. The matrix Acof with cij in its i, j-th entry
is called the cofactor matrix of A. Finally, the adjoint matrix of A is defined as
transpose of the cofactor matrix
                                       Aadj = AT .
                                               cof                            (125)
    There are some well-known facts about the minors, determinant, and adjoint
of a matrix:
                                    |A| =                aij cij              (126)
                                                     j
                                                     1
                                   A−1     =            Aadj .                (127)
                                                    |A|
     Since Mij has removed the i-th row, it does not depend on aij , neither does
cij . Thus, we have
                                       ∂
                                          |A| = cij ,                         (128)
                                     ∂aij
                                        ∂
                                   or,    |A| = Acof                          (129)
                                       ∂A
which in turn shows that
                      ∂
                         |A| = Acof = AT = |A|(A−1 )T .
                                         adj                              (130)
                     ∂A
   Using the chain rule, we immediately get that for a positive definite matrix
A,
                              ∂
                                log |A| = (A−1 )T .                       (131)
                             ∂A
   Applying the definition (123), it is easy to show that for a square matrix A,
                                    ∂
                                      (xT Ax) = xxT ,                         (132)
                                   ∂A
since xT Ax =     i   j   aij xi xj where x = (x1 , x2 , . . . , xn ).

                                               17
References
[1] C. M. Bishop. Neural Networks for Pattern Recognition, Oxford University
    Press, Oxford, UK, 1996.
[2] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, second
    edition, Wiley-Interscience, New York, NY, USA, 2001.
[3] http://mathworld.wolfram.com/

[4] M. I. Jordan. An Introduction to Probabilistic Graphical Models, chapter
    13, draft.
[5] G. Welch and G. Bishop. An Introduction to the Kalman Filter, TR 95-041,
    Department of Computer Science, University of North Carolina at Chapel
    Hill.




                                    18

				
DOCUMENT INFO