VIEWS: 162 PAGES: 18 POSTED ON: 2/12/2010 Public Domain
Some Properties Of the Normal Distribution Jianxin Wu GVU Center and College of Computing Georgia Institute of Technology October 18, 2005 1 Introduction The normal distribution is the most widely used probability distribution in statistical pattern recognition and machine learning. The nice properties of the normal distribution might be the main reason for its popularity. In this short note1 , I try to organize the basic facts about the normal dis- tribution. There is no advanced theory. However, in order to understand these facts, some linear algebra and analysis are needed, which are not always covered suﬃciently in undergraduate texts. The attempt of this note is to pool these facts together, and hope that it will be useful. 2 Deﬁnition 2.1 Univariate Normal The probability density function of a univariate normal distribution has the following form: 1 (x−µ)2 p(x) = √ e− 2σ2 , (1) 2πσ in which µ is the expected value of x, and σ 2 is the variance. We assume that σ > 0. We have to ﬁrst verify that eq. (1) is a valid probability density function. It is obvious that p(x) ≥ 0 always holds for x ∈ R. From eq. (96) in Appendix 1 I planned to write a short note listing some properties of the normal distribution. However, somehow I decided to keep it self-contained. The result is that it becomes very fat. 1 ∞ 2 √ A, we know that −∞ exp − xt dx = tπ. Applying this equation, we have ∞ ∞ 1 (x − µ)2 p(x) dx = √ exp − dx (2) −∞ 2πσ −∞ 2σ 2 ∞ 1 x2 = √ exp − 2 dx (3) 2πσ −∞ 2σ 1 √ 2 = √ 2σ π = 1, (4) 2πσ which means that p(x) is a valid p.d.f. 2 The distribution with p.d.f. √1 exp − x 2π 2 is called the standard nor- mal distribution. In Appendix A, it is showed that the mean and standard deviation of the standard normal distribution are 0 and 1 respectively. By ∞ doing a change of variables, it is easy to show that µ = −∞ xp(x) dx and ∞ σ 2 = −∞ (x − µ)2 p(x) dx for a general normal distribution. 2.2 Multivariate Normal The probability density function of a multivariate normal distribution has the following form: 1 1 p(x) = 1/2 exp − (x − µ)T Σ−1 (x − µ) , (5) (2π)d/2 |Σ| 2 in which x is a d-dimensional vector, µ is the d-dimensional mean, and Σ is the d-by-d covariance matrix. We assume that Σ is a symmetric, positive deﬁnite matrix. We have to ﬁrst verify that eq. (5) is a valid probability density function. It is obvious that p(x) ≥ 0 always holds for x ∈ Rd . Next we diagonalize Σ as Σ = U T ΛU in which U is an orthogonal matrix containing the eigenvectors of Σ, Λ = [λ1 , . . . , λd ] is a diagonal matrix containing the eigenvalues of Σ in its diagonal entries and |Λ| = |Σ|. Let’s deﬁne a new random vector as y = Λ−1/2 U (x − µ). (6) The mapping from y to x is one-to-one. The determinant of the Jacobian 2 ∂y −1/2 matrix is ∂x = Λ−1/2 = |Σ| . Now we are ready to calculate the integral 1 1 p(x) dx = 1/2 exp − (x − µ)T Σ−1 (x − µ) dx (7) (2π)d/2 |Σ| 2 1 1/2 1 = 1/2 |Σ| exp − y T y dy (8) (2π)d/2 |Σ| 2 d 1 y2 = √ exp − i dyi (9) i=1 2π 2 d = 1=1 (10) i=1 in which yi is the ith component of y, i.e. y = (y1 , . . . , yd ). This equation gives the validity of the multivariate normal density function. Since y is a random vector, it has a density, denoted as pY (y). Using the inverse transform method, we get pY (y) = p(µ + U T Λ1/2 y) U T Λ1/2 (11) U T Λ1/2 1 T = 1/2 exp − U T Λ1/2 y Σ−1 U T Λ1/2 y (12) (2π)d/2 |Σ| 2 1 1 = exp − y T y (13) (2π)d/2 2 The density deﬁned by 1 1 pY (y) = d/2 exp − y T y . (14) (2π) 2 is called a spherical normal distribution. Let z be a random vector formed by a subset of the components of y. By marginalization it is clear that pZ (z) = 2 1 yi (2π)|z|/2 exp − 1 z T z , and speciﬁcally pYi (yi ) = √1 exp − 2 . Using this 2 2π fact, it is straightforward to show that the mean vector and covariance matrix of a spherical normal distribution are 0 and I, respectively. Using the inverse transform of eq. (6), we can easily calculate the mean vector and covariance matrix of the density p(x). E(x) = E(µ + U T Λ1/2 y) = µ + E(U T Λ1/2 y) = µ (15) E (x − µ)(x − µ)T = E (U T Λ1/2 y)(U T Λ1/2 y)T (16) = U T Λ1/2 E(yy T )Λ1/2 U (17) = U T Λ1/2 Λ1/2 U (18) = Σ (19) 3 3 Notation and Parameterization When we have a density of the form in eq. (5), it is often written as x ∼ N (µ, Σ) (20) or, N (x; µ, Σ) (21) In most cases we will use the mean vector µ and covariance matrix Σ to express a normal density. This is called the moment parameterization. There is another parameterization of a normal density, called the canonical parame- terization. In the canonical parameterization, a normal density is expressed as 1 p(x) = exp α + η T x − xT Λx , (22) 2 in which α = − 1 d log 2π − log |Λ| + η T Λ−1 η is a normalization constant. 2 The parameters in these two representations are related by Λ = Σ−1 (23) η = Σ−1 µ (24) Σ = Λ−1 (25) µ = Λ−1 η. (26) Notice that there is a confusion in our notation: Λ has diﬀerent meanings in eq. (22) and eq. (6). In eq. (22), Λ is a parameter in the canonical parameteriza- tion of a normal density, which is not necessarily diagonal. In eq. (6), Λ is a diagonal matrix formed by the eigenvalues of Σ. It is straightforward to show that the moment parameterization and canonical parameterization of the nor- mal distribution are equivalent. In some cases the canonical parameterization is more convenient to use than the moment parameterization. 4 Linear Operation and Summation 4.1 Univariate case 2 2 Suppose x1 ∼ N (µ1 , σ1 ) and x2 ∼ N (µ2 , σ2 ) are two independent univariate 2 normal variables. It is obvious that ax1 + b ∼ N (aµ1 + b, a2 σ1 ), in which a and b are two scalars. Now consider a random variable z = x1 + x2 . The density of z is calculated as: 4 pZ (z) = pX1 (x1 )pX2 (x2 ) dx1 dx2 (27) z=x1 +x2 = pX1 (x1 + µ1 )pX2 (x2 + µ2 ) dx1 dx2 (28) z=x1 +x2 −µ1 −µ2 = pX1 (x1 + µ1 )pX2 (z − x1 − µ1 ) dx1 (29) 1 x2 (z − x − µ1 − µ2 )2 = exp − 2 − 2 dx (30) 2πσ1 σ2 2σ1 2σ2 2 2 (z−µ1 −µ2 )2 (z−µ1 −µ2 )σ1 exp 2 2 σ1 +σ2 x− σ12 +σ 2 2 = exp − dx (31) 2πσ1 σ2 2σ 2 σ 2 1 2 2 2 σ1 +σ2 1 (z − µ1 − µ2 )2 2 2 2σ1 σ2 = exp 2 2 2 2π (32) 2πσ1 σ2 σ1 + σ2 σ1 + σ2 1 (z − µ1 − µ2 )2 = √ exp 2 2 , (33) 2π 2 2 σ 1 + σ2 σ1 + σ2 in which the step from eq. (31) to eq. (32) used the result of eq. (96). The sum of two univariate normal random variables is again a normal random variable, with the mean value and variance summed up respectively, i.e. z ∼ N (µ1 + 2 2 µ2 , σ1 + σ2 ). The summation rule is easily generalized to n independent normal random variables. 4.2 Multivariate Case Suppose x1 ∼ N (µ1 , Σ1 ) is a d-dimensional normal random variable, A is a q- by-d matrix and b is a q-dimensional vector, then z = Ax + b is a q-dimensional normal random variable: z ∼ N (Aµ + b, AΣAT ). This fact is proved using the characteristic function (see Appendix B). The characteristic function of Z is ϕZ (t) = EZ exp(itT z) (34) T = EX exp it (Ax + b) (35) T T T = exp(it b)EX exp i(A t) x (36) 1 = exp(itT b) exp i(AT t)T µ − (AT t)T Σ(AT t) (37) 2 1 = exp itT (Aµ + b) − tT (AΣAT )t , (38) 2 in which the step from eq. (37) to eq. (38) used the eq. (106) in Appendix B. Appendix B states that if a characteristic function ϕ(t) is of the form exp(itT µ− 5 1 T 2 t Σt),then the underlying density p(x) is normal with mean µ and covariance matrix Σ. Applying this fact to eq. (38), we immediately get z ∼ N (Aµ + b, AΣAT ). (39) Suppose x1 ∼ N (µ1 , Σ1 ) and x2 ∼ N (µ2 , Σ2 ) are two independent d- dimensional normal random variables, and deﬁne a new random vector z = x + y. We can calculate the probability density function pZ (z) using the same method as we used in the univariate case. However, the calculation is complex and we have to apply the matrix inversion lemma in Appendix C. Characteristic function simpliﬁes the calculation. Using eq. (109) in Appen- dix B, we get ϕZ (t) = ϕX (t)ϕY (t) (40) 1 1 T = exp itT µ1 − tT Σ1 t exp itT µ2 − t Σ2 t (41) 2 2 1 = exp itT (µ1 + µ2 ) − tT (Σ1 + Σ2 )t , (42) 2 which immediately gives us z ∼ N (µ1 + µ2 , Σ1 + Σ2 ). The summation of two multivariate normal random variables is as easy to compute as in the univariate case: sum up the mean vectors and covariance matrices. The rule is same for summing up several multivariate normal random variables. Now we have the tool of linear transformation and let’s revisit eq. (6). For convenience we retype the equation here: x ∼ N (µ, Σ), and y = Λ−1/2 U (x − µ). (43) Using the properties of linear transformations on a normal density, y is in- deed normal (in section 2.2 we painfully calculate p(y) using the inverse trans- form method), and has mean vector 0 and covariance matrix I. The transformation of applying eq. (6) is called the whitening transforma- tion since the transformed density has an identity covariance matrix. 5 Geometry and Mahalanobis Distance Figure 1 shows a bivariate normal density function. Normal density has only one mode, which is the mean vector, and the shape of the density is determined by the covariance matrix. Figure 2 shows the equal probability contour of a bivariate normal random variable. All points on a given equal probability contour must have the following term evaluated to a constant value: r2 (x, µ) = (x − µ)T Σ−1 (x − µ) = c (44) r2 (x, µ) is called the Mahalanobis distance from x to µ, given the covariance matrix Σ. Eq. (44) deﬁnes a hyperellipsoid in d-dimensional, which means that 6 Figure 1: Bivariate normal p.d.f. Figure 2: Equal probability countour of a bivariate normal distribution the equal probability contour is a hyperellipsoid in d-dimensional. The principal axes of this hyperellipsoid are given by the eigenvectors of Σ, and the lengths of these axes are proportional to square root of the eigenvalues associated with these eigenvectors. 6 Conditioning Suppose x1 and x2 are two multivariate normal random variables, which have a joint p.d.f x1 1 p = x2 (2π) (d1 +d2 )/2 |Σ|1/2 T −1 1 x1 − µ1 Σ11 Σ12 x1 − µ1 · exp − , 2 x2 − µ2 Σ21 Σ22 x2 − µ2 in which d1 and d2 are the dimensionalities of x1 and x2 respectively, and Σ11 Σ12 Σ= . The matrices Σ12 and Σ21 are covariance matrices between Σ21 Σ22 x1 and x2 , and satisfying Σ12 = ΣT .21 The marginal distributions x1 ∼ N (µ1 , Σ11 ) and x2 ∼ N (µ2 , Σ22 ) are easy to get from the joint distribution. We are interested in computing the condi- tional probability density p(x1 |x2 ). We will need to compute the inverse of Σ, and this task is completed by using the Schur complement (see Appendix C). For notational simplicity, we 7 denote the Schur complement of Σ11 as S11 , deﬁned as S11 = Σ22 − Σ21 Σ−1 Σ12 . 11 Similarly, the Schur complement of Σ22 is S22 = Σ11 − Σ12 Σ−1 Σ21 . 22 Applying eq. (119) and noticing that Σ12 = ΣT , we get (writing x1 − µ1 as 21 x1 , and x2 − µ2 as x2 ) −1 −1 Σ11 Σ12 S22 −1 −S22 Σ12 Σ−122 = −1 T −1 (45) Σ21 Σ22 −Σ22 Σ12 S22 Σ−1 22 −1 T −1 + Σ22 Σ12 S22 Σ12 Σ−1 22 and T −1 x1 − µ1 Σ11 Σ12 x1 − µ1 x2 − µ2 Σ21 Σ22 x2 − µ2 −1 −1 = x1T S22 x1 + x2T (Σ−1 + Σ−1 ΣT S22 Σ12 Σ−1 )x2 22 22 12 22 −1 = (x1 + Σ12 Σ−1 x2 )T S22 (x1 + Σ12 Σ−1 x2 ) + x2T Σ−1 x2 . 22 22 22 (46) Thus we can split the joint distribution as x1 p x2 1 −1 (x1 + Σ12 Σ−1 x2 )T S22 (x1 + Σ12 Σ−1 x2 ) 22 22 = −1 1/2 exp − 2 (2π)d1 |S22 | 1 1 T −1 · exp x Σ x (47) (2π)d2 |Σ−1 |1/2 22 2 2 22 2 in which we used the fact that |Σ| = |Σ22 | |S22 | (from eq. (118) in Appendix C). Since the second term in the right hand side of eq. (47) is the marginal p(x2 ) and p(x1 , x2 ) = p(x1 |x2 )p(x2 ), we now get the conditional probability p(x1 |x2 ) as 1 −1 (x + Σ12 Σ−1 x2 )T S22 (x1 + Σ12 Σ−1 x2 ) p(x1 |x2 ) = −1 exp − 1 22 22 , (2π)d1 |S22 |1/2 2 (48) or x1 |x2 −1 ∼ N (µ1 + Σ12 Σ−1 x2 , S22 ) 22 (49) ∼ N (µ1 + Σ12 Σ−1 (x2 − µ2 ), Σ11 − Σ12 Σ−1 Σ21 ) 22 22 (50) 7 Product of Gaussians Suppose p1 (x) = N (x; µ1 , Σ1 and p2 (x) = N (x; µ2 , Σ2 ) are two independent d-dimensional normal random variables. Sometimes we want to compute the density which is proportional to the product of the two normal densities, i.e. pX (x) = αp1 (x)p2 (x), in which α is a proper normalization constant to make pX (x) a valid density function. 8 In this task the canonical notation (see section 3) will be extremely helpful. Writing the two normal densities in canonical form 1 p1 (x) = exp α1 + η T x − xT Λ1 x 1 (51) 2 1 p2 (x) = exp α2 + η T x − xT Λ2 x , 2 (52) 2 the density pX (x) is easy to compute, as pX (x) = αp1 (x)p2 (x) 1 = exp α + (η 1 + η 2 )T x − xT (Λ1 + Λ2 )x . (53) 2 This equation states that in the canonical parameterization, in order to compute product of Gaussians, we just sum the parameters. This result is readily extended to the product of n normal densities. Sup- pose we have n normal distributions pi (x), whose parameters in the canonical n parameterization are η i and Λi , i = 1, 2, . . . , n. Then pX (x) = α i=1 pi (x) is also a normal density, given by n T n 1 pX (x) = exp α + ηi x − xT Λi x . (54) i=1 2 i=1 Now let’s go back to the moment parameterization. Suppose we have n normal distributions pi (x), in which pi (x) ∼ N (x; µi , Σi ), i = 1, 2, . . . , n. Then n pX (x) = α i=1 pi (x) is normal, p(x) = N (x; µ, Σ) (55) where Σ−1 = Σ−1 + Σ−1 + · · · + Σ−1 1 2 n (56) −1 −1 −1 Σ µ = Σ1 µ1 + Σ2 µ2 + · · · + Σ−1 µn n (57) Now we have listed some properties of the normal distribution. Next let’s see how these properties are applied. 8 Application I: Parameter Estimation 8.1 Maximum Likelihood Estimation Let us suppose that we have a d-dimensional multivariate normal random vari- able x ∼ N (µ, Σ), and n i.i.d (independently, identically distributed) samples D = {x1 , x2 , . . . , xn } sampled from this distribution. The task is to estimate the parameters µ and Σ. 9 The log-likelihood function of observing the data set D given parameters µ and Σ is: l (µ, Σ|D) (58) n = log p(xi ) (59) i=1 n nd n 1 = − log(2π) + log |Σ−1 | − (xi − µ)T Σ−1 (xi − µ). (60) 2 2 2 i=1 Taking derivative of the likelihood function with respect to µ and Σ−1 gives (see Appendix D): n ∂l = Σ−1 (xi − µ) (61) ∂µ i=1 n ∂l n 1 = Σ− (xi − µ)(xi − µ)T , (62) ∂Σ−1 2 2 i=1 in which eq. (61) used eq. (124) and the chain rule, and eq. (62) used eq. (131), eq. (132) and the fact that Σ = ΣT . The notation in eq. (61) is a little bit confusing. There are two Σ in the right hand side, the ﬁrst represents a summation and the second represents the covariance matrix. In order to ﬁnd the maximum likelihood solution, we want to ﬁnd the max- imum of the likelihood function. Setting both eq. (61) and eq. (62) to 0 gives us the solution: n 1 µM L = xi (63) n i=1 n 1 ΣM L = (xi − µM L )(xi − µM L )T (64) n i=1 Eq. (63) and eq. (64) clearly states that the maximum likelihood estimation of the mean vector and covariance matrix are just the sample mean and sample covariance matrix respectively. 8.2 Bayesian Parameter Estimation In this Bayesian estimation example, we assume that the covariance matrix is known. Let us suppose that we have a d-dimensional multivariate normal density x ∼ N (µ, Σ), and n i.i.d samples D = {x1 , x2 , . . . , xn } sampled from this distribution. We also need a prior on the parameter µ. Let’s assume that the prior is µ ∼ N (µ0 , Σ0 ). The task is to estimate the parameters µ. Note that we assume µ0 , Σ0 , and Σ are all known. The only parameter to be estimated is the mean vector µ. 10 ˆ In Bayesian estimation, instead of ﬁnd a point µ in the parameter space that gives maximum likelihood, we calculate the posterior density for the parameter p(µ|D), and use the entire distribution of µ as our estimation for this parameter. Applying the Bayes’ law, p(µ|D) = αp(D|µ)p0 (µ) (65) n = α p(xi )p0 (µ) (66) i=1 in which α is a normalization constant which does not depend on µ. Apply the result in section 7, we know that p(µ|D) is also normal, and p(µ|D) = N (µ; µn , Σn ) (67) where Σ−1 n = nΣ−1 + Σ−1 0 (68) Σ−1 µn n = nΣ−1 µ + Σ−1 µ0 0 (69) Both µn and Σn can be calculated from known parameters. So we have determined the posterior distribution p(µ|D) for µ given the data set D. We choose the normal distribution to be the prior family. Usually, the prior distribution is chosen such that the posterior belongs to the same functional form as the prior. A prior and posterior chosen in this way are said to be conjugate. We have seen that normal distribution have the nice property that both the prior and posterior are normal, i.e. normal distribution is auto-conjugate. After p(µ|D) is determined, a new sample is classiﬁed by calculating the probability p(x|D) = p(x|µ)p(µ|D) dµ. (70) µ Eq. (70) and eq. (27) has the same form. Thus we can guess that p(x|D) is normal again, and p(x|D) = N (x; µn , Σ + Σn ). (71) The guess is correct, and is easy to verify by repeating the steps in eq.(27) through eq. (33). 9 Application II: Kalman Filter 9.1 The Model The Kalman ﬁlter address the problem of estimating a state vector x in a discrete time process, given a linear dynamic model xk = Axk−1 + wk−1 , (72) 11 and a linear measurement model z k = Hxk + v k . (73) The process noise wk and measurement noise v k are assumed to be normal: w ∼ N (0, Q) (74) v ∼ N (0, R) (75) At time k − 1, assume we know the distribution of xk−1 , the task is to estimate the posterior probability of xk at time k, given the current observation z k and the previous state estimation p(xk−1 ). In a broader point of view, the task can be formulated as estimating the posterior probability of xk at time k, given all the previous state estimates and all the observations up to time step k. Under certain Markovian assumption, it is not hard to prove that the two problem formulations are equivalent. In the Kalman ﬁlter setup, we assume the prior is normal, i.e. at time t = 0, p(x0 ) ∼ N (x; µ0 , P0 ). Instead of using Σ, here we use P to represent a covari- ance matrix, in order to match the notations in the Kalman ﬁlter literature. 9.2 The Estimation I will show that, with the help of the properties of Gaussians we have obtained, it is quite easy to derive the Kalman ﬁlter equations. The Kalman ﬁlter can be separated in two (related) steps. In the ﬁrst step, based on the estimation p(xk−1 ) and the dynamic model (72), we get an estimate p(x− ). Note that the minus sign means that this estimation is done before we k take into account the measurement. In the second step, based on p(x− ) and k the measurement model (73), we get the ﬁnal estimation p(xk ). First, let’s estimate p(x− ). Assume that at time k − 1, the estimation we k already got is a normal distribution p(xk−1 ) ∼ N (x; µk−1 , Pk−1 ). (76) This assumption coincides well with the prior p(x0 ). We will show that, under this assumption, after the Kalman ﬁlter update, p(xk ) will also become normal, and this makes the assumption reasonable. Applying the linear operation equation (39) on the dynamic model (72), we immediately get the estimation for x− : k x− k − ∼ N (µ− , Pk ) k (77) µ− k = Aµk−1 (78) − Pk = APk−1 AT + Q (79) The estimate p(x− ) conditioned on the observation z k gives p(xk ), the es- k timation we want. Thus the conditioning property (50) can be used. In or- der to use eq. (50), we have to build the joint covariance matrix ﬁrst. Since 12 − Var(z k ) = HPk H T + R (applying eq. (39) to eq. (73)) and Cov(z k , x− ) k = Cov(Hx− + v k , x− ) k k (80) = Cov(Hx− , x− ) k k (81) − = HPk , (82) the joint covariance matrix of (x− , z k ) is: k − − Pk Pk H T − − T . (83) HPk HPk H + R Applying the conditioning equation (50), we get p(xk ) = p(x− |z k ) k (84) ∼ N (µk , Pk ) (85) − − − − Pk = Pk − Pk H T (HPk H T + R)−1 HPk (86) − − T − T −1 µk = µk + Pk H (HPk H + R) (z k − Hµk ) (87) The equations (77∼79) and (84∼87) are the Kalman ﬁlter updating rules. − − −1 The term Pk H T HPk H T + R appears in both eq. (86) and eq. (87). Deﬁning − − Kk = Pk H T (HPk H T + R)−1 , (88) the equations are simpliﬁed as − Pk = (I − Kk H)Pk (89) µk = µ− + Kk (z k − Hµ− ). k k (90) The term Kk is called the Kalman gain matrix, and the term z k −Hµ− is called k the innovation. A Gaussian Integral We will compute the integral of the univariate normal p.d.f. in this section. The trick in doing this integration is to consider two independent univariate gaussians at one time. 13 ∞ ∞ ∞ 2 e−x dx = e−x2 dx e−y2 dy (91) −∞ −∞ −∞ ∞ ∞ = e−(x2 +y2 ) dx dy (92) −∞ −∞ ∞ 2π = re−r2 dr dθ (93) 0 0 ∞ 1 = 2π − e−r2 (94) 2 0 √ = π, (95) in eq. (93) the polar coordinates are used, and the extra r appeared inside the equation is determinant of the Jacobian. The above integral can be easily extend as ∞ √ x2 f (t) = exp − dx = tπ (96) −∞ t in which we assume t > 0. Then we have ∞ df d x2 = exp − dx (97) dt dt −∞ t ∞ 2 x x2 = 2 exp − dx (98) −∞ t t and df d√ 1 π = tπ = . (99) dt dt 2 t The above two equations give us ∞ x2 t2 π x2 exp − dx = . (100) −∞ t 2 t Applying eq. (100), we have ∞ 1 x2 1 4 π x2 √ exp − dx = √ =1 (101) −∞ 2π 2 2π 2 2 It is obvious that ∞ 1 x2 x √ exp − dx = 0 (102) −∞ 2π 2 2 since x exp − x is an odd function. 2 Eq. (102) and eq. (101) have proved that the mean and standard deviation of a standard normal distribution are 0 and 1, respectively. 14 B Characteristic Functions The characteristic function of a random variable with p.d.f. p(x) is deﬁned as its Fourier transform T ϕ(t) = E[eit x ] (103) √ in which i = −1. Let’s compute the characteristic function of a normal variate. ϕ(t) = E[exp(itT x)] (104) 1 1 = exp − (x − µ)T Σ−1 (x − µ) + itT x dx (105) (2π)d/2 |Σ|1/2 2 1 T = exp(itT µ − t Σt) (106) 2 Since the characteristic function is deﬁned as a Fourier transform, the in- verse Fourier transform of ϕ(t) will be exactly p(x), i.e. a random variable is completely determined by its characteristic function. When we see that a char- 1 acteristic function ϕ(t) is of the form exp(itT µ − 2 tT Σt), we know that the underlying density is normal with mean µ and covariance matrix Σ. Suppose that x and y are two independent random vectors with the same dimension, and deﬁne a new random vector z = x + y. Then pZ (z) = pX (x)pY (y) dxdy (107) z=x+y = pX (x)pY (z − x) dx. (108) Since convolution in the function space is a product in the Fourier space, we have ϕZ (t) = ϕX (t)ϕY (t), (109) which means that the characteristic function of the sum of two independent random variables is just the product of the characteristic functions of the sum- mands. C Schur Complement and the Matrix Inversion Lemma The Schur complement is very useful in computing the inverse of a block matrix. Suppose M is a block matrix expressed as A B M= , (110) C D in which A and D are non-singular square matrices. We want to compute M −1 . 15 Some algebraic manipulations give I 0 I −A−1 B M (111) −CA−1 I 0 I I 0 A B I −A−1 B = (112) −CA−1 I C D 0 I A B I −A−1 B = (113) 0 D − CA−1 B 0 I A 0 A 0 = = , (114) 0 D − CA−1 B 0 SA in which the term D − CA−1 B is called the Schur complement of A, denoted as SA . Equation XM Y = Z implies that M −1 = Y Z −1 X. Hence we have −1 I −A−1 B A 0 I 0 M −1 = (115) 0 I 0 SA −CA−1 I −1 A−1 −A−1 BSA I 0 = −1 (116) 0 SA −CA−1 I −1 −1 A−1 + A−1 BSA CA−1 −A−1 BSA = −1 −1 (117) −SA CA−1 SA Taking the determinant of both sides of eq. (115), it gives |M | = |A||SA |. (118) We can also compute M −1 by using the Schur complement of D, in the following way: −1 −1 SD −SD BD−1 M −1 = −1 −1 −1 −1 (119) −D CSD D + D−1 CSD BD−1 |M | = |D||SD |. (120) Eq. (117) and eq. (119) are two diﬀerent representations of the same matrix M −1 , which means the corresponding blocks in these two equations must be −1 −1 equal, e.g. SD = A−1 + A−1 BSA CA−1 . This result is known as the matrix inversion lemma: −1 SD = (A − BD−1 C)−1 = A−1 + A−1 B(D − CA−1 B)−1 CA−1 . (121) The following result, which comes from equating the upper right block is also useful: A−1 B(D − CA−1 B)−1 = (A − BD−1 C)−1 BD−1 . (122) This formula and the matrix inversion lemma are useful in the derivation of the Kalman ﬁlter equations. 16 D Vector and Matrix Derivatives Suppose y is a scalar, A is a matrix, and x and y are vectors. The partial derivative of y with respect to A is deﬁned as: ∂y ∂y = (123) ∂A ij ∂aij where aij the i, j-th component of the matrix A. From the deﬁnition (123), it is easy to get the following rule. ∂ ∂ xT y = yT x = y (124) ∂x ∂x For a square matrix A that is n-by-n, the matrix Mij deﬁned by removing from A the i-th row and j-th column is called a minor of A. The scalar cij = (−1)i+j Mij is called a cofactor of A. The matrix Acof with cij in its i, j-th entry is called the cofactor matrix of A. Finally, the adjoint matrix of A is deﬁned as transpose of the cofactor matrix Aadj = AT . cof (125) There are some well-known facts about the minors, determinant, and adjoint of a matrix: |A| = aij cij (126) j 1 A−1 = Aadj . (127) |A| Since Mij has removed the i-th row, it does not depend on aij , neither does cij . Thus, we have ∂ |A| = cij , (128) ∂aij ∂ or, |A| = Acof (129) ∂A which in turn shows that ∂ |A| = Acof = AT = |A|(A−1 )T . adj (130) ∂A Using the chain rule, we immediately get that for a positive deﬁnite matrix A, ∂ log |A| = (A−1 )T . (131) ∂A Applying the deﬁnition (123), it is easy to show that for a square matrix A, ∂ (xT Ax) = xxT , (132) ∂A since xT Ax = i j aij xi xj where x = (x1 , x2 , . . . , xn ). 17 References [1] C. M. Bishop. Neural Networks for Pattern Recognition, Oxford University Press, Oxford, UK, 1996. [2] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classiﬁcation, second edition, Wiley-Interscience, New York, NY, USA, 2001. [3] http://mathworld.wolfram.com/ [4] M. I. Jordan. An Introduction to Probabilistic Graphical Models, chapter 13, draft. [5] G. Welch and G. Bishop. An Introduction to the Kalman Filter, TR 95-041, Department of Computer Science, University of North Carolina at Chapel Hill. 18