# Feature Extraction

Document Sample

```					Feature Extraction

主講人：虞台文
Content
   Principal Component Analysis (PCA)
   PCA Calculation — for Fewer-Sample Case
   Factor Analysis
   Fisher’s Linear Discriminant Analysis
   Multiple Discriminant Analysis
Feature Extraction

Principal Component
Analysis (PCA)
Principle Component Analysis

   It is a linear procedure to find the direction in
input space where most of the energy of the
input lies.
–   Feature Extraction
–   Dimension Reduction

   It is also called the (discrete) Karhunen-
Loève transform, or the Hotelling transform.
The Basis Concept
x              Assume data x (random vector)
w         has zero mean.

PCA finds a unit vector w to
reflect the largest amount of
variance of the data.

That is,

Demo        w*  argmax E[( w x) ]            T       2

||w ||1
Remark: C is symmetric
and semipositive definite.

The Method
E[( w x) ]  w Cw
T     2           T

E[( w T x) 2 ]  E[w T xx T w ]  E[w T (xx T )w ]  w T E[xx T ]w
N
1
E[xx ] 
T

N
 i
x i xT  C
i 1
Covariance Matrix

w*  argmax E[( w x) ]             T     2

||w ||1
E[( w x) ]  w Cw
T    2   T

The Method
maximize     f ( w )  w CwT

subject to   g (w )  w w  1  0
T

The method of Lagrange multiplier:
Define L(w)  f (w)  g (w)
The extreme point, say, w* satisfies

 w L(w*)   w f (w*)   w g (w*)  0
E[( w x) ]  w Cw
T   2        T

The Method
maximize     f ( w )  w Cw
T

subject to   g (w )  w w  1  0
T

L(w)  f (w)  g (w)  w T Cw   (w T w  1)

L(w)  2Cw  2w

Setting L(w)  0
Cw  w
E[( w x) ]  w Cw
T     2          T

Discussion
At extreme points          w Cw  w w  
T                 T

   Let w1, w2, …, wd be the eigenvectors of C whose
corresponding eigenvalues are 1≧ 2 ≧ … ≧ d.
   They are called the principal components of C.
   Their significance can be ordered according to their
eigenvalues.

w is a eigenvector of C, and 
is its corresponding eigenvalue.    Cw  w
E[( w x) ]  w Cw
T     2          T

Discussion
At extreme points          w Cw  w w  
T                 T

   Let w1, w2, …, wd be the eigenvectors of C whose
corresponding eigenvalues are 1≧ 2 ≧ … ≧ d.
   They are called the principal components of C.
   Their significance can be ordered according to their
eigenvalues.

   If C is symmetric and semipositive definite, all their
eigenvectors are orthogonal.
   They, hence, form a basis of the feature space.
   For dimensionality reduction, only choose few of them.
Applications
 Image  Processing
 Signal Processing
 Compression
 Feature Extraction
 Pattern Recognition
Example
Projecting the data onto the
most significant axis will
facilitate classification.

This also achieves
dimensionality reduction.
Issues
The most significant component
   PCA is effective for             obtained using PCA.
identifying the multivariate
signal distribution.

   Hence, it is good for signal
reconstruction.
The most
   But, it may be inappropriate   significant
for pattern classification.    component for
classification
Whitening
   Whitening is a process that transforms the random vector,
say, x = (x1, x2 , …, xn )T (assumed it is zero mean) to, say,
z = (z1, z2 , …, zn )T with zero mean and unit variance.

E[zz ]  I
T

   z is said to be white or sphered.
   This implies that all of its elements are uncorrelated.
   However, this doesn’t implies its elements are
independent.
Clearly, D is a diagonal matrix and
E is an orthonormal matrix.

Whitening Transform
Let V be a whitening transform, then           z  Vx
E[zz ]  VE[xx ]V  VC x V
T                 T     T                T

Decompose Cx as C x  EDE              T

Set V  D     1/ 2   T
E          E[zz ]  IT
1/ 2
C x  EDE   T
VD             T
E
Whitening Transform
If V is a whitening transform, and U is any
orthonormal matrix, show that UV, i.e.,
rotation, is also a whitening transform.
Proof)
E[zz ]  UV E[xx ]V U
T                   T   T   T

 UVC x V U  UIU
T T      T

I
Why Whitening?
   With PCA, we usually choose several major
eigenvectors as the basis for representation.
   This basis is efficient for reconstruction, but
may be inappropriate for other applications,
e.g., classification.

   By whitening, we can rotate the basis to get
more interesting features.
Feature Extraction

PCA Calculation —
for Fewer-Sample Case
N
1
C
N
 i
xi xT
i 1

Complexity for PCA Calculation

   Let C be of size n × n
   Time complexity by direct computation －
O(n3)
   Are there any efficient method in case that

N          n?
N
1
C
N
 i
xi xT
i 1
PCA for Covariance Matrix
from Fewer Samples

 Consider N samples of xi  ( x1 ,        , xn )T , i  1,   ,N
with N     n

   Define D  (x1 , , x N )

N
1             1
C
N

i 1
x x  DDT
T
i i
N
1
C  DDT
N
PCA for Covariance Matrix
from Fewer Samples

1 T
   Define N × N matrix T        D D
N

   Let ei (i  1, , N ) be the orthonormal
eigenvectors of of T with corresponding
eigenvalues i, i.e.,
Tei   i ei   (i  1,   , N)
1 T
 D Dei   i ei       (i  1,   , N)
N
1
C  DDT
N
PCA for Covariance Matrix                             1 T
from Fewer Samples                                  T D D
N
Tei   i ei

Eigenvectors of
C
C(Dei )   i (Dei )    (i  1,     , N)

1
DDT Dei   i Dei       (i  1,    , N)
N
1 T
 D Dei   i ei         (i  1,   , N)
N
1
C  DDT
N
PCA for Covariance Matrix                          1 T
from Fewer Samples                               T D D
N
Tei   i ei
1
Define pi          Dei    (i  1,   , N)
iN
C(Dei )   i (Dei )   (i  1,   , N)
1
C  DDT
N
PCA for Covariance Matrix                                      1 T
from Fewer Samples                                           T D D
N
Tei   i ei
1
Define pi               Dei       (i  1,     , N)
iN
1                             1
p pj 
T
e D De j 
T    T
eT Te j
 i j N                        i j
i                    i                            i

j        1 i  j
       e ej  
T

i        0 i  j
i

pi are orthonormal eigenvectors of C
with eigenvalues i   i
Feature Extraction

Factor Analysis
What is a Factor?
   If several variables correlate highly, they
might measure aspects of a common
underlying dimension.
–   These dimensions are called factors.
   Factors are classification axis along which
the measures can be plotted.
the more that factor can explain intercorrelations
between those variables.
Graph Representation
Verbal
Skill
+1    (F2)

1                 +1   Quantitative
Skill
(F1)

1
What is Factor Analysis?
   A method for investigating whether a number of
variables of interest Y1, Y2, …, Yn, are linearly related
to a smaller number of unobservable factors F1,
F2, …, Fm.

   For data reduction and summarization.

   Statistical approach to analyze interrelationships
among the large number of variables & to explain
these variables in term of their common underlying
dimensions (factors).
Quantitative skill?
unobservable
Example                    Verbal skill?

Observable Data
The Model
Y1  11 F1  12 F2   1m Fm  e1
Y2   21 F1   22 F2    2 m Fm  e2

Yn   n1 F1   n 2 F2    nm Fm  en
y: Observation Vector E[y ]  0
y  Bf  ε          f: Factor Vector     E[f ]  0, E[f T f ]  I
: Gaussian-Noise Matrix
E[ε]  0, E[ε T ε]  diag[ 12 ,  n ]
2
The Model
E[yy ]  Cy  E[( Bf  ε)( Bf  ε) ]  BB  Q
T                                 T            T

y: Observation Vector E[y ]  0
y  Bf  ε       f: Factor Vector   E[f ]  0, E[f T f ]  I
: Gaussian-Noise Matrix
E[ε]  0, E[ε T ε]  diag[ 12 ,  n ]
2
y  Bf  ε
The Model
E[yy ]  Cy  E[( Bf  ε)( Bf  ε) ]  BB  Q
T                                                                                       T           T

Can be estimated                                                                             Can be obtained
from data                                                                                 from the model

 m 2                                                       
 12 0          0
m                            m

s                 sY1Yn            1 j                      j2                   jn 
2
sY1Y2                                               1j                       1j


Y1
          m j 1       j 1                         j 1
                             
2
 sY2Yn            
m
   2 j  jn 
m
0 2 2
0
Q
sY Y     sY2
Cy   2 1                         BBT   2 j j1                    2
2j                      
                              
j 1             j 1                 j 1
                             
                                    m                                                
2                                                                                               2
m                             m
                          j2              nj 
 sYnY1            sYn                                                                                               n 
2

         sYnY2                     j 1 nj j1
              j 1
nj          
j 1                  0
     0            
y  Bf  ε
The Model
E[yy ]  Cy  E[( Bf  ε)( Bf  ε) ]  BB  Q
T                                                                                            T          T

Var[Yi ]  s         
2
Yi
2
i1
2
i2
2
im    i
2

Commuality                                      Specific Variance
Explained                                      Unexplained
 m 2                                                       
 12 0         0
m                            m

s                 sY1Yn                 1 j                      j2                   jn                   
2
sY1Y2
                 
1j                       1j
Y1
                                        m j 1       j 1                         j 1

2
 sY2Yn                 
m
   2 j  jn 
m
0  12      0
Q
sY Y     sY2
Cy   2 1                              BBT   2 j j1                    2
2j                      
                                   
j 1             j 1                 j 1
                       
                                         m                                                
2                                                                                                    2
m                             m
                          j2              nj 
 sYnY1            sYn                                                                                                  n 
2

         sYnY2                          j 1 nj j1
              j 1
nj          
j 1                  0
     0           
Example
E[yy ]  Cy  E[( Bf  ε)( Bf  ε) ]  BB  Q
T                       T       T

Cy 

BBT + Q =
Goal
E[yy ]  Cy  E[( Bf  ε)( Bf  ε) ]  BB  Q
T                            T        T

Our goal is to minimize
trace[Cy  B B]  trace[Q]
T

Hence,

B*  arg min trace[Cy  B B]      T
B
Uniqueness
E[yy ]  Cy  E[( Bf  ε)( Bf  ε) ]  BB  Q
T                                T         T

Is the solution unique?
There are infinite number of solutions.
Since if B* is a solution and T is an orthonormal
transformation (rotation), then BT is also a solution.
Cy =

Example

Which one is better?
0.5 0.5                 0.707     0 
0.3 0.3 
B1                            0.231
B2            0  

0.5  0.5
                         0
        0.707

Left: each factor have nonzero

Example                    Right: each factor controls
different variables.

i2                      i2

i1                        i1

0.5 0.5                0.707     0 
0.3 0.3 
B1                           0.231
B2            0  

0.5  0.5
                        0
        0.707

The Method
principal component method.

Cy  EE              T

 [e1 , , e m , , e n ]diag[1 , , m , , n ][e1 , , e m , , e n ]T

B  [e1 ,  , e m ]diag[ ,  ,  ]    1/ 2
1
1/ 2
m

Q  Cy  BB           T
Example
 3.136773 0.023799
                    
Cy       B    0.132190 2.237858
 0.127697 1.731884
                    
Factor Rotation
 11     12  1m            t11 t12    t1m 
                                               
  21     22   2 m         t21 t22    t2 m 
B                            T
                                   
                                               
        n2       nm      t           tmm 
 n1                           m1 tm 2          
Matrix                    Matrix

Factor Rotation:       B  BT
Factor Rotation
 11     12  1m           Criteria:
                      
  21     22   2 m           Varimax
B
                          Quartimax
                        
        n2       nm 
   Equimax
 n1                     
   Orthomax
Matrix

Factor Rotation:       B  BT
m
Criterion: Maxmize                    F
2

i 1
i

Varimax                                         Subject to tT t j   ij
i

Let B  β1, β2 ,, βn             T

T  t1 , t 2 , , t m 

 β1 t1 β1 t 2
T      T
 β1 t m 
T
 T                           
 β 2 t1 βT t 2       β mt m 
T
BT           2
  [bij ]nm      bij  βTj t i
                       
 βT t βT t           βT t m 
 n 1                                                        1         
2

   b          
n 2            n                     n                       n
2               2 2
   bij 
2

n  j 1 
...               Fi             ij
j 1                     
   2
F1      2
F2            2
Fm
m
Criterion: Maxmize                     F
2

i 1
i

Varimax                                     Subject to tT t j   ij
i

Construct the Lagrangian
m  n            1   n             
2
m m
L(T, Λ)    (βTj t i ) 4    (βTj t i ) 2    2 ij t T t j
                              i
i 1  j 1        n  j 1           
                                      i 1 j 1

 bij  βTj t i

2
1         
   b          
n                       n
2               2 2
   bij 
2

n  j 1 
Fi             ij
j 1                     
m n            1   n             
2
m m
L(T, Λ)    (βTj t i ) 4    (βTj t i ) 2    2 ij t T t j
                              i
i 1  j 1        n  j 1           
                                      i 1 j 1

Varimax
4 n T 2 n
L(T, Λ)
                                      
n                                                     m
 4 (β j t k ) β j    (β j t k )  (β j t k )β j  4 ik t i
T      3                            T

t k     j 1              n  j 1


 j 1              i 1
cjk
dk              bjk
n
   1                 m
 4 c jk  d k b jk β j  4 ik t i
j 1    n                i 1
L(T, Λ )     n
   1                 m
 4 c jk  d k b jk β j  4 ik t i
t k      j 1    n                i 1

Varimax
B  β1, β2 ,, βn              Define
T

T  t1 , t 2 , , t m            C  [b3 ]nm
jk
n

BT  [b jk ]nm                    D  diag[d1 ,, d m ]       dk   b2
jk
j 1
b jk  βT t j
k
A  [ij ]mm

L ( T, Λ )     is the kth column of
t k    4[BT C  1 BT BTD  TA ]
n
L(T, Λ )     n
   1                 m
 4 c jk  d k b jk β j  4 ik t i
t k      j 1    n                i 1

Varimax
L(T, Λ)
 4[M  TA ]
T
M  BT [C  1 BTD ]
n

L ( T, Λ )is the kth column of
t k    4[BT C  1 BT BTD  TA ]
n
M  B [C  1 BTD ]
T
n

Varimax           Goal:   TA  M
L(T, Λ)
 4[M  TA ]
T

L(T, Λ) reaches maximum once TA  M
M  B [C  1 BTD ]
T
n

Varimax                            Goal:      TA  M
Initially,
• obtain B0 by whatever method, e.g., PCA.
• set T0 as the approximation rotation matrix, e.g., T0=I.
Iteratively execute the following procedure:
B1  B 0 T0
evaluate C1 , D1 and M1       You need information of B1.
find T1 and A1 such that T1A1  M1         Next slide

if T1  T0 stop
T0  T1
Repeat
M  B [C  1 BTD ]
T
n

Varimax                            Goal:      TA  M
Pre-multiplying each side by its transpose.
Initially,
• obtain 2 by whateverUUT e.g., PCA.
A1B0 M1 M1  method,
T
• set T0 as the approximation rotation matrix, e.g., T0=I.
A1  U1/ 2 U T following procedure:
Iteratively execute the
  T0 
B1 T1 B 0M1A1 1
evaluate C1 , D1 and M1       You need information of B1.
find T1 and A1 such that T1A1  M1         Next slide

if T1  T0 stop
T0  T1
Repeat
Varimax

 11 12       1m    Criterion:
                      
    
  21  22         
  2m    Maximize
B  BT  
                                m
                      
 
 n1    n2        nm 
       J (T)         2
Fi
i 1
...
F F
2
1
2
2
F
2
m

 F  Var[.2 ]
2
i      i
m
Maximize      J (T)             2
Fi
Varimax                                                 i 1

Let B  β1, β2 ,, βm           T

 11 12     1m 
                    
    
  21  22       
  2m           T  t1 , t 2 , , t m 
B  BT  
           

 
    ij  βT t j

  nm 
i
 n1                                                         2
1           
n2                       n                n
   (  ij )     ij 
2
    2 2
 2

n  j 1     
Fi
j 1                    
m   n           1   n            
2

J (T)    (βT t j ) 4    (βT t j ) 2  
i             i        
i 1  j 1       n  j 1          

Feature Extraction

Fisher’s Linear
Discriminant Analysis
Main Concept
   PCA seeks directions that are efficient for
representation.

   Discriminant analysis seeks directions that
are efficient for discrimination.
Classification Efficiencies on Projections
Criterion  Two-Category
||w|| = 1                  1
mi 
ni
x
xD i

w           m1                           1           
~  1
mi
ni
w x  w  n
T

T
 x  
xD i     i    xD i 
~
m1           m2
 wT mi

~
m2
Between-Class Scatter Matrix
S B  (m1  m 2 )(m1  m 2 )T

Scatter
||w|| = 1                      1
mi 
ni
x
xD i
~
mi  w T m i

w           m1
Between-Class Scatter
~ ~
(m1  m2 ) 2  (w T m1  w T m 2 ) 2
~
m1           m2
 w T (m1  m 2 )(m1  m 2 )T w
~
m2                wT S B w

The larger the better
Between-Class Scatter Matrix
S B  (m1  m 2 )(m1  m 2 )T
Within-Class Scatter Matrix
Scatter                                 2
SW    (x  mi )(x  mi )
i 1 xDi
T

||w|| = 1
 (x  m )(x  m )
T
Si                   i      i
xD i

w           m1
~2 
si               ~
(wT x  mi ) 2  w T S i w
xDi
~
m1           m2       Within-Class Scatter
~ 2  ~ 2  ~ 2  w T (S  S )w
s     s1 s2             1   2
~
m2                   wT SW w
The smaller the better
Between-Class Scatter Matrix
S B  (m1  m 2 )(m1  m 2 )T
Within-Class Scatter Matrix
Goal                                  2
SW    (x  mi )(x  mi )
i 1 xDi
T

T
||w|| = 1                              w SBw
Define J (w)  T
w SW w
w           m1
Generalized
Rayleigh quotient
~
m1           m2
w*  arg max J (w)
~
m2                                    w

The length of w is immaterial.
S B  (m1  m 2 )(m1  m 2 )T
S B w  c(m1  m 2 )

Generalized Eigenvector
To maximize J(w), w is the                   wT S B w
generalized eigenvector        Define J (w)  T
associated with largest
w SW w
generalized eigenvalue.                           Generalized
Rayleigh quotient
That is,   S B w   SW w
or    S S B w  w
1
W                    w*  arg max J (w)
w

w  SW1 (m1  m 2 )        The length of w is immaterial.
S B  (m1  m 2 )(m1  m 2 )T
S B w  c(m1  m 2 )                 T
w SBw
J (w)  T
Proof                                                w SW w

To maximize J(w), w is the     dJ (w)   2S B w   2SW w wT S B w
generalized eigenvector                T       T
dw     w SW w w SW w wT SW w
associated with largest
generalized eigenvalue.             dJ (w )
Set         0
dw
That is,   S B w   SW w             2S B w   2SW w wT S B w
 T
T
w SW w w SW w wT SW w
or    S S B w  w
1
W
 wT S B w 
SBw   T            SW w
                               w S w SW w

w  SW1 (m1  m 2 )                      W    

1
w  S (m1  m 2 )
W
Example
w
w

w
Feature Extraction

Multiple Discriminant
Analysis
Generalization of
Fisher’s Linear Discriminant

For the c-class problem, we seek a (c1)-dimension projection
for efficient discrimination.
Scatter Matrices  Feature Space

Total Scatter Matrix
m1           m2
ST   (x  m)( x  m)T
x                                 +   m
Within-Class Scatter Matrix
c
SW    (x  mi )(x  mi )T            m3
i 1 xDi

Between-Class Scatter Matrix
c
S B   ni (m i  m )(m i  m )T   ST  S B  SW
i 1
The (c1)-Dim Projection
The projection space will be
described using a d(c1)      m1           m2
matrix W.                           +   m

W  [ w1   w 2  w c 1 ]        m3
Scatter Matrices  Projection Space

Total Scatter Matrix
~                       m1             m2
ST  W T S T W
Within-Class              ~       ~   +   m
m1     m2
Scatter Matrix                  ~
+m
~                               m3
SW  WT SW W              ~
m3            W
Between-Class Scatter Matrix
~
S B  WT S B W
Criterion
Total Scatter Matrix              ~
~
SB  WT S B W
ST  W T S T W    J ( W)  ~   T
SW  W SW W
Within-Class
Scatter Matrix
~                  W*  arg max J (W)
SW  WT SW W                  W

Between-Class Scatter Matrix
~
S B  WT S B W

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 28 posted: 6/23/2012 language: Latin pages: 68
How are you planning on using Docstoc?