BackPropagation - PowerPoint

Document Sample

```					         CHAPTER 11

Back-
Propagation
Ming-Feng Yeh         1
Objectives
A generalization of the LMS algorithm, called
backpropagation, can be used to train
multilayer networks.
Backpropagation is an approximate steepest
descent algorithm, in which the performance
index is mean square error.
In order to calculate the derivatives, we need
to use the chain rule of calculus.

Ming-Feng Yeh                                    2
Motivation
The perceptron learning and the LMS
algorithm were designed to train single-
layer perceptron-like networks.
They are only able to solve linearly
separable classification problems.
Parallel Distributed Processing
The multilayer perceptron, trained by the
backpropagation algorithm, is currently the
most widely used neural network.

Ming-Feng Yeh                                  3
Three-Layer Network

Number of neurons in each layer: R  S 1  S 2  S 3
Ming-Feng Yeh                                          4
Pattern Classification:
XOR gate
The limitations of the single-layer
perceptron (Minsky & Papert, 1969)
     0                 0         
p1   , t1  0   p 2   , t 2  1
P2          P4        0                 1         
      1               1         
p 3   , t3  1 p 4   , t 4  0
P1         P3         0               1         

Ming-Feng Yeh                                                5
Two-Layer XOR Network
w1
1
Two-layer, 2-2-1 network
1      1
n1     a1
p1     2
2
                              P1
1                     n12        a12
1

1              n1     a1      1 .5               AND

2      2
p2
1                            1
1.5                                     P4
1
Individual Decisions        1
w2
Ming-Feng Yeh                                                     6
Solved Problem P11.1
Design a multilayer network to distinguish
these categories.

p1  1 1  1  1   p3  1  1 1  1
T                     T

p2   1  1 1 1    p4   1 1  1 1
T                     T

Class I         Class II
Wp 1  b  0         Wp 3  b  0
Wp 2  b  0         Wp 4  b  0
There is no hyperplane that can separate
these two categories.
Ming-Feng Yeh                                       7
Solution of Problem P11.1
p1        2             1
n1      1
a1

p2        2
1                    n12    a12
1

p3        2            n1     a1    1

2      2

p4        2                        1
1                      OR
1
AND

Ming-Feng Yeh                                          8
Function Approximation
Two-layer, 1-2-1 network
1
f ( n) 
1
, f 2 ( n)  n
1  e n

3

2

1

w1  10 , w1  10 , b1  10 , b2  10 .
1
2
1          1
0

w12  1, w12  1, b 2  0.
-1
-2        -1.5   -1   -0.5   0   0.5   1   1.5   2

Ming-Feng Yeh                                                                            9
Function Approximation
The centers of the steps occur where
the net input to a neuron in the first
layer is zero.
n1  w1 p  b1  0  p   b1 w1   (10) 10  1
1    1      1              1  1

n1  w1 p  b2  0  p   b2 w1   10 10  1
2    2
1              1
2

The steepness of each step can be
weights.

Ming-Feng Yeh                                               10
Effect of Parameter Changes
3
1
b2

2

1

20          15        10          5       0

0

-1
-2        -1.5        -1        -0.5       0       0.5   1   1.5   2

Ming-Feng Yeh                                                                            11
Effect of Parameter Changes
3

w12
1.0
2

0.5

0.0
1

-0.5

-1.0
0

-1
-2         -1.5   -1   -0.5   0   0.5   1   1.5          2

Ming-Feng Yeh                                                                  12
Effect of Parameter Changes
3

w12
1.0
2

0.5

0.0
1

-0.5

-1.0
0

-1
-2         -1.5   -1   -0.5   0   0.5   1   1.5          2

Ming-Feng Yeh                                                                  13
Effect of Parameter Changes
3

b2
1.0
2
0.5

0.0
1
-0.5

-1.0
0

-1
-2        -1.5   -1   -0.5      0   0.5   1   1.5   2

Ming-Feng Yeh                                                             14
Function Approximation
Two-layer networks, with sigmoid
transfer functions in the hidden layer
and linear transfer functions in the
output layer, can approximate virtually
any function of interest to any degree
accuracy, provided sufficiently many
hidden units are available.

Ming-Feng Yeh                             15
Backpropagation Algorithm
For multilayer networks the outputs of one
layer becomes the input to the following layer.
a m 1  f m 1 ( W m 1a m  b m1 ), m  0,1,2,..., M  1
a 0  p, a  a M

Ming-Feng Yeh                                                      16
Performance Index
Training Set: {p1, t1}  {p2, t2}    {pQ, tQ}
Mean Square Error: F x = E e2  = E t – a2 
T                    T
Vector Case: F x= Ee e  = E t – a  t – a
Approximate Mean Square Error:
ˆ  x = t k – a k T  tk  – ak   = eT  k e k
F
Approximate Steepest Descent Algorithm
m                   m             Fˆ        m                 m           Fˆ
w i j k   + 1 =   wi j k  – ------    b i k   + 1 =   b i k  –  -----
m
m
w i j                                  b i

Ming-Feng Yeh                                                                             17
Chain Rule
df (n( w)) df (n) dn( w)
      
dw      dn     dw
If f(n) = en and n = 2w, so that f(n(w)) = e2w.
df (n( w)) df (n) dn( w)
              en  2
dw      dn     dw
Approximate mean square error:
ˆ
F (x)  [t (k )  a(k )]T [t (k )  a(k )]  e T (k )e(k )
ˆ
F                     ˆ
F nim
wi , j (k  1)  wi , j (k )   m  wi , j (k )   m  m
m                m                       m

wi , j               ni wi , j
ˆ
F                   ˆ
F nim
bi (k  1)  bi (k )   m  bi (k )   m  m
m              m                       m

bi                 ni bi
Ming-Feng Yeh                                                                18
The net input to the ith neurons of layer m:
S m 1                nim         nim
nim   wimj a m 1  bim    m  a m1 ,      1
wi , j      bi
,    j                     j       m
j 1

ˆ
The sensitivity of F to changes in the ith element
ˆ
of the net input at layer m: sim  F nim
ˆ
F       ˆ
F nim       m 1
Gradient: m  m  m  si  a j
m

wi , j ni wi , j

Fˆ    ˆ
F nim
 m  m  sim  1  sim
bim ni bi
Ming-Feng Yeh                                            19
Steepest Descent Algorithm
The steepest descent algorithm for the
approximate mean square error:
ˆ
F nim
wi , j (k  1)  wi , j (k )   m  m  wimj (k )  sim a m1
m                m

ni wi , j
,               j

ˆ
F nim
b (k  1)  b (k )   m  m  bim (k )  sim
m            m                                              F
--------
ni bi
i            i                                                    m
n 1
Matrix form:                                                   F
--------
s  -----m =
m               m       m   m–1 T          m      F             m
W  k + 1 = W  k  – s a                      -- --     n 2
n


m           m        m
b k + 1 = b k  – s
F
----------
m
n m
Ming-Feng Yeh                                                           S       20
BP the Sensitivity
Backpropagation: a recurrence
relationship in which the sensitivity at
layer m is computed from the
sensitivity at layer m+1.
Jacobian matrix:  n n  n  m 1
1
m 1
1
m 1
1
                              
 n         n       n 
m        m           m
1         2          sm
 n m 1
nm 1
n m 1
n m 1          2        2
     2
  n   m
n m
n  .m
n m            1         2          sm  
 m 1                
m 1
 ns m1   ns m1   nsmm11 


 m                           
 n1        n2      nsmm 
m
                              
Ming-Feng Yeh                                                      21
Matrix Repression
The i,j element of Jacobian matrix
 s m 1 m             
m

  wi ,l al  bi 
 l 1
m 1

nim 1                                       a m      f m (n m )
                          wm1 j  wm1           j

n m                   n j                  n j        n m
m            i, j    m i, j
j                                                         j

 wimj1 f m (n m ).
,           j

n m 1                        m m
 W m 1F m (n m ) ,  f (n1 )
                              0              0 
n m                                                              
 m (n m )   0
m
f m (n2 )          0 
F                                                     .
                                   
                              m (n m )
 0
                 0          f     sm 
Ming-Feng Yeh                                                            22
Recurrence Relation
The recurrence relation for the sensitivity
ˆ  n m 1  T F
F                        ˆ               Fˆ
s  m 
m                                  
 m 1  F(n )(W )
m m 1 T

n       n m 
n m 1
             n

 F m (n m )(W m 1 ) T s m 1 .

The sensitivities are propagated backward
through the network from the last layer to the
first layer. M    M 1
s s       2
 s  s .
1

Ming-Feng Yeh                                                  23
Backpropagation Algorithm
At the final layer:                 SM
  (t j  a j ) 2
ˆ
F   (t  a ) T (t  a )       j 1                             ai
siM  M                                                 2(ti  ai ) M .
ni         ni M
niM                      ni
ai  aiM f M (niM )  M M
 M              f (ni )
ni
M
ni     niM


siM  2(ti  ai ) f M (niM )

s M  2F M (n M )( t  a)

Ming-Feng Yeh                                                            24
Summary
The first step is to propagate the input forward
through the network:
a0  p
a m 1  f m 1 ( W m 1a m  b m1 ), m  0,1,2,..., M  1
a  aM
The second step is to propagate the sensitivities
backward through the network:
 Output layer: s
M         
 2F M (n M )( t  a)
m    m m
 Hidden layer: s  F (n )( W
m 1 T m 1
) s , m  M  1,..., 2,1
The final step is to update the weights and biases:
W m (k  1)  W m (k )  s m (a m1 ) T
b m (k  1)  b m (k )  s m
Ming-Feng Yeh                                                           25
BP Neural Network
Layer 1   Layer m-1               m
Layer m           Layer M
1                                    w
w 1,1                                       1,1
p1                       1          1                 m                         1         1           a1M
w1,1                                   w   i ,1
2
w1 1 ,1
S
wSmm 1 ,1
w1,mj
p2                       2           j              m                           i         k            a kM
w i, j

w1mS m
,
wSmm1 , j
,
pR                       S1         Sm                                    S     m 1
SM          a SMM
w1 1 , R
S
wSmm1 ,S m

Ming-Feng Yeh                                                                                    26
Ex: Function Approximation
        t
g  p  = 1 + sin  -- p 
-
4 

p                                            e

+
1-2-1
Network

Ming-Feng Yeh                                            27
Network Architecture
p
1-2-1    a
Network

Ming-Feng Yeh                     28
Initial Values
W1  0  = –0.27    b1  0  = – 0.48            W2  0 = 0.09 –0.17      b2 0  = 0.48
– 0.41                  – 0.13

3
Network Response
Initial Network                                                  Sine Wave

Response:                   2

a2   1

0

-1
-2            -1           0         1                 2

Ming-Feng Yeh
p                               29
Forward Propagation
0
Initial input:             a = p = 1

Output of the 1st layer:
                                     
a = f W a + b  = l og sig  –0.27 1 + – 0.48  = log si g – 0.75 
1   1  1 0   1
 –0.41        – 0.13               – 0.54 
1
-- -- -- -- --
- -- -- -- ---
0.75
a1 =       1+ e           =   0.321
1            0.368
-- -- --0.54
-- --
- -- -- -- ---
1+ e
Output of the 2nd layer:
a = f W a + b  = purelin ( 0.09 – 0.17 0.321 + 0.48 ) = 0.446
2   2  2 1   2

0.368
error:
                                    
e = t – a =  1 + sin  -- p   – a =  1 + sin  -- 1   – 0.446 = 1.261
2
-                           -
         4                      4 

Ming-Feng Yeh                                                                        30
Transfer Func. Derivatives

d  1 
 1 ( n)                 e n
f                   n 

dn  1  e  (1  e  n ) 2
    1  1 
 1     n    n 
 (1  a1 )(a1 )
 1  e  1  e 

 2 ( n)  d ( n)  1
f
dn

Ming-Feng Yeh                                             31
Backpropagation
The second layer sensitivity:
                       
s 2  2F 2 (n 2 )( t  a)  2[ f 2 (n 2 )]e
 2  1  1.261  2.522
The first layer sensitivity:
                     (1  a1 )(a1 )
1    1
0         w12,1  2
s1  F1 (n1 )(W 2 )T s 2                             1  2 
s
      0        (1  a2 )(a2 )  w1, 2 
1

(1  0.321)  0.321                0           0.09
                                                0.17 2.522
(1  0.368)  0.368 
         0                                              
 0.0495

Ming-Feng Yeh 
0.0997
                                             32
Weight Update
Learning rate   0.1
W 2 (1)  W 2 (0)  s 2 (a1 )T
 0.09  0.17  0.1[2.522]0.321 0.368
 0.171  0.0772
b 2 (1)  b 2 (0)  s 2  [0.48 ]  0.1[2.522 ]  [0.732 ]
W1 (1)  W1 (0)  s1 (a 0 )T
 0.27        0.0495       0.265
          0.1 0.0997[1]   0.420
  0.41                           
 0.48       0.0495  0.475
b (1)  b (0)  s  
1        1        1
  0.1 0.0997   0.140
 0.13                       
Ming-Feng Yeh                                                       33
Choice of Network Structure
Multilayer networks can be used to
approximate almost any function, if we
have enough neurons in the hidden
layers.
We cannot say, in general, how many
layers or how many neurons are

Ming-Feng Yeh                             34
Illustrated Example 1
3                         3

i
g p  = 1 + sin --- p 
2                         2
--
4        1                         1

0                         0

-1                        -1
1-3-1 Network               -2   -1    0     1   2    -2   -1    0     1    2

i 1                      i2
3                         3

2                         2

1                         1

0                         0

-1                        -1
-2   -1    0     1   2    -2   -1    0     1    2

i4                       i 8
Ming-Feng Yeh                                                                 35
Illustrated Example 2
3                         3

2
1-2-1               2
1-3-1
6
g  p  = 1 + sin -- - p 
---       1                         1
4 
0                         0

2 p2                      -1                        -1
-2     -1    0   1   2    -2     -1    0   1   2

3                         3

2
1-4-1               2
1-5-1
1                         1

0                         0

-1                        -1
-2     -1    0   1   2    -2     -1    0   1   2

Ming-Feng Yeh                                                                  36
Convergence
g  p  = 1 + sin p   2  p  2
3                                               3

2         5                                     2

1
5
1                                      3        1     3
2                                                   4
4                  2
0     0                                         0

0
1
-1                                              -1
-2           -1       0       1           2     -2           -1   0       1        2

Convergence to Global Min.         Convergence to Local Min.
The numbers to each curve indicate the sequence of iterations.
Ming-Feng Yeh                                                           37
Generalization
In most cases the multilayer network is
trained with a finite number of examples of
proper network behavior: {p1, t1}  {p2, t2}    {pQ, tQ}
This training set is normally representative of
a much larger class of possible input/output
pairs.
Can the network successfully generalize what it
has learned to the total population?

Ming-Feng Yeh                                                38
Generalization Example

g  p  = 1 + sin  -- p 
-
4 
p = –2 –1.6 –1.2   1.6 2
3                                                    3

1-2-1                                               1-9-1
2                                                    2

1                                                    1

0                                                    0

-1
Generalize well                       -1
Not generalize well
-2           -1           0         1          2    -2           -1      0   1         2

For a network to be able to generalize, it should have fewer
parameters than there are data points in the training set.
Ming-Feng Yeh                                                                39

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 98 posted: 8/8/2012 language: English pages: 39