BackPropagation - PowerPoint

Document Sample
BackPropagation - PowerPoint Powered By Docstoc
					         CHAPTER 11

         Back-
      Propagation
Ming-Feng Yeh         1
                Objectives
     A generalization of the LMS algorithm, called
     backpropagation, can be used to train
     multilayer networks.
     Backpropagation is an approximate steepest
     descent algorithm, in which the performance
     index is mean square error.
     In order to calculate the derivatives, we need
     to use the chain rule of calculus.


Ming-Feng Yeh                                    2
                Motivation
     The perceptron learning and the LMS
     algorithm were designed to train single-
     layer perceptron-like networks.
     They are only able to solve linearly
     separable classification problems.
     Parallel Distributed Processing
     The multilayer perceptron, trained by the
     backpropagation algorithm, is currently the
     most widely used neural network.

Ming-Feng Yeh                                  3
       Three-Layer Network




Number of neurons in each layer: R  S 1  S 2  S 3
Ming-Feng Yeh                                          4
     Pattern Classification:
           XOR gate
     The limitations of the single-layer
     perceptron (Minsky & Papert, 1969)
                          0                 0         
                     p1   , t1  0   p 2   , t 2  1
    P2          P4        0                 1         
                           1               1         
                     p 3   , t3  1 p 4   , t 4  0
     P1         P3         0               1         


Ming-Feng Yeh                                                5
 Two-Layer XOR Network
                                                             w1
                                                              1
     Two-layer, 2-2-1 network
                          1      1
                         n1     a1
   p1     2
            2
                                                P1
                    1                     n12        a12
                   1
                                       
         1              n1     a1      1 .5               AND
                  
                          2      2
  p2
         1                            1
                    1.5                                     P4
                   1
                Individual Decisions        1
                                           w2
Ming-Feng Yeh                                                     6
     Solved Problem P11.1
     Design a multilayer network to distinguish
     these categories.

         p1  1 1  1  1   p3  1  1 1  1
                          T                     T




        p2   1  1 1 1    p4   1 1  1 1
                          T                     T



                Class I         Class II
         Wp 1  b  0         Wp 3  b  0
         Wp 2  b  0         Wp 4  b  0
     There is no hyperplane that can separate
     these two categories.
Ming-Feng Yeh                                       7
Solution of Problem P11.1
  p1        2             1
                         n1      1
                                a1
                
  p2        2
                    1                    n12    a12
                1
                                     
  p3        2            n1     a1    1
                
                          2      2

  p4        2                        1
                    1                      OR
                1
                          AND

Ming-Feng Yeh                                          8
  Function Approximation
     Two-layer, 1-2-1 network
                                                               1
                                                   f ( n) 
                                                    1
                                                                     , f 2 ( n)  n
                                                            1  e n

                                             3




                                             2




                                             1




 w1  10 , w1  10 , b1  10 , b2  10 .
  1
            2
                      1          1
                                             0



  w12  1, w12  1, b 2  0.
                                            -1
                                              -2        -1.5   -1   -0.5   0   0.5   1   1.5   2


Ming-Feng Yeh                                                                            9
  Function Approximation
     The centers of the steps occur where
     the net input to a neuron in the first
     layer is zero.
        n1  w1 p  b1  0  p   b1 w1   (10) 10  1
         1    1      1              1  1


        n1  w1 p  b2  0  p   b2 w1   10 10  1
         2    2
                     1              1
                                       2

     The steepness of each step can be
     adjusted by changing the network
     weights.

Ming-Feng Yeh                                               10
 Effect of Parameter Changes
                 3
                        1
                       b2

                 2




                 1

                       20          15        10          5       0

                 0




                -1
                  -2        -1.5        -1        -0.5       0       0.5   1   1.5   2


Ming-Feng Yeh                                                                            11
 Effect of Parameter Changes
                 3

                       w12
                                                                    1.0
                 2

                                                                    0.5

                                                                    0.0
                 1

                                                                    -0.5

                                                                    -1.0
                 0




                -1
                  -2         -1.5   -1   -0.5   0   0.5   1   1.5          2


Ming-Feng Yeh                                                                  12
 Effect of Parameter Changes
                 3

                       w12
                                                                    1.0
                 2

                                                                    0.5

                                                                    0.0
                 1

                                                                    -0.5

                                                                    -1.0
                 0




                -1
                  -2         -1.5   -1   -0.5   0   0.5   1   1.5          2


Ming-Feng Yeh                                                                  13
 Effect of Parameter Changes
                 3

                       b2
                                               1.0
                 2
                                               0.5

                                               0.0
                 1
                                               -0.5

                                               -1.0
                 0




                -1
                  -2        -1.5   -1   -0.5      0   0.5   1   1.5   2


Ming-Feng Yeh                                                             14
  Function Approximation
     Two-layer networks, with sigmoid
     transfer functions in the hidden layer
     and linear transfer functions in the
     output layer, can approximate virtually
     any function of interest to any degree
     accuracy, provided sufficiently many
     hidden units are available.


Ming-Feng Yeh                             15
Backpropagation Algorithm
     For multilayer networks the outputs of one
     layer becomes the input to the following layer.
     a m 1  f m 1 ( W m 1a m  b m1 ), m  0,1,2,..., M  1
     a 0  p, a  a M




Ming-Feng Yeh                                                      16
           Performance Index
     Training Set: {p1, t1}  {p2, t2}    {pQ, tQ}
     Mean Square Error: F x = E e2  = E t – a2 
                                            T                    T
     Vector Case: F x= Ee e  = E t – a  t – a
     Approximate Mean Square Error:
      ˆ  x = t k – a k T  tk  – ak   = eT  k e k
      F
     Approximate Steepest Descent Algorithm
        m                   m             Fˆ        m                 m           Fˆ
      w i j k   + 1 =   wi j k  – ------    b i k   + 1 =   b i k  –  -----
                                                                                     m
                                             m
                                         w i j                                  b i

Ming-Feng Yeh                                                                             17
                      Chain Rule
      df (n( w)) df (n) dn( w)
                      
          dw      dn     dw
     If f(n) = en and n = 2w, so that f(n(w)) = e2w.
      df (n( w)) df (n) dn( w)
                              en  2
          dw      dn     dw
     Approximate mean square error:
      ˆ
      F (x)  [t (k )  a(k )]T [t (k )  a(k )]  e T (k )e(k )
                                           ˆ
                                          F                     ˆ
                                                               F nim
        wi , j (k  1)  wi , j (k )   m  wi , j (k )   m  m
          m                m                       m

                                         wi , j               ni wi , j
                                         ˆ
                                        F                   ˆ
                                                           F nim
        bi (k  1)  bi (k )   m  bi (k )   m  m
          m              m                       m

                                       bi                 ni bi
Ming-Feng Yeh                                                                18
      Sensitivity & Gradient
     The net input to the ith neurons of layer m:
            S m 1                nim         nim
     nim   wimj a m 1  bim    m  a m1 ,      1
                                  wi , j      bi
               ,    j                     j       m
                j 1

                          ˆ
     The sensitivity of F to changes in the ith element
                                         ˆ
     of the net input at layer m: sim  F nim
                ˆ
               F       ˆ
                      F nim       m 1
     Gradient: m  m  m  si  a j
                                  m

              wi , j ni wi , j

                       Fˆ    ˆ
                            F nim
                            m  m  sim  1  sim
                       bim ni bi
Ming-Feng Yeh                                            19
Steepest Descent Algorithm
      The steepest descent algorithm for the
      approximate mean square error:
                                         ˆ
                                       F nim
       wi , j (k  1)  wi , j (k )   m  m  wimj (k )  sim a m1
        m                m

                                       ni wi , j
                                                   ,               j


                               ˆ
                             F nim
       b (k  1)  b (k )   m  m  bim (k )  sim
         m            m                                              F
                                                                    --------
                             ni bi
        i            i                                                    m
                                                                    n 1
      Matrix form:                                                   F
                                                                    --------
                                                    s  -----m =
          m               m       m   m–1 T          m      F             m
       W  k + 1 = W  k  – s a                      -- --     n 2
                                                        n




                                                                    
         m           m        m
       b k + 1 = b k  – s
                                                                      F
                                                                    ----------
                                                                          m
                                                                    n m
 Ming-Feng Yeh                                                           S       20
            BP the Sensitivity
     Backpropagation: a recurrence
     relationship in which the sensitivity at
     layer m is computed from the
     sensitivity at layer m+1.
     Jacobian matrix:  n n  n  m 1
                                     1
                                              m 1
                                              1
                                                          m 1
                                                          1
                                                              
                                 n         n       n 
                                       m        m           m
                                      1         2          sm
                                 n m 1
                                            nm 1
                                                      n m 1
                    n m 1          2        2
                                                         2
                                n   m
                                             n m
                                                      n  .m
                     n m            1         2          sm  
                                 m 1                
                                               m 1
                                 ns m1   ns m1   nsmm11 
                                                           

                                 m                           
                                 n1        n2      nsmm 
                                                m
                                                              
Ming-Feng Yeh                                                      21
            Matrix Repression
     The i,j element of Jacobian matrix
                     s m 1 m             
                       m


                     wi ,l al  bi 
                     l 1
                                      m 1
                                           
       nim 1                                       a m      f m (n m )
                                           wm1 j  wm1           j

        n m                   n j                  n j        n m
                                   m            i, j    m i, j
           j                                                         j
                          
                  wimj1 f m (n m ).
                     ,           j

      n m 1                        m m
               W m 1F m (n m ) ,  f (n1 )
                                                    0              0 
       n m                                                              
                       m (n m )   0
                                                       m
                                                 f m (n2 )          0 
                      F                                                     .
                                                                      
                                                                 m (n m )
                                    0
                                                    0          f     sm 
Ming-Feng Yeh                                                            22
        Recurrence Relation
     The recurrence relation for the sensitivity
             ˆ  n m 1  T F
            F                        ˆ               Fˆ
      s  m 
       m                                  
                              m 1  F(n )(W )
                                            m m 1 T

           n       n m 
                                                     n m 1
                                n
           
          F m (n m )(W m 1 ) T s m 1 .

     The sensitivities are propagated backward
     through the network from the last layer to the
     first layer. M    M 1
                      s s       2
                                   s  s .
                                     1




Ming-Feng Yeh                                                  23
Backpropagation Algorithm
     At the final layer:                 SM
                                         (t j  a j ) 2
          ˆ
         F   (t  a ) T (t  a )       j 1                             ai
   siM  M                                                 2(ti  ai ) M .
        ni         ni M
                                                niM                      ni
                ai  aiM f M (niM )  M M
                     M              f (ni )
                ni
                  M
                     ni     niM


                       
    siM  2(ti  ai ) f M (niM )
            
    s M  2F M (n M )( t  a)


Ming-Feng Yeh                                                            24
                        Summary
     The first step is to propagate the input forward
     through the network:
        a0  p
        a m 1  f m 1 ( W m 1a m  b m1 ), m  0,1,2,..., M  1
        a  aM
     The second step is to propagate the sensitivities
     backward through the network:
       Output layer: s
                            M         
                                 2F M (n M )( t  a)
                            m    m m
       Hidden layer: s  F (n )( W
                                               m 1 T m 1
                                                   ) s , m  M  1,..., 2,1
     The final step is to update the weights and biases:
         W m (k  1)  W m (k )  s m (a m1 ) T
         b m (k  1)  b m (k )  s m
Ming-Feng Yeh                                                           25
                  BP Neural Network
                       Layer 1   Layer m-1               m
                                                                     Layer m           Layer M
               1                                    w
             w 1,1                                       1,1
p1                       1          1                 m                         1         1           a1M
           w1,1                                   w   i ,1
            2
 w1 1 ,1
  S
                                     wSmm 1 ,1
                                                           w1,mj
p2                       2           j              m                           i         k            a kM
                                                  w i, j



                                         w1mS m
                                           ,
                                                                   wSmm1 , j
                                                  wimS m
                                                    ,
pR                       S1         Sm                                    S     m 1
                                                                                          SM          a SMM
            w1 1 , R
             S
                                                  wSmm1 ,S m

Ming-Feng Yeh                                                                                    26
Ex: Function Approximation
                                            t
                g  p  = 1 + sin  -- p 
                                     -
                                  4 


       p                                            e

                                                 +
                      1-2-1
                     Network



Ming-Feng Yeh                                            27
       Network Architecture
                p
                     1-2-1    a
                    Network




Ming-Feng Yeh                     28
                    Initial Values
W1  0  = –0.27    b1  0  = – 0.48            W2  0 = 0.09 –0.17      b2 0  = 0.48
           – 0.41                  – 0.13

                             3
                                                                  Network Response
 Initial Network                                                  Sine Wave


 Response:                   2




                        a2   1




                             0




                             -1
                              -2            -1           0         1                 2


Ming-Feng Yeh
                                                         p                               29
       Forward Propagation
                               0
  Initial input:             a = p = 1

  Output of the 1st layer:
                                                                      
     a = f W a + b  = l og sig  –0.27 1 + – 0.48  = log si g – 0.75 
      1   1  1 0   1
                                            –0.41        – 0.13               – 0.54 
                      1
                -- -- -- -- --
                - -- -- -- ---
                         0.75
     a1 =       1+ e           =   0.321
                      1            0.368
                -- -- --0.54
                         -- --
                - -- -- -- ---
                1+ e
  Output of the 2nd layer:
     a = f W a + b  = purelin ( 0.09 – 0.17 0.321 + 0.48 ) = 0.446
      2   2  2 1   2

                                              0.368
  error:
                                                     
     e = t – a =  1 + sin  -- p   – a =  1 + sin  -- 1   – 0.446 = 1.261
                                         2
                             -                           -
                          4                      4 

Ming-Feng Yeh                                                                        30
 Transfer Func. Derivatives

                d  1 
       1 ( n)                 e n
      f                   n 
                               
                dn  1  e  (1  e  n ) 2
                      1  1 
                 1     n    n 
                                        (1  a1 )(a1 )
                   1  e  1  e 

       2 ( n)  d ( n)  1
      f
                 dn


Ming-Feng Yeh                                             31
           Backpropagation
    The second layer sensitivity:
                                      
      s 2  2F 2 (n 2 )( t  a)  2[ f 2 (n 2 )]e
           2  1  1.261  2.522
    The first layer sensitivity:
                                 (1  a1 )(a1 )
                                         1    1
                                                        0         w12,1  2
       s1  F1 (n1 )(W 2 )T s 2                             1  2 
                                                                            s
                                        0        (1  a2 )(a2 )  w1, 2 
                                                        1


            (1  0.321)  0.321                0           0.09
                                                          0.17 2.522
                                     (1  0.368)  0.368 
                     0                                              
               0.0495
            
Ming-Feng Yeh 
                 0.0997
                                                                    32
                  Weight Update
     Learning rate   0.1
     W 2 (1)  W 2 (0)  s 2 (a1 )T
                 0.09  0.17  0.1[2.522]0.321 0.368
              0.171  0.0772
     b 2 (1)  b 2 (0)  s 2  [0.48 ]  0.1[2.522 ]  [0.732 ]
     W1 (1)  W1 (0)  s1 (a 0 )T
                0.27        0.0495       0.265
                       0.1 0.0997[1]   0.420
                 0.41                           
                               0.48       0.0495  0.475
       b (1)  b (0)  s  
        1        1        1
                                       0.1 0.0997   0.140
                               0.13                       
Ming-Feng Yeh                                                       33
Choice of Network Structure
     Multilayer networks can be used to
     approximate almost any function, if we
     have enough neurons in the hidden
     layers.
     We cannot say, in general, how many
     layers or how many neurons are
     necessary for adequate performance.


Ming-Feng Yeh                             34
       Illustrated Example 1
                              3                         3


                    i
  g p  = 1 + sin --- p 
                              2                         2
                     --
                   4        1                         1


                              0                         0

                              -1                        -1
   1-3-1 Network               -2   -1    0     1   2    -2   -1    0     1    2

                                         i 1                      i2
                              3                         3


                              2                         2


                              1                         1


                              0                         0


                              -1                        -1
                               -2   -1    0     1   2    -2   -1    0     1    2


                                         i4                       i 8
Ming-Feng Yeh                                                                 35
       Illustrated Example 2
                               3                         3


                               2
                                     1-2-1               2
                                                               1-3-1
                    6
 g  p  = 1 + sin -- - p 
                     ---       1                         1
                   4 
                               0                         0

  2 p2                      -1                        -1
                                -2     -1    0   1   2    -2     -1    0   1   2



                               3                         3


                               2
                                     1-4-1               2
                                                               1-5-1
                               1                         1


                               0                         0


                               -1                        -1
                                -2     -1    0   1   2    -2     -1    0   1   2




Ming-Feng Yeh                                                                  36
                           Convergence
          g  p  = 1 + sin p   2  p  2
3                                               3




2         5                                     2


                   1
                                                                                5
1                                      3        1     3
                   2                                                   4
                                   4                  2
0     0                                         0

                                                                       0
                                                          1
-1                                              -1
 -2           -1       0       1           2     -2           -1   0       1        2


      Convergence to Global Min.         Convergence to Local Min.
       The numbers to each curve indicate the sequence of iterations.
       Ming-Feng Yeh                                                           37
                Generalization
     In most cases the multilayer network is
     trained with a finite number of examples of
     proper network behavior: {p1, t1}  {p2, t2}    {pQ, tQ}
     This training set is normally representative of
     a much larger class of possible input/output
     pairs.
     Can the network successfully generalize what it
     has learned to the total population?


Ming-Feng Yeh                                                38
         Generalization Example
                            
        g  p  = 1 + sin  -- p 
                             -
                          4 
                                         p = –2 –1.6 –1.2   1.6 2
3                                                    3



      1-2-1                                               1-9-1
2                                                    2




1                                                    1




0                                                    0




-1
              Generalize well                       -1
                                                             Not generalize well
 -2           -1           0         1          2    -2           -1      0   1         2


      For a network to be able to generalize, it should have fewer
      parameters than there are data points in the training set.
      Ming-Feng Yeh                                                                39

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:98
posted:8/8/2012
language:English
pages:39