Neural network

Document Sample
Neural network Powered By Docstoc
					Chapter 3

Image Acquisition

Recording of the image information by projection of
   intensities upon the recording medium:
   photographical film, CCD-sensor, etc.

For the continuous case (photographic film) the
   intensity is given by a contiuous function f(x,y).
   Here x and y indicate the coordinates on the film
   (i.e. origin is at left lower corner)
f(x,y) is the brightness or light intensity value at the
    point with coordinates x,y
   (e.g. 0 ... 1).
In the discrete case (image consists of ‘pixels’, or
    ‘picture elements)
   the image is a matrix, with M rows (lines) and N
   columns are
   let i and k be the row and column indices,
   so i = 1 ... M, k = 1 ... N.
    In digital image processing the intensity function
    can only assume discrete values (e.g. 0, 1, ...
    Thus denote the intensity function with f(i,k):




Image preprocessing
Problem : Images delivered by the camera often
   have insufficient quality: noise, distortions, bad
   contrast, illumination problems, motion blurring

Image preprocessing : Low level processing to make
   the image usable for further processing steps (i.e.
   object recognition)
   ‘Image filtering’

Image filtering

a) Averaging
Local calculation of the average intensity for a
   neighborhood of adjacent pixels:

       E3 = (E1 + E2 + E3 + E4 + E5)

                                    E2      E3   E4

or spreading the averaging process onto regions of
    any size:

           g(i,k) =
                                       f(i,k)
                         i,k  Region

Effect of averaging: noise reduction, image is
"flattened" by removal of small, disturbances or data

Linear filter operator

Disadvantage: Edges in the image (important
   strucural features) are also flattened by averaging

b) Median filtering
Creation of the local median value: sort a 3x3 pixel
   matrix around E5 in ascending order into a list.
   Gray level of E5 is then replaced by the median
   value of the list

     List (1, 44, 77, 140, 190)

     Median is 77

     Average is (1+44+77+140+190)/5

Non-linear operator

                                    E1    E2     E3

                                    E4    E5     E6
                                    E7    E8     E9
Advantage of the median: less sensitive with respect
   to data gaps and random disturbances

   Pixel intensity values: (99,100,0,101,100)

   (value 0 is a data gap)

   Median: 100 (correct)
   Average: 400/5 = 80
Example (original image of a hammer)
Random noise added to the original image:
Averaging with 3x3 and 9x9 matrix


   Median- Filtering



mean3:                            -> median
  suppresses noise with little smooting
c) Fourier transform

Universal method for the image preprocessing, can
   be used globally and locally. Contrast
   enhancement, noise filtering, edge detection

Filter characteristics can be adjusted in many ways


   decompose a (periodic) function f into partial

   e.g. Horzontal line in the image (row of pixels)
   can be regarded as a function f of one parameter:

   x: pixel number along the image line

   f(x) gray levels of the x-pixel
f is decomposed into individual frequencies

theorem: any (periodic) function can be decomposed
   in this way

f is only defined in small sector (image range), f can
     thus be regarded as section of a periodic function

Result: f is represented as a weighted sum of
   component functions sin(nx) and cos(nx),
   Short: sn, cn denote sin(nx), cos(nx).

For large n, the function sin(nx) is a function with a
   high frequency

   Consider a (horizontal) line in the image

   Noise reduction: calculate fourier decomposition
   and delete all partial functions with a high
   frequency, i.e. all partial functions sin(nx), cos(nx)
   for large n
   Edge enhancement: delete all partial functions
   with low frequency:
         edges are sudden changes of intensity
         within one line

         abrupt changes correspond to steep
         ascends, can only be represented with high

         -> deleting the low frequencies enhances
         the sharp ascends

Reverse tranformation: After deleting certain
   components of the form sin(nx), cos(nx)
   reconstruct f from the remaining components only

Similar procedure is possible with a 2D image matrix
   instead of a single line

(2D fourier transform)
Advantage: FT is independent of direction


Orig. Image

Orig. Img. as function:
Fourier-transform (power spectrum)
Fourier-transform (as function)

Low pass
High pass
Filtering (with low and high pass)

Low pass:

High pass:
Image after low pass (obtained by reverse transform)

(Notice that a vertical line in the image is marked)

high pass:
Image along this vertical line (drawn as function)
Low pass:                  High pass:

Edge detection

An edge appears in an image as a sudden change of
   gray levels (intensities)

Goal of edge detection: find line segments or curves
  where such sudden changes occur in the image

Fundamental method for most edge finding
   Mathematical function differentiation

   I.e. regard image line as function f(x), where x is
   the index (pixel number)

   Calculate derivative df/dx or f ’(x) from f(x)

As long as the gray level remains constant, the
    derivative is null
For edges, derivative is distinct from null

For two-dimensional images (whole image instead of
   single line in image), the derivative of the
   intensity function f(x,y) depends on the direction

Example derivative along a line of the intensity


f(x, y0)


First derivative in x-direction:

Second derivative in x-direction:


Absolute value of the first derivative corresponds the
   value of the gray level change:

    Edge detection with threshold value:
    Whenever the value of the first derivative
    exceeds threshold: report an edge at this pixel

    Problem: How can the threshold value be
Second derivative is more sensitive to changes than
   first derivative (steeper ascend)
Second derivative is also highly sensitive towards
   noise. Therefore it is typically applied only after
   noise filtering.

   Important advantage:
   At each edge f''(x,y) has a zero-crossing.
   ->No threshold value is necessary for second
Derivatives deliver different results for different

    Direction-less edge detection
    Edge detection without having to specify any
    direction in advance

Gradients method: a procedure based on the first
   derivative of the image matrix

Definition of the gradient for a function of two
   variables f(x,y):

G(x,y) = (df(x,y)/dx , df(x,y)/dy) = : (fx, fy)
The direction of this gradient vector is the direction of
   the greatest ascent/descent, with starting point

The absolute value of the gradient
                                      2    2
                G(x,y)=            fx + fy
is a measure for the amount of change,
independent of direction.

Absolute value of gradients is such a direction-less
   edge detector

If the preferred direction is desired:
   instead of taking the absolute value
   multiply the gradient with a vector perpendicular
   with the preferred edge direction

Calculation of the gradient
Approximation of the partial derivatives by simple
   differences :

   fx = ( f(x,y) - f(x - x, y) ) / (x)

   fy = ( f(x,y) - f(x, y - y) ) / (y)
Here x = 1 can be assumedFor the case of
   discrete image matrices:

              fi = f(i,k) - f(i - 1, k)

              fk = f(i,k) - f(i, k - 1)

Implemetation of both formulas as operators
''operator masks'':

                                  -1           +1


Application of the operator
     move the mask over the entire image matrix and
     multiply each intensity value with the weighting
     factors (here +1 and –1)

Improvement: Use of more than two pixels and extra
   weighting of some pixels:

     (reduces noise sensitivity)
     Example for such operators:
              Gx                          Gy

         -1     0     +1             +1    +2    +1

         -2     0     +2             0      0     0

          i     0     +1             -1
                                      i    -2    -1

The combination of these two operators is known as
                       2          2
      GS(x,y) =      Gx (x,y) + Gy (x,y)

There are many other possibilities for defining
operators based on the first derivative or gradients

   G(x,y) = MAX(abs(fx, absfy))
Common feature of nearly all gradient-based
   maximum at the position of the edge has width of
   several pixels
   Desirable: width only one pixel
   ->line thinning procedures necessary

   -> Erosion, dilatation
   (based on the Minkowski-sum)
Examples (Sobel operator)


Sobel (Gs)
vertical edges (Gx)

horizontal edges (Gy)

operators can be tested online under

Partitioning of the image into regions (segments or
   ''semantic units'') according to appropriate
   homogeneity criteria

Example: Image with one large object. Instead of
   finding the object’s edges, find the region
   representing the object in the image

Result: more compact descriptions of the scene,

        a higher abstraction level

Two procedures can be distinguished :

a) Homogeneity orientated segmentation:
    construction of regions of similar image elements
    until a discontinuity is encountered

b) Discontinuity orientated segmentation: search for
    edges first, then connect the edges to region
    boundaries with appropriate criteria


Construction of connected regions (search for

Simplest method : threshold value

Starting from a starting point (x,y) all surrounding
   image points are checked for their gray level.

   An new image point at the position (x’,y’) is
   included in the region, if:

   Abs(f(x,y) - f(x’,y’)) <= T

Very simple procedure. Problems :

Where is the best starting point for a region?

How to choose the threshold value for a region?
Frequently used method: region orientated
   segmentation through partitioning and
   reassembling ("Split - And - Merge")

Starting from the entire image, the current region is
   partitioned into four quadratic sub-regions

   Each quadratic sub-region is partitioned further
   (into four smaller quadrants) if the threshold value
   inside the region is exceeded (Splitting).

Algorithm stops, if no further splitting or merging is

Split :

      1    2

                             1           2           3       4
      4        3

Split :
          21 22
          24 23
                        1        2           3           4
  41 42   31       32

  44 43   34       33                                41 42 43 44

                                 31 32 33 34
               21 22 23 24
Merge :


            32         1             2            3           4

                           23 24                 31 32                    42

Split :                                  START

          234233            1            2            3           4
                                23 24              31 32                       42

                   231 232 233 234                     321 322 323 324
Merge :


                                1            2            3           4

                                    23 24             31 32                     42

                           231 234                        321 324
Representation of the solution as a tree:

  - Expansion of a node, if threshold value

  - Non-expanded nodes are merged to a node

"Split - And - Merge" avoids problem of searching for
   a good starting point.

   However: Threshold value still needed.

For scenes with known lighting and approximately
  known image content:

   Determination of the threshold value "by
   educated guessing" of the user

In appropriate cases: use grey level histogram to find
    threshold value:


         1           2              3      4

x-axis: gray levels (e.g. 0..255)

y-axis: number of pixels with grey level 0, number of
   pixels with grey level 1, ...A large region contains
   many points of the same grey level. Thus, if there
   are several regions with distinct grey levels, the
   histogram of the image has several distinct
   maxima and minima.

The threshold value limits are then placed at the
   minima. This reduces the danger to split regions,
   which actually belong together.

Threshold value determination is particularly simple
in case of a histogram with two distinct maxima (so-
called “bimodal histogram”):

   output image has only two gray levels
   image can be processed as binary image:

        Image elements of one gray level value are
        dedicated to objects, the other gray level
        value represents the background

Advantages of pure binary image processing:

  - very simple hardware

  - fast ("real-time segmentation")

  - straightforward analysis

These advantages are partially compensated by the
   necessity of special lighting
Edge-orientated segmentation

Construction of contours by following sequences of
   object edges in the image

Requires edge detection procedures (derivatives,
   Sobel operator, etc.)

Connect adjacent edges to regions

Problem : How to connect edges, particularly in the
   case of alternative solutions?

Two possibilities :
Contour detection is reduced to the search of a path
   through a graph. Graph describes all transitions
   between an image point and its “neighbor”.

As soon as the path has been found, the region is
   described by its contour.
Advantage of the second method: It is possible to
  use prior knowledge about the expected contours

By changing the heuristic used for the search one
   can adapt to the respective problem. But the
   determination of this heuristic can be difficult.

Procedure :

In discrete images an edge consists of ‘edge

   An edge-element is the segment of an edge
   between to adjacent pixels

   Example: Pixels A and B have one edge element
   in common, denoted by (A, B)
  A        B       C
                        The edge elements are
  D         E

   represented as nodes in a graph:

                  Starting point: A
         k(A,D)                  k(A,B)

    A,D             k(A,B)
k(D,E)                           k(D,E)       k(B,E)

         D,E                 k(B,E)




   Nodes in the graph are edge elements

   e.g. edge element between pixel A and B
   denoted by (A,B)
Successor of a node:

   Edge element (A, B) has two end points

   The are successors of the node (A, B) are the
   edge elements touching these end points

   Example (B, E) is a successor of (A, B)Starting
   from a point S adding a new edge element to an
   existing contour causes a certain cost

The cost function g(n) describes the cost for a path
   from the starting point S to node n

The costs between any two nodes ni and nj are
   denoted by k(ni, nj)

Thus edge detection is reduced to the search of the
   shortest path in a graph. The set of all nodes
   (edge elements), which are along the shortest
   path, describes the desired contour

The cost function depends on the intensity difference
   between the edge elements determining image
   points P1 and P2
   Large intensity difference between the pixels

   Edge element (A, B) has low costs

   Small intensity difference: pixels A and B belong
   to the same region. Thus (A, B) should not be
   part of the contour -> (A, B) receives high cost

Example for a contour :

               A          B        C
               10         10       5
               D          E        F
               10         5        5
               G          H        I
               10         5        4

Graph :

                            10         5
                A,B                            B,C
 A,D                                                             C, F
           5            5                  5            10

   D,E                       B,E                             E,F
                5                                  10

           10                                           10
       5                         E,H                         9

                    5                          9
  G,H                                                        H,I

Finding a shortest path in a graph is equivalent to
   finding a path with lowest total cost.

Several methods for finding shortest paths in graphs
   are known

Graph searching techniques:

   Start node and one or more goal nodes (goal
   states). Wanted is an optimal (low-cost) path, to
   connect the start node to the goal node

In case of contour detection: a start node (=starting
    image point) and several target nodes

   a) closed contour is completely visible the image:

       starting point = end point

   b) contour lies only partially inside image: end
   point is pixel on the image boundary

Breadth first search

Breadth first search always finds the shortest path, if
   one exists


   Number of nodes searched is exponential

   -> impractical in most cases

   Include previous knowledge via the cost function,

   ‘always only follow the one path currently having
   lowest cost’ instead of all paths as in breadth-first
Application of the A*-algorithm to segmentation

Nodes again correspond to edge elements, as above

Cost function depends on the gradient as above, i.e.
   the lower the intensity difference between the two
   pixels of an edge element, the higher the costs of
   this edge element

Heuristic function depends on the expected contour
   The greater the deviations from the expected
   contour, the higher the heuristic cost of this path

Example for the calculation of the heuristic function
   h(n) for a node n:

a) Assume squares must be detected in the image.
    Then choose h such that costs are increase with
    distance to the starting point. Additionally any
    deviation from a straight line, which does not
    amount ±90o should cause high cost

b) Likewise, to detect circles of known radius,
    function h can be chosen proportional to the
    difference of a default radius of curvature
Thus h is only useful, if the shape of objects to be
   detected is known in advance

Object recognition

The goal of object recognition in images is to extract
   assertions of the form:

   "Image region X with the properties Y it is an
   apple (a dog, an assembly part…), if projected
   with method Z onto the image area."

To make such assertions, a model is needed, i.e.

   a) Knowledge of all objects potentially occurring
   in the image and

   b) Means for the description of the current image
Approach :

a) Comparison between characteristics of prototypes
    with an observed features (characteristics) in the
    image on the basis of statistic methods (decision

b) Classification with neural networks

Statistic approach:

An object in the image is characterized by n
   descriptors xi (e.g. length, area, color,

Assemble the descriptors to a so-called
   characteristic vector:
                       x= .

The decision is made in such a way, that a presented
   object is mapped to a certain object class i.

Decision is made by evaluating a decision function

If M different object classes have to be distinguished,
    M decision functions are needed, which map the
    characteristic vector to the different classes.
The decision is made in such a way, that
   characteristic vector x of the object is inserted
   into all decision functions. The object belongs to
   that class, whose decision function is smaller
   (greater) than all other values of decision
   functions, i.e. xa belongs to the class i, if

di(xa) < dj(xa) j \ j=iIn the simplest case the
    decision function is the Euclidian distance
    between the characteristic vector x and the
    prototype vector of the class

di(x) = absx – miThe prototype vector for each
    class is computed by "averaging" over a large
    number N of actual characteristic vectors:

                   mi =  xk
                       Nk = 1
Very fast (parallelism !) decision making is possible,
   but the process is relatively rigid

Interpretation with neural networks
Characteristic vector as above

    But first of all an adequate model is computed by

    Training through back-propagation


Internal system parameters (weight values) are
adapted to best fit the training data

adaptation in small steps, similar to numeric

Basic model

           x1    g1
                       +           >S
           xn    gn

    Binary input values x1,..,xn (only 0 or 1)

    Output also binary

    Threshold value S

    Weights g1,...,gn


    State storage

    x1    g1
    x2            Z         >S
    ...          f
    xn    gn

    More general function f replaces summation, in
    the above model f(x1,...,xn) = x1 + ...+xn

    Function f depends on the state Z, Z is stored

Alternative 2
    Combination with a logic gate


                    l2            +         >S


Logic gates transform many binary inputs to one

Goal: determine values gi from a number of training
data sets

each training data set has known classification

determine values gi such that a new previously un-
classified data set can be classified correctly

Training instances are pre-classified, i.e. each
training instance is marked as a positive or a
negative example for the feature to be learned
Neural network

             Input    cells    Output
                        >S1        >S1'

                        >S2        >S2'
                 .        .        .
                 .        .        .
                 .        .        .
                        >Sn        >Sm'

resp. (Threshold values transformed into as weights)
               I                                      O
                           g             h
                               11   >0

                                    >0            >0

                       .            .             .
                       .            .             .
                       .            .             .
                                    >0            >0

                   1                1

Binary input in cells in layer I

Binary output in layer O

Procedure for the calculation of an output for a given

    1.    Multiplication of the weights gij with the input

    2.   Threshold value comparison in the hidden

    3.   Propagation into the output layer,
         multiplication with the weights hij
    4.     Threshold value comparison in the output


    Given instances

    weights gij and hij must be calculated

        Error Back-propagation


           In the beginning all weights are random

           First input vector (x1,...,xn) of the training
           set is propagated through the network in
           forward direction (i.e. from left to right)

           Delivers output vector (y1,...,ym)

           Desired output is given in the training data

           Actual output (y1,...,ym) is compared to the
           desired output (1,...,m)
Call the difference between actual and desired
output the error F

Adapt weights gij and hij in such a way, that the
error F is minimized

Thus: consider the value
(y1 -1) 2 at first output cell a1

Adaptation of weight value h11 such that the
error value (y1 -1) 2 is reduced

Short cut: write h instead of h11

Consider the function F(h,x) in the rectangular
sector shown

                    11   >0
                              h       a

            .            .        .
            .            .        .
            .            .        .           Rect.

        1                1
Zoom of the above sector with error calculation
added at the end:

                 h           1
                         >0           2
            x                    (y-)


error function F(x,h) will is transformed into
differentiable function by substituting the threshold
value function at a1

I.e. previous simple 0-1-threshold value computation
is substituted by a differentiable function

Specifically: replace the step function



node a1 transforms the input to the value


(different choices the latter function are possible)

How to adapt h ?

     Determine one-dimensional derivative F’(x, h) of
     F(x,h) with respect to h (x is regarded as a
     constant here!!)

              F (h)

                          h1        h h
x is constant, thus we can write Fx(h)
instead of F(x,h)

To reduce the error, simply subtract the value F’x(h)
from h

        if F’x(h) is negative (i.e. at h2), error is
        If F’x(h) is positive (i.e. at h1), error is also

Calculation of the derivative of Fx(h):

    Set        r(h, x) = hx

               s(u) =   Error!

               t(y) = (y - )2

as above, write

          rx(h) = r(h,x) = hx
Thus :

    Fx(h) = t(s( rx(h))
              and x are constant!


    s'(u) = s(u) (1-s(u))

(Exercise: check the latter formula!!)


    F'x(h)    = t'( s ( rx(h))) s '( rx(h))   r 'x(h)

    (one-dimensional chain rule!)

              = 2( s ( rx(h)) - ) s '( rx(h)) x

              = 2(s(u) - ) s(u) (1-s(u)) x

    where u = rx(h) = xh
Visualization (one-dimensional)
            F (h)

                         h               h h
                          1               2

at h1 the derivative F'x(h) is positive,
thus subtracting F'x(h) reduces the error

in h2 F'x(h) is negative
thus subtracting F'x(h) also reduces the error

Extension to the overall network:

    Extend the error function F to the entire network

    Take derivative of

         F(x1,...,xn) (g11,...,gkl, h11,...,hrs)

    with respect to each individual gij (resp. hij)

         (partial derivatives)
    Change the weight vector into direction of the
    negative gradient, i.e. 'in direction of'

    F0 =      -(   Error!,...., Error!)

    i.e. subtract F0 from the weight vector

         G = (g11,...,gkl, h11,...,hrs)

    Derivation of F with respect to the weights gij
    can be calculated as above

Recall the derivative of a two-dimensional function

    F : mapping R2 to R

    Vector dF of the partial derivatives is
    perpendicular to the Isocurves
    dF points towards the direction of the steepest


            F: (x,y) -> x

         (Graph of the function is the plane in R3,
         vector of the partial derivative is the vector
    Adaptation of the weight vector G through

         G - F0

    Thereby is a constant, which must be
    determined experimentally

Caution: individual weights calculated above must
not be changed, before all remaining weights have
been calculated and stored

(otherwise a typical programming bug would arise,
calculated weight changes also depend on the
current values of the remaining weights)


    -    gradient descent with a momentum

    -    more layers

Shared By:
dominic.cecilia dominic.cecilia http://