Pattern Matching

Shared by: HC120216145935
Categories
Tags
-
Stats
views:
7
posted:
2/16/2012
language:
English
pages:
68
Document Sample
scope of work template
							      Pattern Matching


     a    b    a       c   a   a   b
                               1
     a    b    a       c   a   b
                           4   3   2
          a    b       a   c   a   b




1   Pattern Matching
Strings

    A string is a sequence of             Let P be a string of size m
     characters                                A substring P[i .. j] of P is the
    Examples of strings:                       subsequence of P consisting of the
                                                characters with ranks between i and
        Java program                           j
        HTML document                         A prefix of P is a substring of the
        DNA sequence                           type P[0 .. i]
        Digitized image                       A suffix of P is a substring of the
    An alphabet S is the set of                type P[i ..m - 1]
     possible characters for a family      Given strings T (text) and P
     of strings                             (pattern), the pattern matching
    Example of alphabets:                  problem consists of finding a
                                            substring of T equal to P
        ASCII
        Unicode
                                           Applications:
                                               Text editors
        {0, 1}
                                               Search engines
        {A, C, G, T}
                                               Biological research

2                                           Pattern Matching
Brute-Force Algorithm


    The brute-force pattern              Algorithm BruteForceMatch(T, P)
     matching algorithm compares the         Input text T of size n and pattern
     pattern P with the text T for              P of size m
     each possible shift of P relative       Output starting index of a
     to T, until either                         substring of T equal to P or -1
      a match is found, or                     if no such substring exists
      all placements of the pattern have    for i  0 to n - m
        been tried
                                                { test shift i of the pattern }
    Brute-force pattern matching
     runs in time O(nm)                         j0
    Example of worst case:                     while j < m  T[i + j] = P[j]
      T = aaa … ah                                 jj+1
      P = aaah                                 if j = m
      may occur in images and DNA                  return i {match at i}
        sequences
                                                else
      unlikely in English text
                                                    break while loop {mismatch}
                                             return -1 {no match anywhere}
3                                          Pattern Matching
Boyer-Moore Heuristics


    The Boyer-Moore’s pattern matching algorithm is based on two heuristics
     Start at the end: Compare P with a subsequence of T moving backwards
     Character-jump heuristic: When a mismatch occurs at T[i] = c
          If P contains c, shift P to align the last occurrence of c in P with T[i]
          Else, shift P to align P[0] with T[i + 1]
    Example



     a       p a t     t e r n         m a t c h i n g                   a l g o r       i   t h m

                 1                       3                                   5       11 10 9 8 7
     r i     t h m           r   i   t h m                     r   i     t h m        r i t h m

                        2                                 4                                  6
             r i    t h m                     r   i   t h m                      r   i   t h m

4                                                     Pattern Matching
Last-Occurrence Function


       Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet
        S to build the last-occurrence function L mapping S to integers, where
        L(c) is defined as
           the largest index i such that P[i] = c or
           -1 if no such index exists
       Example:
                                        c          a         b        c   d
           P = abacab                L(c)         4         5        3   -1
           S = {a, b, c, d}


       The last-occurrence function can be represented by an array indexed
        by the numeric codes of the characters
       The last-occurrence function can be computed in time O(m + s), where
        m is the size of P and s is the size of S


5                                                  Pattern Matching
The Boyer-Moore Algorithm
                                           Last is abbreviated “l” in figs
Algorithm BoyerMooreMatch(T, P, S)
   L is lastOccurenceFunction(P, S )       Case 1: j  1 + l
   im-1                                           .   .   .   .   .       .   a .   .   .   .   .   .
   jm-1                                                                       i
   repeat
       if T[i] = P[j]                                      .   .   .       .   b a
           if j = 0                                                            j l
                return i { match at i }                                          m-j
           else                                                .   .       .   .   b a
                ii-1
                jj-1                                                  j
       else
                                           Case 2: 1 + l  j
           { character-jump }
           last L[T[i]]                           .   .   .   .   .       .   a .   .   .   .   .   .
           i  i + m – min(j, last+ 1)                                         i
           jm-1                                           .   a .         .   b .
   until i > n  1 // beyond text length                       l               j
   return -1 { no match }                                                        m - (1 + l)

                                                                           .   a .   .   b .

                                                                           1+l
  6                                        Pattern Matching
Update function?

               i  i + m – min(j, last +1)

       Why the min?
           If the last character is to the left of where you are looking, you
            will just shift so that the characters align. That amount is (m-
            1ast+l).
           If the last character is to the right of where you are looking, that
            would require a NEGATIVE shift to align them. That is not
            good. In that case, the whole pattern just shifts 1. HOWEVER,
            the code to do that is NOT obvious.
               Recall j starts out at m-1 and then decreases. i also starts out at the end
                of the pattern and decreases. Thus, if you add to i m-j, you are really
                moving the starting point to just one higher than the starting value for i
                for the current pass. Try it with some numbers to see.

    7                                                Pattern Matching
Example                           i=4 j=3 last(a) = 4 j<last(a)+1
                                     m=6 SOOO i=4+6-3 = 7

    0 1 2 34 5 67 8 9
    a   b a   c   a a     b   a d        c    a    b a      c   a b     a   a b b
                      1
                                    Last a is to the right of where we are
    a   b a   c   a b             currently. Just shift pattern to right by 1
                  4   3   2                   13 12 11 10 9         8
        a b a     c   a   b                   a    b a      c   a b
                              5                                 7
          a b     a   c   a   b          a    b    a   c    a   b
                                    6
              a   b a     c   a b



8                                            Pattern Matching
Analysis

       Boyer-Moore’s algorithm runs in
        time O(nm + |S|)                       a   a   a    a     a   a   a   a   a
       The |S| comes from initialzing the
        last function. We expect |S| to be     6   5   4    3     2   1
        less than nm, but if it weren’t we     b   a   a    a     a   a
        add it to be safe.
       Example of worst case:                     12 11 10       9   8   7
           T = aaa … a                            b   a    a     a   a   a
           P = baaa
       The worst case may occur in                    18 17 16 15 14 13
        images and DNA sequences but is                b    a     a   a   a   a
        unlikely in English text
       Boyer-Moore’s algorithm is                         24 23 22 21 20 19
        significantly faster than the brute-                b     a   a   a   a   a
        force algorithm on English text



9                                              Pattern Matching
     The KMP Algorithm - Motivation
                                              At this point, the prefix of the
                                              pattern matches the suffix
   Knuth-Morris-Pratt’s algorithm
                                              of the PARTIAL pattern
    compares the pattern to the text
    in left-to-right, but shifts the      .    .   a b a a b x .             .   .   .     .
    pattern more intelligently than the
    brute-force algorithm.
   It takes advantage of the fact that            a b a a b a
    we KNOW what we have already
                                                             j
    seen in matching the pattern.
   When a mismatch occurs, what is                               a b a a b a
    the most we can shift the pattern
    so as to avoid redundant
    comparisons?                               No need to                    Resume
                                                                           comparing
   Answer: the largest prefix of             repeat these               here by setting
    P[0..j] that is a suffix of P[1..j]       comparisons                     j to 2
    10                                         Pattern Matching
       KMP Failure Function
   Knuth-Morris-Pratt’s algorithm
    preprocesses the pattern to find
    matches of prefixes of the pattern                j     0       1   2   3       4       5
    with the pattern itself                          P[j]   a       b   a   a       b       a
   The failure function F(j) is defined as          F(j)   0       0   1   1       2       3
    the size of the largest prefix of the
    pattern that is also a suffix of the
                                             .   .    a b a a b x .             .       .   .   .
    partial pattern P[1..j]
   Knuth-Morris-Pratt’s algorithm
    modifies the brute-force algorithm so
                                                      a b a a b a
    that if a mismatch occurs at P[j]  T[i]
    we set j  F(j - 1), which resets                           j
    where we are in the pattern.
                                                                  a b a a b a
   Notice we don’t reset i, but just
    continue from this point on.                                F(j - 1)
     11                                          Pattern Matching
      The KMP Algorithm
   At each iteration of the while-loop, either
     i increases by one, or it doesn’t           Algorithm KMPMatch(T, P)
     the shift amount i - j increases by at         F  failureFunction(P)
       least one (observe that F(j - 1) < j)         i0
                                                     j0
     One worries that they may be stuck in
                                                     while i < n
       the section where i doesn’t increase.             if T[i] = P[j]
     Amortized Analysis: while sometimes                    if j = m - 1
       we just shift the pattern without                          return i - j { match }
       moving i, we can’t do that forever, as                else // keep going
       we have to have moved forward in i                         ii+1
                                                                  jj+1
       and j before we can just shift the
                                                         else
       pattern.                                              if j > 0
   Hence, there are no more than 2n                              j  F[j - 1]
    iterations of the while-loop                             else
                                                             // at first position can’t
   Thus, KMP’s algorithm runs in optimal                    // use the failure function
    time O(n) plus the cost of computing the                      ii+1
    failure function.                                return -1 { no match }

       12                                          Pattern Matching
Computing the Failure Function

   The failure function can be
    represented by an array and can be
    computed in O(m) time
   The construction is similar to the            Algorithm failureFunction(P)
    KMP algorithm itself                             F[0]  0
                                                     i1
   At each iteration of the while-loop,             j0
    either                                           while i < m
        i increases by one, or                          if P[i] = P[j]
                                                             {we have matched j + 1 chars}
        the shift amount i - j increases by at              F[i]  j + 1
         least one (observe that F(j - 1) < j)               ii+1
   Hence, there are no more than 2m                         jj+1
                                                         else if j > 0 then
    iterations of the while-loop                             {use failure function to shift P}
   So the total complexity of KMP is                        j  F[j - 1]
    O(m+n)                                               else
                                                             F[i]  0 { no match }
                                                             ii+1

    13                                              Pattern Matching
  Example

            a b a c a a b a c c a b a c a b a a b b
            1 2 3 4 5 6
            a b a c a b              Fail at 6, reset j to 1 (F of prev loc)

                             7
Note,we start
comparing
                         a b a c a b             Fail at 7, reset j to 0 (F of prev loc)
from the left.               8 9 10 11 12
                             a b a c a b
                                            13       Fail at 12, reset j to 0 (F of prev loc)
                                             a b a c a b
        j   0    1   2   3   4   5
     P[j]   a    b   a   c   a   b
                                                 14 15 16 17 18 19
                                                 a b a c a b
     F(j)   0    0   1   0   1   2
                                          Fail at 13, reset j to 0
   14                                          Pattern Matching
Binary Failure Function

    Your programming assignment (due 10/29) extends the
     idea of the KMP string matching.
    If the input is binary in nature (only two symbols are used
     - such as x/y or 0/1), when you fail to match an x, you
     know you are looking at a y.
    Normally, when you fail, you say – “How much of the
     PREVIOUS pattern matches?” And then check the current
     location again.
    With binary input, you can say, “How much of the string
     including the real value of what I was trying to match can
     shift on top of each other?”

    15                              Pattern Matching
Example
     Text String   x   x   y   x   x   y   x   x   y   x   x   y   y   y   y   x   x   x   x   x   x


      Pattern      x   x   y   x   x   y   x   y


     Compare       0   0   0   0   0   0   0   ^


        Shift                  x   x   y   x   x   y   x   y


     Compare                                       0   0   ^


        Shift                              x   x   y   x   x   y   x   y


     Compare                                                   0   ^


        Shift                                                          x   x   y   x   x   y   x   y


     Compare                                                           ^




16                                                     Pattern Matching
Can you find the Binary Failure Function (given the regular
failure function?) for the TWO pattern strings below?
                     i    P[i]      Bfail   F
                     0    x                 0
                     1    x                 1
                     2    y                 0
                     3    x                 1
                     4    x                 2
                     5    y                 3
                     6    x                 4
                     7    y                 0
                     8    x                 1
                      9   y                 0
                     10   x                 1
                     11   y                 0



                     i       p[i]               Bfail       F
                     0       a                              0
                     1       a                              1
                     2       b                              0
                     3       a                              1
                     4       a                              2
                     5       b                              3
                     6       a                              4
                     7       a                              5
                     8       a                              2
                     9       b                              3




 17                                                     Pattern Matching
Can you find the Binary Failure Function (given the regular
failure function?)
                     i    P[i]      Bfail   F
                     0    x         0       0
                     1    x         0       1
                     2    y         2       0
                     3    x         0       1
                     4    x         0       2
                     5    y         2       3
                     6    x         0       4
                     7    y         5       0
                     8    x         0       1
                      9   y         2       0
                     10   x         0       1
                     11   y         2       0



                     i       p[i]               Bfail       F
                     0       a                  0           0
                     1       a                  0           1
                     2       b                  2           0
                     3       a                  0           1
                     4       a                  0           2
                     5       b                  2           3
                     6       a                  0           4
                     7       a                  0           5
                     8       a                  6           2
                     9       b                  2           3




 18                                                     Pattern Matching
                                                               Tries




      e     i                    mi         nimize             ze


mize      nimize   ze   nimize        ze




 19                                        Tries   2/16/2012 7:34 AM
Preprocessing Strings

    Preprocessing the pattern speeds up pattern matching
     queries
        After preprocessing the pattern, KMP’s algorithm performs pattern
         matching in time proportional to the text size
    If the text is large, immutable and searched for often (e.g.,
     works by Shakespeare), we may want to preprocess the
     text instead of the pattern
    A trie (pronounced TRY) is a compact data structure for
     representing a set of strings, such as all the words in a text
        A tries supports pattern matching queries in time proportional to
         the pattern size


20                                                   Tries   2/16/2012 7:34 AM
Standard Trie


   The standard trie for a set of strings S is an ordered tree such that:
       Each node but the root is labeled with a character
       The children of a node are alphabetically ordered
       The paths from the external nodes to the root yield the strings of S
   Example: standard trie for the set of strings
    S = { bear, bell, bid, bull, buy, sell, stock, stop }


                                b                                    s

                e                i               u          e                t

         a              l       d         l             y   l                o

         r              l                 l                 l            c         p

                                                                         k
21                                                          Tries   2/16/2012 7:34 AM
Standard Trie (cont)


    A standard trie uses O(n) space and supports searches,
     insertions and deletions in time O(dm), where:
     n total size of the strings in S
     m size of the string parameter of the operation
     d size of the alphabet


                         b                                      s

            e            i           u             e                    t

       a          l      d      l          y       l                    o

       r          l             l                  l                c         p

                                                                    k
22                                                     Tries   2/16/2012 7:34 AM
Word Matching with a Trie


    We insert the       s e e        a    b e a r ?              s e l        l       s t o c k !
     words of the        0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

     text into a trie    s e e        a    b u l        l ?       b u y            s t o c k !
                         24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
    Each leaf stores    b i d        s t o c k !             b i d           s t o c k !
     the                 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
     occurrences of      h e a r          t h e        b e l      l ?         s t o p !
     the associated      69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
     word in the
     text
                          b                        h                               s

             e            i           u            e               e                       t

        a         l       d      l          y      a          e           l                o
                        47, 58             36             0, 24
        r         l              l                 r                      l            c       p
        6        78              30                69                     12                   84
                                                                                       k
                                                                             17, 40,
23                                                                Tries   2/16/2012 7:34 AM
                                                                              51, 62
Compressed Trie


       A compressed trie has                               b                                s
        internal nodes of degree at
        least two                              e            id            u            ell        to
       It is obtained from standard
        trie by compressing chains of ar           ll            ll                y         ck        p
        “redundant” nodes


                      b                                          s

         e            i           u                     e                      t

a              l      d      l             y            l                      o

r              l             l                          l             c                p

                                                                      k
24                                                               Tries        2/16/2012 7:34 AM
Compact Representation

    Compact representation of a compressed trie for an array of strings:
        Stores at the nodes ranges of indices instead of substrings in order to make nodes
         a fixed size
        Uses O(s) space, where s is the number of strings in the array
        Serves as an auxiliary index structure

                                  0 1 2 3 4                          0 1 2 3                            0 1 2 3
                    S[0] =        s e e                    S[4] =    b u l l             S[7] =         h e a r
                    S[1] =        b e a r                  S[5] =    b u y               S[8] =         b e l l
                    S[2] =        s e l l                  S[6] =    b i d               S[9] =         s t o p
                    S[3] =        s t o c k

                                  1, 0, 0   b                        7, 0, 3   hear                0, 0, 0   s
              1, 1, 1     e       6, 1, 2   id        4, 1, 1
                                                                 u                  0, 1, 1
                                                                                               e             3, 1, 2 to

    ar 2, 3                       ll                            5, 2, 2 y 0, 2, 2
                                                                                                    ll
     1,                 8, 2, 3             4, 2, 3   ll                            e         2, 2, 3    3, 3, 4ck 9, 3, 3   p
    25                                                                                Tries     2/16/2012 7:34 AM
    Suffix Trie

   The compressed tree doesn’t work if you don’t start at the beginning of the
    word. Suppose you were allowed to start ANYWHERE in the word.
   The suffix trie of a string X is the compressed trie of all the suffixes of X


                             m i n i m i z e
                             0 1 2 3 4 5 6 7



          e          i                            mi            nimize              ze


      mize        nimize        ze       nimize         ze


     26                                                 Tries   2/16/2012 7:34 AM
Suffix Trie (cont) – showing as numbers


    Compact representation of the suffix trie for a string X of size n from
     an alphabet of size d
         Uses O(n) space
         Supports arbitrary pattern matching queries in X in O(dm) time, where m is
          the size of the pattern


                                m i n i m i z e
                                0 1 2 3 4 5 6 7



         7, 7         1, 1                          0, 1             2, 7              6, 7

            4, 7      2, 7       6, 7        2, 7          6, 7

27                                                         Tries   2/16/2012 7:34 AM
Encoding Trie


    A code is a mapping of each character of an alphabet to a binary code-
     word
    A prefix code is a binary code such that no code-word is the prefix of
     another code-word
    An encoding trie represents a prefix code
        Each leaf stores a character
        The code word of a character is given by the path from the root to the leaf
         storing the character (0 for a left child and 1 for a right child



         00    010    011     10      11
         a       b      c      d      e
                                                   a                          d        e

                                                           b         c
28                                                         Tries   2/16/2012 7:34 AM
Encoding Trie (cont)


    Given a text string X, we want to find a prefix code for the characters of X
     that yields a small encoding for X
        Frequent characters should have short code-words
        Rare characters should have long code-words
    Example
        X = abracadabra
        T1 encodes X into 29 bits
        T2 encodes X into 24 bits

         T1                                            T2


         c                           d     b           a                         b      r

                  a        r                                   c         d
29                                                          Tries   2/16/2012 7:34 AM
Huffman’s Algorithm

    Given a string X, Huffman’s
     algorithm constructs a         Algorithm HuffmanEncoding(X)
     prefix code that minimizes       Input string X of size n
     the size of the encoding of      Output optimal encoding trie for X
     the string.                      C  distinctCharacters(X)
    It runs in time                  computeFrequencies(C, X)
     O(n + d log d), where n is       Q  new empty heap
     the size of the string and d     for all c  C
     is the number of distinct          T  new single-node tree storing c
     characters of the string           Q.insert(getFrequency(c), T)
    A heap-based priority            while Q.size() > 1
     queue is used as an                f1  Q.minKey()
     auxiliary structure                T1  Q.removeMin()
                                        f2  Q.minKey()
                                        T2  Q.removeMin()
                                        T  join(T1, T2)
                                        Q.insert(f1 + f2, T)
                                      return Q.removeMin()
    30                                               Tries   2/16/2012 7:34 AM
Example
                                               11

                                     a                      6
X = abracadabra
Frequencies                                         2                    4

     a   b   c       d    r                    c        d           b        r
     5   2   1       1    2
                                                                                     6

                                                                             2               4
a        b   c       d     r
5        2   1       1     2                                a            c       d       b       r
                                                            5
                         Choice of which two
                            Is not unique
                 2                                  2                    4

a        b   c       d     r         a         c        d           b        r
5        2                 2         5
31                                                          Tries       2/16/2012 7:34 AM
At Seats

   X = catinthehat
   Frequencies in English text
       i     n   b   a   c   t   h   e
       8     8   2   8   3   9 6 12



           Create tree

           Use to decode




  32                                     2/16/2012 7:34 AM
Text Similarity
Detect similarity to focus on, or
ignore, slight differences
a. DNA analysis
b. Web crawlers omit duplicate
pages, distinguish between similar
ones
c. Updated files, archiving, delta
files, and editing distance
33                 Pattern Matching
Longest Common Subsequence
One measure of similarity is the
length of the longest common
subsequence between two texts.
This is NOT a contiguous substring,
so it loses a great deal of structure.
I doubt that it is an effective metric
for all types of similarity, unless
the subsequence is a substantial
part of the whole text.
34                   Pattern Matching
LCS algorithm uses the dynamic
programming approach

Recall: the first step is to find the
recursion. How do we write LCS in
terms of other LCS problems?
 The parameters for the smaller
problems being composed to solve a
larger problem are the lengths of a
prefix of X and a prefix of Y.


35                   Pattern Matching
Find recursion:

Let L(i,j) be the length of the LCS between
two strings X(0..i) and Y(0..j).

Suppose we know L(i, j), L(i+1, j) and L(i,
j+1) and want to know L(i+1, j+1).
a. If X[i+1] = Y[j+1] then the best we can
do is to get a LCS of L(i, j) + 1.
b. If X[i+1] != Y[j+1]
   then it is max(L[i, j+1], L(i+1, j))


36                       Pattern Matching
           Longest Common Subsequence
         One measure of similarity is the length of the
     longest common subsequence between         two texts.

     *    a    b    c    d    g    h    t      h      m s
*    0    0    0    0    0    0    0    0      0      0   0
a    0    1    1    1    1    1    1    1      1      1   1
e    0    1    1    1    1    1    1    1      1      1   1
d    0    1    1    1    2    2    2    2      2      2   2
f    0    1    1    1    2    2    2    2      2      2   2
h    0    1    1    1    2    2    3    3      3      3   3
h    0    1    1    1    2    2    3    3      4      4   4
37                                     Pattern Matching
This algorithm initializes the array or table for L
by putting 0’s along the borders, then is a
simple nested loop filling up values row by row.
This it runs in O(nm)

While the algorithm only tells the length of the
LCS, the actual string can easily be found by
working backward through the table (and
strings), noting points at which the two
characters are equal


38                          Pattern Matching
         Longest Common Subsequence
     Mark with info to generate string Every diagonal
                shows what is part of LCS


     *    a    b   c    d    g   h    t      h      m s
*    0    0    0   0    0    0   0    0      0      0   0
a    0    1    1   1    1    1   1    1      1      1   1
e    0    1    1   1    1    1   1    1      1      1   1
d    0    1    1   1    2    2   2    2      2      2   2
f    0    1    1   1    2    2   2    2      2      2   2
h    0    1    1   1    2    2   3    3      3      3   3
h    0    1    1   1    2    2   3    3      4      4   4
39                                   Pattern Matching
                 Try this one…

     *   i   d   o   n   o   t    l      i      k   e
*    0   0   0   0   0   0   0    0      0      0   0
n    0
o    0
t    0
i    0
c    0
e    0
40                               Pattern Matching
The rest of the material in these notes is not in your text (except as exercises)



    Sequence Comparisons
        Problems in molecular biology involve finding the
         minimum number of edit steps which are required to
         change one string into another.
        Three types of edit steps: insert, delete, replace.
        (replace may cost extra as it is like delete and insert)
        The non-edit step is “match” – costing zero.
        Example: abbc babb
        abbc  bbc  babc  babb (3 steps)
        abbc  babbc  babb (2 steps)
        We are trying to minimize the number of steps.


41                                                 Pattern Matching
Idea: look at making just one position right. Find all
the ways you could use.

    Count how long each would take (using
    recursion) and figure best cost.
   Then use dynamic programming. Orderly
    way of limiting the exponential number of
    combinations to think about.
   For ease in coding, we make the last
    character correct (rather than any other).



 42                            Pattern Matching
First steps to dynamic programming

    Think of the problem recursively. Find your prototype –
     what comes in and what comes out.
         Int C(n,m) returns the cost of turning the first
          n characters of the source string (A) into the
          first m characters of the destination string (B).
         Now, find the recursion. You have a helper who
          will do ANY smaller sub-problem of the same
          variety. What will you have them do? Be lazy.
          Let the helper do MOST of the work.



    43                             Pattern Matching
Types of edit steps: insert, delete, replace, match. Consider
match to be “free” but the others to cost 1.
There are four possibilities (pick the cheapest)


      1. If we delete an, we need to change A(0..n-1) to B(0..m).
           The cost is C(n,m) = C(n-1,m) + 1
      C(n,m) is the cost of changing the first n of str1 to the
           first m of str2.
      2. If we insert a new value at the end of A(n) to match bm,
           we would still have to change A(n) to B(m-1). The
           cost is C(n,m) = C(n,m-1) + 1
      3. If we replace an with bm, we still have to change A(n-1)
           to B(m-1). The cost is C(n,m) = C(n-1,m-1) + 1
      4. If we match an with bm, we still have to change A(n-1)
           to B(m-1). The cost is C(n,m) = C(n-1,m-1)



 44                                  Pattern Matching
        We have turned one problem into three problems - just
         slightly smaller.
        Bad situation - unless we can reuse results. Dynamic
         Programming.
        We store the results of C(i,j) for i = 1,n and j = 1,m.
        If we need to reconstruct how we would achieve the
         change, we store both the cost and an indication of
         which set of subproblems was used.




45                                    Pattern Matching
M(i,j) which indicates which of the four decisions lead to the best result.




         Complexity: O(mn) - but needs O(mn) space as well.
         Consider changing do to redo:
         Consider changing mane to mean:




 46                                           Pattern Matching
At your seats try
Changing “mane” to “mean”

         *   m    e             a     n
     *   0
     m
     a
     n
     e

47                 Pattern Matching
Changing “mane” to “mean”

         *     m     e             a     n
     *   0     I-1   I-2           I-3   I-4
     m   D-1   M-0   I-1           I-2   I-3
     a   D-2   D-1   R-1           M-1   I-2
     n   D-3   D-2   R-2           D-2   M-1
     e   D-4   D-3   M-2           D-3   D-2

48                    Pattern Matching
Changing “do” to “redo”
Assume: match is free; others are 1.
I show the choices as I- or R-, etc,
but could have shown with an arrow as well

        *       r       e              d         o


*       I-0     I-1     I-2            I-3       I-4


d       D-1     R-1     R-2            M-2       I-3


o       D-2     R-2     R-2            R-3       M-2

49                            Pattern Matching
Another problem:
Longest Increasing Subsequence of single
list

     Find the longest increasing subsequence in a sequence of
        distinct integers.
      Example: 5 1 10 2 20 30 40 4 5 6 7 8 9 10 11
     Why do we care? Classic problem:
      1. computational biology: related to MUMmer system
        for aligning genomes.
      2. Card games
      3. Airline boarding problem
      4. Maximization problem in a random environment.

     How do we solve?



50                                        Pattern Matching
 Longest Increasing Subsequence of single
 list
Find the longest increasing subsequence in a sequence of distinct integers.
Idea 1. Given a sequence of size less than m, can find the longest sequence of it.
   (Recursion) What is problem? Can we use a subproblem to solve the larger
   problem? Will the solution to the smaller problem be a part of the solution
   for the larger problem?

 Case 1: It either can be added to the longest subsequence or not
 Case 2: It is possible that it can be added to a non-selected subsequence
  (creating a sequence of equal length - but having a smaller ending point)
 Case 3: It can be added to a non-selected sub-sequence creating a sequence
  of smaller length but successors make it a good choice.

 Example: 5 1 10 2 20 30 40 4 5 6 7 8 9 10 11
Smallest increasing subsequence of underlined part is not part of
  complete solution


 51                                        Pattern Matching
 Idea 2. Given a sequence of size string < m,
we know how to find the longest increasing
subsequence for EVERY smaller problem.


     We don’t know which problem we want to add to.
     What is the complexity? For each n, we call n-1 subproblems
      which are 1 smaller. Looks exponential.
     We would need to store results of subproblems some way.




 52                                   Pattern Matching
Idea: if you have two subsequences of length x,
the one with the smaller end value is preferable.

 BIS: an array of the best (least value) ending point for a
   subsequence of each length.
 For s= 1 to n (or recursively the other way)
  For k = s downto 1 until find correct spot
       If BIS(k) > As and BIS(k-1) < As
                BIS(k) = As




 53                                    Pattern Matching
     Actually, we don't need the sequential search as can do a
       binary search.
     5 1 10 2 12 8 15 18 45 6 7 3 8 9
     Length BIS
       1       5 1
       2       10 2
       3       12 8 6 3
       4       15 7
       5      18 8
         6   45 9
        To output the sequence would be difficult as you don't
         know where the sequence is. You would have to
         reconstruct. You only the length of the longest
         increasing subsequence.




54                                      Pattern Matching
Try: 8 1 4 2 9 10 3 5 14 11 12 7

Length   End Pos   1st       2nd
                   Replaceme Replaceme
                   nt        nt
1        8         1
2        4         2
3        9         3
4        10        5
5        14        11                 7
55                 Pattern Matching
6        12
Probabilistic Algorithms

        Suppose we have a collection of items and wanted to find a number
         that is greater than the median (the number for which half are
         bigger).
         How would you solve it?




56                                         Pattern Matching
Probabilistic Algorithms

        Suppose we have a collection of items and wanted to find a number
         that is greater than the median (the number for which half are
         bigger).
        We could sort them - O(n log n) and then select one in last half.
        We could find the biggest - but stop looking half way through.
         O(n/2)
        Cannot guarantee one in the upper half in less than n/2
         comparisons.
        What if you just wanted good odds?
        Pick two numbers, pick the larger one. What is probability it is in
         the lower half?



57                                          Pattern Matching
Pick two numbers
 There are four possibilities:
   both are lower than median
   the first is lower the other higher.
   the first is higher the other lower
   both are higher.

 If we pick the larger of the two numbers…
 We will be right 75% of the time! We only lose if both are in
    the lowest half.




58                                 Pattern Matching
Select k elements and pick the biggest, the
probability of being correct is 1 - 1/2k . Good
odds - controlled odds.

    Termed a Monte Carlo algorithm. It may
  give the wrong result with very small probability.
      The method is called after the city in the Monaco principality, because of a
      roulette, a simple random number generator. The name and the systematic
      development of Monte Carlo methods dates from about 1944.
  Another type of probabilistic algorithm is one that never gives
    a wrong result, but its running time is not guaranteed.
   Termed Las Vegas algorithm as you are guaranteed
    success if you try long enough and don’t care how much you
    spend.




 59                                              Pattern Matching
A coloring Problem: Las Vegas Style




      Let S be a set with n elements. (n only effects complexity not
       algorithm)
      Let S1, S2... Sk be a collection of distinct (in some way different)
     subsets of S, each containing exactly r elements such that k 2r-2 .
       (We will use this fact to bound the time)
      GOAL: Color each element of S with one of two colors (red or
       blue) such that each subset Si contains at least one red and one
       blue element.




60                                         Pattern Matching
Idea


        Try coloring them randomly and then just checking to
         see if you happen to win. Checking is fast, as you can
         quit checking each subset when you see one of each.
         You can quit checking the collection (and announce
         failure) when any single color subset is found.
        What is the probability that all items in a set of r
         elements are red? 1/2r
        as equal probability that each of the two colors is
         assigned and r items in the set.




61                                    Pattern Matching
What is the probability that any one of the collections
is all red?
        k/2r = 1/2r + 1/2r +… + 1/2r
      Since we are looking for the or of a set of probabilities, we
       add.
       k is bound by 2r-2 so k*1/2r <= 1/4
      The probability of all blue or all red in a single set is one half.
       (double probability of all red)
      If our random coloring fails, we simply try again until success.
       Our expected number of attempts is 2.




  62                                     Pattern Matching
Finding a Majority


        Let E be a sequence of integers x1,x2,x3, ... xn The
         multiplicity of x in E is the number of times x appears in E.
         A number z is a majority in E if its multiplicity is greater
         than n/2.
        Problem: given a sequence of numbers, find the majority in
         the sequence or determine that none exists.
        NOTE: we don’t want to merely find who has the most
         votes, but determine who has more than half of the votes.




63                                       Pattern Matching
        For example, suppose there is an election. Candidates are
         represented as integers. Votes are represented as a list of
         candidate numbers.
        We are assuming no limit of the number of possible
         candidates.




64                                       Pattern Matching
Ideas

     1. sort the list O(n log n)
     2. Go through the list, incrementing the count of each candidate. If I
         had to look up the candidate, I would need to store them
         somewhere. If have a balanced tree of candidate names, complexity
         would be n log c (where c is number of candidates) Note, if we
         don’t know how many candidates, we can’t give them indices.
     3. Quick select. See if median (kth largest item) occurs more than n/2
         times. O(n) (Find the median, and then make a pass through seeing
         how many times it occurs.)
     4. Take a small sample. Find the majority - then count how many times
         it occurs in the whole list.
     5. Make one pass - Discard elements that won’t affect majority.




65                                         Pattern Matching
Our algorithm will find a possible majority.


        Algorithm: find two unequal elements. Delete them. Find the
         majority in the smaller list. Then see if it is a majority in the original
         list.
        How do we remove elements? It is easy. We scan the list in order.
        We are looking for a pair to eliminate.
        Let i be the current position. All the items before xi which have not
         been eliminated have the same value. All you really need to keep is
         the number of times this candidate, C value occurs (which has not
         been deleted).




66                                             Pattern Matching
     Note:
      If there is a majority and if xi  xj and we remove both of
       them, then the majority in the old list is the majority in the
       new list.
      Reasoning: if xi is the majority, it had to be more than half,
       so throwing out a subset where it is EXACTLY half, won’t
       affect the majority.
      If xi is not a majority, throwing it out won’t matter.
      If xi is a majority, there are m xi’s out of n, where m > n/2.
       Notice if we subtract one from both sides, we get
                 m-1 > n/2 -1 = (n-2)/2
      If we remove two elements, (m-1 > (n-2)/2).
      The converse is not true. If there is no majority, removing
       two may make something a majority in the smaller list:
       1,2,4,5,5.

67                                     Pattern Matching
 For example:


List:      1 4 6 3 4 4 4 2 9 0 2 4 1 4 2 2 3 2 4 2
Occurs:    X X 1 X 1 2 3 2 1 X 1 X 1 X 1 2 1 2 1 2
Candidate: 1 6 4 4 4 4 4 ? 2 ? 1 ? 2 2 2 2 2 2
2 is a candidate, but is not a majority in the whole list.
 Complexity: n-1 compares to find a candidate. n-1
   compares to test if it is a majority.
So why do this over other ways? Simple to code. No different in terms
   of complexity, but interesting to think about.




  68                                 Pattern Matching

						
Related docs
Other docs by HC120216145935
World Americas 3 Aztecs
Views: 4  |  Downloads: 0
UNIT 6 The Lymphatic System (Chapter 20)
Views: 4  |  Downloads: 0
Eventyr Uden Billeder
Views: 8  |  Downloads: 0
Song Chart
Views: 24  |  Downloads: 0
MEMORANDUM OF CONSIDERATION - DOC 1
Views: 9  |  Downloads: 0
CARTES ET TOPOS GUMS AIX
Views: 35  |  Downloads: 0
Missouri Revised Statutes
Views: 3  |  Downloads: 0