String Searching - PowerPoint - PowerPoint

Shared by: HC12070402538
Categories
Tags
-
Stats
views:
7
posted:
7/3/2012
language:
pages:
40
Document Sample
scope of work template
							String Searching
          CSCI 2720
         Spring 2007
     Eileen Kraemer
String Search
 A common word processor facility is to search
 for a given word in a document. Generally, the
 problem is to search for occurrences of a short
 string in a long string.


  the   Do the first then do the other one
History of String Search
The brute force algorithm:
   invented in the dawn of computer history
   re-invented many times, still common
Knuth & Pratt invented a better one in 1970
   invented independently by Morris
   published 1976 as “Knuth-Morris-Pratt”
Boyer & Moore found a better one before 1976
   found independently by Gosper
Karp & Rabin found a “better” one in 1980
 The obvious algorithm is to try the word at each possible
  place, and compare all the characters:
   for i := 0 to n-m do            (doc length n)
      for j := 0 to m-1 do                (word length m)

          compare word[j] with doc[i+j]

          if not equal, exit the inner loop
 The complexity is at worst O(m*n) and best
  O(n).
Improving String Search

Surprisingly, there is a faster algorithm
where you compare the last characters first:
  Do the first then do the other one
   the
         compare ‘e’ with ‘ ‘, fail so move along 3 places

   Do the first then do the other one
                 the can only move along 2 places
Improved string search, continued

In every case where the document
 character is not one of the characters in
 the word, we can move along m places.
 Sometimes, it is less.
Problem Definition, terminology
 Let p be the pattern string
 Let t be the target string
 Let k be the index of the character in the target
  string that “lies over” the first character of the
  pattern
 Given two strings, p and t, over the alphabet ,
  determine whether p occurs as the substring of t
 That is, determine whether there exists k such
  that p=Substring(t,k,|p|).
Straightforward string searching
function SimpleStringSearch(string p,t): integer
{Find p in t; return its location or -1 if p is not a substring of t}


 for k from 0 to Length(t) – Length(p) do
        i <- 0
        while i < Length(p) and p[i] = t[k+i] do
                 i <- i+1
        if i == Length(p) then return k
 return -1
SimpleStringSearch
t[0]   t[1]    t[2]   t[3]   t[4]   t[5]   t[6]   t[7]   t[8]   t[9]   t10]

A      B       C      E      F      G      A      B      C      D      E
p[0]    p[1]   p[2]   p[3]


A      B       C      D

Y       Y       Y      N
SimpleStringSearch
t[0]   t[1]    t[2]    t[3]    t[4]    t[5]   t[6]   t[7]   t[8]   t[9]   t10]

A      B       C       E       F       G      A      B      C      D      E
        p[0]    p[1]    p[2]    p[3]


         A     B       C       D

          N
SimpleStringSearch
t[0]   t[1]   t[2]   t[3]    t[4]    t[5]    t[6]   t[7]   t[8]   t[9]   t10]

A      B      C      E       F       G       A      B      C      D      E
              p[0]    p[1]    p[2]    p[3]


              A      B       C       D

               N
SimpleStringSearch
t[0]   t[1]   t[2]   t[3]   t[4]   t[5]   t[6]    t[7]   t[8]   t[9]   t10]

A      B      C      E      F      G      A       B      C      D      E
                     p[0]   p[1]   p[2]    p[3]


                      A     B      C      D

                      N
SimpleStringSearch
t[0]   t[1]   t[2]   t[3]   t[4]   t[5]    t[6]       t[7]       t[8]   t[9]   t10]

A      B      C      E      F      G       A          B          C      D      E
                            p[0]    p[1]       p[2]       p[3]


                            A      B       C          D

                             N
SimpleStringSearch
t[0]   t[1]   t[2]   t[3]   t[4]   t[5]   t[6]       t[7]       t[8]       t[9]   t10]

A      B      C      E      F      G      A          B          C          D      E
                                   p[0]       p[1]       p[2]       p[3]


                                   A      B          C          D

                                   N
SimpleStringSearch
t[0]   t[1]   t[2]   t[3]   t[4]   t[5]   t[6]       t[7]       t[8]       t[9]   t10]

A      B      C      E      F      G      A          B          C          D      E
                                   p[0]       p[1]       p[2]       p[3]


                                   A      B          C          D

                                   N
SimpleStringSearch
t[0]   t[1]   t[2]   t[3]   t[4]   t[5]   t[6]   t[7]   t[8]    t[9]    t10]

A      B      C      E      F      G      A      B      C       D       E
                                          p[0]   p[1]    p[2]    p[3]


                                          A      B      C       D

                                           Y      Y     Y        Y
Straightforward string searching
Worst case:
  Pattern string always matches completely except for last
   character
  Example: search for XXXXXXY in target string of
   XXXXXXXXXXXXXXXXXXXX
  Outer loop executed once for every character in target
   string
  Inner loop executed once for every character in pattern
  (|p| * |t|)
 Okay if patterns are short, but better algorithms
  exist
Knuth-Morris-Pratt

(|p| * |t|)
Key idea:
   if pattern fails to match, slide pattern to right by
   as many boxes as possible without permitting a
   match to go unnoticed
Knuth-Morris-Pratt
t[0]   t[1]     t[2]    t[3]    t[4]    t[5]   t[6]    t[7]   t[8]   t[9]   t10]

X      Y        X       Y       X       Y      c

p[0]     p[1]    p[2]    p[3]    p[4]

X      Y        X       Y       Z

Y       Y        Y       Y       N

                X       Y       X       Y      Z
                 Y      Y        Y       Y         ?
Knuth-Morris Pratt

Correct motion of pattern depends on both
 location of mismatch and the mismatching
 character
If c == X : move 2 boxes to right
If c == E : move 5 boxes to right
If c == Z : target found; alg terminates
Knuth-Morris-Pratt

Goal: determine d, number of boxes to
 right pattern should move; smallest d such
 that:
    p[0] = t[k+d]
    p[1] = t[k+d+1]
    p[2] = t[k+d+2]
    …
    p[i-d] = t[k+i]
Knuth-Morris-Pratt

Note: can be stated largely in terms of
 pattern alone.
Value of d depends only on:
  The pattern
  The value of i
  The mismatching character c (at t[k+i])
Knuth-Morris-Pratt
 Can define a function KMPskip(p,i,c) to give
  correct d
  Return smallest integer d such that 0 <= d <=I, such that
   p[i-d] == c and p[j] == p[j+d] for each 0 <=j <= i-di1
  Return i+1 if no such d exists


 Calculate all values of KMPskip for pattern p and
  store it in KMPskiparray
 do lookup at each mismatch
Knuth-Morris-Pratt
 For pattern ABCD:
        A   B   C     D

    A   0   1   2     3
    B   1   0   3     4

    C   1   2   0     4

    D   1   2   3     0

other
        1   2   3     4
Knuth-Morris-Pratt
 For pattern XYXYZ:
           X   Y   X   Y   Z

      X    0   1   0   3   2

           1   0   3   0   5
      Y

      Z    1   2   3   4   0

           1   2   3   4   5
   other
Knuth-Morris-Pratt
Function KMPSearch(string p, t): integer
{Find p in t; return its location or -1 if p is not a substring of t}
KMPskiparray <- ComputeKMPskiparray(p)
k <- 0
i <- 0
While k < Length(t) – Length(p) do
   if i == Length(p) then return k
   d <- KMPskiparray[I,t[k+i]]
   k <- k + d
   i <- I + 1 –d
Return -1
The Boyer-Moore Algorithm
Similar to KMP in that:
  Pattern compared against target
  On mismatch, move as far to right as possible
Different from KMP in that:
  Compare the patterns from right to left instead
   of left to right
Does that make a difference?
  Yes!! – much faster on long targets; many
   characters in target string are never examined
   at all
Boyer-Moore example
t[0]    t[1]    t[2]    t[3]    t[4]    t[5]   t[6]    t[7]    t[8]   t[9]    t10]

A       B       C       E       F      G       A       B      C       D       E
p[0]     p[1]    p[2]    p[3]


 A      B       C       D

                         N


There is no E in the pattern : thus the pattern can’t match if any characters lie
under t[3]. So, move four boxes to the right.
Boyer-Moore example
t[0]   t[1]   t[2]   t[3]    t[4]   t[5]   t[6]   t[7]   t[8]   t[9]    t10]

A      B      C      E      F       G      A      B      C      D      E

                            p[0]    p[1]   p[2]   p[3]

                            A       B      C      D

                                                    N


Again, no match. But there is a B in the pattern. So move two boxes to the
right.
Boyer-Moore example
t[0]   t[1]   t[2]   t[3]   t[4]   t[5]   t[6]   t[7]    t[8]     t[9]     t10]

A      B      C      E      F      G      A      B       C        D        E

                                          p[0]    p[1]     p[2]     p[3]

                                           A     B       C        D

                                           Y     Y         Y       Y
Boyer-Moore : another example
      t[k]   t[k+1]   …              t[k+i]                     t[k+m-1]

                      …              c        E        …   R     G

      p[0]     p[1]   …     p[i-1]    p[i]    p[i+1]   …        p[m-1]

       L     E        …     S        D        E        …   R     G

                                     N         Y       Y   Y      Y

Problem: determine d, the number of boxes that the pattern can be moved to
the right.


d should be smallest integer such that t[k+m-1]= p[m-1-d], t[k+m-2] = p[m-2-d],
… t[k+i] = p[i-d]
The Boyer-Moore Algorithm
 We said:
  d should be smallest integer such that:
      T[k+m-1] = p[m-1-d]
      T[k+m-2] = p[m-2-d]
      T[k+i] = p[i-d]
  Reminder:
      k = starting index in target string
      m = length of pattern
      i = index of mismatch in pattern string
  Problem: statement is valid only for d<= i
      Need to ensure that we don’t “fall off” the left edge of the
       pattern
Boyer-Moore : another example
        t[k]                                  t[k+5]                        t[k+8]

                                              c        X      Y      Z

         p[0]    p[1]   p[2]    p[3]   p[4]   p[5]     p[6]   p[7]   p[8]

         Y      Z       W       X      Y      Z        X      Y      Z

                                                N      Y       Y      Y


If c == W, then d should be 3

If c == R, then d should be 7
BMPSkip
 Let m = |p|
 For any character c and any i such that 0<= i <
  m , define BMPSkip(p,i,c) to be:
  The amount the pattern can move to the right when
   characters i+1 through m-1 of the pattern match
   corresponding characters in the target but p[i] doesn’t
   match character c.
 Then BMPSkip(p,I,c) should return the smallest
  d such that:
  p[j]= p[j-d] for all j such that max(i+1,d) <= j<= m-1, and
  p[i-d] = c if d<= i
    Boyer-Moore
      For pattern ABCD:
                A   B   C       D   <- if the position in the pattern is
                                    this character
And the
mis-
            A   -   4   4   3
matching
character   B   4   -   4   2       Then skip this many spaces …
in the
target is   C   4   4   -   1
this -

            D   4   4   4   -

      other
                4   4   4   4
    Boyer-Moore
     For pattern XYXYZ:
                    X       Y       X   Y   Z   - If the position in the
                                                pattern is this
And the        X    -   5       -       5   2
mis-
matching
character           5   -       5       -   1   Then skip this many
in the        Y                                 spaces
target is
this --
               Z    5   5       5       5   -

                    5   5       5       5   5
            other
Note:

entries in the Boyer-Moore arrays are
 generally larger than with KMP; thus, the
 pattern will move faster
Table not consulted on a match (thus, the
 blank entries)
BMSearch
Function BMSearch(string p,t): int
{Find p in t; return its location or -1 if p is not a substring of t}
   BMSkiparray <- ComputeBMSkipArray(p)
   k <- 0
   while k <= Length(t) – Length(p) do
         i <- Length(p) – 1
         while i >= 0 and p[i] = t[k+i] do
                    i <- i– 1
         if i = -1 then return k
         k <- k + BMSkiparray[i,t[k+i]]
   return -1
The Karp-Rabin Algorithm Idea
Karp & Rabin found an algorithm which is:
   almost as fast as Boyer-Moore
   simple enough to understand easily
   can be adapted for 2-dimensional searches for
    patterns in pictures
Go back to the brute force idea, but now use a
single number to represent the word you are
searching for, and a single number for the current
portion of the document you are comparing
against.
The Karp-Rabin Algorithm
 Suppose we are searching for 4-letter words. Then the
  whole (English) word fits in one (computer) word w of 4
  bytes. If the current 4 bytes of the document are also in
  one word d, a single comparison can match the two in
  one step. To move along the document, shift d and add
  in the next character.
 For longer words, use hashing. The characters of the
  word and the document are combined into single hash
  numbers wh and dh. The hash number dh can be
  updated by doing a suitable sum and adding in the code
  for the next character.

						
Related docs
Other docs by HC12070402538
Corporate Members
Views: 0  |  Downloads: 0
assets ngin - DOC
Views: 4  |  Downloads: 0
RECORD OF PROCEEDINGS
Views: 3  |  Downloads: 0
No Slide Title
Views: 0  |  Downloads: 0
Clinician Training Data Sheet
Views: 10  |  Downloads: 0
CANADIAN COUNSELLING ASSOCIATION
Views: 11  |  Downloads: 0
Item B2a Part1
Views: 1  |  Downloads: 0
The GAD-7 Anxiety Scale
Views: 47  |  Downloads: 0
Digital repositories support 2006 04 11
Views: 0  |  Downloads: 0
Extension Economist
Views: 0  |  Downloads: 0