# String Searching - PowerPoint - PowerPoint

Document Sample

```					String Searching
CSCI 2720
Spring 2007
Eileen Kraemer
String Search
A common word processor facility is to search
for a given word in a document. Generally, the
problem is to search for occurrences of a short
string in a long string.

the   Do the first then do the other one
History of String Search
The brute force algorithm:
invented in the dawn of computer history
re-invented many times, still common
Knuth & Pratt invented a better one in 1970
invented independently by Morris
published 1976 as “Knuth-Morris-Pratt”
Boyer & Moore found a better one before 1976
found independently by Gosper
Karp & Rabin found a “better” one in 1980
 The obvious algorithm is to try the word at each possible
place, and compare all the characters:
for i := 0 to n-m do            (doc length n)
for j := 0 to m-1 do                (word length m)

compare word[j] with doc[i+j]

if not equal, exit the inner loop
 The complexity is at worst O(m*n) and best
O(n).
Improving String Search

Surprisingly, there is a faster algorithm
where you compare the last characters first:
Do the first then do the other one
the
compare ‘e’ with ‘ ‘, fail so move along 3 places

Do the first then do the other one
the can only move along 2 places
Improved string search, continued

In every case where the document
character is not one of the characters in
the word, we can move along m places.
Sometimes, it is less.
Problem Definition, terminology
 Let p be the pattern string
 Let t be the target string
 Let k be the index of the character in the target
string that “lies over” the first character of the
pattern
 Given two strings, p and t, over the alphabet ,
determine whether p occurs as the substring of t
 That is, determine whether there exists k such
that p=Substring(t,k,|p|).
Straightforward string searching
function SimpleStringSearch(string p,t): integer
{Find p in t; return its location or -1 if p is not a substring of t}

for k from 0 to Length(t) – Length(p) do
i <- 0
while i < Length(p) and p[i] = t[k+i] do
i <- i+1
if i == Length(p) then return k
return -1
SimpleStringSearch
t[0]   t[1]    t[2]   t[3]   t[4]   t[5]   t[6]   t[7]   t[8]   t[9]   t10]

A      B       C      E      F      G      A      B      C      D      E
p[0]    p[1]   p[2]   p[3]

A      B       C      D

Y       Y       Y      N
SimpleStringSearch
t[0]   t[1]    t[2]    t[3]    t[4]    t[5]   t[6]   t[7]   t[8]   t[9]   t10]

A      B       C       E       F       G      A      B      C      D      E
p[0]    p[1]    p[2]    p[3]

A     B       C       D

N
SimpleStringSearch
t[0]   t[1]   t[2]   t[3]    t[4]    t[5]    t[6]   t[7]   t[8]   t[9]   t10]

A      B      C      E       F       G       A      B      C      D      E
p[0]    p[1]    p[2]    p[3]

A      B       C       D

N
SimpleStringSearch
t[0]   t[1]   t[2]   t[3]   t[4]   t[5]   t[6]    t[7]   t[8]   t[9]   t10]

A      B      C      E      F      G      A       B      C      D      E
p[0]   p[1]   p[2]    p[3]

A     B      C      D

N
SimpleStringSearch
t[0]   t[1]   t[2]   t[3]   t[4]   t[5]    t[6]       t[7]       t[8]   t[9]   t10]

A      B      C      E      F      G       A          B          C      D      E
p[0]    p[1]       p[2]       p[3]

A      B       C          D

N
SimpleStringSearch
t[0]   t[1]   t[2]   t[3]   t[4]   t[5]   t[6]       t[7]       t[8]       t[9]   t10]

A      B      C      E      F      G      A          B          C          D      E
p[0]       p[1]       p[2]       p[3]

A      B          C          D

N
SimpleStringSearch
t[0]   t[1]   t[2]   t[3]   t[4]   t[5]   t[6]       t[7]       t[8]       t[9]   t10]

A      B      C      E      F      G      A          B          C          D      E
p[0]       p[1]       p[2]       p[3]

A      B          C          D

N
SimpleStringSearch
t[0]   t[1]   t[2]   t[3]   t[4]   t[5]   t[6]   t[7]   t[8]    t[9]    t10]

A      B      C      E      F      G      A      B      C       D       E
p[0]   p[1]    p[2]    p[3]

A      B      C       D

Y      Y     Y        Y
Straightforward string searching
Worst case:
Pattern string always matches completely except for last
character
Example: search for XXXXXXY in target string of
XXXXXXXXXXXXXXXXXXXX
Outer loop executed once for every character in target
string
Inner loop executed once for every character in pattern
(|p| * |t|)
 Okay if patterns are short, but better algorithms
exist
Knuth-Morris-Pratt

(|p| * |t|)
Key idea:
 if pattern fails to match, slide pattern to right by
as many boxes as possible without permitting a
match to go unnoticed
Knuth-Morris-Pratt
t[0]   t[1]     t[2]    t[3]    t[4]    t[5]   t[6]    t[7]   t[8]   t[9]   t10]

X      Y        X       Y       X       Y      c

p[0]     p[1]    p[2]    p[3]    p[4]

X      Y        X       Y       Z

Y       Y        Y       Y       N

X       Y       X       Y      Z
Y      Y        Y       Y         ?
Knuth-Morris Pratt

Correct motion of pattern depends on both
location of mismatch and the mismatching
character
If c == X : move 2 boxes to right
If c == E : move 5 boxes to right
If c == Z : target found; alg terminates
Knuth-Morris-Pratt

Goal: determine d, number of boxes to
right pattern should move; smallest d such
that:
p[0] = t[k+d]
p[1] = t[k+d+1]
p[2] = t[k+d+2]
…
p[i-d] = t[k+i]
Knuth-Morris-Pratt

Note: can be stated largely in terms of
pattern alone.
Value of d depends only on:
The pattern
The value of i
The mismatching character c (at t[k+i])
Knuth-Morris-Pratt
 Can define a function KMPskip(p,i,c) to give
correct d
Return smallest integer d such that 0 <= d <=I, such that
p[i-d] == c and p[j] == p[j+d] for each 0 <=j <= i-di1
Return i+1 if no such d exists

 Calculate all values of KMPskip for pattern p and
store it in KMPskiparray
 do lookup at each mismatch
Knuth-Morris-Pratt
 For pattern ABCD:
A   B   C     D

A   0   1   2     3
B   1   0   3     4

C   1   2   0     4

D   1   2   3     0

other
1   2   3     4
Knuth-Morris-Pratt
 For pattern XYXYZ:
X   Y   X   Y   Z

X    0   1   0   3   2

1   0   3   0   5
Y

Z    1   2   3   4   0

1   2   3   4   5
other
Knuth-Morris-Pratt
Function KMPSearch(string p, t): integer
{Find p in t; return its location or -1 if p is not a substring of t}
KMPskiparray <- ComputeKMPskiparray(p)
k <- 0
i <- 0
While k < Length(t) – Length(p) do
if i == Length(p) then return k
d <- KMPskiparray[I,t[k+i]]
k <- k + d
i <- I + 1 –d
Return -1
The Boyer-Moore Algorithm
Similar to KMP in that:
Pattern compared against target
On mismatch, move as far to right as possible
Different from KMP in that:
Compare the patterns from right to left instead
of left to right
Does that make a difference?
Yes!! – much faster on long targets; many
characters in target string are never examined
at all
Boyer-Moore example
t[0]    t[1]    t[2]    t[3]    t[4]    t[5]   t[6]    t[7]    t[8]   t[9]    t10]

A       B       C       E       F      G       A       B      C       D       E
p[0]     p[1]    p[2]    p[3]

A      B       C       D

N

There is no E in the pattern : thus the pattern can’t match if any characters lie
under t[3]. So, move four boxes to the right.
Boyer-Moore example
t[0]   t[1]   t[2]   t[3]    t[4]   t[5]   t[6]   t[7]   t[8]   t[9]    t10]

A      B      C      E      F       G      A      B      C      D      E

p[0]    p[1]   p[2]   p[3]

A       B      C      D

N

Again, no match. But there is a B in the pattern. So move two boxes to the
right.
Boyer-Moore example
t[0]   t[1]   t[2]   t[3]   t[4]   t[5]   t[6]   t[7]    t[8]     t[9]     t10]

A      B      C      E      F      G      A      B       C        D        E

p[0]    p[1]     p[2]     p[3]

A     B       C        D

Y     Y         Y       Y
Boyer-Moore : another example
t[k]   t[k+1]   …              t[k+i]                     t[k+m-1]

…              c        E        …   R     G

p[0]     p[1]   …     p[i-1]    p[i]    p[i+1]   …        p[m-1]

L     E        …     S        D        E        …   R     G

N         Y       Y   Y      Y

Problem: determine d, the number of boxes that the pattern can be moved to
the right.

d should be smallest integer such that t[k+m-1]= p[m-1-d], t[k+m-2] = p[m-2-d],
… t[k+i] = p[i-d]
The Boyer-Moore Algorithm
 We said:
d should be smallest integer such that:
 T[k+m-1] = p[m-1-d]
 T[k+m-2] = p[m-2-d]
 T[k+i] = p[i-d]
Reminder:
 k = starting index in target string
 m = length of pattern
 i = index of mismatch in pattern string
Problem: statement is valid only for d<= i
 Need to ensure that we don’t “fall off” the left edge of the
pattern
Boyer-Moore : another example
t[k]                                  t[k+5]                        t[k+8]

c        X      Y      Z

p[0]    p[1]   p[2]    p[3]   p[4]   p[5]     p[6]   p[7]   p[8]

Y      Z       W       X      Y      Z        X      Y      Z

N      Y       Y      Y

If c == W, then d should be 3

If c == R, then d should be 7
BMPSkip
 Let m = |p|
 For any character c and any i such that 0<= i <
m , define BMPSkip(p,i,c) to be:
The amount the pattern can move to the right when
characters i+1 through m-1 of the pattern match
corresponding characters in the target but p[i] doesn’t
match character c.
 Then BMPSkip(p,I,c) should return the smallest
d such that:
p[j]= p[j-d] for all j such that max(i+1,d) <= j<= m-1, and
p[i-d] = c if d<= i
Boyer-Moore
 For pattern ABCD:
A   B   C       D   <- if the position in the pattern is
this character
And the
mis-
A   -   4   4   3
matching
character   B   4   -   4   2       Then skip this many spaces …
in the
target is   C   4   4   -   1
this -

D   4   4   4   -

other
4   4   4   4
Boyer-Moore
 For pattern XYXYZ:
X       Y       X   Y   Z   - If the position in the
pattern is this
And the        X    -   5       -       5   2
mis-
matching
character           5   -       5       -   1   Then skip this many
in the        Y                                 spaces
target is
this --
Z    5   5       5       5   -

5   5       5       5   5
other
Note:

entries in the Boyer-Moore arrays are
generally larger than with KMP; thus, the
pattern will move faster
Table not consulted on a match (thus, the
blank entries)
BMSearch
Function BMSearch(string p,t): int
{Find p in t; return its location or -1 if p is not a substring of t}
BMSkiparray <- ComputeBMSkipArray(p)
k <- 0
while k <= Length(t) – Length(p) do
i <- Length(p) – 1
while i >= 0 and p[i] = t[k+i] do
i <- i– 1
if i = -1 then return k
k <- k + BMSkiparray[i,t[k+i]]
return -1
The Karp-Rabin Algorithm Idea
Karp & Rabin found an algorithm which is:
almost as fast as Boyer-Moore
simple enough to understand easily
can be adapted for 2-dimensional searches for
patterns in pictures
Go back to the brute force idea, but now use a
single number to represent the word you are
searching for, and a single number for the current
portion of the document you are comparing
against.
The Karp-Rabin Algorithm
 Suppose we are searching for 4-letter words. Then the
whole (English) word fits in one (computer) word w of 4
bytes. If the current 4 bytes of the document are also in
one word d, a single comparison can match the two in
one step. To move along the document, shift d and add
in the next character.
 For longer words, use hashing. The characters of the
word and the document are combined into single hash
numbers wh and dh. The hash number dh can be
updated by doing a suitable sum and adding in the code
for the next character.

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 8 posted: 7/4/2012 language: pages: 40
How are you planning on using Docstoc?