String Searching - PowerPoint - PowerPoint
Shared by: HC12070402538
-
Stats
- views:
- 7
- posted:
- 7/3/2012
- language:
- pages:
- 40
Document Sample


String Searching
CSCI 2720
Spring 2007
Eileen Kraemer
String Search
A common word processor facility is to search
for a given word in a document. Generally, the
problem is to search for occurrences of a short
string in a long string.
the Do the first then do the other one
History of String Search
The brute force algorithm:
invented in the dawn of computer history
re-invented many times, still common
Knuth & Pratt invented a better one in 1970
invented independently by Morris
published 1976 as “Knuth-Morris-Pratt”
Boyer & Moore found a better one before 1976
found independently by Gosper
Karp & Rabin found a “better” one in 1980
The obvious algorithm is to try the word at each possible
place, and compare all the characters:
for i := 0 to n-m do (doc length n)
for j := 0 to m-1 do (word length m)
compare word[j] with doc[i+j]
if not equal, exit the inner loop
The complexity is at worst O(m*n) and best
O(n).
Improving String Search
Surprisingly, there is a faster algorithm
where you compare the last characters first:
Do the first then do the other one
the
compare ‘e’ with ‘ ‘, fail so move along 3 places
Do the first then do the other one
the can only move along 2 places
Improved string search, continued
In every case where the document
character is not one of the characters in
the word, we can move along m places.
Sometimes, it is less.
Problem Definition, terminology
Let p be the pattern string
Let t be the target string
Let k be the index of the character in the target
string that “lies over” the first character of the
pattern
Given two strings, p and t, over the alphabet ,
determine whether p occurs as the substring of t
That is, determine whether there exists k such
that p=Substring(t,k,|p|).
Straightforward string searching
function SimpleStringSearch(string p,t): integer
{Find p in t; return its location or -1 if p is not a substring of t}
for k from 0 to Length(t) – Length(p) do
i <- 0
while i < Length(p) and p[i] = t[k+i] do
i <- i+1
if i == Length(p) then return k
return -1
SimpleStringSearch
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
Y Y Y N
SimpleStringSearch
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
N
SimpleStringSearch
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
N
SimpleStringSearch
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
N
SimpleStringSearch
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
N
SimpleStringSearch
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
N
SimpleStringSearch
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
N
SimpleStringSearch
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
Y Y Y Y
Straightforward string searching
Worst case:
Pattern string always matches completely except for last
character
Example: search for XXXXXXY in target string of
XXXXXXXXXXXXXXXXXXXX
Outer loop executed once for every character in target
string
Inner loop executed once for every character in pattern
(|p| * |t|)
Okay if patterns are short, but better algorithms
exist
Knuth-Morris-Pratt
(|p| * |t|)
Key idea:
if pattern fails to match, slide pattern to right by
as many boxes as possible without permitting a
match to go unnoticed
Knuth-Morris-Pratt
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
X Y X Y X Y c
p[0] p[1] p[2] p[3] p[4]
X Y X Y Z
Y Y Y Y N
X Y X Y Z
Y Y Y Y ?
Knuth-Morris Pratt
Correct motion of pattern depends on both
location of mismatch and the mismatching
character
If c == X : move 2 boxes to right
If c == E : move 5 boxes to right
If c == Z : target found; alg terminates
Knuth-Morris-Pratt
Goal: determine d, number of boxes to
right pattern should move; smallest d such
that:
p[0] = t[k+d]
p[1] = t[k+d+1]
p[2] = t[k+d+2]
…
p[i-d] = t[k+i]
Knuth-Morris-Pratt
Note: can be stated largely in terms of
pattern alone.
Value of d depends only on:
The pattern
The value of i
The mismatching character c (at t[k+i])
Knuth-Morris-Pratt
Can define a function KMPskip(p,i,c) to give
correct d
Return smallest integer d such that 0 <= d <=I, such that
p[i-d] == c and p[j] == p[j+d] for each 0 <=j <= i-di1
Return i+1 if no such d exists
Calculate all values of KMPskip for pattern p and
store it in KMPskiparray
do lookup at each mismatch
Knuth-Morris-Pratt
For pattern ABCD:
A B C D
A 0 1 2 3
B 1 0 3 4
C 1 2 0 4
D 1 2 3 0
other
1 2 3 4
Knuth-Morris-Pratt
For pattern XYXYZ:
X Y X Y Z
X 0 1 0 3 2
1 0 3 0 5
Y
Z 1 2 3 4 0
1 2 3 4 5
other
Knuth-Morris-Pratt
Function KMPSearch(string p, t): integer
{Find p in t; return its location or -1 if p is not a substring of t}
KMPskiparray <- ComputeKMPskiparray(p)
k <- 0
i <- 0
While k < Length(t) – Length(p) do
if i == Length(p) then return k
d <- KMPskiparray[I,t[k+i]]
k <- k + d
i <- I + 1 –d
Return -1
The Boyer-Moore Algorithm
Similar to KMP in that:
Pattern compared against target
On mismatch, move as far to right as possible
Different from KMP in that:
Compare the patterns from right to left instead
of left to right
Does that make a difference?
Yes!! – much faster on long targets; many
characters in target string are never examined
at all
Boyer-Moore example
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
N
There is no E in the pattern : thus the pattern can’t match if any characters lie
under t[3]. So, move four boxes to the right.
Boyer-Moore example
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
N
Again, no match. But there is a B in the pattern. So move two boxes to the
right.
Boyer-Moore example
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
Y Y Y Y
Boyer-Moore : another example
t[k] t[k+1] … t[k+i] t[k+m-1]
… c E … R G
p[0] p[1] … p[i-1] p[i] p[i+1] … p[m-1]
L E … S D E … R G
N Y Y Y Y
Problem: determine d, the number of boxes that the pattern can be moved to
the right.
d should be smallest integer such that t[k+m-1]= p[m-1-d], t[k+m-2] = p[m-2-d],
… t[k+i] = p[i-d]
The Boyer-Moore Algorithm
We said:
d should be smallest integer such that:
T[k+m-1] = p[m-1-d]
T[k+m-2] = p[m-2-d]
T[k+i] = p[i-d]
Reminder:
k = starting index in target string
m = length of pattern
i = index of mismatch in pattern string
Problem: statement is valid only for d<= i
Need to ensure that we don’t “fall off” the left edge of the
pattern
Boyer-Moore : another example
t[k] t[k+5] t[k+8]
c X Y Z
p[0] p[1] p[2] p[3] p[4] p[5] p[6] p[7] p[8]
Y Z W X Y Z X Y Z
N Y Y Y
If c == W, then d should be 3
If c == R, then d should be 7
BMPSkip
Let m = |p|
For any character c and any i such that 0<= i <
m , define BMPSkip(p,i,c) to be:
The amount the pattern can move to the right when
characters i+1 through m-1 of the pattern match
corresponding characters in the target but p[i] doesn’t
match character c.
Then BMPSkip(p,I,c) should return the smallest
d such that:
p[j]= p[j-d] for all j such that max(i+1,d) <= j<= m-1, and
p[i-d] = c if d<= i
Boyer-Moore
For pattern ABCD:
A B C D <- if the position in the pattern is
this character
And the
mis-
A - 4 4 3
matching
character B 4 - 4 2 Then skip this many spaces …
in the
target is C 4 4 - 1
this -
D 4 4 4 -
other
4 4 4 4
Boyer-Moore
For pattern XYXYZ:
X Y X Y Z - If the position in the
pattern is this
And the X - 5 - 5 2
mis-
matching
character 5 - 5 - 1 Then skip this many
in the Y spaces
target is
this --
Z 5 5 5 5 -
5 5 5 5 5
other
Note:
entries in the Boyer-Moore arrays are
generally larger than with KMP; thus, the
pattern will move faster
Table not consulted on a match (thus, the
blank entries)
BMSearch
Function BMSearch(string p,t): int
{Find p in t; return its location or -1 if p is not a substring of t}
BMSkiparray <- ComputeBMSkipArray(p)
k <- 0
while k <= Length(t) – Length(p) do
i <- Length(p) – 1
while i >= 0 and p[i] = t[k+i] do
i <- i– 1
if i = -1 then return k
k <- k + BMSkiparray[i,t[k+i]]
return -1
The Karp-Rabin Algorithm Idea
Karp & Rabin found an algorithm which is:
almost as fast as Boyer-Moore
simple enough to understand easily
can be adapted for 2-dimensional searches for
patterns in pictures
Go back to the brute force idea, but now use a
single number to represent the word you are
searching for, and a single number for the current
portion of the document you are comparing
against.
The Karp-Rabin Algorithm
Suppose we are searching for 4-letter words. Then the
whole (English) word fits in one (computer) word w of 4
bytes. If the current 4 bytes of the document are also in
one word d, a single comparison can match the two in
one step. To move along the document, shift d and add
in the next character.
For longer words, use hashing. The characters of the
word and the document are combined into single hash
numbers wh and dh. The hash number dh can be
updated by doing a suitable sum and adding in the code
for the next character.
Get documents about "