# String Searching - PowerPoint

Document Sample

String Searching
• String searching is a critical algorithm used
– by grep in unix/linux to search a directory for a given text
sequence
– in word processors
– in bioinformatics to locate matching gene sequences
• Because the source file or array might be large, we want
to have an efficient algorithm
– how efficient? is O(m*n) reasonable where m = length of
substring and n = length of string? No, consider a 100Kbyte
file to examine for a 20 character sequence, this would require
100,000 * 20 = 2 million comparisons!
– we would prefer O(m + n)
• We look at four algorithms where we will assume
– we have a master string of length n, and a substring of length m
– we want to find the index of the first occurrence of the
substring in master, or return -1 if the substring is not found
Simple Search Algorithm
• The first algorithm is a straight forward                           i
search
– index j moves across the substring and the
current segment of the master string starting           master
at index i
– if(substring[j] = = master[i+j]) then                                      j
increment j
– if we have a mismatch, reset j to 0 and i++ –                        sub
• that is, start the search over again at the next
character in master
search(sub, master) { // find sub in master or return -1
m = sub.length; n = master.length; i = 0;
while(i + m <= n) { // is there enough of master to search?
j = 0;           // start at sub[0] and master[i]
while(master[i + j] = = sub[j] && j < m)
j++;        // while we are still matching, continue
if(j >= m) return i; // did we reach the end of sub? If so,
else i++;           // return i otherwise move on to next
}                       // character in master and try again
return -1;
}
Simple String Search Analysis
• The outer loop will iterate as many as n – m + 1 times depending
on when the substring is found (if found at all)
– we have “ – m” because we don‟t have to search the final characters since
we would run out of master string before we had a complete match
• The inner loop will iterate based strictly on how closely the
substring matches this portion of the master string
– consider matching “12345” against “020304050612345”
• we never look at more than a single character in master for the first 10 characters
– we only start working through the inner while loop of the algorithm once we
reach the actual placement of the substring (O(n))
– in the worst case though, we might have to look at m – 1 items before we
find a mismatch
– consider matching “AAAB” in “AAAAAAAAAAB”, at each position in the
master string, we match 3 „A‟ characters
• O(n*m)
• The algorithm‟s best case is O(n), worst case is O(m * (n – m + 1))
– but since we will probably have a large master string (such as a text file) and
a fairly short substring, m << n and so our worst cast comes out to O(m * n)
• The average case probably won‟t have too many close matches so
the complexity will be closer to O(n)
Rabin-Karp
• In our simple search algorithm, we start checking each segment of
master until we mismatch with the substring
– however, if there is evidence that the substring will not match the master
string, we never have to start that inner-loop search
– in order to accomplish this, we will compute the substring‟s “fingerprint”
and we will compute the fingerprint of each segment of the master string
– if our segment‟s fingerprint does not equal the substring‟s fingerprint, there
is no need to start searching the segment and we move on to the master
string‟s next segment
– this algorithm has a complexity that ranges from O(m*n) in the worst case
to O(m + n) in the average and best case
• What can we use for a fingerprint?
– we need to be able to determine a fingerprint quickly (O(1)) or else this
approach will not improve over the straightforward approach
• If few of the substrings have the same fingerprint, then instead of
O(m*n) we have O(m*n*k + m) where k is the percentage of
substrings that contain the fingerprint and the O(m) at the end is to
determine the substring‟s fingerprint
Using Parity for a Fingerprint
• Assume that we are searching a binary string for a binary substring
– even parity exists if the binary string has an even number of 1 bits
• for instance, 010111 has even parity while 000001, 001011, and 111101 all have
odd parity – note that 000000 has even parity
• Compute the parity of the substring (takes O(m)) and the parity of
each substring of length m in the master string (takes O(n)) and
now use the straight forward string matching algorithm but precede
the inner while loop with an if statement:
– if(fingerprint[i] = = substring_fingerprint) while …
• we only bother to start searching at index i in the master string if the fingerprint
of this portion of the master string matches the fingerprint of the substring
Example: Search for 010111 in 0010110101001010011
Substring fingerprint = 1 (even parity)
Fingerprints of master‟s substrings:
Character: 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 0 1 1
Fingerprint: 0 0 1 0 1 0 1 0 1 0 1 1 0 0 x x x x x
x indicates non-applicable starting points
underlined locations are those where we have to start searching
Using non-Parity Fingerprints
• Parity is either even or odd so we might wind up having as
many locations to search as to not search
– k = 50%, or our average case complexity is O(m * n * .5 + m) which is
still O(m*n) – no improvement
• We want a better fingerprint, one in which we can be more
certain as to whether to search or not
– we use the same idea, but rather than even or odd parity, we will do a
more unique form of computation:
m 1                             sj is the jth character of the string

 s j 2 m 1 j mod q
j 0
q is some prime number, for instance 7
m is the size of the master string

– If we assume binary numbers, then this is actually computing the
decimal equivalent of the binary number starting at position j, mod-ing it
by q, a prime number
Example
Si 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 0 1 1
S 11 22 45 26 53 42 20 41 18 37 10 20 41 19 x x x x x
f 4 1 3 5 4 0 6 6 4 2 3 6 6 5 x x x x x
• Here, the substring is 001011, since it‟s length is 6, we compute fingerprints in
master of length 6, and we use 7 to mod the summations by
• Examples:
– f[0] (fingerprint starting at location 0) is 001011 % 7 = 0*25 + 0*24 + 1*23 + 0*22 +
1*21 + 1*20 = 11 % 7 = 4
– f[5] = 101010 % 7 = 42 % 7 = 0
– f[14] isn‟t computing because we would run out of string before we could match the
substring
• Our substring has a fingerprint of 001011 % 7 = 11 % 7 = 4
– we only need to search at locations i = 0, 4 and 8
• Computing the fingerprints for the entire master string takes 6*n operations, but
we can reduce this to 1*n if we are clever
– how?
• Whether O(6*n) or O(1*n), it is still O(n)
Rabin-Karp Algorithm
• Assume that fingerprints are evenly distributed across possible substring
sequences, then the inner while loop will only be examined (n – m) / q times
where q is the selected mod value
– by selecting a prime number for q, we will further avoid patterns in the data (consider
if q was 10 or 16, we would find a lot more matches)
• The inner loop only iterates while this portion of the master string and the
substring continue to match
– for instance, we stop iterating if we are searching for “ABCDE” and we have
“ABEDE”
• Thus, we can expect in the average case a complexity of O(m + n)
compute fingerprint f (f[i] is fingerprint starting at index i of master)
compute fingerprint of substring
i = 0; m = substring.length; n = master.length;
while(i + m <= n) {
j = 0;
if (f[i] = = fingerprint)
while(master[i + j] = = substring[j] && j < m) j++;
if(j>=m) return i else i++;
}
return -1;
}
Knuth-Morris Pratt
• Consider string matching on the following example:
• Because of the structure of the substring, when we reach
a mismatch at index 3 of Master, we actually do not have
to restart at index 1, but instead at index 2 – why?
– By taking advantage of this idea, we can reduce the complexity
of the string search algorithm in the worst case to be O(n + m)
– The Knuth-Morris-Pratt algorithm does this by creating an
align array which determines where we need to start our string
searching from once we find a mismatch
– The align array is based on the structure, or repetitiveness, of
the substring
• We therefore solve the problem in two parts, compute
align (O(m)) and string search through master (O(n)).
Computing Align
• The idea of the align array is to tell us where to align
the substring to the master string after a mismatch
– Consider as an example:                   With a mismatch here, we don‟t want
to restart j at 0 and i at 8, we might be
able to do better than that

• The algorithm for computing align is given below
i=2, j=0;
align[0] = -1; align[1] = 0;
while(i<m) {
if(substring[i] = = substring[j]) {
align[i]=j+1; i++; j++; }
else if(j>0) j=align[j];
else { align[i]=0; i++;}
}
Example
• Compute the align array for “papaya”
• Initialize values
– align[0] = -1
– align[1] = 0
– j = 0, i = 2
• Compute for i=2 to 5
– align[2] = j + 1 = 1 (since character at i = character at j), i = 3, j
=1
– align[3] = j + 1 = 2 (since character at i = character at j), i = 4, j
=2
– align[4]: since character at i != character at j, reset j to align[j]
= align[2] = 1, and still character at i != character at j, reset j to
align[j] = align[1] = 0, so align[4] = 0, i = 5, j = 0
– align[5]: character at i != character at j and j is not > 0 so
align[5] = 0, i = 6
• Done
KMP Algorithm
Compute align array
m = 0, s = 0, done = false
while(s < length of substring &&!done) {
if(master[m] = = substring[s]) {
m++; s++; }
else if(s = = 0) m++;
else s = align[s]+1;
done = (length of substring – s > length of master – m);
}
if(s >= length of substring) return (m – length of substring);
else return -1;
• As shown before, the align algorithm is O(m)
• Here, the while loop iterates once per character in the master string until a
match is found, but may iterate more than once per character if the substring
needs to be realigned
– however, the substring will never be realigned to an earlier point in the master
string than the current location
• The complexity here is no greater than O(n – m + 1) and so we have an overall
KMP Example
Boyer-Moore
• Another string searching algorithm tries to take advantage of what
happens in a mismatch between a substring and master string
when the character in master does not exist at all in the substring
– specifically, the substring to master string match occurs right-to-left rather
than left-to-right
– consider:
• master: abcabdabeacd
• substring: abe
– we start by comparing “abe” to “abc” but rather than scanning these left-to-
right, we scan them right-to-left
– since “c” does not exist at all in the substring, we know that the substring
does not match the master string at positions 0, 1, or 2
– in KMP, we realign the substring to the master string at index 1, but here,
we know that “c” is not in any part of the substring, so we can realign the
substring to start at master string index 3
• Question: how do we know that “c” is not in the substring?
– we need an O(1) means of checking this or else the complexity of Boyer-
Moore will not improve over the simple string search
Creating a Jump Array
• In order to determine if a given substring contains a character, we
prepare a “jump” array
• We scan the entire alphabet of characters that are permissible for
a string in the given programming language
– for each character, we store where to realign our substring to based on
whether the character exists in the substring or not
– for all of the characters in the language that do not appear in the substring,
the jump value will be at m+1
• that is, move m locations in the master string since we no longer have to look
at the m characters that align our substring to the master string
– for any character that does exist in the substring, we can only jump to a
location to where the characters matched

char ch; int k;
for(ch=0; ch<alpha; ch++)             // assume all characters are not in substring
jump[ch] = m;
for(k=1;k<=m;k++)
jump[sub.charAt(k)]=m – k;       // for those in substring, use m - k
Boyer-Moore Search
• Our search algorithm works as follows:
– Start searching in master at location m-1 and substring at location m-1
– If the two characters are not a match and the character in master does not
exist in the substring, then use the jump array to move to the next starting
location in master, which will allow us to skip over m characters in master!
– If the two characters match, then start working toward the left in both
master and substring until either we have a complete match or a mismatch
• on a mismatch, again use the jump array to shift to the right in the master string
– If the two characters do not match but the character in master exists in the
substring, then we have to shift not to the right but to the left by some
amount as determined by the previous algorithm
• Because we might have to move our substring to the left, we might
ultimately have O(m*n) comparions
• More likely though, we use the jump array to move to the right,
and we might have as few as O(m + n / m) comparisons, which is
a good deal less than our KMP algorithm
– HOWEVER, the Boyer-Moore complexity is misleading, to create the Jump
array, we need to look at |S| characters (|S| is the size of the language‟s
alphabet, for instance 256 for Ascii, 65336 for Unicode!)
– So our complexity is really O(|S| + m + n /m)
Approximate String Matching
• As useful as string matching is, there are many reasons for
wanting to perform approximate string matching (such as in spell
checking and bioinformatics)
• How do we approximate how close a string matches another?
– two forms of mismatch
• if the string1.charAt(j) != string2.charAt(j)
• if string1.charAt(j) does not appear in string2 (or vice versa)
– the first case is known as a revise difference
– the second is a delete or insert difference (a character has been deleted or inserted)
• what about two characters that appear out of position?
– consider approximate vs. apporximate? -- is this 1 error or 2?
– we will compute a matrix d that will store the cost (mismatch) when
comparing two strings (d[i][0] and d[0][i] are initialized to 0 for all i)
•   matchCost = d[i-1][j-1] if string1.charAt(i) = = string2.charAt(j)
•   reviseCost = d[i-1][j-1] + 1 if string1.charAt(i) != string2.charAt(j)
•   insertCost = d[i-1][j] + 1
•   deleteCost = d[i][j-1] + 1
•   d[i][j] = minimum(matchCost, reviseCost, insertCost, deleteCost)
• The value at d[n-1][m-1] is the cost when comparing a string of
size m to a string of size n
– the complexity of this algorithm is O(m*n)
Example
• We use a 2-D table to compute d[i][j]
– d[0][j] is initialized to j and d[i][0] is initialized to I
– we now fill in the rest of the table up through d[length1][length2] where length1 is
the length of the first string and length2 is the length of the second string
• The “cost” of matching string1 to string2 is d[length1][length2]
– the larger the value of the cost, the less likely that we have a match
– below, as an example, we compare “hello” to “hell”, “hallo” and “hall”

Since “hall” scores 2, it is less of a match than either “hell” or “hallo”, each of which
score 1 (hell is missing 1 letter, hallo is off by 1 letter)

visit http://www-apparitions.ucsd.edu/~rmckinle/string/ if you want to play with the
Algorithm through a Java applet

DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 15 posted: 5/4/2011 language: English pages: 18