String Searching - PowerPoint

Document Sample
String Searching - PowerPoint Powered By Docstoc
					                  String Searching
• String searching is a critical algorithm used
   – by grep in unix/linux to search a directory for a given text
   – in word processors
   – in bioinformatics to locate matching gene sequences
• Because the source file or array might be large, we want
  to have an efficient algorithm
   – how efficient? is O(m*n) reasonable where m = length of
     substring and n = length of string? No, consider a 100Kbyte
     file to examine for a 20 character sequence, this would require
     100,000 * 20 = 2 million comparisons!
   – we would prefer O(m + n)
• We look at four algorithms where we will assume
   – we have a master string of length n, and a substring of length m
   – we want to find the index of the first occurrence of the
     substring in master, or return -1 if the substring is not found
             Simple Search Algorithm
• The first algorithm is a straight forward                           i
   – index j moves across the substring and the
     current segment of the master string starting           master
     at index i
   – if(substring[j] = = master[i+j]) then                                      j
     increment j
   – if we have a mismatch, reset j to 0 and i++ –                        sub
        • that is, start the search over again at the next
          character in master
    search(sub, master) { // find sub in master or return -1
         m = sub.length; n = master.length; i = 0;
        while(i + m <= n) { // is there enough of master to search?
             j = 0;           // start at sub[0] and master[i]
             while(master[i + j] = = sub[j] && j < m)
                  j++;        // while we are still matching, continue
              if(j >= m) return i; // did we reach the end of sub? If so,
             else i++;           // return i otherwise move on to next
         }                       // character in master and try again
        return -1;
      Simple String Search Analysis
• The outer loop will iterate as many as n – m + 1 times depending
  on when the substring is found (if found at all)
   – we have “ – m” because we don‟t have to search the final characters since
     we would run out of master string before we had a complete match
• The inner loop will iterate based strictly on how closely the
  substring matches this portion of the master string
   – consider matching “12345” against “020304050612345”
       • we never look at more than a single character in master for the first 10 characters
   – we only start working through the inner while loop of the algorithm once we
     reach the actual placement of the substring (O(n))
   – in the worst case though, we might have to look at m – 1 items before we
     find a mismatch
   – consider matching “AAAB” in “AAAAAAAAAAB”, at each position in the
     master string, we match 3 „A‟ characters
       • O(n*m)
• The algorithm‟s best case is O(n), worst case is O(m * (n – m + 1))
   – but since we will probably have a large master string (such as a text file) and
     a fairly short substring, m << n and so our worst cast comes out to O(m * n)
• The average case probably won‟t have too many close matches so
  the complexity will be closer to O(n)
• In our simple search algorithm, we start checking each segment of
  master until we mismatch with the substring
   – however, if there is evidence that the substring will not match the master
     string, we never have to start that inner-loop search
   – in order to accomplish this, we will compute the substring‟s “fingerprint”
     and we will compute the fingerprint of each segment of the master string
   – if our segment‟s fingerprint does not equal the substring‟s fingerprint, there
     is no need to start searching the segment and we move on to the master
     string‟s next segment
   – this algorithm has a complexity that ranges from O(m*n) in the worst case
     to O(m + n) in the average and best case
• What can we use for a fingerprint?
   – we need to be able to determine a fingerprint quickly (O(1)) or else this
     approach will not improve over the straightforward approach
• If few of the substrings have the same fingerprint, then instead of
  O(m*n) we have O(m*n*k + m) where k is the percentage of
  substrings that contain the fingerprint and the O(m) at the end is to
  determine the substring‟s fingerprint
        Using Parity for a Fingerprint
• Assume that we are searching a binary string for a binary substring
   – even parity exists if the binary string has an even number of 1 bits
       • for instance, 010111 has even parity while 000001, 001011, and 111101 all have
         odd parity – note that 000000 has even parity
• Compute the parity of the substring (takes O(m)) and the parity of
  each substring of length m in the master string (takes O(n)) and
  now use the straight forward string matching algorithm but precede
  the inner while loop with an if statement:
   – if(fingerprint[i] = = substring_fingerprint) while …
       • we only bother to start searching at index i in the master string if the fingerprint
         of this portion of the master string matches the fingerprint of the substring
 Example: Search for 010111 in 0010110101001010011
  Substring fingerprint = 1 (even parity)
  Fingerprints of master‟s substrings:
          Character: 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 0 1 1
          Fingerprint: 0 0 1 0 1 0 1 0 1 0 1 1 0 0 x x x x x
                   x indicates non-applicable starting points
                   underlined locations are those where we have to start searching
      Using non-Parity Fingerprints
• Parity is either even or odd so we might wind up having as
  many locations to search as to not search
   – k = 50%, or our average case complexity is O(m * n * .5 + m) which is
     still O(m*n) – no improvement
• We want a better fingerprint, one in which we can be more
  certain as to whether to search or not
   – we use the same idea, but rather than even or odd parity, we will do a
     more unique form of computation:
       m 1                             sj is the jth character of the string

        s j 2 m 1 j mod q
       j 0
                                        q is some prime number, for instance 7
                                        m is the size of the master string

   – If we assume binary numbers, then this is actually computing the
     decimal equivalent of the binary number starting at position j, mod-ing it
     by q, a prime number
Si 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 0 1 1
S 11 22 45 26 53 42 20 41 18 37 10 20 41 19 x x x x x
f 4 1 3 5 4 0 6 6 4 2 3 6 6 5 x x x x x
• Here, the substring is 001011, since it‟s length is 6, we compute fingerprints in
  master of length 6, and we use 7 to mod the summations by
• Examples:
    – f[0] (fingerprint starting at location 0) is 001011 % 7 = 0*25 + 0*24 + 1*23 + 0*22 +
      1*21 + 1*20 = 11 % 7 = 4
    – f[5] = 101010 % 7 = 42 % 7 = 0
    – f[14] isn‟t computing because we would run out of string before we could match the
• Our substring has a fingerprint of 001011 % 7 = 11 % 7 = 4
    – we only need to search at locations i = 0, 4 and 8
• Computing the fingerprints for the entire master string takes 6*n operations, but
  we can reduce this to 1*n if we are clever
    – how?
• Whether O(6*n) or O(1*n), it is still O(n)
                    Rabin-Karp Algorithm
• Assume that fingerprints are evenly distributed across possible substring
  sequences, then the inner while loop will only be examined (n – m) / q times
  where q is the selected mod value
    – by selecting a prime number for q, we will further avoid patterns in the data (consider
      if q was 10 or 16, we would find a lot more matches)
• The inner loop only iterates while this portion of the master string and the
  substring continue to match
    – for instance, we stop iterating if we are searching for “ABCDE” and we have
• Thus, we can expect in the average case a complexity of O(m + n)
            compute fingerprint f (f[i] is fingerprint starting at index i of master)
            compute fingerprint of substring
            i = 0; m = substring.length; n = master.length;
            while(i + m <= n) {
                 j = 0;
                if (f[i] = = fingerprint)
                    while(master[i + j] = = substring[j] && j < m) j++;
                if(j>=m) return i else i++;
            return -1;
                 Knuth-Morris Pratt
• Consider string matching on the following example:
   – Substring: ABADA
• Because of the structure of the substring, when we reach
  a mismatch at index 3 of Master, we actually do not have
  to restart at index 1, but instead at index 2 – why?
   – By taking advantage of this idea, we can reduce the complexity
     of the string search algorithm in the worst case to be O(n + m)
   – The Knuth-Morris-Pratt algorithm does this by creating an
     align array which determines where we need to start our string
     searching from once we find a mismatch
   – The align array is based on the structure, or repetitiveness, of
     the substring
• We therefore solve the problem in two parts, compute
  align (O(m)) and string search through master (O(n)).
                  Computing Align
• The idea of the align array is to tell us where to align
  the substring to the master string after a mismatch
   – Consider as an example:                   With a mismatch here, we don‟t want
                                               to restart j at 0 and i at 8, we might be
                                               able to do better than that

• The algorithm for computing align is given below
              i=2, j=0;
              align[0] = -1; align[1] = 0;
              while(i<m) {
                  if(substring[i] = = substring[j]) {
                       align[i]=j+1; i++; j++; }
                  else if(j>0) j=align[j];
                  else { align[i]=0; i++;}
• Compute the align array for “papaya”
• Initialize values
   – align[0] = -1
   – align[1] = 0
   – j = 0, i = 2
• Compute for i=2 to 5
   – align[2] = j + 1 = 1 (since character at i = character at j), i = 3, j
   – align[3] = j + 1 = 2 (since character at i = character at j), i = 4, j
   – align[4]: since character at i != character at j, reset j to align[j]
     = align[2] = 1, and still character at i != character at j, reset j to
     align[j] = align[1] = 0, so align[4] = 0, i = 5, j = 0
   – align[5]: character at i != character at j and j is not > 0 so
     align[5] = 0, i = 6
• Done
                       KMP Algorithm
      Compute align array
      m = 0, s = 0, done = false
      while(s < length of substring &&!done) {
            if(master[m] = = substring[s]) {
                   m++; s++; }
            else if(s = = 0) m++;
            else s = align[s]+1;
            done = (length of substring – s > length of master – m);
      if(s >= length of substring) return (m – length of substring);
      else return -1;
• As shown before, the align algorithm is O(m)
• Here, the while loop iterates once per character in the master string until a
  match is found, but may iterate more than once per character if the substring
  needs to be realigned
    – however, the substring will never be realigned to an earlier point in the master
      string than the current location
• The complexity here is no greater than O(n – m + 1) and so we have an overall
  complexity of about O(n)
KMP Example
• Another string searching algorithm tries to take advantage of what
  happens in a mismatch between a substring and master string
  when the character in master does not exist at all in the substring
   – specifically, the substring to master string match occurs right-to-left rather
     than left-to-right
   – consider:
       • master: abcabdabeacd
       • substring: abe
   – we start by comparing “abe” to “abc” but rather than scanning these left-to-
     right, we scan them right-to-left
   – since “c” does not exist at all in the substring, we know that the substring
     does not match the master string at positions 0, 1, or 2
   – in KMP, we realign the substring to the master string at index 1, but here,
     we know that “c” is not in any part of the substring, so we can realign the
     substring to start at master string index 3
• Question: how do we know that “c” is not in the substring?
   – we need an O(1) means of checking this or else the complexity of Boyer-
     Moore will not improve over the simple string search
                Creating a Jump Array
• In order to determine if a given substring contains a character, we
  prepare a “jump” array
• We scan the entire alphabet of characters that are permissible for
  a string in the given programming language
   – for each character, we store where to realign our substring to based on
     whether the character exists in the substring or not
   – for all of the characters in the language that do not appear in the substring,
     the jump value will be at m+1
        • that is, move m locations in the master string since we no longer have to look
          at the m characters that align our substring to the master string
   – for any character that does exist in the substring, we can only jump to a
     location to where the characters matched

   char ch; int k;
    for(ch=0; ch<alpha; ch++)             // assume all characters are not in substring
         jump[ch] = m;
         jump[sub.charAt(k)]=m – k;       // for those in substring, use m - k
                 Boyer-Moore Search
• Our search algorithm works as follows:
   – Start searching in master at location m-1 and substring at location m-1
   – If the two characters are not a match and the character in master does not
     exist in the substring, then use the jump array to move to the next starting
     location in master, which will allow us to skip over m characters in master!
   – If the two characters match, then start working toward the left in both
     master and substring until either we have a complete match or a mismatch
       • on a mismatch, again use the jump array to shift to the right in the master string
   – If the two characters do not match but the character in master exists in the
     substring, then we have to shift not to the right but to the left by some
     amount as determined by the previous algorithm
• Because we might have to move our substring to the left, we might
  ultimately have O(m*n) comparions
• More likely though, we use the jump array to move to the right,
  and we might have as few as O(m + n / m) comparisons, which is
  a good deal less than our KMP algorithm
   – HOWEVER, the Boyer-Moore complexity is misleading, to create the Jump
     array, we need to look at |S| characters (|S| is the size of the language‟s
     alphabet, for instance 256 for Ascii, 65336 for Unicode!)
   – So our complexity is really O(|S| + m + n /m)
       Approximate String Matching
• As useful as string matching is, there are many reasons for
  wanting to perform approximate string matching (such as in spell
  checking and bioinformatics)
• How do we approximate how close a string matches another?
   – two forms of mismatch
       • if the string1.charAt(j) != string2.charAt(j)
       • if string1.charAt(j) does not appear in string2 (or vice versa)
             – the first case is known as a revise difference
             – the second is a delete or insert difference (a character has been deleted or inserted)
       • what about two characters that appear out of position?
             – consider approximate vs. apporximate? -- is this 1 error or 2?
   – we will compute a matrix d that will store the cost (mismatch) when
     comparing two strings (d[i][0] and d[0][i] are initialized to 0 for all i)
       •   matchCost = d[i-1][j-1] if string1.charAt(i) = = string2.charAt(j)
       •   reviseCost = d[i-1][j-1] + 1 if string1.charAt(i) != string2.charAt(j)
       •   insertCost = d[i-1][j] + 1
       •   deleteCost = d[i][j-1] + 1
       •   d[i][j] = minimum(matchCost, reviseCost, insertCost, deleteCost)
• The value at d[n-1][m-1] is the cost when comparing a string of
  size m to a string of size n
   – the complexity of this algorithm is O(m*n)
• We use a 2-D table to compute d[i][j]
    – d[0][j] is initialized to j and d[i][0] is initialized to I
    – we now fill in the rest of the table up through d[length1][length2] where length1 is
      the length of the first string and length2 is the length of the second string
• The “cost” of matching string1 to string2 is d[length1][length2]
    – the larger the value of the cost, the less likely that we have a match
    – below, as an example, we compare “hello” to “hell”, “hallo” and “hall”

  Since “hall” scores 2, it is less of a match than either “hell” or “hallo”, each of which
  score 1 (hell is missing 1 letter, hallo is off by 1 letter)

  visit if you want to play with the
  Algorithm through a Java applet