A Fast String Matching Algorithm by pptfiles

VIEWS: 23 PAGES: 18

									  A Fast String Matching Algorithm

The Boyer Moore Algorithm
      The obvious search algorithm

Considers each character position of str and
determines whether the successive patlen characters
of str matches pat.
In worst case, the number of comparisons is in the
order of                .
Ex. pat: aab ; str: ..aaaaac .
          Knuth-Pratt-Morris Algoritm
 Linear search algorithm.
 Preprocesses pat in time linear in   and searches
 str in time linear in         .

         …               EXAMPLE
  EXAMPLE
 EXAMPLE
EXAMPLE
HERE IS A SIMPLE EXAMPLE
      Characteristics of Boyer Moore
                Algorithm
Basic idea: string matches the pattern from the
right rather than from the left.
Preprocessing pat and compute two tables:
        &         for shifting pat & the pointer of
str.
Ex. pat : AT-THAT;
   str : …WHICH-FINALLY-HALTS.—
AT-THAT-POINT
                 Informal Description
 Compare the last char of the pat with the
 patlenth char of str :
       AT-THAT
AT-THAT
WHICH-FINALLY-HALTS.—AT-THAT-POINT

Observation 1: char is not to occur in pat, skip
                 chars of str.
               Informal Description
Observation 2: char is in pat, slide pat down
positions so that char is aligned to the corresponding
  character in pat.
            AT-THAT
 WHICH-FINALLY-HALTS.--AT-THAT-
 POINT
           = if char not occur in pat,then     ; else
          , where j is the maximum integer such that
               .
              Informal Description

Observation 3a: str matches the last m chars of pat,
 and came to a mismatch at some new char. Move
 strptr by         .(pat shifted by           )

            AT-THAT
      AT-THAT
…FINALLY-HALTS.--AT-THAT-POINT
               Informal Description
Observation 3b: the final m chars of pat (a subpat) is
 matched, find the right most plausible reoccurrence
 of the subpat, align it with the matched m chars of
 str (slide pat           positions).

                 AT-THAT
                 AT-THAT
            AT-THAT
…FINALLY-HALTS.—AT-THAT-POINT
              The delta1 & delta2 tables
  The delta1 table has as many entries as there are chars in
  the alphabet.
Ex. pat: a b c d e       ; a t – t h a t
       : 4 3 2 1 0 else,5; 1 0 4 0 2 1 0 else,7

  The delta2 table has as many entries as there are chars in
  pat.

Ex. pat: a b c d e ; a t - t h a t
        : 9 8 7 6 1 ; 11 10 9 8 7 8 1
Ex: we compute j=5
 j= 1 2 3 4 5        6 7
Pat: e d b c a       b c
            e d      b c a b c
    -2 -1 0 1 2      3 4 5 6 7
Then
                                The algorithm
stringlen          length of string.
 i       patlen.
top : if i > stringlen then return false.
           j     patlen.
loop: if j=0 then return i+1.
if string(i)=pat(j)
then
j       j-1
i       i-1
goto loop.
close;
i        i +max( delta1(sting(i)) , delta2(j))
goto top.
Implementation Consideration
              Loops: fast, undo, slow
Fast:scans down string, effectively looking for the last
character           in pat, skipping according
to        .
– 80% time spent in it.
Undo:decides whether this situation arose because all
of string has been scanned or because
was hit.
Slow:backs up checking for matches.
It is easy to implement on a byte addressable machine
– Char <- string (i), etc
  Measured the cost of each search

Three strings:binary alphabet, English, random
alphabet.

Fig.1:the number of references made to string.
Fig.2:the total number of machine instruction
that actually got executed.
Performance (empirical evidence)
                Boyer Moore
                      V.S.
       Knuth, Morris, and Pratt algorithm
for English text.
Boyer Moore:
– every reference to string passes about 4 characters for a
  pattern of length 5.
– For sufficiently large alphabets and sufficiently long patterns
  executes fewer than 1 instruction per character passed.
K.M.P.:
– Search reference string about 1.1 times per character.
– a character can be expected to be at least 3.3 instructions.
                    Conclusion

Require fewer CPU cycle.
Most efficiently on a byte-addressable machine.
Unadvisable:to find the first of several possible
substrings or to identify a location in string
defined by a regular expression.
– Aho and Corasick is more suitable.
                     Conclusion

Improve:by fetching larger bytes in the fast
loop and using a hash array to encode the
extended        .
– Exponentially increases the effective size of the
  alphabet and reduces the frequency of common
  characters.

								
To top