A Fast String Matching Algorithm
The Boyer Moore Algorithm
The obvious search algorithm
Considers each character position of str and
determines whether the successive patlen characters
of str matches pat.
In worst case, the number of comparisons is in the
order of .
Ex. pat: aab ; str: ..aaaaac .
Linear search algorithm.
Preprocesses pat in time linear in and searches
str in time linear in .
HERE IS A SIMPLE EXAMPLE
Characteristics of Boyer Moore
Basic idea: string matches the pattern from the
right rather than from the left.
Preprocessing pat and compute two tables:
& for shifting pat & the pointer of
Ex. pat : AT-THAT;
str : …WHICH-FINALLY-HALTS.—
Compare the last char of the pat with the
patlenth char of str :
Observation 1: char is not to occur in pat, skip
chars of str.
Observation 2: char is in pat, slide pat down
positions so that char is aligned to the corresponding
character in pat.
= if char not occur in pat,then ; else
, where j is the maximum integer such that
Observation 3a: str matches the last m chars of pat,
and came to a mismatch at some new char. Move
strptr by .(pat shifted by )
Observation 3b: the final m chars of pat (a subpat) is
matched, find the right most plausible reoccurrence
of the subpat, align it with the matched m chars of
str (slide pat positions).
The delta1 & delta2 tables
The delta1 table has as many entries as there are chars in
Ex. pat: a b c d e ; a t – t h a t
: 4 3 2 1 0 else,5; 1 0 4 0 2 1 0 else,7
The delta2 table has as many entries as there are chars in
Ex. pat: a b c d e ; a t - t h a t
: 9 8 7 6 1 ; 11 10 9 8 7 8 1
Ex: we compute j=5
j= 1 2 3 4 5 6 7
Pat: e d b c a b c
e d b c a b c
-2 -1 0 1 2 3 4 5 6 7
stringlen length of string.
top : if i > stringlen then return false.
loop: if j=0 then return i+1.
i i +max( delta1(sting(i)) , delta2(j))
Loops: fast, undo, slow
Fast：scans down string, effectively looking for the last
character in pat, skipping according
– 80% time spent in it.
Undo：decides whether this situation arose because all
of string has been scanned or because
Slow：backs up checking for matches.
It is easy to implement on a byte addressable machine
– Char <- string (i), etc
Measured the cost of each search
Three strings：binary alphabet, English, random
Fig.1：the number of references made to string.
Fig.2：the total number of machine instruction
that actually got executed.
Performance (empirical evidence)
Knuth, Morris, and Pratt algorithm
for English text.
– every reference to string passes about 4 characters for a
pattern of length 5.
– For sufficiently large alphabets and sufficiently long patterns
executes fewer than 1 instruction per character passed.
– Search reference string about 1.1 times per character.
– a character can be expected to be at least 3.3 instructions.
Require fewer CPU cycle.
Most efficiently on a byte-addressable machine.
Unadvisable：to find the first of several possible
substrings or to identify a location in string
defined by a regular expression.
– Aho and Corasick is more suitable.
Improve：by fetching larger bytes in the fast
loop and using a hash array to encode the
– Exponentially increases the effective size of the
alphabet and reduces the frequency of common