 Factor oracle definition
 Construction methods
 Suffix oracle
 Factor oracle for a set of words
 Applications in string matching

     Data Structures that Represent the
             Factors of a String

   Suffix      trie   –     tree
    representing all the suffixes
    of the string.
                                               Figure 1. Suffix Trie of the string abbc
   Suffix automaton (DAWG)
    – the minimal automaton
    recognizing all the suffixes
    of the string.
                                           Figure 2. Suffix Automaton of the string abbc

   Both the suffix automaton
    and the factor oracle can
    be obtained from the suffix
    trie.                                     Figure 3. Factor Oracle for the string abbc

    Factor Oracle – Basic Ideas
 The factor oracle is a data structure used
  for indexing all the factors of a given word.
 An automaton built on a string p that acts
  like an oracle on the factors of the string.
 If a string is accepted by the automaton it
  may be a factor of p – weak factor
 All the correct factors are accepted.

         Factor Oracle – Example

                       Figure 4. Factor oracle for abbbaab
          ba is a factor of baab so a transition from 2 to 5 by a is added

   Factor oracle for the string abbbaab.
   All states are considered final.
   The word abba is accepted although it is not a
    factor of abbbaab.

Factor Oracle – Formal Definition

         Figure 5. High level construction algorithm of Oracle(p). The algorithm
                            has a quadratic time complexity.
Definition 1. The factor oracle of a string p  p1 p2  pm
is the automaton built by the algorithm Build_Oracle,
   where all the states are terminal.
      Factor Oracle – Properties

1. Acyclic homogenous deterministic automaton.
2. Recognizes at least the factors of p, the string
   that it was built for.
3. Has the fewest states possible (for a string p of
   length m there are precisely m+1 states).
4. Has a linear number of transitions (the total
   number ranges between m and 2m-1).

      Factor Oracle – Construction

   In the sequential construction the letters of the word
    are read from left to right and the automaton is
    upgraded at each step.

   We denote repet p (i ) the longest suffix of
     pref p (i )  p1 p2  pi that appears at least twice in it.

   We define S p a function on the states of the automaton
    called supply function that maps each state i of
    Oracle(p) to the state j where the reading of
     repet p (i ) ends.

      Factor Oracle – Construction Algorithm

Buid_Oracle_Sequential( p  p1 p2  pm )
1.     create initial state 0, set S p (0)  1
2.      for i=1 to m do
3.        create new state i
4.        add a new transition from i-1 to i by p i
5.        set k  S p (i  1)
6.        while      k  1 and there is no transition from k by   pi do
7.                   add new transition from k to i by pi
8.                   set k  S p (k )
9.        endwhile
10.       if k = -1 then set S p (i )  0
11.       else set S p (i )   ( k , pi )
12.    endfor

   Construction of the Factor Oracle for the string abbbaab

S p (0)  1, S p (1)  0                              S p ( 2)  0
                                           Add a new transition from 0 to 2 by b

       S p (3)  2                                         S p ( 4)  3
No new transition is needed

         Construction of the Factor Oracle for the string abbbaab

            S p (5)  1                                          S p ( 6)  1

Add new transitions from 3 and 2 to 5 by a               Add new transition from 1 to 6 by a

                                  S p (7 )  2
                          No new transition is needed

          Suffix Oracle - Definition

   We mark some states in the factor oracle for the string
    p as final in order to recognize suffixes of p. The new
    structure is called suffix oracle.

   A state q of the suffix oracle is terminal if and only if
    there is a path labeled by a suffix of p from the initial
    state leading to q.

   Terminal states are determined by following the
    supply function from state m of Oracle( p  p1 p2  pm ).

         Suffix Oracle – Example

               Figure 6. Suffix oracle for the string abbbaab.
                    Double circled states are terminal.

   The suffix oracle is a little more complicated to
    implement than the factor oracle. Also, it
    requires more memory space.
    Factor Oracle for a Set of Words
   The factor oracle can be extended for a set of
    words so that it contains at least all the factors of
    the words from the set.
   We set an order on the words from the set, in
    order to avoid the uniqueness problem.
   The oracle is built on a trie of all the words which
    is updated similarly to the factor oracle for one
   The supply function maps each state i of the
    oracle to the state j where the reading of the
    longest repeated suffix that appears in one of
    the words ends.

 Factor Oracle for a Set of Words

Figure 7. Trie for the set {abbba, baaa}              Figure 8. Intermediate phase in the
                                                     construction of the factor oracle for the
                                                                set {abbba, baaa}

                 Figure 9. Factor oracle for the set {abbba, baaa}

Backward Oracle Matching Algorithm
   Version of the BDM algorithm using the factor
    oracle instead of the suffix automaton.

    Fast in practice for very long patterns and small

   Preprocessing phase linear in time and space

   Optimal on average (conjecture.)
             BOM – Main Idea

 The search uses the oracle of the reversed pattern. The search
stops when the word is no longer recognized by the oracle (which
    shows it is certainly not a factor of the reversed pattern).

     The search window is shifted beyond the point the search
                       failed (safe shift).

                 BOM – Facts
   The suffix oracle of the reversed pattern can be
    used instead of the factor oracle. The shifts are
    longer but there are more operations needed.
   Worst case complexity of BOM is O(mn), where
    m is the length of the pattern, and n the total
    length of the text.
   Because the factor oracle accepts some words
    that are not really factors of the pattern in some
    cases the total number of inspections is greater
    than in BDM.
    TurboBOM combines BOM with KMP to obtain
    an algorithm linear in the worst case.

     Factor Oracle – Applications

   Finding the repeats in a string

     Data   compression

     Bioinformatics

     Machine   improvisation

    Factor Oracle – Open Problems

      Figure 10. The factor oracle for the string abbb accepts exactly all the
                               factors of the string.

   What     is    the    automaton-independent
    characterization of the language recognized
    by the oracle.

    Factor Oracle – Open Problems

       Figure 11. The factor oracle for the string abcacdace has 8 extra transitions

               Figure 12. A similar automaton with 7 extra transitions

   The factor oracle is not the minimal
    homogenous automaton which recognizes at
    least the factors of the string.
