Factor Oracle Suffix Oracle by lqh68203

VIEWS: 0 PAGES: 22

									      Factor Oracle
      Suffix Oracle



Factor Oracle, Suffix Oracle   1
                 Outline
 Factor oracle definition
 Construction methods
 Suffix oracle
 Factor oracle for a set of words
 Applications in string matching




                 Factor Oracle, Suffix Oracle   2
     Data Structures that Represent the
             Factors of a String

   Suffix      trie   –     tree
    representing all the suffixes
    of the string.
                                               Figure 1. Suffix Trie of the string abbc
   Suffix automaton (DAWG)
    – the minimal automaton
    recognizing all the suffixes
    of the string.
                                           Figure 2. Suffix Automaton of the string abbc

   Both the suffix automaton
    and the factor oracle can
    be obtained from the suffix
    trie.                                     Figure 3. Factor Oracle for the string abbc


                         Factor Oracle, Suffix Oracle                                       3
    Factor Oracle – Basic Ideas
 The factor oracle is a data structure used
  for indexing all the factors of a given word.
 An automaton built on a string p that acts
  like an oracle on the factors of the string.
 If a string is accepted by the automaton it
  may be a factor of p – weak factor
  recognition.
 All the correct factors are accepted.


                  Factor Oracle, Suffix Oracle   4
         Factor Oracle – Example



                       Figure 4. Factor oracle for abbbaab
          ba is a factor of baab so a transition from 2 to 5 by a is added

   Factor oracle for the string abbbaab.
   All states are considered final.
   The word abba is accepted although it is not a
    factor of abbbaab.

                               Factor Oracle, Suffix Oracle                  5
Factor Oracle – Formal Definition




         Figure 5. High level construction algorithm of Oracle(p). The algorithm
                            has a quadratic time complexity.
Definition 1. The factor oracle of a string p  p1 p2  pm
is the automaton built by the algorithm Build_Oracle,
   where all the states are terminal.
                              Factor Oracle, Suffix Oracle                         6
      Factor Oracle – Properties

1. Acyclic homogenous deterministic automaton.
2. Recognizes at least the factors of p, the string
   that it was built for.
3. Has the fewest states possible (for a string p of
   length m there are precisely m+1 states).
4. Has a linear number of transitions (the total
   number ranges between m and 2m-1).



                   Factor Oracle, Suffix Oracle    7
      Factor Oracle – Construction

   In the sequential construction the letters of the word
    are read from left to right and the automaton is
    upgraded at each step.

   We denote repet p (i ) the longest suffix of
     pref p (i )  p1 p2  pi that appears at least twice in it.

   We define S p a function on the states of the automaton
    called supply function that maps each state i of
    Oracle(p) to the state j where the reading of
     repet p (i ) ends.

                           Factor Oracle, Suffix Oracle            8
      Factor Oracle – Construction Algorithm

Buid_Oracle_Sequential( p  p1 p2  pm )
1.     create initial state 0, set S p (0)  1
2.      for i=1 to m do
3.        create new state i
4.        add a new transition from i-1 to i by p i
5.        set k  S p (i  1)
6.        while      k  1 and there is no transition from k by   pi do
7.                   add new transition from k to i by pi
8.                   set k  S p (k )
9.        endwhile
10.       if k = -1 then set S p (i )  0
11.       else set S p (i )   ( k , pi )
12.    endfor



                                Factor Oracle, Suffix Oracle               9
   Construction of the Factor Oracle for the string abbbaab




S p (0)  1, S p (1)  0                              S p ( 2)  0
                                           Add a new transition from 0 to 2 by b




       S p (3)  2                                         S p ( 4)  3
No new transition is needed



                            Factor Oracle, Suffix Oracle                           10
         Construction of the Factor Oracle for the string abbbaab




            S p (5)  1                                          S p ( 6)  1

Add new transitions from 3 and 2 to 5 by a               Add new transition from 1 to 6 by a




                                  S p (7 )  2
                          No new transition is needed

                                 Factor Oracle, Suffix Oracle                              11
          Suffix Oracle - Definition

   We mark some states in the factor oracle for the string
    p as final in order to recognize suffixes of p. The new
    structure is called suffix oracle.

   A state q of the suffix oracle is terminal if and only if
    there is a path labeled by a suffix of p from the initial
    state leading to q.

   Terminal states are determined by following the
    supply function from state m of Oracle( p  p1 p2  pm ).


                         Factor Oracle, Suffix Oracle           12
         Suffix Oracle – Example




               Figure 6. Suffix oracle for the string abbbaab.
                    Double circled states are terminal.


   The suffix oracle is a little more complicated to
    implement than the factor oracle. Also, it
    requires more memory space.
                         Factor Oracle, Suffix Oracle            13
    Factor Oracle for a Set of Words
   The factor oracle can be extended for a set of
    words so that it contains at least all the factors of
    the words from the set.
   We set an order on the words from the set, in
    order to avoid the uniqueness problem.
   The oracle is built on a trie of all the words which
    is updated similarly to the factor oracle for one
    word.
   The supply function maps each state i of the
    oracle to the state j where the reading of the
    longest repeated suffix that appears in one of
    the words ends.

                      Factor Oracle, Suffix Oracle     14
 Factor Oracle for a Set of Words
            Example



Figure 7. Trie for the set {abbba, baaa}              Figure 8. Intermediate phase in the
                                                     construction of the factor oracle for the
                                                                set {abbba, baaa}




                 Figure 9. Factor oracle for the set {abbba, baaa}


                               Factor Oracle, Suffix Oracle                                      15
Backward Oracle Matching Algorithm
   Version of the BDM algorithm using the factor
    oracle instead of the suffix automaton.

    Fast in practice for very long patterns and small
    alphabets.

   Preprocessing phase linear in time and space
    complexity.

   Optimal on average (conjecture.)
                     Factor Oracle, Suffix Oracle   16
             BOM – Main Idea




 The search uses the oracle of the reversed pattern. The search
stops when the word is no longer recognized by the oracle (which
    shows it is certainly not a factor of the reversed pattern).




     The search window is shifted beyond the point the search
                       failed (safe shift).

                       Factor Oracle, Suffix Oracle                17
                 BOM – Facts
   The suffix oracle of the reversed pattern can be
    used instead of the factor oracle. The shifts are
    longer but there are more operations needed.
   Worst case complexity of BOM is O(mn), where
    m is the length of the pattern, and n the total
    length of the text.
   Because the factor oracle accepts some words
    that are not really factors of the pattern in some
    cases the total number of inspections is greater
    than in BDM.
    TurboBOM combines BOM with KMP to obtain
    an algorithm linear in the worst case.

                     Factor Oracle, Suffix Oracle   18
     Factor Oracle – Applications


   Finding the repeats in a string

     Data   compression

     Bioinformatics


     Machine   improvisation

                       Factor Oracle, Suffix Oracle   19
    Factor Oracle – Open Problems




      Figure 10. The factor oracle for the string abbb accepts exactly all the
                               factors of the string.


   What     is    the    automaton-independent
    characterization of the language recognized
    by the oracle.

                                Factor Oracle, Suffix Oracle                     20
    Factor Oracle – Open Problems



       Figure 11. The factor oracle for the string abcacdace has 8 extra transitions




               Figure 12. A similar automaton with 7 extra transitions

   The factor oracle is not the minimal
    homogenous automaton which recognizes at
    least the factors of the string.
                              Factor Oracle, Suffix Oracle                             21
References
1.   Cyril Allauzen, Maxime Crochemore, Mathieu Raffinot Efficient
     Experimental String Matching by Weak Factor Recognition in
     Proceedings of 12th conference on Combinatorial Pattern Matching, 2001

2.   Cyril Allauzen, Mathieu Raffinot Oracle des facteurs d’un ensemble de
     mots Technical report 99-11, Institut Gaspard Monge Universite Marne la
     Valee, 1999

3.   Loek Cleophas, Gerard Zwaan, Bruce Watson Constructing Factor
     Oracles in Proceedings of the Prague Stringology Conference 2003,
     2003

4.   Arnaud Levebvre, Thierry Lecroq Computing repeated factors with a
     factor oracle in Proceedings of 11th Australian Workshop on
     Combinatorial Algorithms, 2000

5.   G. Assayag, S. Dubnov Using Factor                    Oracles   for   Machine
     Improvisation Soft Computing, 2004



                            Factor Oracle, Suffix Oracle                        22

								
To top