# ppt

Document Sample

```					生物資訊相關演算法
Algorithms in Bioinformatics

http://www.iis.sinica.edu.tw/~hil/

2003/10/28     Algorithms in Bioinformatics, Lecture 6   1
Today – 如虎添翼

    An fundamental query that significantly
strengthens suffix tree
– Range Minima Query (RMQ)
 前翼: RMQ for ±sequences.
 後翼: RMQ for general sequences.

    Intermission – 小巨’s magic show
– “Amazing!”

2003/10/28               Algorithms in Bioinformatics, Lecture 6   2

Wildcard matching
Document listing
Fuzzy matching

LCE                         RMQ

LCA

+/-RMQ
RMQ: Range Minima Query

    S: a sequence of                             123456789
numbers.                                 S = 340141932
    小(S, i, j) = k if
– i ≤ k ≤ j, and                             小(S, 2, 6) = 3
– S[k] = min(S[i],
S[i+1], …, S[i]).                          小(S, 4, 10) = 4 (or 6).

2003/10/28               Algorithms in Bioinformatics, Lecture 6           4
The RMQ challenge

    Input: a sequence S of numbers
    Output: a data structure D for S
    Time complexity
– Constant query time
   Each query 小(S, i, j) for S can be answered
from D and S in O(1) time.
– Linear preprocessing time
   D can be computed in O(|S|) time.
2003/10/28                   Algorithms in Bioinformatics, Lecture 6   5
Naïve approach

    Storing the answer of 小(S, i, j) in a table
for all index pairs i and j with 1 ≤ i ≤ j ≤ |S|.
    Query time = O(1).
    Preprocessing time = Ω(|S|2).

2003/10/28          Algorithms in Bioinformatics, Lecture 6   6
Faster Preprocessing

    Assumption (without loss of generality)
– |S| = 2k for some positive integer k.
    Idea:
– Precomputing the values of 小(S, i, j) only for
those indices i and j with j – i + 1 = 1, 2, 4,
8, …, 2k = |S|.
    Preprocessing time
– O(|S| log |S|).

2003/10/28                 Algorithms in Bioinformatics, Lecture 6   7

    Let k be the (unique)                     i             j – 2k + 1   i + 2k – 1       j
integer that satisfies
2k ≤ j – i + 1 < 2k+1.
    Then, 小(S, i, j) is
– x = 小(S, i, i + 2k – 1)
or
– y = 小(S, j – 2k + 1, j).

2003/10/28                Algorithms in Bioinformatics, Lecture 6                      8
As a result

    RMQ
– Input: O(n) numbers
– Preprocessing: O(n log n) time
– Query: O(1) time
    RMQ
– Input: O(n/log n) numbers
– Preprocessing: O(n) time
– Query: O(1) time
2003/10/28           Algorithms in Bioinformatics, Lecture 6   9

The RMQ Challenge for ±sequeneces

2003/10/28   Algorithms in Bioinformatics, Lecture 6   10
±sequeneces

    S is a ±sequence if S[i] – S[i – 1] = ±1 for
each index i with 2 ≤ i ≤ |S|.
    For example,
–S = 5 6 5 4 3 2 3 2 3 4 5 6 5 6 7
–     + - - - - + - + + + + - + +

– S = 3 4 3 2 1 0 -1 -2 -1 0 1 2 1
–      + - - - - - - + + + + -

2003/10/28          Algorithms in Bioinformatics, Lecture 6   11

for ±sequeneces
    Input: a ±sequence S of numbers
    Output: a data structure D for S
    Time complexity
– Constant query time
   Each query 小(S, i, j) for S can be answered from
D and S in O(1) time.
– Linear preprocessing time
   D can be computed in O(|S|) time under the unit-
cost RAM model.
2003/10/28                   Algorithms in Bioinformatics, Lecture 6   12
Unit-Cost RAM model

    Operations such as add, minus,
comparison on consecutive O(log n) bits
can be performed in O(1) time.

2003/10/28        Algorithms in Bioinformatics, Lecture 6   13
Idea: compression                                                     Any
constant c <
1 is OK.
   Breaking S into blocks of length L = ½ log |S|.
– There are B = 2|S|/log |S| blocks.
   Let 縮[t] be the minimum of the t-th block of S.
– 縮[t] = min {S[j] | j = (t – 1) L < j ≤ tL} for t = 1, 2, …, B.
– Computable in O(|S|) time.
   RMQ on 縮: 小(縮, x, y)
– O(1) query time.
– O(|S|) preprocessing time. (Why?)

2003/10/28               Algorithms in Bioinformatics, Lecture 6               14

    Suppose S[i] is in the                        小(S, i, j) is one of
α-th block of S.                                 – 小(S, i, αL)
– (α–1) L<i ≤ αL.                              – 小(S, (γ–1)L +1, j)
    Suppose S[j] is in the                           – 小(S, (β-1)L+1, βL)
γ-th block of S.                              Note that each of
– (γ–1) L < j ≤ γL.                          these three is a query
    β= 小(縮,α+1,γ-1).                               within a length-L
block.

2003/10/28              Algorithms in Bioinformatics, Lecture 6              15
Illustration
i                                             j

α                      β                           γ

2003/10/28       Algorithms in Bioinformatics, Lecture 6       16

    It remains to show how to answer 小(S, i, j)
in O(1) time for any indices i and j such
that (t–1)L < i ≤ j ≤ tL for some positive
integer t with the help of some linear time
preprocessing.

2003/10/28        Algorithms in Bioinformatics, Lecture 6   17
Difference sequence

    The difference sequence 差 of S is defined
as follows: 差[i] = S[i+1] – S[i].
– Since S is a ±sequence, each 差[i] = ±1.
– 小(S, i, j) can be determined from 差[i…j].
– The number of distinct patterns of a length-L
difference sequence is exactly 2L = |S|½.

2003/10/28            Algorithms in Bioinformatics, Lecture 6   18
Preprocessing all patterns

1,1 … 2,1 2,2 2,3 2,4 2,5 … L, L                          o(|S|) time.
++++     1      …      2   2   2      2     …       L             – #row = |S|½
+++–     1      …      2   2   2      2     …       L             – #col = ¼ log2 |S|
++–+     1      …      2   2   2      2     …       L
– Each entry is
++––     1      …      2   2   2      2     …       L               computable in
+–++     1      …      2   3   3      3     …       L               O(log |S|) time.
+–+–     1      …      2   3   3      3     …       L
+––+     1      …      2   3   4      4     …       L
小(S, i, j) takes
+–––     1      …      2   3   4      5     …       L
O(1) time.
…       1      …      2   3   … … …                L
––––     1      …
2003/10/28      2   3              … L
4 5 in Bioinformatics, Lecture 6
Algorithms                                         19
LCA: Lowest Common
Ancestor
An application of RMQ for ±sequences

2003/10/28    Algorithms in Bioinformatics, Lecture 6   20

Wildcard matching
Document listing
Fuzzy matching

LCE                         RMQ

LCA

+/-RMQ
Lowest Common Ancestor

    T is a rooted tree.
    祖(x, y) is the lowest (i.e., deepest) node of
T that is an ancestor of both node x and
node y.

2003/10/28         Algorithms in Bioinformatics, Lecture 6   22
For example, …
祖(5,7)
1
祖(3,6)
2

4                           7
3

5               6

2003/10/28            Algorithms in Bioinformatics, Lecture 6       23
The challenge for 祖(x, y)

    Input: an n-node rooted tree T.
    Output: a data structure D for T.
    Requirement:
– D can be computed in O(n) time.
– Each query 祖(x, y) for T can be answered
from D in O(1) time.

2003/10/28           Algorithms in Bioinformatics, Lecture 6   24
Idea: depth-first traversal
1234567890123
1
V=1232454642171
L=1232343432121                                             2

4       7
3
If V[i]=x and V[j]=y,                                           5       6
then 祖(x, y)=V[小(L, i, j)]

2003/10/28        Algorithms in Bioinformatics, Lecture 6                   25
Idea: depth-first traversal
1234567890123                              O(n)-time Preprocessing
V=1232454642171                                 – Computing V and L
L=1232343432121                                 – Preprocessing L for
queries 小(L, i, j).
– Precomputing an array I
1 2 3 4 5 6 7                                   such that V[I[x]] = x for
I=1,2,3,5,6,8,12                                  each node x.

2003/10/28        Algorithms in Bioinformatics, Lecture 6               26
Idea: depth-first traversal
1234567890123                              Query time is clearly
V=1232454642171                               O(1).
L=1232343432121                                                         1

2
1 2 3 4 5 6 7
4        7
3
I=1,2,3,5,6,8,12
5       6

2003/10/28        Algorithms in Bioinformatics, Lecture 6                   27
祖(5,7)

Example
1
祖(3,6)
2

4
1234567890123                           3                                 7

V=1232454642171                                             5       6

L=1232343432121

1 2 3 4 5 6 7
I=1,2,3,5,6,8,12

2003/10/28        Algorithms in Bioinformatics, Lecture 6                    28
LCE: Longest Common
Extension
An application of LCA queries 祖(i, j).

2003/10/28     Algorithms in Bioinformatics, Lecture 6   29

Wildcard matching
Document listing
Fuzzy matching

LCE                         RMQ

LCA

+/-RMQ
Longest Common Extension

    Suppose A and B are two strings.
    Let 延(i, j) be the largest number d + 1
such that A[i…i+d] = B[j…j+d].
    Example
–     A=ababba
–     B=bbaabbb
–     延(1,1) = 0, 延(2,1) = 1,
–     延(2,2) = 2, 延(3,4) = 3.
2003/10/28               Algorithms in Bioinformatics, Lecture 6   31
The challenge for 延(i, j)

    Input: two strings A and B.
    Objective: output a data structure D for A
and B in O(|A|+|B|) time such that each
query 延(i, j) can be answered from D in
O(1) time.

2003/10/28         Algorithms in Bioinformatics, Lecture 6   32
Idea: Suffix Tree for A#B\$

      x is the i-th leaf
      y is the (j+|A|+1)-st
祖(x, y)                               leaf.
      The depth of 祖(x, y)
is exactly 延(i, j).

A-suffix

A                   #               B         \$

x        y                                                       B-suffix
2003/10/28             Algorithms in Bioinformatics, Lecture 6                  33
Wildcard Matching
An application of longest common
extension 延(i, j)

2003/10/28    Algorithms in Bioinformatics, Lecture 6   34

Wildcard matching
Document listing
Fuzzy matching

LCE                         RMQ

LCA

+/-RMQ
Wildcard Matching

    Input: two strings P and S,
– where P has k wildcard characters ‘?’, each
could match any character of S.
    Output: all occurrences of P in S.

2003/10/28            Algorithms in Bioinformatics, Lecture 6   36
Naïve algorithm

    Suppose S has t distinct characters.
    Naïve algorithm:
Construct the suffix tree of S;
For each of tk possibilities of P do
Output the occurrences of P in S;
    Time complexity = Ω(|S|+tk|P|).

2003/10/28         Algorithms in Bioinformatics, Lecture 6   37
Wildcard Matching via
longest common extension
   Suppose j1 < j2 < … < jk are the
indices such that                                                i
– P[j1] = P[j2] = … = P[jk] = ‘?’.
S
   P matches S[i…i+|P|–1] if and
only if
–   延(i, 1) ≥ j1 – 1;                                     P
–   延(i+ j1, j1+1) ≥ j2 – j1 – 1;                                1      j1   j2   jk   |P|
–   延(i+ j2, j2+1) ≥ j3 – j2 – 1;
–   …
–   延(i+ jk-1, jk-1+1) ≥ jk – jk-1 – 1; and
–   延(i+ jk, jk+1) ≥ |P| – jk + 1.

2003/10/28                   Algorithms in Bioinformatics, Lecture 6                   38
O(k|S|) time

    O(|P|+|S|) = O(|S|) time: preprocessing for
supporting each 延(i, j) query in O(1) time.
    O(|S|) iterations, each takes time O(k).

2003/10/28         Algorithms in Bioinformatics, Lecture 6   39
Fuzzy Matching
Another application of longest
common extension 延(i, j).

2003/10/28     Algorithms in Bioinformatics, Lecture 6   40
Fuzzy Matching

    Input: an integer k and two strings P and S
    Output: all “fussy occurrences” of P in S,
where each “fussy occurrence” allows at
most k mismatched characters.

2003/10/28         Algorithms in Bioinformatics, Lecture 6   41
Fuzzy occurrences

    Whether P occurs in S[i…i+|P|-1] with k
or fewer errors can be determined by…
– j = 延(i, 1); error = 0;
– while (j < |P|)
 If (++error > k) then return “no”;
 j += 1 + 延(i + j + 1, j + 2);

– return “yes”.

2003/10/28                 Algorithms in Bioinformatics, Lecture 6   42
O(k|S|) time

    O(|P|+|S|) = O(|S|) time: preprocessing for
supporting each 延(i, j) query in O(1) time.
    O(|S|) iterations, each takes time O(k).

2003/10/28         Algorithms in Bioinformatics, Lecture 6   43

    Amazing

2003/10/28     Algorithms in Bioinformatics, Lecture 6   44

challenge for general sequences
Another application of lowest
common ancestor

2003/10/28     Algorithms in Bioinformatics, Lecture 6   45

Wildcard matching
Document listing
Fuzzy matching

LCE                         RMQ

LCA

+/-RMQ
The RMQ challenge

    Input: a sequence S of numbers
    Output: a data structure D for S
    Time complexity
– Constant query time
   Each query 小(S, i, j) for S can be answered
from D and S in O(1) time.
– Linear preprocessing time
   D can be computed in O(|S|) time.
2003/10/28                   Algorithms in Bioinformatics, Lecture 6   47
Idea: Minima Tree

123456789                                            5

S=432417363                            3                       7

2                  4         6       9
    小(S,i,j)=
祖(i,j).          1                                        8

2003/10/28   Algorithms in Bioinformatics, Lecture 6                   48
Homework 3
    Problem 1: Grow the suffix tree for “b a a b c
b b a b c”. Draw the intermediate tree with
growing point and suffix links for each step of
Ukkonen’s algorithm, as we did in the class.
(You may turn in a ppt file with animation for
this problem. )
    Problem 2: Show how to construct a minima tree
for any sequence S of numbers in O(|S|) time.
    Due
– 100%: 11:59pm, Nov 4, 2003
– 50%: 1:10pm, Nov 11, 2003
2003/10/28            Algorithms in Bioinformatics, Lecture 6   49

    Don’t turn in codes for homeworks, unless
you are explicitly asked to do so.
    As for extra-credit implementation, it has
to be demo-able on WEB. So, plain codes
do not count.

2003/10/28        Algorithms in Bioinformatics, Lecture 6   50
Listing source strings that
contains a pattern string
[Muthukrishnan, SODA’02]

An application of RMQ for general sequences

2003/10/28     Algorithms in Bioinformatics, Lecture 6   51

Wildcard matching
Document listing
Fuzzy matching

LCE                         RMQ

LCA

+/-RMQ
The problem

    Input:
– Strings S1, S2, …, Sm, which can be
preprocessed in linear time.
– A string P.
    Output:
– The index j of each Sj that contains P.

2003/10/28            Algorithms in Bioinformatics, Lecture 6   53
Preliminary attempts

    Obtaining the suffix tree for S1#S2#…#Sm\$.
– Find all occurrences of P.
 I.e., exact string matching for S1#S2#…#Sm\$ and P.
 Time = O(|P| + total number of occurences of P).

    Obtaining the suffix tree for each Si.
– Determining whether P occurs in Si.
 I.e., substring problem for each pair Si and P.
 Time = O(|P|m).

2003/10/28                 Algorithms in Bioinformatics, Lecture 6   54
The challenge

    Input:
– Strings S1, S2, …, Sm, which can be preprocessed in
linear time.
– A string P.
    Output:
– The index j of each Sj that contains P.
    Objective
– O(|P| + 現(P)) time, where 現(P) is the number of
output indices.
2003/10/28               Algorithms in Bioinformatics, Lecture 6   55
The second attempt

      Constructing the
suffix tree for
S1#S2#…#Sm\$.
      Keeping the distinct
descendant leaf colors
for each internal node.
      Query time?
      Preprocessing time?

2003/10/28   Algorithms in Bioinformatics, Lecture 6          56
The second attempt
      Each query takes O(|P|+現
(P)) time. (Why?)

      The preprocessing may need
Ω(m| S1#S2#…#Sm\$|) time.
(Why?)

      Q: Any suggestions for
resolving this problem?

2003/10/28   Algorithms in Bioinformatics, Lecture 6               57
Compact Representation
1,8

      Keeping the list 彩 of
1,4                                     5,8
leaf colors from left
2,4                                  6,8           to right.
1                                   5
      Each internal keeps
2
3,4
6,7          8
the indices of leftmost
and rightmost
3             4         6       7                           descendant leaves.
1       2        3          4       5       6          7           8

2003/10/28                          Algorithms in Bioinformatics, Lecture 6           58
The challenge of listing
distinct colors
    Input: a sequence 彩 of colors.
    Output: a data structure D for 彩 such that
– D is computable in O(|彩|) time.
– Each 顏(i, j) = {彩(i), …, 彩(j)} query can be
answered from D in O(|顏(i, j)|) time.

2003/10/28           Algorithms in Bioinformatics, Lecture 6   59
An auxiliary index array

    Let 前[i] = 0 if 彩[j] ≠彩[i] for all j < i.
    Let 前[i] be the largest index j with j < i
such that 彩[i] = 彩[j].

1   2   3          4           5           6      7   8
彩

前   0   0   0           2          3           1      5   6
2003/10/28               Algorithms in Bioinformatics, Lecture 6           60
An observation

    A color c is in 顏(i, j) if and only there is
an index k in [i, j] such that
– 彩[k] = c and 前[k] < i.

1   2   3          4           5           6      7   8
彩

前   0   0   0           2          3           1      5   6
2003/10/28               Algorithms in Bioinformatics, Lecture 6           61
The algorithm 解(i, j)
    Just recursively call 破(i, j, i);

    Subroutine 破(p, q, 左界):
–     If (p > q) then return;
–     Let k = 小(前, p, q);
–     If (k ≥ 左界) then return;
–     Output 彩[k];
–     Call 破(p, k – 1, 左界);
–     Call 破(k + 1, q, 左界);

2003/10/28                 Algorithms in Bioinformatics, Lecture 6   62

1   2   3        4           5           6            7   8
彩

前     0   0   0         2           3          1            5   6

2003/10/28               Algorithms in Bioinformatics, Lecture 6           63
Time = O(|顏(i, j)|)

    Why?

2003/10/28   Algorithms in Bioinformatics, Lecture 6   64

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 14 posted: 2/17/2012 language: English pages: 64
How are you planning on using Docstoc?