CS262: Computational genomics
CS 262 – Problem Set 3
(due at the beginning of class on Thursday, Feb 26)
Collaboration is allowed, but you must submit separate writeups. Please also
write the names of all your collaborators on your submissions.
1. Sequence assembly
(a) Let F be a collection of fragments. The overlap multigraph of F, denoted
as OM(F) is a directed, weighted multigraph. The set V of nodes of this
structure is just F itself. A directed edge from a ε F to a different
fragment b ε F with weight t >= 0 exists if the suffix of a with t characters
is a prefix of b.
i. Explain how directed paths in this graph give rise to a multiple alignment of
sequences belonging to this path. Also explain how a consensus sequence can
be derived, providing a common superstring of the involved sequences.
ii. Let P be a path in OM(F) that goes through every vertex (P is any
Hamiltonian path). Let S(P) be the common superstring derived from P. Let
w(P) be the weight of P. Prove that minimizing |S(P)| is equivalent to
maximizing w(P). (Note: S(P) is the sequence of the target DNA molecule to
be assembled)
iii. A collection of fragments F is said to be substring-free if there are no
two distinct strings a and b in F such that a is a substring of b. Prove that if
S is a shortest common superstring of F, there is a Hamiltonian path P such
that S = S(P).
iv. If F is a collection of strings, prove that there is a unique substring-free
collection G equivalent to F (ie, having the same superstring). How does this
result help you?
v. Prof. Kotovsky suggests a greedy approach to solve the sequence assembly
problem formulated as a shortest common superstring problem. He says,
“We know that looking for shortest common superstrings is the same as
looking for Hamiltonian paths of maximum weight in a directed multigraph.
To maximize the weight, we can simplify the multigraph and consider only
the heaviest edge between every pair of nodes, discarding other parallel
edges of smaller weight. To compute the heaviest path, continuously add the
heaviest available edge, which is one that does not upset the construction of
a Hamiltonian path given the previously chosen edges. Because the graph is
complete (zero-weight edges are also assumed to be present), this process
1
CS262: Computational genomics
stops only when a path containing all vertices is formed.” Prove that this
greedy strategy does not always produce the best result.
2. Chaining Local Alignments
(a) Consider the following problem:
Let S be a sequence of numbers, n1, . . . , nk. Let each number have an
associated weight, w1, . . . ,wk. Find the heaviest increasing subsequence of S,
that is, a subsequence ni1 , . . . , nim such that i1 = 1.
i. What string will maximize the number of nodes (edges) in our suffix tree?
ii. What string will minimize the number of nodes (edges) in our suffix tree
if m >= n? Extra Credit: What string will minimize the number of nodes
(edges) in our suffix tree if m < n?
(c) A maximal pair in a string S is a pair of identical substrings α and β in S
such that the character to the immediate left (right) of α is different from
the character to the immediate left (right) of β.
Give a linear-time algorithm that takes in a string S and finds the longest
maximal pair in which the two copies do not overlap. That is, if the two
copies begin at positions p1 < p2 and are of length n’, then p1+n’ < p2. This is
exercise #40 of Chapter 7 of the Gusfield book (page 173).
4