Docstoc

bio

Document Sample
bio Powered By Docstoc
					5     SUPPLEMENT
                                                                               A
5.1      The Minimal Mosaic Partition algorithm snips closest to
         the root.
We prove here the statement of section 2.3 that the partition con-
structed by the Minimal Mosaic algorithm has the desirable addi-
tional property that its highest snip is closest to the root among all
minimal mosaic partitions of the h-tree. In passing we prove also
that the algorithm indeed finds a minimal mosaic partition.
Denote by T(v) the subtree rooted at v, by N(v) the minimal num-
ber of snips needed to create a mosaic partition of T(v), and let A(v)
be the set of annotations such that a  A(v) if and only if there is a         B
mosaic partition of T(v) in which the component of v has annota-
tion a.
Theorem. Let left and right be the two children of the vertex v.
Then N(left)+N(right)  N(v)  N(left)+N(right)+1.
Furthermore,
1. N(v)= N(left)+N(right) iff A(left)  A(right)   .
       In this case A(v)= A(left)  A(right).                            Sup Fig. 6. Two minimal mosaic partitions. A) A minimal mosaic parti-
2. N(v)= N(left)+N(right)+1 iff A(left)  A(right)=  .                  tion whose highest snip is not as high as possible. B) A minimal mosaic
       In this case A(v)= A(left)  A(right).                            partition whose highest snip as high as possible, as found by the Minimal-
Proof. Observe that a minimal mosaic partition of T(v) induces (not      Mosaic algorithm.
necessarily minimal) mosaic partitions of T(left) and T(right), im-
plying N(left)+N(right)  N(v). If there is equality, the two induced    5.2     Snipping to minimize misclassification – Proof of cor-
mosaic partitions are in fact minimal, and the annotations assigned              rectness.
to left and right in these partitions are both equal to the annotation   The minimum misclassification algorithm is based on dynamic
assigned to v. Thus A(left)  A(right)  A(v). Conversely, for any       programming. Its correctness follows from the following easily
a in A(left)  A(right) a minimal mosaic partition of T(v), in which     verified optimal substructure.
v is assigned annotation a, can be constructed by adjoining the          Let P be a partition of T(v) with the minimum number of misclassi-
minimal mosaic partitions of T(left) and T(right) in which the ver-      fied leaves, when node v must be assigned the label l and it is
tices left and right are assigned annotation a. This proves that         permitted to snip k edges (creating a (k+1) partition). Let left and
N(v)= N(left)+N(right) and A(v)= A(left)  A(right.,                     right be the two children of v. Then the partitions of T(left) and
If on the other hand A(left)  A(right)=  , so that N(v)=               T(right) induced by P are optimal for the corresponding induced
                                                                         parameters.
N(left)+N(right)+1, then a minimal mosaic partition of T(v) can
be constructed from minimal mosaic partitions of T(left) and             5.3     Snipping to minimize misclassification-Time complexity
T(right) by giving v the annotation of either left or right.             In order to calculate minSnips(v,l,k), four cases have to be consid-
Thus A(v)= A(left)  A(right).                                           ered (see figure 3, the recurrence formula). In each case, there are
                                                                         up to k+1 possibilities to apportion the k snips to the two subtrees.
Corollary. The mosaic partition found by the MinimalMosaic algo-         Consequently the time complexity for calculating minSnips(v,l,k),
rithm has the property that the induced mosaic partition of any          and minNum(v,k), is O(k). The algorithm computes minSnips(v,l,k)
subtree T(v) is minimal for that subtree: minSnips(v)=N(v). More-        for each node, label and 0≤k<K, hence the time complexity of the
over, the partition snips an edge closest to the root, among all         algorithm is O(nLK2), where n is the number of genes (tree leaves),
minimal mosaic partitions.                                               L is the total number of possible labels and K is the requested
Proof. That minSnips(v)=N(v) is easily proven by induction. For          number of snips.
the second statement let us inspect the vertices of the tree by le-
vels, down from the root. If at vertex v case 1 of the theorem is        5.4      Snipping to minimize misclassification – Traceback
applicable, no minimal mosaic partition of T(v) can introduce a          Once minNum(root,K-1) is computed, the appropriate snips can be
snip between v and its immediate children, for otherwise the total       found by a traceback, from the root of the tree down to the leaves.
number of snips of the partition would exceed the minimum. Con-          Let left and right be the two children of root.
sequently, as long as the Minimal Mosaic algorithm does not snip         Let l* be a label such that minMis(root,l*,K-1)= minNum(root,K-
an edge, no minimal mosaic partition can have a snip there.              1). Then the following cases are considered:
Supplementary figure 6 presents an example of a minimal mosaic           1. Case 1: minMis(root,l*,K-1)= minMis(left,l*,r)+ min-
partition, which induces a partition for a subtree which is not mi-           Mis(right,l*,K-1-r) for some 0≤r≤K-1.
nimal for that subtree, and whose highest snip is lower than the one          In this case there are no snips between root and either of its
found by the MinimalMosaic algorithm.                                         children. The traceback continues recursively from both min-
                                                                              Mis(left,l*,r) and minMis(right,l*,K-1-r).
                                                                         2. Case 2: minMis(root,l*,K-1)= minNum(left,r)+ min-
                                                                              Mis(right,l*,K-2-r) for some 0≤r≤K-2.
Dotan-Cohen et al.



     In this case there is a snip between root and left. The trace-
     back searches for a label l' such that minMis(left,l',r)= min-
     Num(left,r) and continues recursively from both min-
     Mis(left,l',r) and minMis(right,l*,K-2-r).
3. Case 3: minMis(root,l*,K-1)= minNum(right,r)+ min-
     Mis(left,l*,K-2-r)) for some 0≤r≤K-2.
     In this case there is a snip between root and right. The trace-
     back searches for a label l' such that minMis(right,l',r)= min-
     Num(left,r) and continues recursively from both min-
     Mis(right,l',r) and minMis(left,l*,K-2-r).
Note that in each of the cases there may be more than one value of
r which yields an optimal solution. We chose that partition in
which the number of snips allotted to T(left) and T(right) are as
close to equal as possible. In other words, in the traceback of min-
Mis(v,l,k), r=k/2 is the first index to be examined, while r=0 and
r=k are the last.
5.5     Simulation study
The purpose of the simulation study was to evaluate the success of
the algorithm, as measured by its predictive ability, in settings of
variable difficulty determined by various “noise” parameters. To
this end we constructed datasets of points in a d-dimensional
space. Each dataset consisted of g sets of points, each set generated
by a different independent Gaussian, as defined below:
                          g
                 p(x)   p(i ) p(x | i),
                         i 1

                                                     || xd  i ,d ||2
                                                 
                              d k
                                        1                 2 2
                 p(x | i)                  e                           .
                                d 1   2

Here p(i)=1/g.
The g Gaussians had identical standard deviations (5 representative
standard deviations were used in our analysis). The means of the
Gaussians were positioned randomly in a d-dimensional cube
[-1,1]d. The points generated from the first Gaussian will be re-
ferred to as the first cluster, and the points generated from all other
Gaussians together will be referred to as the second cluster.
The points were labeled as follows: Of the n points, 80% were
labeled with one of 2 labels, according to the cluster they origi-
nated from: points from the first cluster (first Gaussian) were la-
beled "1", while points from the second cluster (other g-1 Gaus-
sians) were labeled "2". To simulate errors in the labeling of the
clusters, 10% of the labeled points were given a wrong label: "2"
instead of "1" and vice versa. The true labels of the remaining 20%
of the points were withheld from the algorithms, and the success of
the algorithms in correctly classifying these points forms the basis
for the evaluation of their clustering performance.
For each standard deviation s such datasets were generated inde-
pendently. Because the means were selected at random, the overlap
between the different Gaussian and the location of the first Gaus-
sian (first cluster) varies between the different datasets. The classi-
fication accuracy is measured as the average over the different
datasets.
In our simulation we used the following numbers: n=5000, g=10,
d=10, s=10.