Abstract Using molecular phylogenetics researchers build family by bdj93780

VIEWS: 26 PAGES: 60

									                                         Abstract

Using molecular phylogenetics researchers build family trees to see which genes should be
grouped in families according to sequence similarities. Phylogenetic trees are currently cre-
ated from multiple alignments. Multiple alignments suffers from NP hard complexity and
requires that sequences compared have high similarity. With the novel method presented
here, developed by Tobias Hill and Robert Fredriksson at the Department of Neuroscience,
Unit of Pharmacology, Uppsala University, an alignment was created using pairwise com-
parisons of all the sequences. This resulted in a distance matrix which in turn was used by a
neighbor joining method to create a phylogenetic tree. Test data with known characteristics
were generated using an artificial evolutionary process, and used to test the novel method
against some existing methods.
Novel molecular phylogenetics using pairwise alignments




                                                          ii
                                                                                                                                    CONTENTS




Contents

1   Introduction                                                                                                                             1
    1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                      2

2   Biological background                                                                                                                     3
    2.1 The cell . . . . . . . . . . . . . .       .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .    3
    2.2 DNA . . . . . . . . . . . . . . . .        .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .    4
    2.3 RNA . . . . . . . . . . . . . . . .        .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .    5
    2.4 Proteins . . . . . . . . . . . . . .       .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .    5
    2.5 The central dogma . . . . . . . .          .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .    5
    2.6 The genetic code . . . . . . . . .         .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .    6
    2.7 Mutations . . . . . . . . . . . . .        .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .    6
    2.8 Alignment . . . . . . . . . . . . .        .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .    8
         2.8.1 Pairwise alignment . . . .          .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .    8
         2.8.2 Multiple alignment . . . .          .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .    9
    2.9 Phylogenetics . . . . . . . . . . .        .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .    9
         2.9.1 Neighbor joining . . . . .          .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .   11
         2.9.2 Comparing trees . . . . .           .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .   11
    2.10 File formats . . . . . . . . . . . .      .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .   12
         2.10.1 FASTA sequence format .            .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .   14
         2.10.2 PHYLIP alignment format            .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .   14
         2.10.3 The Newick tree format .           .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .   15

3   Generating biological sequences                                                                                                          17
    3.1 TreeGen algorithm . . . . . .     .   .    .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .   17
        3.1.1 Parameters to TreeGen       .   .    .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .   18
        3.1.2 Detailed algorithm . .      .   .    .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .   19
    3.2 Design and implementation . .     .   .    .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .   20

4   Phylogenetics using pairwise alignments                                                                                                  23
    4.1 PhyloPair algorithm . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .   23
         4.1.1 Parameters . . . . . . . . .             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .   23
         4.1.2 Detailed algorithm . . . . .             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .   23
    4.2 Design and implementation . . . . .             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .   24




                                                  iii
Novel molecular phylogenetics using pairwise alignments




5   Tests and comparisons                                                                            27
    5.1 Comparing the trees . . . . . . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   27
          5.1.1 Performing the tests . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   28
          5.1.2 Creating the trees . . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   28
          5.1.3 Comparing the created trees . . . . . . . . . . . . . . . .      .   .   .   .   .   28
          5.1.4 Increasing the delete probability . . . . . . . . . . . . . .    .   .   .   .   .   29
          5.1.5 Increasing the insert probability . . . . . . . . . . . . . .    .   .   .   .   .   29
          5.1.6 Increasing the delete and insert probability simultaneously      .   .   .   .   .   32
          5.1.7 Increasing the delete length . . . . . . . . . . . . . . . .     .   .   .   .   .   36
          5.1.8 Increasing the insert length . . . . . . . . . . . . . . . . .   .   .   .   .   .   38
          5.1.9 Increasing the delete and insert length simultaneously . . .     .   .   .   .   .   38
          5.1.10 Increasing the mutation matrix . . . . . . . . . . . . . . .    .   .   .   .   .   41
    5.2 Comparing performance between PhyloPair and ClustalW . . . .             .   .   .   .   .   43

6   Discussion and future work                                                                       47
    6.1 Comparing the trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .              47
    6.2 Comparing performance . . . . . . . . . . . . . . . . . . . . . . . . . . .                  48
    6.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .              48

Acknowledgments                                                                                      51

Bibliography                                                                                         53




                                                          iv
                                                                                           LIST OF FIGURES




List of Figures

 1    Celltypes: Eukaryotes and procaryotes. . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .    4
 2    The DNA molecule. . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .    5
 3    The central dogma of molecular biology. . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .    7
 4    The genetic code. . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .    7
 5    Example of a rooted and an unrooted phylogenetic tree.       .   .   .   .   .   .   .   .   .   .   .   10
 6    Example of the neighbor joining method. . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   11
 7    Tree T1 used in the webbing matrix example. . . . . .        .   .   .   .   .   .   .   .   .   .   .   13
 8    Tree T2 used in the webbing matrix example. . . . . .        .   .   .   .   .   .   .   .   .   .   .   13
 9    Tree generated from the Newick file . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   15

 10   A simplified UML class diagram for the TreeGen program . . . . . . . . .                                  21
 11   The TreeShop environment with the TreeGen program running. . . . . . . .                                 21

 12   Schematic view of the algorithm creating a phylogenetic tree based on pair-
      wise alignments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                           24
 13   The resulting tree after the Alignment and Make tree buttons were pressed.                               25

 14   Increasing the delete probability. First chart . . . . . . . . . . .             .   .   .   .   .   .   30
 15   Increasing delete probability. Second chart. . . . . . . . . . . .               .   .   .   .   .   .   31
 16   Inreasing the insert probability. First chart. . . . . . . . . . . . .           .   .   .   .   .   .   32
 17   Inreasing the insert probability. Second chart. . . . . . . . . . .              .   .   .   .   .   .   33
 18   Increasing the delete and insert probability. First chart. . . . . .             .   .   .   .   .   .   34
 19   Increasing the delete and insert probability. Second chart. . . . .              .   .   .   .   .   .   35
 20   Increasing the delete length probability. First chart. . . . . . . .             .   .   .   .   .   .   36
 21   Increasing the delete length probability. Second chart. . . . . . .              .   .   .   .   .   .   37
 22   Increasing the insert length probability. First chart. . . . . . . .             .   .   .   .   .   .   39
 23   Increasing the insert length probability. Second chart. . . . . . .              .   .   .   .   .   .   40
 24   Increasing the delete and insert length probability. First chart. . .            .   .   .   .   .   .   41
 25   Increasing the delete and insert length probability. Second chart.               .   .   .   .   .   .   42
 26   Increasing the mutation matrix. First chart. . . . . . . . . . . . .             .   .   .   .   .   .   43
 27   Increasing the mutation matrix. Second chart. . . . . . . . . . .                .   .   .   .   .   .   44
 28   Comparing the performace of PhyloPair with ClustalW. . . . . .                   .   .   .   .   .   .   45




                                            v
Novel molecular phylogenetics using pairwise alignments




                                                          vi
                                                                     CHAPTER 1. INTRODUCTION




Chapter 1

Introduction

A huge amount of sequence data from different biological projects is available to researchers
and are waiting to be processed. This material comes from large scale projects aimed at de-
termining the whole genome sequence for humans and other species. The sequence material
can be used to find preserved regions containing genes. Finding genes and identifying their
function is vital in the development of new drugs. This interpretation of the gathered data
can be seen as the third stage in the research on the basic organical building blocks. In the
first stage, the different parts of the cell and the mechanisms of inheritance were explored
and in the second stage, the information stored in the cells gene pool were sequenced [3].
The interpretation is a challenge for both biologists and computer scientists. Both disci-
plines need to exchange methods and experiences.
     One way to identify the probable function of newly discovered genes is to identify their
closest relatives. Genes with a high degree of sequence similarity tend to have similar
functions. A method for modelling the evolutionary history is to draw a tree where the
nodes represents the sequences involved in the study, and the distances between the nodes
represents the evolutionary distance. This is called a phylogenetic tree. The sequences we
see today represent the leaves of this tree. To be able to calculate the distances between
sequences, the leaves are compared to produce an alignment. An alignment is a result from
comparing two or more sequences and trying to adjust them to be as equal as possible by
inserting spaces into, at the ends of or before the sequences. When an alignment has been
created, a tree can be approximated from the leaves. This allows researchers to see which
genes should be grouped in families according to sequence similarities.
     Many of the computational problems involved in alignments and construction of phy-
logenetic trees are hard to solve in polynomial time. For example, the multiple alignment
problem, where every sequence in a list of sequences is compared and aligned to the rest of
the list, is NP-complete [13]. From this result it follows that it can be a very time consum-
ing process to generate phylogenetic trees from multiple alignments. Another weakness of
a multiple alignment is that if the sequence material differs greatly the information provided
is reduced. This means that for sequences to be meaningfully compared, they need to have
a high similarity. The novel method presented here is to use a pairwise alignment method,
instead of a multiple alignment, to generate a distance matrix. The tree is then created using
a neighbor joining method. The method is presented in Chapter 4 and the implementation
is called PhyloPair. The method is developed by Tobias Hill and Robert Fredriksson at the
Department of Neuroscience, Unit of Pharmacology, Uppsala University.


                                              1
Novel molecular phylogenetics using pairwise alignments




     To be able to test and compare the performance of this and other methods, controlled
test data reflecting a biological deveopment, would have to be created. The creation of
this type of test data is another part of this work. Using an artificial evolutionary process,
test data with known characteristics were generated and used to test and compare the novel
method against some existing methods.


1.1     Outline
The project were divided into three parts. The first part was to develop controlled, syn-
thetic test data reflecting biological behaviour. The implementation of this part was called
TreeGen. The second part was to develop a novel method for phylogeny using pairwise
alignments. This implementation was called PhyloPair. Both programs were added as mod-
ules to an existing software called TreeShop [21] [20] which is a program used for viewing
and comparing phylogenetic trees. The third part was to use the data generated to test the
PhyloPair method against some existing methods. This report will first give a short bio-
logical background, explaining briefly the concepts used in the following parts, and how
they fit in a computer science perspective. Next, the method for generating the test data,
TreeGen, is described, followed by the PhyloPair method. The third part of the report deals
with the tests and comparisons done, and finally the last chapter handles conclusions made
and future work.




                                                          2
                                                             CHAPTER 2. BIOLOGICAL BACKGROUND




Chapter 2

Biological background

2.1     The cell
Cells are the structural and functional units of all living organisms [11]. They are divided
into two general categories.

      • Prokaryotes : Cells that lack a nuclear membrane and organelles. This was the first
        type of cells to evolve for about four billion years ago. An example of an organism
        that is a prokaryote is bacteria [11].

      • Eukaryotes : Cells that have membrane bound compartments such as a nucleus. They
        also contain organelles which are structures within the cell that performs specific
        functions. Eukaryotes appeared for about 1.5 billion years ago and were a milestone
        in the evolution of life. Animals, plants and fungi are examples of eukaryotic organ-
        isms [11].

    The eukaryotes has a number of organelles controlling the behaviour of the cell. Or-
ganelles can be seen as the equivalence of organs in the body. The nucleus is an organelle
where the cells chromosomes reside and almost all DNA replication and RNA synthesis
occur. Another important organelle in the cell is the ribosome which is responsible for the
protein synthesis. Mitochondria are self-replicating organelles and are involved in generat-
ing energy in the eukaryotic cell [11]. Figure 1 is an overview of eukaryotes and prokary-
otes. There are unicellular organisms, such as bacteria, that consists of a single cell, and
multicellular that consists of several cells. Human is a multicellular organism consisting of
an estimated 100.000 billions cells [11].
    Apart from organelles there are other structures building up the cell. The plasma mem-
brane serves to separate and protect a cell from the surrounding environment. In the mem-
brane there are a variety of molecules that acts as channels and pumps, transporting different
molecules into and out of the cell [11]. The cytoskeleton acts as the cells scaffold, it orga-
nizes and maintain the cells shape, anchors organelles in place, etc [11]. The cytoplasm, or
cytosol, is a large fluid-filled space inside the cell. This is where the organelles reside in eu-
karyotic cells. The cytoplasm contains dissolved nutrients, it breaks down waste products,
moves material around the cell and is the perfect environment for the mechanics of the cell
[11].
    DNA and RNA, explained below, are the two genetic materials in the cell. Most organ-
isms have DNA as genetic material, but a few viruses have RNA instead. The biological

                                               3
Novel molecular phylogenetics using pairwise alignments




      Figure 1: [11] Eukaryotes and procaryotes are the two categories of cells. Eukaryotes have
                a nucleus, procaryotes lack nucleaus and organelles.



information in an organism is encoded in its DNA or RNA sequence [11]. Multicellular
organisms replace damaged or worn out cells through a replication process called mitosis
which is the division of a eukaryotic cell nucleus to produce two identical daughter nu-
clei. Humans produce for example new skin cells by replicating the DNA found in that cell
through mitosis. For eukaryotes to reproduce, they must first create special cells called ga-
metes which is eggs and sperm, in a process called meiosis. The sperm and egg join to make
a single cell that then divides and differentiates to form the beginning of a new organism
[11].


2.2     DNA
The DNA molecule controls the shape, structure and function of the entire organism by
controling the production of proteins [4]. It is responsible for inheritance in the organism
[13]. DNA is an abbreviation for Deoxy-riboNucleic-Acid and the molecule has a three
dimensional structure in the form of a double helix [8]. The four basic elements of DNA are
called nucleotides. Each nucleotide has one of the four bases: Adenine, Guanine, Cytosine,
Thymine, abbreviated as A, G, C, T [16]. Each helix consists of nucleotides held together by
phospho-diester bonds, and the two helices are held together by hydrogen bonds [13]. The
hydrogen bonds connect the nucleotide A to T, and the G to C. Figure 2 shows a simplified
version of the DNA molecule.
     Each DNA molecule is packaged in a chromosome. The total genetic information stored
in the chromosomes is called the genome of an organism [13]. Humans have two distinct
components, the nuclear genome and the mitochondrial genome. The nuclear genome is
divided into 24 linear DNA molecules each contained in a different chromosome. The mi-
tochondrial genome is a circular DNA molecule, although small it codes for some very
important proteins [11]. A region of the DNA that controls a discrete hereditary charac-
teristic, such as eyecolor, is called a gene [13]. Most eukariotic genes have their coding
sequences, called exons, interrupted by non-coding sequences called introns. In human,
genes constitute of approximately 2-3% [11]. The role of the remaining part of the sequence
is unknown, though several theories have been suggested. One theory is that some of the
introns are involved in a process called recombination where the parents genes combines to
novel genes with new combinations of exons in the child [8].



                                                          4
                                                              CHAPTER 2. BIOLOGICAL BACKGROUND




      Figure 2: The DNA molecule in the characteristic shape of a double helix with the bases
                paired as A to T and G to C.



2.3     RNA
RiboNucleic Acid is a long chain of nucleic acids, similar to DNA. Some of the differences
are that RNA is usually single stranded, differs in chemical composition, and can occur
outside the nucleus. The thymine base in DNA is replaced by uracil in RNA, which makes
the abbreviation A, G, C, U [16]. RNA is the main component in the transcription phase.
Messenger RNA (mRNA) is a type of RNA that is the result after a process called splicing,
where the introns are removed. mRNA carries the information encoded in DNA to the
protein assembly machinery, called the ribosome. The ribosome uses mRNA as a template
to create the exact protein coded for by the gene [8]. Transfer RNA (tRNA) is a set of small
RNA molecules, each of about 80 nucleotides in length, working as adaptor molecules that
recognize both an amino acid and a triplet of codons. tRNA enforces the universal genetic
code as described below [11].


2.4     Proteins
Proteins are large, complex molecules that determine the shape and structure of the cell.
They are made up of one or more chains of amino acids, small molecules that are the
building blocks of protein. There are typically 20 different amino acids that build up the
proteins. They are abbreviated A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V
[16]. The average protein size is around 200 amino acids long. Large proteins can reach
over a thousand amino acids [13].


2.5     The central dogma
        DNA makes RNA, RNA makes protein, and proteins make us. Francis Crick


                                                5
Novel molecular phylogenetics using pairwise alignments




The central dogma of molecular biology can be described as a process with three phases.
DNA replicates, DNA converts to RNA, and RNA converts to proteins [3]. The three phases
of the central dogma are:
      • Replication: DNA replicates by splitting the double helix into two parts, much like a
        zipper that unzips. The enzyme RNA polymerase opens the doublehelix by breaking
        the hydrogen bonds between the bases [4]. The two single strands will act as tem-
        plates and start to recreate the doublehelix. Due to the pairing requirements of the
        DNA structure, each strand will create a double helix identical to the original. The
        new nucleotides are assumed to come from a pool of free nucleotides present in the
        cell [13].

      • Transcription: The DNA molecule is transfered to RNA. The entire length of the gene
        is transcribed into a very large RNA molecule. Both introns and exons are a part of
        this RNA molecule. Before the RNA molecule leaves the nucleus, a process called
        splicing takes place, where all the intron sequences are removed. This results in a
        much shorter RNA molecule. Depending on the resulting tissue the pattern of the
        splicing can vary. After this step, the RNA molecule moves out of the nucleus as a
        messenger RNA (mRNA) molecule [18].

      • Translation: Protein is created from the mRNA. The translation process is controlled
        by a set of small RNA molecules called transfer RNA (tRNA). The task of tRNA is
        to enforce the universal genetic code described below. The ribosome uses the mRNA
        and tRNA to produce new proteins.
      Figure 3 shows how the central dogma converts a gene to protein.


2.6     The genetic code
The rules by which the nucleotide sequence of a gene is translated into the amino acid
sequence of the corresponding protein is called the genetic code [13]. The nucleotides in
the mRNA molecule is read in serial as a group of three. Each triplet is called a codon, and
specifies one amino acid. There are 64 possible codons, but only 20 different amino acids
commonly found in proteins. This means that most amino acids are specified by several
codons. Three of the 64 codons specify the end of translation [13]. Figure 4 shows the
genetic code.


2.7     Mutations
A mutation is a change in the DNA nucleotide sequence caused by a faulty replication
process. Mutation may take place within a single gene, which is called point mutation, or
mutation may occur as segments interchanges within or between chromosomes, which is
called chromosomal mutation. There are three kinds of point mutations that can occur in
the DNA sequence.
      • Substitution: A change of a nucleotide.

      • Deletion: A removal of one or more nucleotides.

                                                          6
                                                          CHAPTER 2. BIOLOGICAL BACKGROUND




Figure 3: The central dogma of molecular biology. Replication, transcription and transla-
          tion are the three phases that are involved in creating protein.




Figure 4: The genetic code. Most amino acids are specified by several codons, as there are
          64 possible codons but only 20 different amino acids.




                                           7
Novel molecular phylogenetics using pairwise alignments




      • Insertion: An addition of one or more nucleotides.

    Although a mutation may change the amino acid sequence of a protein it does not nec-
essarily change the function of the protein [13]. Sequence motifs are areas of the sequence
were mutation rate is higher or lower than in the rest of the sequence. Non-coding regions
typically have a higher mutation rate than the coding-regions [22].


2.8     Alignment
The concept of a sequence alignment is essential for a large area of molecular biology. The
alignment can be used to estimate the biological difference of two or more DNA or protein
sequences [13]. Similarity is a term meaning an observed quantity expressed such as percent
identity. Two genes that share a common evolutionary history are said to be homologuos.
One goal when producing an alignment is to determine whether two or more sequences
show sufficient similarity to infer a homology between the sequences [1]. It is generally
more informative to compare protein sequences. The reasons for this are that many changes
in a DNA sequence do not change the amino acid that is specified. Many amino acids
share related properties and mismatched amino acids in an alignment can be accounted for
using scoring systems. Because of this, when a nucleotide coding sequence is analyzed it
is often preferable to study its translated protein [18]. Protein sequence comparisons can
identify homologous sequences from organisms that last shared a common ancestor over 1
billion years ago. DNA sequence comparisons typically allow lookback times of up to 600
million years ago [18]. As it is known that certain amino acids substitute more easily for
one another, a matrix of probabilities for substitution can be created. An example of this is
the Point Accepted Mutation (PAM) matrix [5]. A measure of one PAM is where 1% of the
amino acids have been changed. Another example is BLOSUM [12].

2.8.1    Pairwise alignment
Definition 1 [13] An alignment of two sequences S and T is obtained by first inserting
chosen spaces either into, at the ends of or before S and T , and then placing the two resulting
sequences one above the other so that every character or space in either sequence opposite
a unique character or a unique space in the other sequence.

Example
Given sequences S = acgcttg and T = catgtat a possible pairwise alignment is
ac--gctttg
-catg-tat-

    Each two character alignments and character versus space alignments is given a score.
Usually insert and delete operations, alignment of a character and a space, are given the
same score. Alignment algorithms search for the minimal scoring, representing the minimal
difference or maximum similarity between two sequences [13]. An alignment is called
global if the maximum similarity between two sequences S and T of roughly the same length
is calculated. A local alignment is when the maximum similarity between a subsequence of
S and a subsequence of T is calculated [13].


                                                          8
                                                                 CHAPTER 2. BIOLOGICAL BACKGROUND




    Given two sequences S of length n, and T of length m the space complexity for an
optimal global alignment is O(min(m, n)). The time complexity is O(nm). The space and
time complexity can be achieved using a dynamic programming approach [13]. Pairwise
alignments are often created using a heuristic method, e.g. BLAST [9].

2.8.2    Multiple alignment
Definition 2 [13] A multiple alignment of sequences S1 , S2 , . . . , Sk is a series of sequences
S1 , S2 , . . . , Sk with spaces such that | S1 |=| S2 |= . . . =| Sk | and S j is an extension of S j
obtained by insertion of spaces.

Example
Given sequences S = acgcttg, T = catgtat and U = acgtgtc a possible multiple align-
ment is
ac--gctttg
-catg-tat-
acgtg-tc--

    A multiple alignment from sequences S1 , S2 , . . . , Sk with lengths n1 , n2 , . . . , nk and n1 =
n2 = . . . = nk has space complexity O(nk ) and time complexity O(2k nk *computation of
the cost function). As in pairwise alignment, these results are obtained using dynamic
programming. The space and time complexity implies that the exact solution is practical
only for a small number of sequences [13]. Therefore, multiple alignments are created using
different heuristic methods, e.g. ClustalW [23].


2.9     Phylogenetics
Phylogenetics is the study of ancestry between different species. By using a tree structure,
different families of species can be grouped together. Earlier, the study involved morpho-
logical differencies such as number of legs that were used to create different family trees
[14]. Nowadays scientists study the sequences of amino acids or nucleotides in different or-
ganisms. Important functions of an unknown gene may be discovered from the relationships
to a known gene. If two genes are closely related they are likely to have similar functions.
Molecular phylogeny is used to group genes in families according to sequence similarities.
Figure 5 shows a phylogenetic tree [18].
    External nodes in the tree are nodes under comparison. They are also called Operational
Taxonomic Units, OTU:s. Internal nodes are called hypothetical taxonomic units, HTU:s,
and represent inferred ancestral units [17]. In rooted trees the path from root to a node
defines an evolutionary path and the root is the common ancestor to all OTU:s. In unrooted
trees the relationships among the OTU:s are specified but not the evolutionary paths [14].
The major methods for creating phylogenetic trees are the following.

      • Distance based methods. Evolutionary distances computed for all OTU:s are used to
        create trees. Neighbor joining, described below, and the Fitch-Margoliash method are
        two examples [14]. They assume a molecular clock.




                                                  9
Novel molecular phylogenetics using pairwise alignments




                    Figure 5: Example of a rooted and an unrooted phylogenetic tree.



    • Character based methods derives trees that optimizes the distribution of actual data
      patterns for each character [1]. It assumes that each character substitution is inde-
      pendent of its neighbors. Maximum parsimony and maximum likelihood are two
      examples [17].

    Both methods have their advantages and disadvantages. Distance based methods all
assume a molecular clock which means that they assume that all mutations happen at a ran-
dom clocklike rate. This assumption is generally not true because different environmental
conditions affect the rate of mutation and certain selection issues are different with dif-
ferent time periods [17]. The major advantage of distance based methods is that they are
not as computationally intensive as for example maximum likelihood [1]. Character based
methods suffer from the major drawback that they are very computationally intensive.
    A phylogenetic data analysis consists of the following four steps [1].

    1. Alignment. A typical alignment procedure is to use a program such as ClustalW [23]
       to produce an alignment, and then to do some manual editing of the alignment.

    2. Determining the substitution model. This involves determining substitution rates be-
       tween bases and among different sites in the sequence [1].

    3. Tree building. Using one of the previously mentioned tree building methods, trees
       are generated from the alignment data.

    4. Tree evaluation. Some procedures involved in tree evalutation are skewness tests,
       permutation tests and bootstrapping [1].

A general recommendation in phylogenetic analysis is to build a tree using a distance based
method and if possible also a tree using a character based method, then examine the trees to
see if they have a high similarity [17].




                                                          10
                                                            CHAPTER 2. BIOLOGICAL BACKGROUND




2.9.1   Neighbor joining
The method for creating phylogenetic trees used in this work was neighbor joining. It is a
heuristic method that uses a distance matrix representing the distances between the OTU:s,
and finds neighbors sequentially that minimize the total length of the tree [14]. Many trees
need to be created and compared so the performance of the neighbor joining method are the
decisive factor in chosing between the methods. It is also the most widely used distance
based method. The algorithm is as follows.
    Given a distance matrix M and n number of OTU:s.

   1. Start with a star tree of the n OTU:s.

   2. Find the two nodes with a smallest common distance, using the distance matrix, and
      combine them to a new internal node.

   3. Insert a branch between the new internal node and the centre of the star. The new
      branch value is a mean of the two original ones. The new internal node represents
      a merged OTU and distances between OTU:s are computed to form a new distance
      matrix.

   4. Continue from 2 until all nodes have been connected.

Figure 6 shows an example of the first steps in the method.




    Figure 6: Example of the first steps in the neighbor joining method. The nodes 1 and 2
              have the smallest common distance and are combined into a new internal node.
              The neighbor join method then recalculates the distance matrix and continues
              with the rest of the nodes in the star.




2.9.2   Comparing trees
The three methods for tree comparisons implemented in the TreeShop program are the web-
bing matrix, least common subtrees (LCS) and subtree prune and regraft (SPR) [21]. As a
large number of trees were to be created and compared, the webbing matrix method were
chosen because of its speed and ability to return values reflecting overall tree similarity.

The webbing matrix method
The webbing matrix method [24] is a method for calculating the overall similarity between
trees.


                                               11
Novel molecular phylogenetics using pairwise alignments




Definition 3 [24] A subtree is a part of a tree in which the root of the subtree may be an
internal node or the root, and its leaves must be all leaves which belong to the node in the
tree assigned as a root of the subtree. A complete set of subtrees involves all non-terminal
nodes of the tree.

    For two subtrees, let n be the total number of leaves in the two subtrees, a be the number
of leaves common to the two subtrees, and b be the number of leaves not common to the
two subtrees.

Definition 4 [24] Given two trees T1 and T2 that can be partitioned into M and N number
of complete subtree sets, an overall similarity measure between the trees is
S(T1 , T2 ) = MN ∑M ∑N α(Si j )Si j
                  1
                      i=1 j=1
i = 1, 2, . . . , M j = 1, 2, . . . , N
    Where Si j = 2a−b and α(Si j ) is a weight function related to Si j and described below .
                        n

Definition 5 The webbing matrix method is a method where two complete subtree sets of
the two trees compared are written as the column and row headings of a matrix. Elements
which are not one will be given actual values of Si j in the matrix. When a one appears, all
values in the same row and column will be replaced by an asterisk. This means that if two
subtrees Ai and B j are exactly matched, other subtrees do not need to be compared to either
one of the two subtrees. The weight function can be defined as

                                     1 if Sip =1 or Sq j =1
                         αSi j = {
                                     0 if Sip = 1 and j=p, or Sqi = 1 and i=q
     p = 1, 2, . . . , N q = 1, 2, . . . , M

The result from the method will be a value in the interval [−1, 1].

Example from the TreeShop implementation of the webbing matrix
Given two trees T1 and T2 as in Figures 7 and 8 the resulting matrix will be as follows.

                                        ∗      ∗      ∗
                                                         
                                   1
                                 ∗ 0.43 0.33 0.2 
                                                         
                                 ∗     0    −0.2 0 
                                   ∗ 0.33 0.6 0.5

    From the matrix the overall similarity value can be calculated to S(T1 , T2 ) = 0.57 us-
ing the above formula. The output from the webbing matrix method implemented in the
TreeShop program were converted to a percentage scale, where 0% means no similarity and
100% means full similarity [21]. In the example this means that the similarity between the
trees will be 79%.


2.10     File formats
The vast amount of information stored in biological databases has lead to a number of differ-
ent formats. There was a need for some formats representing and storing sequences, align-
ments and trees. The choice of the following was done because they seem to be the most

                                                          12
                                        CHAPTER 2. BIOLOGICAL BACKGROUND




Figure 7: Tree T1 used in the webbing matrix example.




Figure 8: Tree T2 used in the webbing matrix example.




                         13
Novel molecular phylogenetics using pairwise alignments




frequently occuring in litterature and other sources, and the existing software, TreeShop
[21][20], had methods for handling them. The three formats described here are a small part
of all existing formats.

2.10.1     FASTA sequence format
The FASTA [10] format is used to store sequence data. The file starts with a > character
which denotes the start of a description line. The following lines contain the sequence data.
Several sequences can be stored in one FASTA file which are separated by the description
lines.
Example
>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK



2.10.2     PHYLIP alignment format
The PHYLIP [7] format is used to store an alignment of several sequences. The first line of
the file contains the number of sequences and the number of characters separated by blanks.
The information for each sequence follows, starting with a sequence name, and continues
with the characters for that sequence. The data can be represented in interleaved or sequen-
tial format. In an interleaved format, the first part of the file should contain the first part
of each sequence, then the second part of each sequence, and so on. Only the first parts of
the sequences should be preceded by names. In a sequential format all of one sequence is
given, possibly on multiple lines, before the next starts.
Example of an interleaved PHYLIP file
5 42
Turkey AAGCTNGGGC ATTTCAGGGT
Salmo AAGCCTTGGC AGTGCAGGGT
HSapiens ACCGGTTGGC CGTTCAGGGT
Chimp AAACCCTTGC CGTTACGCTT
Gorilla AAACCCTTGC CGGTACGCTT

GAGCCCGGGC      AATACAGGGT       AT
GAGCCGTGGC      CGGGCACGGT       AT
ACAGGTTGGC      CGTTCAGGGT       AA
AAACCGAGGC      CGGGACACTC       AT
AAACCATTGC      CGGTACGCTT       AA




                                                          14
                                                              CHAPTER 2. BIOLOGICAL BACKGROUND




2.10.3   The Newick tree format
The Newick [2] tree format describes a tree using characters. Commas indicates a branch
and parenthesis are used to separate the nodes. A colon followed by a numerical value
indicates a distance.
Example
((prot2:10.0,(prot4:10.0,prot5:15.0):9.0):7.0,
(prot6:17.0,prot7:14.0):10.0):0.0;
The corresponding graphical representation is shown in Figure 9.




                          Figure 9: Tree generated from the Newick file




                                               15
Novel molecular phylogenetics using pairwise alignments




                                                          16
                                                      CHAPTER 3. GENERATING BIOLOGICAL SEQUENCES




Chapter 3

Generating biological sequences

An important aspect of the project is to generate data used for testing the algorithms against each
other. The goal is to generate synthesized data reflecting true biological data. This is done using
an approach from a tool called Rose (Random-model Of Sequence Evolution) [22]. This approach
uses a probabilistic model of the evolution of DNA- and protein sequences to simulate an artificial
evolutionary process. The data created can be used in testing different phylogenetic and alignment
methods. It can also be used as a simulated evolution and help students gain a better understanding
of the concepts involved in phylogenetic research.
     Phylogenetics today tries to build trees based on molecular sequences. As the sequences we
observe today have evolved during a period of millions of years, it is impossible to deduce which
phylogenetic tree representation that is in accordance with the tree of sequences created by nature.
This imposes a problem when testing methods creating such trees. If researchers had access to
the tree created as these sequences have evolved, they could evaluate the methods against this data.
Creating artificial sequence data with known characteristics is therefore necessary. Using the method
presented here with a starting sequence and probability parameters, a tree is generated. This tree is
called the true tree. The leaves represents the sequences seen today and the inner nodes represent
ancestor sequences. An alignment of the leaves is created based on all insert and deletes that have
occured during the creation of the tree. This represents an alignment that is true with respect to
the events at the nodes [22], and called the true alignment. The data generated; the true tree, the
true alignment and the sequences, can be used by researchers to compare and evaluate methods of
phylogenetics and alignments.


3.1   TreeGen algorithm
TreeGen is a program implemented as a module to an existing software called TreeShop. When given
a starting sequence and probability parameters, TreeGen simulates an artificial evolutionary process
and creates controlled, synthetic test data. The resulting test data consists of a tree, sequences and
an alignment. This is an overview of the algorithm used in TreeGen. The algorithm builds a tree
starting from a user specified root sequence down to a user specified tree depth. As long as the tree is
within the tree depth, each node will be expanded, creating two new nodes at a new tree depth. The
branch length to the root determines which node that is first expanded. The node closest to the root
will always be expanded first. When the tree has reached the desired depth, nodes with only one leaf
will have an additional leaf created. This is done to create a balanced tree. During the tree creation
process a true alignment reflecting the development of the sequences is created. When the algorithm
is finished the result will be a tree, a family of sequences derived from the original sequence and a
true alignment of these sequences. This data can then be used to test different algorithms.
     The user can specify regions of the sequence where the rates of mutation should be increased
or decreased. These regions are called sequence motifs and this functionality are added to reflect


                                                 17
Novel molecular phylogenetics using pairwise alignments




the mutation rate of genomic sequences found in nature. This allows the user to specify regions of
particular high mutation rate, and to suppress mutations in oher regions [22].

3.1.1    Parameters to TreeGen
The number of sequences to create are inserted in the dialog box of the TreeGen program. The other
parameters can be inserted manually in the dialogbox or loaded using an ordinary textfile.

     • Alphabet. Currently the DNA alphabet A, C, G, T and the 20 character amino acid alphabet
       are supported.
     • Root sequence. Currently nucleotide- or amino acid sequence.
     • Number of leaves to create.
     • Character frequencies. This is a vector of the same length as the alphabet with probabilities
       for each symbol. This will be calculated based on the root sequence if no value is given.
     • Mutation matrix. This controls the probability by which different symbols can mutate to each
       other. For amino acid sequences it is assumed that the PAM matrices are used [5].
     • Deletion and insertion probabilities in the range 0 . . . 1.
     • Deletion length and insertion length probabilities. These are vectors of length equal to the
       maximum possible length of deletion or insertion. Each length hold a probability value.
     • Mutation probability vector. This is a vector of same length as the sequence. It is used to
       specify regions of different mutation rate. A value from zero and < 1 suppress mutation, a
       value > 1 increases the mutation rate. A value of 1 on all positions, means that the vector
       does not affect the mutation rate.
     • Max branch length. Sets an upper bound on how far a parent and an adjacent child node can
       be from each other.
     • Mutation probability in the range 0 . . . 1. This value controls how often a mutation should
       occur.

Example of a parameter file for TreeGen
alphabet:protein
sequence:arndcqeghilkmfpstwyv
matrix:0.9867 0.0002 0.0009 ... rest of PAM1 matrix, every row ending with ;
insprob:0.0
inslen:1
delprob:0.0
dellen:0.1 0.1 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
mutvector:1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
charfreq:0.087 0.041 0.040 0.047 0.033 0.038 0.050 0.089 0.034 0.037 0.085
0.081 0.015 0.040 0.051 0.070 0.058 0.010 0.030 0.065
mutprob:0.0
maxbranchlen:5

    Given the above parameters the program generates the requested number of sequences, a tree
and a multiple alignment of the sequences which is correct with respect to the creation process.




                                                          18
                                                      CHAPTER 3. GENERATING BIOLOGICAL SEQUENCES




3.1.2    Detailed algorithm
The method TreeGenerate creates new tree nodes. The new nodes Evolve method is called which
performs mutations, deletions and insertions on the sequence. When a deletion is performed, the
same number of gaps as the number of deleted symbols will be inserted into the sequence. When an
insertion is performed, the same number of gaps as the number of inserted symbols will be inserted
in all the other leaf nodes. This is done to keep the leaves aligned during the tree building process.

TreeGenerate
   1. Create the root from start sequence.
   2. Repeat until the number of leaves input is reached.
   3. Create two new nodes based on the root node with random distance from 1 to parameter max
      branch length. The new nodes will have a mutation matrix equal to the roots mutation matrix
      multiplied distance number of times. Run evolve, see below, on each of the new nodes.
   4. Set node closest to the original root as the new root. Repeat from step 2.
Evolve
   1. Run mutate on each symbol of the sequence.
   2. Perform deletions on the mutated sequence.
   3. Perform insertions on the sequence resulting from step 2.
Mutate
   1. Check if mutation should occur. This is done by chosing a random value between 0 and 1.
      This value is then multiplied with the symbols position in the mutation probability vector. If
      the resulting value is larger than the parameter mutation probability, mutation will occur.
   2. Choose a random value between 0 and 1.
   3. Use the mutation matrix to get the symbol to mutate to. This is done by summing up the
      probability values in the mutating symbol column one step at a time. When the random value
      is larger than the sum, the corresponding symbol is chosen.
   4. If the symbol to mutate to is different than the existing one, change symbol.
Perform deletions
   1. For each symbol in the sequence.
   2. Check if deletion should occur. This is done by chosing a random value between 0 and 1.
      This value is then multiplied with the symbols position in the mutation probability vector. If
      the resulting value is larger than the parameter deletion probability, deletion will occur.
   3. Choose a random value between 0 and 1.
   4. Use the deletion length vector to get the length of deletion. This is done by summing up the
      probability values in the deletion length vector one step at a time. When the random value is
      larger than the sum, the position in the vector represents the deletion length.
   5. Choose a random position on the sequence.
   6. Delete symbols at the random position with delete length from item 4.
   7. Insert delete length number of gaps at delete position in the sequence.
Perform insertions
   1. For each symbol in the sequence.

                                                 19
Novel molecular phylogenetics using pairwise alignments




    2. Check if insertion should occur. This is done by chosing a random value between 0 and 1.
       This value is then multiplied with the symbols position in the mutation probability vector. If
       the resulting value is larger than the parameter insertion probability, insertion will occur.
    3. Choose a random value between 0 and 1.
    4. Use the insertion length vector to get the length of insertion, in the same way as for deletion.
    5. Choose a random position on the sequence.
    6. Insert symbols at the random position with insert length from item 4. Symbols are chosen
       based on initial sequence character frequency.
    7. Insert insert length number of gaps at insert position in all other leaves sequences.


3.2     Design and implementation
The TreeGen program was incorporated into an existing software for modelling phylogenetic trees
called TreeShop. TreeShop has methods for drawing and comparing trees [21][20]. As the TreeShop
program is developed using the Java programming language [15], the TreeGen program was also
written in Java. A separate menu was added and an input dialog where the user can specify the
different parameters was implemented. The dialog uses different tabs for the parameters. The first
tab, labeled Probabilities, is used to input the different probabilities. The second tab, labeled Base
sequences, allows the user to input a start sequence or load a start sequence from a file. The third
tab, labeled Mutation matrix, displays the mutation matrix used and the user can change the existing
matrix or load a matrix from a file. The last tab, shown in Figure 11, displays the resulting data
after the Generate tree button is pressed and has buttons for saving the sequences and the alignment
in the standard formats described in 2.10. The created tree shown in Figure 11 can be saved using
TreeShops own method, found under the File menu.
     Existing software packages such as the tree package from Sun [15], BioJava [6] and the Java
Matrix Package [19] were used to save development time. The implementation of the TreeGen al-
gorithm was packaged in an object called EvoGenerate. This object holds the tree and methods for
retrieving the data generated. Despite the fact that there already existed software for phylogenetic
trees, there was a need for another model when creating the trees according to the algorithm de-
scribed. Instead of developing a new tree model, the existing tree model from the javax.swing.tree
[15] package called DefaultTreeModel was used. The DefaultMutableTreeNode in the package was
used as a placeholder for the Evolve objects. The Evolve objects contains the sequence data and all
probability values for the node. The sequence is represented by a TreeGenSequence object which
inherits from the AbstractSymbolList in the BioJava [6] package. The TreeGenSequence makes it
possible to add gaps in the sequence. The MutationMatrix object contained in the Evolve object
inherits from the Matrix object in the Java Matrix Package [19]. This made it possible to use the
existing matrix operations, such as multiplication, in the program. After the tree is generated, it
is converted and viewed in the TreeShop environment. Figure 10 shows a simplified UML class
diagram for the program where the attributes and operations are left out.




                                                          20
                                              CHAPTER 3. GENERATING BIOLOGICAL SEQUENCES




         Figure 10: A simplified UML class diagram for the TreeGen program




Figure 11: The TreeShop environment with the TreeGen program running. A tree, se-
           quences and an alignment has been created. All of the resulting data is not
           visible in this picture. The textfield is scrollable.




                                         21
Novel molecular phylogenetics using pairwise alignments




                                                          22
                                             CHAPTER 4. PHYLOGENETICS USING PAIRWISE ALIGNMENTS




Chapter 4

Phylogenetics using pairwise
alignments

A major problem in creating phylogenetic trees is the need for doing a multiple alignment of the
sequences involved. As introduced in 2.8.2, a multiple alignment has a high space and time com-
plexity. This affects the phylogenetic tree building as the trees possible to create will be limited
by the number of sequences at leaf level. With the method presented in this chapter, huge trees
could be created and the time and space complexity for generating the trees would be reduced. The
implementation of this method is called PhyloPair and presented below. The method is developed
by Tobias Hill and Robert Fredriksson at the Department of Neuroscience, Unit of Pharmacology,
Uppsala University.


4.1     PhyloPair algorithm
PhyloPair is a program implemented as a module to an existing software called TreeShop. This
is an overview of the algorithm used in PhyloPair. Pairwise comparisons of all the sequences are
performed. These comparisons are made with an alignment program such as BLAST [9]. BLAST
is an abbreviation for Basic Local Alignment Search Tool. The resulting scores will represent the
evolutionary distance betweeen the sequences. A distance matrix based on the scores is created. The
distance matrix is then used as input to a neighbor joining method to create a phylogenetic tree.

4.1.1    Parameters
The program need as input a path to an external alignment program, a path to the database required
by the alignment program and a path to the FASTA file containing the sequences.

4.1.2    Detailed algorithm
The program is currently using the BLAST alignment program [9].


   1. Create a database containing all sequences. In this case, when BLAST is used, a program
      called formatdb [9] formats the input sequences so they can be searched by the alignment
      programs in the BLAST package.
   2. For each sequence, run BLAST with the sequence as query against the database.
   3. Parse the result from BLAST into a distance matrix. Use only the second best scoring value,
      as the best result will be when the query sequence is compared to itself.


                                                23
Novel molecular phylogenetics using pairwise alignments




    4. Run neighbor join with the distance matrix as input to build the tree.


4.2     Design and implementation
A separate dialog in TreeShop was written were the user can input a path to the external alignment
program, the path to the database and the path to the sequence file. The returning scores from the
alignment program is then used to build a matrix with distance values. The matrix is used as input
to the neighbor joining method. An object called PhyloPair holds the distance matrix and calls the
alignment program used. The result from the alignment program is parsed using an object called
TabularBlastParser which returns a List object. PhyloPair uses this List to create the distance matrix.
The distance matrix is then used by a neighbor joining method to create the tree. Figure 12 shows
a schematic view of the flow in the process. The numbers corresponds to the numbers used in the
detailed algorithm above.




      Figure 12: Schematic view of the algorithm creating a phylogenetic tree based on pairwise
                 alignments.


     When the user has pressed the Alignment button, the Make tree button will be visible, letting
the user create the tree. The result of this process can be seen in Figure 13 where the created tree is
visible to the right.




                                                          24
                                     CHAPTER 4. PHYLOGENETICS USING PAIRWISE ALIGNMENTS




Figure 13: The resulting tree after the Alignment and Make tree buttons were pressed.




                                         25
Novel molecular phylogenetics using pairwise alignments




                                                          26
                                                                   CHAPTER 5. TESTS AND COMPARISONS




Chapter 5

Tests and comparisons

The main part of the tests handles the comparisons between different trees. Another test was made to
compare the practical performance of the pairwise phylogenetic method, PhyloPair, against a method
based on the ClustalW [23] multiple alignment. The ClustalW program is briefly explained below.
The two testprograms were written in the Java [15] programming language and both runs TreeGen,
described in Chapter 3 to create the test data required; the true tree used as reference, the sequences
which are the leaves of the true tree, and the true alignment. The true alignment is the alignment
reflecting the delete and insert events occured at each node. Two neighbor joining methods were
involved in the tests. The TreeShop neighbor joining method and the Phylip [7] neighbor joining
method. Phylip is one of the standard packages for phylogenetics. It consists of a number of small
programs for phylogenetic analysis. The parts used here were protdist which is a program used
for calculating distances between sequences, and neighbor which is a program that uses the output
from protdist to generate a tree using the neighbor joining method. The TreeShop neighbor joining
method was called as a Java method from the testprograms. The Phylip neighbor joining method,
neighbor and protdist, are external programs and were called and executed in separate processes
from the testprogram that compares trees. The alignments involved in the tests were, apart from
the true alignment, the alignments from PhyloPair and the ClustalW alignments. ClustalW [23] is a
multiple alignment program. Given a number of sequences it produces a multiple alignment using
a hierarchical method. Because the TreeGen program is using PAM matrices when mutations are
involved, the ClustalW program was also set to use PAM matrices when creating the alignments.
PhyloPair was called as a Java method from the testprograms. ClustalW was called and executed in
a separate process from the testprograms. Note that in the tests performed, the graphical environment
of TreeShop was not used. Only the methods for creating and comparing trees in TreeShop were
used.


5.1    Comparing the trees
When comparing the trees, the data was created using the TreeGen program with a fixed start se-
quence and a fixed number of leaves to generate. Each parameter in the TreeGen program was
increased separately and the trees were compared using the webbing matrix method described in
2.9.2. The parameters involved in the tests were delete probabilities, insert probabilities, a combi-
nation of delete and insert probabilities, deletion length probabilities, insertion length probabilities,
a combination of deletion and insertion length probabilities, and the mutation matrix. The muta-
tion matrix was increased for each step by multiplying it the number of step times, for example
in the second step the matrix was multiplied with itself one time. The implementation was called
TreeGenTester.




                                                   27
Novel molecular phylogenetics using pairwise alignments




5.1.1    Performing the tests
The fixed parameters in the tests were :
     • Number of iterations. This was set to 11 so that the delete and insertion probabilities increases
       from 0.5 to 1.0 with a stepsize of 0.05, the delete and insert length probabilities increases from
       1 to 10 and the starting mutation matrix is multiplied with itself 1 to 10 times.
     • Repeats per iteration. To be able to calculate a standard deviation this was set to 10.
     • Number of leaves. To be able to use relatively realistic trees and still be able to run the tests
       within a reasonable amount of time, this was set to 50.

5.1.2    Creating the trees
For each step in the test, six trees were created.
    1. True tree. This tree is one of the results from running the TreeGen program.
    2. TreeShop neighbor join tree based on true alignment. This tree is created from the true align-
       ment the TreeGen program creates as the true tree is built. The true alignment is used to create
       a tree using TreeShops neighbor joining method.
    3. TreeShop neighbor join tree based on the PhyloPair alignment. This tree is created using the
       sequences from the TreeGen program. The sequences are used to create a database, then each
       sequence is pairwised aligned against the database. The resulting second best scores are used
       to form a distance matrix. This matrix is then used to create a tree using TreeShops neighbor
       joining method.
    4. Phylip neighbor join tree based on true alignment. Using the true multiple alignment from the
       TreeGen program, a distance matrix is created using program from the Phylip package called
       protdist. Another program from the Phylip package, neighbor, is then used to create the tree.
    5. TreeShop neighbor join tree based on ClustalW alignment. This tree is created from an align-
       ment of the sequences generated from the TreeGen program using the multiple alignment
       program ClustalW[23]. The alignment is then used by TreeShops neighbor joining method to
       build a tree.
    6. Phylip neighbor join tree based on ClustalW multiple alignment. This tree is created from an
       alignment of the sequences generated from the TreeGen program using the multiple alignment
       program ClustalW [23]. The alignment is then used by the program neighbor from the Phylip
       package to create a tree.

5.1.3 Comparing the created trees
The trees created were compared against each other using the webbing matrix method. The true tree
was compared against the trees created from the following alignments and neighbor joining methods.
     • True tree versus TreeShop neighbor join tree based on the true alignment.
     • True tree versus TreeShop neighbor join tree based on the PhyloPair alignment.
     • True tree versus Phylip neighbor join tree based on the true alignment.
     • True tree versus TreeShop neighbor join tree based on the ClustalW multiple alignment.
     • True tree versus Phylip neighbor join tree based on the ClustalW multiple alignment.




                                                          28
                                                                  CHAPTER 5. TESTS AND COMPARISONS




     For each test, two charts were created, showing the webbing matrix result in percent at the Y-
axis and the increasing values at the X-axis. The first chart displays the comparisons of the first three
trees in the list above. The second chart displays the comparison of the true tree versus TreeShop
neighbor join tree based on the PhyloPair alignment, second comparison above, in the same chart as
the last two in the above list. For each step in the tests, ten trees were created and the mean value
and standard deviation of the webbing matrix method were calculated. The results are presented in
the following charts. The staples in the charts represents the mean value and on top of each staple
                                                                                                     s
the standard error of the mean are displayed. Standard error of the mean is calculated as SEM = √n
where s is the standard deviation and n is the number of samples. The trees created from the true
alignment are expected to have a high similarity measure against the true tree.

5.1.4   Increasing the delete probability
The delete probability was increased from 0.5 to 1 with a stepsize of 0.05. Figure 14 shows the
webbing matrix results from comparing the true trees with the trees created by the TreeShop neigh-
bor joining method based on the true alignment, the trees created by the TreeShop neighbor join-
ing method based on the PhyloPair alignment and the trees created by the Phylip neighbor joining
method based on the true alignment. The webbing matrix scale, the Y-axis, goes from 0% to 100%
where 100% means that the trees compared are identical. Figure 15 shows the webbing matrix re-
sults from comparing the true trees with the trees created by the TreeShop neighbor joining method
based on the ClustalW alignments, the trees created by the TreeShop neighbor joining method based
on the PhyloPair alignments and the trees created by the Phylip neighbor joining method based on
the ClustalW alignments. The trees created by the PhyloPair alignments are the same as in Figure
14. The two comparisons that are based on the true alignments, Figure 14, are expected to have a
high similarity with the true tree. This is because the true alignment is reflecting the history of the
evolved sequences. This is not the case in this test. Both the TreeShop neighbor joining method
and the Phylip neighbor joining method, when based on the true alignment, shows a low similarity
with the true tree. The reason for this may be incorrect behaviour of the methods for creating the
trees, here the TreeShop neighbor join and the Phylip neighbor join. There may also be a problem
with how the true alignments are created. The PhyloPair method performs even worse with a simi-
larity of only 30%. The tree comparisons that produces the best results are the TreeShop neighbor
joining method based on the true alignment and the TreeShop neighbor joining method based on the
ClustalW multiple alignment.

5.1.5   Increasing the insert probability
The insert probability was increased from 0.5 to 1 with a stepsize of 0.05. Figure 16 shows the web-
bing matrix results, in a percentage scale at the Y-axis, from comparing the true trees with the trees
created by the TreeShop neighbor joining method based on the true alignment, the trees created by
the TreeShop neighbor joining method based on the PhyloPair alignment and the trees created by the
Phylip neighbor joining method based on the true alignment. Figure 17 shows the webbing matrix
results from comparing the true tree with the trees created by the TreeShop neighbor joining method
based on the ClustalW alignments, the trees created by the TreeShop neighbor joining method based
on the PhyloPair alignments and the trees created by the Phylip neighbor joining method based on
the ClustalW alignments. The trees created by the PhyloPair alignments are the same as in Figure
16. The two comparisons that are based on the true alignments, Figure 14, are expected to have a
high similarity with the true tree because the true alignment is reflecting the history of the evolved
sequences. The trees created from the TreeShop neighbor joining method based on the true align-
ment shows a rather high similarity with the true trees. This is not the case when the Phylip neighbor
joining method based on the true alignment is used. These comparisons all lie under 30% which in-
dicates that something is wrong when using this method. The similarity should be near the TreeShop
neighbor joining method as they are only two different implementations of the same method. Per-


                                                  29
Novel molecular phylogenetics using pairwise alignments




     Figure 14: Increasing the delete probability. First chart. The true tree is compared to the
                TreeShop neighbor join tree based on the true alignment, the TreeShop neigh-
                bor join tree based on the pairwise alignment distance matrix and the Phylip
                neighbor join tree based on the true alignment. The percentage scale shows the
                webbing matrix results from the comparisons. 100% means that the trees are
                identical to the true tree.




                                                          30
                                                         CHAPTER 5. TESTS AND COMPARISONS




Figure 15: Increasing delete probability. Second chart. The true tree is compared to the
           TreeShop neighbor join tree based on the ClustalW alignment, the TreeShop
           neighbor join tree based on the pairwise alignment distance matrix and Phylip
           neighbor join tree based on the ClustalW multiple alignment. The percentage
           scale shows the webbing matrix results from the comparisons. 100% means
           that the trees are identical to the true tree.




                                          31
Novel molecular phylogenetics using pairwise alignments




haps some parameter needs to be adjusted in Phylip to achieve this result. The PhyloPair method
continues to perform poorly in this test. The tree comparisons that produces the best results are the
TreeShop neighbor joining method based on the true alignment and the TreeShop neighbor joining
method based on the ClustalW multiple alignment. The Phylip neighbor joining method based on
the ClustalW alignment produces results that are close to those from the TreeShop neighbor joining
method based on the ClustalW alignment, and significantly better than when the Phylip neighbor
joining method is based on the true alignment. It seems as the Phylip neighbor joining method has
problems when creating trees from the true alignments, but when using alignments from ClustalW,
it performs better.




     Figure 16: Increasing the insert probability. First chart. The true tree is compared to the
                TreeShop neighbor join tree based on the true alignment, the TreeShop neigh-
                bor join tree based on the pairwise alignment distance matrix and the Phylip
                neighbor join tree based on the true alignment. The percentage scale shows the
                webbing matrix results from the comparisons. 100% means that the trees are
                identical to the true tree.




5.1.6    Increasing the delete and insert probability simultaneously
The delete and insert probabilities were simultaneously increased from 0.5 to 1.0 with a stepsize
of 0.05. The TreeShop neighbor joining method based on the true alignment and the PhyloPair
alignment had similar values as in 5.1.5. Figure 18 and Figure 19 shows the two charts with the
webbing matrix percentage scale at the Y-axis and the increasing probabilities of deletes and inserts
at the X-axis. One interesting difference in this test is that the trees created with the TreeShop
neighbor joining method and the Phylip neighbor joining method, when both methods were based
on the ClustalW alignment, had a higher similarity with the true trees than in 5.1.5. This means that
the ClustalW alignment method produced better results when both deletes and inserts were made to
the sequences. This is a good result for the ClustalW method, as the sequences in nature involves
both deletes and inserts. The PhyloPair method continues to have low results which indicates that
something may be wrong with the method or the implementation.



                                                          32
                                                            CHAPTER 5. TESTS AND COMPARISONS




Figure 17: Increasing the insert probability. Second chart. The true tree is compared to the
           TreeShop neighbor join tree based on the ClustalW alignment, the TreeShop
           neighbor join tree based on the pairwise alignment distance matrix and Phylip
           neighbor join tree based on the ClustalW multiple alignment. The percentage
           scale shows the webbing matrix results from the comparisons. 100% means
           that the trees are identical to the true tree.




                                            33
Novel molecular phylogenetics using pairwise alignments




     Figure 18: Increasing the delete and insert probability. First chart. The true tree is com-
                pared to the TreeShop neighbor join tree based on the true alignment, the
                TreeShop neighbor join tree based on the pairwise alignment distance matrix
                and the Phylip neighbor join tree based on the true alignment. The percentage
                scale shows the webbing matrix results from the comparisons. 100% means
                that the trees are identical to the true tree.




                                                          34
                                                           CHAPTER 5. TESTS AND COMPARISONS




Figure 19: Increasing the delete and insert probability. Second chart. The true tree is
           compared to the TreeShop neighbor join tree based on the ClustalW alignment,
           the TreeShop neighbor join tree based on the pairwise alignment distance matrix
           and Phylip neighbor join tree based on the ClustalW multiple alignment. The
           percentage scale shows the webbing matrix results from the comparisons. 100%
           means that the trees are identical to the true tree.




                                           35
Novel molecular phylogenetics using pairwise alignments




5.1.7    Increasing the delete length
The delete length probability was increased from 1 to 10 with a stepsize of 1. The delete probability
was fixed at 1.0. The difference between this test and the test in 5.1.4 is that a deletion is always per-
formed as the sequences evolve and that the length of the deletion may vary from 1 to 10 depending
on which step is performed. Figure 20 shows the webbing matrix results from comparing the true
trees with the trees created by the TreeShop neighbor joining method based on the true alignment,
the trees created by the TreeShop neighbor joining method based on the PhyloPair alignment and
the trees created by the Phylip neighbor joining method based on the true alignment. The Y-axis
shows the webbing matrix scale in percent where 100% means that the trees compared are identical.
Figure 21 shows the webbing matrix results from comparing the true tree with the trees created by
the TreeShop neighbor joining method based on the ClustalW alignments, the trees created by the
TreeShop neighbor joining method based on the PhyloPair alignments and the trees created by the
Phylip neighbor joining method based on the ClustalW alignments. The trees created by the Phy-
loPair alignments are the same as in Figure 20. As in 5.1.4, the two comparisons that are based
on the true alignments, Figure 20, are expected to have a high similarity with the true tree. This is
because the true alignment is reflecting the history of the evolved sequences. This is not the case
in this test. Both the TreeShop neighbor joining method and the Phylip neighbor joining method,
when based on the true alignment, shows a low similarity with the true tree. The reason for this may
be incorrect behaviour of the methods for creating the trees, here the TreeShop neighbor join and
the Phylip neighbor join. There may also be a problem with how the true alignments are created.
The PhyloPair method have a low similarity of only 30%. The tree comparisons that produces the
best results are the TreeShop neighbor joining method based on the true alignment and the TreeShop
neighbor joining method based on the ClustalW multiple alignment.




     Figure 20: Increasing the delete length probability. First chart. The true tree is compared
                to the TreeShop neighbor join tree based on the true alignment, the TreeShop
                neighbor join tree based on the pairwise alignment distance matrix and the
                Phylip neighbor join tree based on the true alignment. The percentage scale
                shows the webbing matrix results from the comparisons. 100% means that the
                trees are identical to the true tree.




                                                          36
                                                         CHAPTER 5. TESTS AND COMPARISONS




Figure 21: Increasing the delete length probability. Second chart. The true tree is com-
           pared to the TreeShop neighbor join tree based on the ClustalW alignment, the
           TreeShop neighbor join tree based on the pairwise alignment distance matrix
           and Phylip neighbor join tree based on the ClustalW multiple alignment. The
           percentage scale shows the webbing matrix results from the comparisons. 100%
           means that the trees are identical to the true tree.




                                          37
Novel molecular phylogenetics using pairwise alignments




5.1.8    Increasing the insert length
The insert length probability was increased from 1 to 10 with a stepsize of 1. The insert proba-
bility was fixed at 1.0. The difference between this test and the test in 5.1.5 is that an insertion is
always performed as the sequences evolve and that the length of the insertion may vary from 1 to
10 depending on which step is performed. Figure 22 shows the webbing matrix results from com-
paring the true trees with the trees created by the TreeShop neighbor joining method based on the
true alignment, the trees created by the TreeShop neighbor joining method based on the PhyloPair
alignment and the trees created by the Phylip neighbor joining method based on the true alignment.
The percentage scale shows the webbing matrix values. Figure 23 shows the webbing matrix results
from comparing the true trees with the trees created by the TreeShop neighbor joining method based
on the ClustalW alignments, the trees created by the TreeShop neighbor joining method based on
the PhyloPair alignments and the trees created by the Phylip neighbor joining method based on the
ClustalW alignments. The trees created by the PhyloPair alignments are the same as in Figure 22.
As in 5.1.5, the two comparisons that are based on the true alignments, Figure 22, are expected to
have a high similarity with the true tree. This is because the true alignment is reflecting the history of
the evolved sequences. The trees created from the TreeShop neighbor joining method based on the
true alignment shows a very high similarity with the true trees. This is not the case when the Phylip
neighbor joining method based on the true alignment is used. As in 5.1.5, these comparisons all lie
under 30%. The similarity should be near the TreeShop neighbor joining method as they are only
two different implementations of the same method. Perhaps some parameter needs to be adjusted
in Phylip to achieve this result. The PhyloPair method continues to perform poorly in this test. The
tree comparisons that produces the best results are the TreeShop neighbor joining method based on
the true alignment and the TreeShop neighbor joining method based on the ClustalW multiple align-
ment. The Phylip neighbor joining method based on the ClustalW alignment produces results that
are far from those from the TreeShop neighbor joining method based on the ClustalW alignment.
The idea from 5.1.5 that the Phylip neighbor joining method performs better when using alignments
from ClustalW does not seem to hold here.

5.1.9    Increasing the delete and insert length simultaneously
The delete and insert length probabilities were simultaneously increased from 1 to 10 with a stepsize
of 1. The delete and insert probabilities were fixed at 1.0. The difference between this test and the
test in 5.1.6 is that a deletion and an insertion is always performed as the sequences evolve and that
the length of the deletion and insertion may vary from 1 to 10 depending on which step is performed.
Figure 24 and Figure 25 shows the two charts with the webbing matrix percentage scale at the Y-
axis and the increasing delete and insert length probabilities at the X-axis. The trees created from the
neighbor joining methods based on the true alignments are expected to have a high similarity with the
true trees. As can be seen in Figure 24 this is only true for the TreeShop neighbor joining method.
The trees created with the Phylip neighbor joining method has very low similarity with the true
tree, under 30%. This indicates that there is a problem when using this method, because it should
perform as good as the TreeShop neighbor joining method. The PhyloPair method continues to
perform poorly in this test. The tree comparisons that produces the results are the TreeShop neighbor
joining method based on the true alignment and the TreeShop neighbor joining method based on the
ClustalW multiple alignment. The Phylip neighbor joining method based on the ClustalW alignment
produces results that are close to those from the TreeShop neighbor joining method based on the
ClustalW alignment, and significantly better than when the Phylip neighbor joining method is based
on the true alignment. It seems as the Phylip neighbor joining method has problems when creating
trees from the true alignments, but when using alignments from ClustalW, it performs better. This
is the same conclusion as in 5.1.5. As in 5.1.6, the trees created with the TreeShop neighbor joining
method and the Phylip neihgbor joining method, when both methods are based on the ClustalW
alignment, has a higher similarity with the true trees than in 5.1.8 when only the insert length is
changed. This means that the ClustalW alignment method produces better results when both deletes


                                                          38
                                                           CHAPTER 5. TESTS AND COMPARISONS




Figure 22: Increasing the insert length probability. First chart. The true tree is compared
           to the TreeShop neighbor join tree based on the true alignment, the TreeShop
           neighbor join tree based on the pairwise alignment distance matrix and the
           Phylip neighbor join tree based on the true alignment. The percentage scale
           shows the webbing matrix results from the comparisons. 100% means that the
           trees are identical to the true tree.




                                            39
Novel molecular phylogenetics using pairwise alignments




     Figure 23: Increasing the insert length probability. Second chart. The true tree is com-
                pared to the TreeShop neighbor join tree based on the ClustalW alignment, the
                TreeShop neighbor join tree based on the pairwise alignment distance matrix
                and Phylip neighbor join tree based on the ClustalW multiple alignment. The
                percentage scale shows the webbing matrix results from the comparisons. 100%
                means that the trees are identical to the true tree.




                                                          40
                                                                  CHAPTER 5. TESTS AND COMPARISONS




and inserts of varying lengths are made to the sequences. This is a good result for the ClustalW
method, as the sequences in nature involves both deletes and inserts, and of varying lengths. The
PhyloPair method continues to have low results.




    Figure 24: Increasing the delete and insert length probability simultaneously. First chart.
               The true tree is compared to the TreeShop neighbor join tree based on the true
               alignment, the TreeShop neighbor join tree based on the pairwise alignment dis-
               tance matrix and the Phylip neighbor join tree based on the true alignment.hop
               neighbor join tree based on the pairwise alignment distance matrix and Phylip
               neighbor join tree based on the ClustalW multiple alignment. The percentage
               scale shows the webbing matrix results from the comparisons. 100% means
               that the trees are identical to the true tree.




5.1.10   Increasing the mutation matrix
Increasing the mutation matrix means multiplying the mutation matrix. For each step the starting
mutation matrix will be multiplied with itself the step number of times. If, for example, the tests are
started with a PAM1 matrix, in step 2 this will be a PAM2 matrix and so on. Note that in TreeGen
the mutation matrix is also multiplied the length of the branch times during the creation of new
nodes. Figure 26 shows the webbing matrix results from comparing the true trees with the trees
created by the TreeShop neighbor joining method based on the true alignment, the trees created by
the TreeShop neighbor joining method based on the PhyloPair alignment and the trees created by
the Phylip neighbor joining method based on the true alignment. The percentage scale shows the
webbing matrix results from the comparisons. Figure 27 shows the webbing matrix results from
comparing the true tree with the trees created by the TreeShop neighbor joining method based on
the ClustalW alignments, the trees created by the TreeShop neighbor joining method based on the
PhyloPair alignments and the trees created by the Phylip neighbor joining method based on the
ClustalW alignments. The trees created by the PhyloPair alignments are the same as in Figure 26. In
this test, the two comparisons that are based on the true alignments, Figure 26, are expected to have
a low similarity with the true tree. This is because the true alignment that reflects the history of the
evolved sequences is only created based on the delete and insert operations performed at each node.


                                                  41
Novel molecular phylogenetics using pairwise alignments




     Figure 25: Increasing the delete and insert length probability simultaneously. Second
                chart. The true tree is compared to the TreeShop neighbor join tree based on
                the ClustalW alignment, the TreeShop neighbor join tree based on the pairwise
                alignment distance matrix and Phylip neighbor join tree based on the ClustalW
                multiple alignment. The percentage scale shows the webbing matrix results
                from the comparisons. 100% means that the trees are identical to the true tree.




                                                          42
                                                                   CHAPTER 5. TESTS AND COMPARISONS




In this test no delete or insert operations are performed, therefore the true alignment will only consist
of the resulting sequences stacked above each other with no gaps inserted to improve the alignment.
The results from the tests when the true alignments are involved shows the expected outcome. The
methods that are expected to perform best in this test are the two neighbor joining methods based on
the ClustalW alignments. ClustalW was set to use PAM matrices when creating the alignments and
as all mutations made in this test was based on the PAM matrices, the results from the two neighbor
joining methods based on ClustalW should be the best. This is not the case as can be seen in Figure
27. The trees created when the ClustalW alignments are involved shows as low similarity as when
the true alignments are involved. The method that performs best in this test is the PhyloPair method.
In fact it shows its highest similarity values when the mutation matrix is increased, of all the tests
performed. The reasons why the trees based on the ClustalW alignment shows such low similarity
with the true tree may be explained by the need to modify some of the parameters to ClustalW.




      Figure 26: Increasing the mutation matrix. First chart. The true tree is compared to the
                 TreeShop neighbor join tree based on the true alignment, the TreeShop neigh-
                 bor join tree based on the pairwise alignment distance matrix and the Phylip
                 neighbor join tree based on the true alignment. The percentage scale shows the
                 webbing matrix results from the comparisons. 100% means that the trees are
                 identical to the true tree.




5.2     Comparing performance between PhyloPair and ClustalW
This test was done to compare the time performance between the PhyloPair method that uses pair-
wise alignments and the ClustalW method that uses multiple alignments. The number of leaves was
the only parameter increased in the test and the time the two different methods spent to create the
trees was measured in milliseconds. Both methods uses the TreeShop neighbor joining method to
create the trees from the alignments. The implementation of the time test was called PhyloPairTime-
Test. The resulting graph is found in Figure 28. The Y-axis displays the time in milliseconds and the
X-axis displays the number of leaves for each tree created. One of the ideas behind the PhyloPair
method was to increase the performance when creating phylogenetic trees. As can be seen in Fig-


                                                   43
Novel molecular phylogenetics using pairwise alignments




     Figure 27: Increasing the mutation matrix. Second chart. The true tree is compared to the
                TreeShop neighbor join tree based on the ClustalW alignment, the TreeShop
                neighbor join tree based on the pairwise alignment distance matrix and Phylip
                neighbor join tree based on the ClustalW multiple alignment. The percentage
                scale shows the webbing matrix results from the comparisons. 100% means
                that the trees are identical to the true tree.




                                                          44
                                                              CHAPTER 5. TESTS AND COMPARISONS




ure 28, the PhyloPair method spends a significantly less time creating the trees than the ClustalW
method. This difference can be expected to be even higher when trees with more leaves are created.




    Figure 28: The result from comparing the time performance of the pairwise phylogenetic
               method, PhyloPair, against ClustalW, both using the TreeShop neighbor joining
               method.




                                               45
Novel molecular phylogenetics using pairwise alignments




                                                          46
                                                              CHAPTER 6. DISCUSSION AND FUTURE WORK




Chapter 6

Discussion and future work

The novel method for molecular phylogeny using pairwise alignments was implemented as a soft-
ware called PhyloPair. To evaluate PhyloPair, a software called TreeGen that generates controlled
test data, was implemented. TreeGen uses an artificial evolutionary process to create trees, se-
quences and alignments, that were used to compare the novel method against some existing tree
building methods. The performance of PhyloPair was compared with ClustalW for creating the trees
when increasing the number of leaves in the trees. Software to perform the tests were also written
and the results were presented as charts.


6.1    Comparing the trees
The results from using only delete operations and varying the delete length in TreeGen showed low
similarity values between the true tree and the tree building methods involved in the study. Even
the neighbor joining methods based on the true alignment showed low similarity values. The results
from using the true alignment were expected to be higher. The reason for the low similarity values
for the trees based on the true alignments probably come from how the TreeGen software creates the
true alignment when only deletes are performed. Perhaps the sequences in the true alignments when
only delete operations are involved are to similar for the neighbor joining methods to be able to create
a tree that is close to the true tree. When a delete is performed, a gap is inserted in the sequence only,
leaving the rest of the sequences as they were. In contrast, when an insert is performed, all other
sequences has gaps inserted. This may be the reason for the low similarity values from comparing
the true trees with the trees created based on the true alignment when only deletes are involved.
     The results from using only insert operations in TreeGen showed that trees created from the true
alignment had a high similiarity with the true tree, except when the Phylip neighbor join method
was used. ClustalW using the TreeShop neighbor joining method had a similarity measure slightly
below the one using TreeShop neighbor join based on the true alignment. The Phylip neighbor join-
ing method based on ClustalW alignment showed low similarity values, but slightly higher than the
PhyloPair method. The lowest results came from using Phylip neighbor join based on the true align-
ment. A strange result is that the Phylip neighbor join method performs better using the ClustalW
alignment than using the true alignment, seen in Figures 16 and 17. When increasing the insert
length probability and keeping the insert probability fixed, comparisons made with the TreeShop
neighbor join method based on the true alignment, showed almost equality with the true tree as seen
in Figures 22 and 23. The Phylip neighbor join, however, had very low similarity values and also
the TreeShop neighbor join based on PhyloPair. The TreeShop neighbor join based on ClustalW had
high similarity values and the Phylip neighbor join based on ClustalW alignment had slightly better
similarity values than PhyloPair.
     Increasing both delete and insert probabilities, Figures 18 and 19, made the tree comparisons
based on ClustalW alignments outperform the tree comparisons made based on the true alignment.


                                                   47
Novel molecular phylogenetics using pairwise alignments




This was the case for both the TreeShop neighbor join and the Phylip neighbor join methods. The
comparisons made using the Phylip neighbor join based on the true alignment and TreeShop neigh-
bor join based on PhyloPair showed similar values as when increasing the insert probability.
     When both the delete and insert length probabilities were increased, Figures 24 and 25, the
tree comparison using the TreeShop neighbor join based on ClustalW alignment outperformed the
TreeShop neighbor join based on the true alignment. A similar situation occured when increasing the
delete and insert probability simultaneously. The similarity values from the comparisons made when
using the TreeShop neighbor join based on the true alignment decreases compared to when only the
increased length probability is used. Both the Phylip neighbor join based on the true alignment and
the TreeShop neighbor join based on PhyloPair have low values. The values for Phylip neighbor
join based on the ClustalW alignment is actually better than those values aquired when increasing
the insert length only.
     Increasing the PAM matrix resulted in low similarity values on all the tree building methods.
The method that performed best was the TreeShop neighbor join based on PhyloPair. The true
alignments created from TreeGen are only based on deletes and inserts which means that no concern
is taken to mutations made to symbols. This explains why the similarity values are low when the
trees are created from the true alignments. Unfortunately even the other methods have low similarity
values.
     According to the comparisons made, the PhyloPair method as implemented here, shows a low
similarity measure. The similarity values from the comparisons all lie about 30%, except for the
PAM matrix where the values are about 40%. These values are to low to give any information re-
garding the usefulness of the method. Compared to the results achieved using ClustalW, the method
is nowhere near the same similarity measures. The Phylip neighbor join method method has very low
similarity values when the true alignment is used. One idea is that there is some difference in what
alignment this method is expecting, compared to the TreeShop neighbor join method. There may
also be some parameters that needs to be adjusted in the neighbor and protdist parts of the Phylip
package. As can be seen in Figure 22 when increasing the length probability the Phylip neighbor
join method has very low values. When based on the ClustalW alignment in Figure 23, the method
performs slightly better.
     The poor results from using PhyloPair and using the Phylip neighbor joining method based on
the true alignment showed that something goes wrong when these two methods are used to create
the trees. The exact reasons for this is not yet known. Some ideas may include the sorting order of
the trees, the order of the sequences in the alignments involved or other programming issues.
     It is generally inadvisable to use a pure computer generated alignment as input to a treebuilding
method [1]. The automated alignments and trees created in the tests performed are not intended to be
a complete phylogenetic analysis of the data. However, the test programs written may be developed
and refined to better reflect a real phylogenetic analysis.


6.2     Comparing performance
The comparison of the performance between PhyloPair and the ClustalW method, when increasing
the number of leaves of the tree to be created, showed that PhyloPair had a significantly better
performance when the trees grew larger. As the curves in Figure 28 seem to diverge more and more,
the performance of the PhyloPair method for very large trees are expected to be much better than the
ClustalW method. The relevance of this result is of less importance until it can be shown that the
method also achieves an acceptable similarity measure with the true tree.


6.3     Future work
Researchers today tend to abandon the PAM matrices in favor of BLOSUM [12] matrices. This
option could also be made available in TreeGen. The TreeGen algorithm currently evolves amino


                                                          48
                                                         CHAPTER 6. DISCUSSION AND FUTURE WORK




acids in the same way as nucleotides. This does not reflect biological behaviour very well. Instead
the evolution of nucleotides should be made on codons instead of each symbol. This makes it
possible to prevent some unwanted behaviour, for example that an evolved sequence receives a stop
codon in the middle. The TreeGen algorithm may be developed and refined, for example additional
user parameters. Different processes and rules when creating the tree may be developed. The test
programs could be developed to better reflect a real phylogenetic analysis. This could be done
in cooperation with phylogenetic researchers. The PhyloPair method has only been tested with
BLAST and no parameters in BLAST has been modified. A further study may involve using different
alignment programs and modifying these programs parameters.




                                               49
Novel molecular phylogenetics using pairwise alignments




                                                          50
                                                          CHAPTER 6. DISCUSSION AND FUTURE WORK




Acknowledgments

                                                                                a
I wish to thank Pedher Johansson at the Department of Computing Science, Ume˚ University.
                                              o
Tobias Hill, Robert Fredriksson and Helgi Schi¨ th at the Department of Neuroscience, Unit of Phar-
macology, Uppsala University.




                                                51
Novel molecular phylogenetics using pairwise alignments




                                                          52
                                                                                      BIBLIOGRAPHY




Bibliography

 [1] B.F. Francis Oullette Andreas D. Baxevanis. Bioinformatics. John Wiley and Sons, Inc, 2
     edition, 2001.
 [2] James Archie, William H.E. Day, Wayne Maddison, Christopher Meacham, F. James Rohlf,
     David Swofford, and Joseph Felsenstein. The newick tree format.
     http://evolution.genetics.washington.edu/phylip/newick doc.html, Aug. 1990.
 [3] Bryan Bergeron. Applied bioinformatics computing: An introduction.
     http://www.informit.com/articles/article.asp?p=30121, Nov. 2002.
 [4] Arnold R. Brody David Eliot Brody. The Science Class You Wish You Had... The Seven Greatest
                                                                               o
     Scientific Discoveries in History and the People Who Made Them. Wahlstr¨ m Widstrand, 1
     edition, 1997.
 [5] M. O. Dayhoff, R. M. Schwartz, and B. C. Orcutt. A model of evolutionary change in proteins.
     Atlas of protein sequece and structure, supplement 3, pages 345–352, 1978.
 [6] Thomas Down et al. Biojava. http://www.biojava.org/.
 [7] Joseph Felsenstein. Phylip, phylogeny inference package.
     http://evolution.genetics.washington.edu/phylip.html, Dec. 2004.
 [8] National Center for Biological Information. http://www.ncbi.nlm.nih.gov/, 2004.
 [9] National Center for Biological Information. Blast : Pairwise alignment heuristic.
     http://www.ncbi.nlm.nih.gov/BLAST/, Nov. 2004.
[10] National Center for Biological Information. Fasta format description.
     http://www.ncbi.nlm.nih.gov/BLAST/fasta.shtml, Nov. 2004.
[11] National Center for Biological Information. A science primer : What is a cell.
     http://www.ncbi.nlm.nih.gov/About/primer/genetics.html, Mar. 2004.
[12] Henikoff J.G. Henikoff S. Automated assembly of protein blocks for database searching. Nu-
     cleic Acids Res. 19, pages 6565–6572, 1991.
[13] G Kimmel, A Farkash, and Ron Shamir et al. Algorithms for molecular biology, Oct. 2001.
[14] Ching Law, Casim A. Sarkar, and Mona Singh. Topics in computational molecular biology,
     Oct. 1999.
[15] Sun Microsystems. http://java.sun.com/, Dec. 2004.
[16] IUPAC-IUB Joint Commission on Biochemical Nomenclature (JCBN). Nomenclature and
     symbolism for amino acids and peptides.
     http://www.chem.qmul.ac.uk/iupac/AminoAcid/index.html, 1983.
[17] Irit Orr. Introduction to phylogenetic analysis.
     http://bip.weizmann.ac.il/course/introbioinfo/lect12/phylogenetics.pdf, Jun. 2003.


                                                53
Novel molecular phylogenetics using pairwise alignments




[18] Jonathan Pevsner. Bioinformatics and functional genomics. John Wiley and Sons Inc., 1st
     edition, 2003.
[19] Joe Hicklin Cleve Moler Peter Webb Ronald F. Boisvert Bruce Miller Roldan Pozo Karin
     Remington. Java matrix package. http://math.nist.gov/javanumerics/jama/, Jun. 1999.
              o
[20] Tomas S¨ derlund. Visualization of multiple phylogenetic trees and their similarities. Master’s
     thesis, Uppsala University, Information Technology, Computing Science Department, 2003.
                a o
[21] Tommy S¨ fstr¨ m and Andreas Vernersson. Methods for objective comparison of phyloge-
     netic trees. Master’s thesis, Uppsala University, Information Technology, Computing Science
     Department, Jan. 2004.
[22] J Stoye, D Evers, and F Meyer. Rose: Generating sequence families. Bioinformatics, Vol 14
     no.2, pages 157–163, Oct. 1998.
[23] J.D. Thompson, D.G. Higgins, and T.J. Gibson. Clustal w: improving the sensitivity of progres-
     sive multiple sequence alignment through sequence weighting, positions-specific gap penalties
     and weight matrix choice. Nucleic Acids Research 22, pages 4673–4680, 1994.
[24] Sakti Pramanik Yang Zhong, Christopher A. Meacham. A general method for tree-comparison
     based on subtree similarity and its use in a taxonomic database. Biosystems 42, pages 1–8,
     1997.




                                                          54

								
To top