PowerPoint Presentation by MfT85YYG

VIEWS: 0 PAGES: 66

									 Gene Families and Functional Annotation

Once genes have been id.ed they need to be functionally
annotated
A computational first step is to group genes w/ other genes -
some of which will hopefully have known fx.s
Once genes are classified, we can begin to examine whether
certain genes are missing or overrepresented in the given
genome - possibly reflecting the niche of the organism
As w/ earlier computational analyses, functional annotation
based solely on in silico analyses is only a first step




  15:51
 Gene Families and Functional Annotation

Sequence-similarity searches are a first pass in
classification
BLAST - Basic Local Alignment Search Tool
BLASTn - nucleotide
BLASTp- protein
BLASTx - translates a nucleotide sequence into all possible
reading frames and scans these against a protein database
All give a Expectation, E, value score - to evaluate the
significance of the match
In both eukaryotes and prokaryotes, 1/3 to 1/2 of searched
genes do not match a protein = orphan genes

  15:51
             Protein Structural Domains
Proteins are made up of combinations of
distinct structural units or domains
Genes can be grouped based on the
domains they contain
These groupings depend on structural
similarity - sequence similarity alone may
be insufficient




    15:51
      Gene clustering by seq. similarity




BLAST searches generally return matches from more than
one protein from more than one species
This happens if the query protein is part of a gene (protein)
family or contains multiple domains found in other proteins
  15:51
BLAST output can be interpreted as a match to one or more
protein domains - Searches of closely related sp. often id.
genes/proteins w/ similar domain structure
Domains shuffle over evolutionary time and are often found in
different combinations across more distant comparisons
Domains do tend to follow biologically reasonable patterns - DNA
binding domains w/ other DNA binding domains, transmembrane
                                                      15:51
domains w/ intra and extracellular domains
          Gene clustering by seq. similarity

Genes can be classified by domain content
The Enzyme Commission (EC) hierarchical classification of
enzymes - each enzyme is assigned a number that reflects
sub-classification of function, e.g. ADH is EC1.1.1.1
Other classification schemes are not as obvious - protein
function is often context-specific
PFAM - protein database that allows access to biochemical
properties of predicted proteins




  15:51
          Gene clustering by seq. similarity

InterPro - classifies individual protein domains




  15:51
          Gene clustering by seq. similarity

Protein functional prediction ≠ assignment of genes to
families
Protein function prediction allows general conclusions
about protein function and genome content based on
protein domains
Classification of gene families involves distinguishing
between paralogs and orthologs




  15:51
          Major Classes of Protein Function

Enzymes
Signal transduction (receptors and kinases)
Nucleic acid binding (transcription factors, nucleic acid
enzymes)
Structural (cytoskeletal, extracellular matrix, motor proteins)
Channel (voltage and chemically gated)
Immunoglobins
Calcium-binding proteins
Transporters
Subclasses vary - as do the representation w/in each
genome
  15:51
                    Gene Clusters
Alignment searches (BLAST) identify genes w/ similar
sequence to the query
If searches id. a single gene, or genes w/ a single fx then
functional assignment to query seq. is simple - but searches
often id lg # of seq.s w/ multiple functions
The most similar sequence is not nec. the seq. w/ which the
query seq. shares a fx




  15:51
                     Gene Clusters
One approach is to try and define as large a protein family as
possible (including many possible functions)
PSI-BLAST can be used to identify a large set of potential
protein family members
A BLAST search is conducted to create an initial protein
sequence alignment - which is then used to initiate a fresh
search
The process is then iterated until no further matches are
id.ed - this reduces the degree of seq. similarity required for
inclusion in the family
A “true” family of genes ought to be bounded by a
significance cut-off to limit the proteins included


  15:51
                    Gene Clusters
Clusters of orthologous genes, COGs, can be used to
classify proteins
COGs are created by id.ing the best hit for each gene in
complete pairwise comparisons across a set of genomes




  15:51
                    Gene Clusters
185,000 proteins from 66 microbial genomes id.ed 4,873
COGs - 75% of all predicted microbial proteins
50% of 110,00 proteins from fly, nematode, human,
ariabidopsis, yeasts and a microsporidian form 4,852 COGs



                                                 COG0837




  15:51
                    Gene Clusters
COGs include both orthologs and paralogs
In (a) HuA and HuA’ are paralogs - distinguishing which
retains the ancestral fx is not as simple as determining
which has the most similar seq.




  15:51
                     Gene Clusters
HuA and MmA differ in 5 a.a., none affect fx
HuA’ and MmA differ in 4 a.a., but one of which changes the
charge of a critical residue
Clustering based on similarity would lead to erroneous fx
classification




  15:51
                  Gene Phylogenies
Clustering groups genes by seq. similarity
Phylogentic analyses ascertain how groups of similar genes
are related by descent
In the HuA, MmA example, the 2 A’ genes can either result
from one (orthologs) or two (paralogs) duplication events
Paralogs are less likely to share a function




  15:51
                   Gene Phylogenies
Often gene fx can be inferred from phylogenetic analysis
The first step is aligning the sequences
A gene tree is then constructed using some algorithm
Duplications and gene relatedness are then ascertained
In the example on the lft, an ancient duplication splits 2 fx.al
grps, on the rt protein 2 likely has the same fx as 5 and 6




  15:51
                      Gene Ontology
Molecular function alone may not predict/describe biological fx
(think crystallins)
The Gene Ontology (GO) annotates and groups genes using a
multi-character approach including cell biological and molecular fx
and/or subcellular localization
The GO project uses defined vocabulary and a hierarchical
structure to classify genes and includes links indicating the type of
evidence for the classification




  15:51
                        GO network
In this example, the gene INNER NO OUTER is at the center w/ the
3 separate classifications radiating out from it




  15:51
                       Gene Otology
The GO vocabulary includes 7000 terms describing molecular fx,
5000 describing biological process, some annotations include as
many as 12 levels w/ in hierarchy terms
This is too deep for efficient computational searches - other
simplified systems are also being developed to allow
computationally screen and classify genes




  15:51
             Molecular Phylogenetics
• Homology = similarity due to common ancestry

• The Gpdh gene sequence from two different species are
  homologous sequences

• All comparisons made in molecular evolution (biology) are
  based on comparing homologous sequences = apples to
  apples

• Sequences must be aligned to allow comparison =
  homologous bases lined up in columns
Human     MVHLTP                                      Human    MVHLTP
Baboon    MVHLTP   The cow and sheep β globin         Baboon   ......
Cow       MLTP     proteins are 2 a.a. shorter than   Cow      .--...
Sheep     MLTP     the other sequences, so gaps       Sheep    .--...
Mouse     MVHLTP   are added to align the             Mouse    ......
                   seqeunces

  15:51
                        Gene Trees
• Accumulation of sequence differences through time is the basis
  of molecular systematics, which analyses them in order to infer
  evolutionary relationships
• A gene tree is a diagram of the inferred ancestral history of a
  group of sequences
• A gene tree is only an estimate of the true pattern of evolutionary
  relations
• UPGMA and Neighbor joining = simple ways to estimate a gene
  tree
• Bootstrapping = sampling w/ replacement, a common technique
  for assessing the reliability of a node in a gene tree
• Taxon = the source of each sequence




   15:51
             Rooted and Unrooted Trees
Analyses of a set of genes produces an unrooted tree
Trees can be rooted, assigned polarity, by assignment of an
outgroup - a sequence that is known to be more distantly related
than any within the rest of the analysis (the ingroup)
Tree branch length denotes the amount
of change along that branch in some
tree building methods




 3 distinct unrooted trees




   15:51
                 Tree Building methods

The 3 primary methods (algorithms) for building gene trees are:
1. Parsimony - a character-based approach that surveys every
possible tree topology. The most parsimonious tree is the
topology that requires the minimum # of steps (changes) in a data
set
Position 1 of this example - tree1 requires 1 change, tree2 2
changes and tree3 2 changes. When the 4 positions are summed
tree 3 is found to be the best (shortest)




   15:51
                 Tree Building methods

The 3 primary methods (algorithms) for building gene trees are:
2. Maximum Likelihood - also a character-based approach, surveys
every possible tree topology and assigns all topologies a
maximum likelihood estimate (score) based on a model of
evolution describing the probability of changes (mutation) through
time. The ML tree is the one with the highest probability
This method can be accurate, but is computationally expensive




   15:51
                Tree Building methods

The 3 primary methods (algorithms) for building gene trees are:
3. Distance Methods - are not character based, instead they
calculate pairwise distances across entire aligned sequences and
construct data matrixes. Trees are built by grouping pairs with the
shortest distances between them. These methods can also
incorporate complex evolutionary models
This method is computationally cheap, will always return and
answer, but are not always accurate.

The simplest distance method, Unweighted Pair Group Method
with Arithmatic Mean, UPGMA, simply counts the number of
sequence changes in all pairwise comparisons




   15:51
        UPGMA Trees




15:51
                  UPGMA Tree Construction



   Hu Ba Co Sh        Mo   Ha   Ch         1.0
Hu    2 6 9           8    9    13               Hu
Ba       7 10         7    10   13                     2/2 = 1.0
Co          3         11   12   16               Ba
Sh                    12   9    15         1.0
Mo                         7    16
Ha                              14


           HuBa   Co Sh Mo      Ha    Ch    1.5
HuBa              6.5 9.5 7.5   9.5   13          Co
Co                    3 11      12    16                3/2 = 1.5
Sh                        12    9     15          Sh
Mo                              7     16    1.5
Ha                                    14
   15:51
                                                                1.0
         UPGMA Tree Construction                                      Hu

                                                                      Ba
                                                                1.0

        HuBa   Co Sh Mo      Ha    Ch            1.5
HuBa           6.5 9.5 7.5   9.5   13                  Co
Co                 3 11      12    16
Sh                     12    9     15                  Sh
Mo                           7     16            1.5
Ha                                 14


        HuBa   CoSh   Mo           Ha     Ch           3.5
HuBa              8   7.5          9.5    13                   Mo
CoSh                  11.5         10.5   15.5                 Ha
Mo                                 7      16
Ha                                        14           3.5
                                                             7/2 = 3.5




15:51
                                                      1.5             1.0
                                                            Co              Hu
    UPGMA Tree Construction
                                                            Sh              Ba
                                                      1.5             1.0

        HuBa    CoSh    Mo      Ha       Ch           3.5
HuBa               8    7.5     9.5      13                      Mo
CoSh                    11.5    10.5     15.5
                                                                 Ha
Mo                              7        16           3.5
Ha                                       14
                                                        1.0
                                                3.0           Hu
         HuBa    CoSh    MoHa          Ch
 HuBa               8    8.5           13                     Ba
 CoSh                    11            15.5
                                                       1.0
 MoHa                                  15                             8/2 = 4
                                                      1.5
                                                              Co

                                                2.5           Sh
                                                      1.5

15:51
              UPGMA Tree Construction
           HuBa   CoSh   MoHa        Ch
                                                                           1.0
  HuBa               8   8.5         13                            3.0            Hu
  CoSh                   11          15.5
  MoHa                               15             .875                          Ba
                                                                            1.0
                                                                           1.5
                                                                                  Co
                                                                   2.5            Sh
               ((HuBa)(CoSh)) MoHa          Ch
((HuBa)(Cosh))                9.75          14.25                          1.5
MoHa                                        15                      3.5
                                                                                  Mo

                                                           1.375                  Ha
                                                                     3.5
                                                           9.75/2 = 4.875


   15:51
             UPGMA Tree Construction
               ((HuBa)(CoSh)) MoHa         Ch
((HuBa)(Cosh))                9.75         14.25                           1.0
MoHa                                       15                      3.0            Hu
                                                    .875                          Ba
                   ((HuBa)(CoSh))(MoHa)    Ch                               1.0
((HuBa)(Cosh))(MoHa)                       14.625                          1.5
                                                                                  Co
                                                 2.4375
                                                                   2.5            Sh
                             14.625/2 = 7.3125                             1.5
                                                                    3.5
                                                                                  Mo

                                                           1.375                  Ha
                                                                     3.5

                                                                                  Ch
                                                           7.3125

   15:51
                Final UPGMA Tree


                       3.0      1.0 a
        .875
                                      b
                                1.0
                                1.5
                                      c
        2.44       2.5                d
                                1.5
                             3.5
                                      e
               1.375                  f
                         3.5
                                      g
               7.3125



15:51
                Phylogenetic Trees


Phylogenetic trees are representations summarizing a
reconstructed evolutionary history
A phylogenetic tree is a diagram that proposes a
hypothesis for reconstructed evolutionary relationships
between a set of objects (taxa or OTUs)
Phylogenetic trees can represent relationships between
species or genes
                 Phylogenetic Trees




OTUs are connected by a set of lines - branches or edges
External nodes or leaves are existing OTUs or extinct
objects tht did not give rise to descendents
Internal nodes represent ancestral states hypothesized to
have occurred during evolution
         Human
         Monkey
         Rat
         Mouse
         Chicken
         Strugeon
         Platy
         Zebrafish
         Lamprey
         Hagfish


Internal nodes can represent speciation or gene duplication
events
A gene tree does not necessarily coincide with a species
tree
Gene duplications will cause a gene tree to differ from a
species tree
                         Resolution
             Human
                                         Human
             Monkey
                                         Monkey
             Rat
                                         Rat
             Mouse
                                         Mouse
             Chicken
                                         Chicken
             Strugeon
                                         Strugeon
             Platy
                                         Platy
             Zebrafish
                                         Zebrafish
             Lamprey
                                         Lamprey
             Hagfish
                                         Hagfish
Trees may be fully or only partially resolved
Every node in a fully resolved tree is bifurcating or
dichotomous
Some nodes in unresolved trees are multifurcating or
polytomous
    Human
    Monkey
                             Rooting
    Rat
    Mouse
    Chicken



 Human           Rat                Human           Monkey

                                    Mouse           Rat
 Monkey       Mouse

                  Human           Mouse


                                  Monkey
                       Rat

Unrooted trees establish the relationships among taxa, but
not the evolutionary pathway
For 4 taxa there are 3 unrooted trees, but 15 rooted trees
        Human
        Monkey
                          Rooting
        Rat
        Mouse
                          Human         Human           Monkey
        Chicken
                          Monkey        Monkey          Human
                          Rat            Rat            Rat
Human             Rat
                          Mouse         Mouse           Mouse
Monkey        Mouse                          Mouse
                            Rat
                                               Rat
                            Mouse
                           Human               Human

                           Monkey              Monkey

 Unrooted trees establish the relationships among taxa, but
 not the evolutionary pathway
 For 4 taxa there are 3 unrooted trees, but 15 rooted trees
                  Types of Trees

         Human                     Human

         Monkey                    Monkey
         Rat                        Rat

         Mouse                     Mouse




Cladograms show the genealogy of taxa, but do not include
timing or divergence (branch lengths have no meaning)
                   Types of Trees
                         Human
                        Monkey

                            Rat



                             Mouse


Additive trees show the genealogy of taxa and branch
lengths represent divergence between taxa
Comparison of branch lengths gives a meaningful estimate
of evolutionary divergence
                     Types of Trees
              time


                          Human
                          Monkey
                          Rat
                          Mouse



Ultrametric trees are similar to additive trees, but assume a
constant rate of change between characters used to build
the tree - a molecular clock
Comparison of branch lengths gives a meaningful estimate
of evolutionary divergence
Ultrametric trees are always rooted
                      Outgroups
   time


               Human
               Monkey
               Rat
               Mouse
               Chicken




The most accurate way to root a tree is to use an
“outgroup” a taxon or group of taxa more distantly related
than any member of the “ingroup”
           Representing Phylogenies


                                             Phylogenetic
                                             relationships can be
                                             represented as
                                             graphical trees, tables
                                             or parenthetical
                                             statements (Newick or
                                             New Hampshire format)



  ((raccon, bear),((sea_lion, seal), ((monkey,cat), weasel)), dog);


((raccon:0.20, bear:0.07):0.01,((sea_lion:0.12, seal:0.12):0.08,
      ((monkey:1.00,cat:0.47), weasel:0.18)), dog:0.25);
                     Bootstrapping
Many tree building algorithms will give a single, fully
resolved, tree from any data set.
Nodes will all be equally represented even if one is
supported by many characters and another by very few.
How to quantify support for any given tree? We can’t re-
run evolution. We can sample many different genes and we
can bootstrap our data.
Bootstrapping is sampling a data set, with replacement, to
generate a new data set. We then use this new set in a
phylogenetic analysis - and repeat this process hundreds
or thousands of times.
We can then present bootstrap scores at each node, the %
of bootstrap trees that contained that specific node
                                     Bootstrapping
1-   G   A   D   D   Y   T   T   K   L   P        1-   T   K   L   L   T   P   D   A   D   G
2-   G   V   E   D   Y   T   T   K   -   P        2-   T   K   -   -   T   P   E   V   D   G
3-   G   A   D   D   Y   T   T   R   L   P        3-   T   R   L   L   T   P   D   A   D   G
4-   C   V   E   D   Y   T   T   R   -   P        4-   T   R   -   -   T   P   E   V   D   C




1-   L   P   Y   D   A   D   D   P   T   G
                                             1-   G    P   K   D   K   K   T   P   D   P
2-   -   P   Y   E   V   D   E   P   T   G
                                             2-   G    P   K   D   K   K   T   P   E   P
3-   L   P   Y   D   A   D   D   P   T   G
                                             3-   G    P   R   D   R   R   T   P   D   P
4-   -   P   Y   E   V   D   E   P   T   C
                                             4-   C    P   R   D   R   R   T   P   E   P
                             Bootstrapping
1- G A D D   Y   T   T   K   L   P     1- T K L   L   T   P   D   A       D   G
2- G V E D   Y   T   T   K   -   P     2- T K -   -   T   P   E   V       D   G
3- G A D D   Y   T   T   R   L   P     3- T R L   L   T   P   D   A       D   G
4- C V E D   Y   T   T   R   -   P     4- T R -   -   T   P   E   V       D   C
  1 2 3 4                        1      1 2 3 4                           1
1                                3    1                                   3
2 3                              2    2 4                                 2
3 1 4                            4    3 1 5
                                                                          4
4 5 2 4                               4 6 2 5
1- L P Y D   A   D   D   P   T   G
                                     1- G P K D   K   K   T   P   D   P
2- - P Y E   V   D   E   P   T   G
                                     2- G P K D   K   K   T   P   E   P
3- L P Y D   A   D   D   P   T   G
                                     3- G P R D   R   R   T   P   D   P
4- - P Y E   V   D   E   P   T   C
                                     4- C P R D   R   R   T   P   E   P
  1 2 3 4                        1     1 2 3 4                        1
1                                3   1                                2
2 4                              2   2 1                              3
3 0 4                                3 3 4
                                 4                                    4
4 4 1 5                              4 5 4 2
Bootstrapping and Condensed Trees
                    In this example, bear
                    and raccoon form a
                    pair in 50% of the
                    data sets
                    We can choose to
                    present a tree that
                    condenses branches
                    of less than some
                    threshold bootstrap
                    support - a
                    condensed tree
Consensus Trees
         Some tree building
         methods will produce
         multiple equally “good”
         trees
         A consensus tree shows
         the features that are
         shared by all or some
         trees.
         A strict consensus tree
         only includes features
         found in all trees
         A majority-rule consensus
         tree includes features
         found ≥ a set %
                      Reconciled Trees




                                    Tree
       Species tree               showing
                                 duplications


Reconciled trees attempt to combine gene trees and species
trees, clearly identifying both speciation and duplication
events
                 Reconciled Trees
                                             Species tree
                                              indicating
                                             locations of
                                             duplication
                                                events




                           Tree showing
                          information on
                            speciation,
                          duplication and
                             gene loss



Reconciled trees attempt to combine gene trees and species
trees, clearly identifying both speciation and duplication
events
                   Analogous Genes




Not all proteins w/ similar fx have common evolutionary
history
Nonhomologous genes can evolve similar fx through
convergent evolution
Seq. similarity and structure, outside of functional sites, is
expected to be low - here catalytic residues and overall
structure of chymotrypsin (yellow) and subtilisin (green) =
analogous enzymes
                      Homoplasy

                                 1-   G   A   D   D   Y   T   T   K   L   P
                                 2-   G   V   E   D   Y   T   T   K   -   P
                                 3-   G   A   D   D   Y   T   T   R   L   P
                                 4-   C   V   E   D   Y   T   T   R   -   P

                                   1 2 3 4                                1
                                 1                                        3
                                 2 3                                      2
                                 3 1 4                                    4
                                 4 5 2 4

Sequence similarity not due to homology is homoplasy
Homoplasy can result from convergent evolution, parallel
evolution or evolutionary reversal
                      HGT or LGT




Transfer of genes from one species to another, horizontal gene
transfer (HGT) or lateral gene transfer (LGT), will confuse
phylogenetic analysis - results in tangled tree - branches that
join
HGT is more common in bacteria and archaea, but is also
found in eukaryotes
                  Xenologous Genes
                                     After transfer the gene in the
                                     donor and recipient species
                                     will be very similar -
                                     xenologous genes
                                     Phylogenetic analysis of
                                     these sequences will
                                     indicated recipient is more
                                     closely related to donor than
                                     it truly is.
                                     Here, 80% seq. identity
                                     between a eukaryotic gene
                                     and its likely bacterial source
                                     Outgroup consists of
w/in ingroup, all seq. are bacterial members of the same gene
except Trichomonas vaginalis, a superfamily
parasitic protozoan
                  Orthologous genes




The # of orthologous, homologous and unique genes in
human, chicken and puffer fish genomes - BLAST analysis
Core orthologs = Single copy orthologs in dark blue, genes
present in all 3 but duplicated in at least 1 are in lighter blue
Pairwise orthologs = orthologs found in only 2 species
               Orthologous genes




The # of orthologous, homologous and unique genes in
human, chicken and puffer fish genomes
Homologous genes for which orthology/paralogy cannot be
determined in yellow
Unique genes in gray
              Duplication w/in genes




Duplication w/in a gene can result in complex proteins w/
repeated domains
These may be identifiable on a dot-plot
Here BRCA2 plotted against itself, repeats visible w/ window
analysis
Synteny
W/ complete genome seq. - can
compare entire genomes to
identify equivalent regions and
orthologous genes - syntentic
regions - except that large scale
rearrangements are common
Genes are lost and duplicated -
and inverted or moved between
chromosomes
The local genomic environment
tends to be similar between
orthologs, but the large-scale
structures differ
                Comparative Genomics
Synteny is inversely correlated with time since last common ancestor

In 500 zebrafish genes 50-80% occur in conserved homology
segments, 2 or more genes in the same order as in humans

Approx. 1/2 of the chromosomes retain ~ complete synteny between
cats and humans
           Orthologs and Paralogs




Orthologs must be distinguished from paralogs for
phylogenetic reconstruction and assignment of possible
function
Pseudogenes must be distinguished from both
    Orthologs, Paralogs and Gene Loss

                                           A,B,C,D are species

                                           and  are paralogs




                                            Incorrect species
                                              tree based on
 Evolutionary                                    gene tree
   history




Gene loss can eliminate orthologs from two species - this is
especially difficult with large (similar) gene families
Gene trees  species trees, but multiple genes may
    Clustering of Orthologs and Paralogs




BLAST can be used to identify orthologs and paralogs
between 2 genomes
Mask low complexity and commonly occurring domains
Scan all gene sequences from one genenome are then
scanned on another noting best-scoring BLAST hits (BeTs) -
repeat for all possible pairs of genomes
Paralogous genes resulting from a duplication since the
divergence between two species will be each others BeTs
Orthologs form groups from different genomes w/ reciprocal
BeTs
    Clustering of Orthologs and Paralogs




Cluster of Orthologous Groups (COG) and euKaryotic
Orthologous Groups (KOG) data bases have been
constructed to identify large numbers of orthologs
Here all 3 genes from 3 different genomes are each others
BeT in pairwise comparisons between the three genomes
Members of COGs or KOGs are assumed to have related fxs
This type of analysis is an alternative to exhaustive
phylogenetic trees - large data sets (# species or genes)
     Clustering of Orthologs and Paralogs




This method identifies
orthologs and paralogs in
this case
With sufficient # of
genomes - 2 COGS will
form, one associated w/
the  part and the other
with the  part of the tree
    Clustering of Orthologs and Paralogs




Gene loss can still be
problematic
Comparison of only
species A and B would
incorrectly group  and 
genes

								
To top