Assembling Protein Sequence Families

Document Sample
Assembling Protein Sequence Families Powered By Docstoc
					  Protein Clustering to Assemble
Families of Homeomorphic Proteins

             Chris Elsik
• Using evolution to infer protein function
• The problem of automatic assembling of
  homeomorphic (identical domain
  organization) protein families
• Agglomerative clustering
• Delineating protein domain boundaries
Evolution Allows us to Infer Function
 • The most powerful method for inferring function of a gene or
   protein is by similarity searching a sequence database.
 • Our ability to characterize biological properties of a protein using
   sequence data alone stems from properties conserved through
   evolutionary time.
 • Homologous (evolutionarily related) proteins always share a
   common 3-dimensional folding structure.
 • They often contain common active sites or binding domains.
 • They frequently share common functions.
 • Predictions made using similar, but non-homologous proteins
   are much less reliable.
• Homologs = genes that are evolutionarily related
• There are two kinds of homologs:
• Orthologs = genes in different species that have diverged from a
  common gene in an ancestral species.
• Paralogs = genes that have diverged due to gene duplication.
• Orthologs are more likely than paralogs to have conserved
• Orthologs cannot be identified using BLAST or FASTA
  sequence comparison alone.
• Reliable ortholog identification requires phylogenetic methods.
Example Gene Tree (with plant genes)
                             Rice-2b      paralogs
             Wheat-1                        The outgroup, Arabidopsis
                              orthologs     is a dicot. The cereals are
            Maize-1                         monocots. Dicots and
                                            monocots diverged ~230
            Sorghum-1                       million years ago. These
                                            monocots diverged from
                                            each other ~60 mya.
     Why shouldn’t we depend on
    inferences based on paralogs?
• Paralogs emerge after a gene duplication.
• Possible fates of duplicated genes:
   – Loss of function for one of the duplicates - lack of
     selective pressure allows gene to mutate beyond
   – Emergence of new functional paralogs - one duplicate
     aquires a new function, so selection favors its
     maintenance in the genome
   – Sub-functionalization - both duplicates are required to
     maintain the function of the original
How do people identify “orthologs”?
 • “Best Hit”
    – Problems - 1) What if the ortholog is not present in the
      database - there is always a best hit. 2) For multidomain
      proteins, the best hit may be a local domain hit.
 • “Reciprocal Best Hit”
    – Problem - this is usually based on E-value, but E-value is
      affected by the length of the match.
    – Predicted genes are often gene fragments (not the true
      length of the gene)
    – This method is a phenetic, not phylogenetic, approach. It
      does not take advantage of a model of protein evolution.
 • Synteny - infer orthology by comparing locations on
     – Can be applied to closely related species, such as mammals, but
       not distantly related species
    – difficult to distinguish tandemly duplicated genes
 • Phylogenetics - build gene trees to distinguish orthologs and
Reciprocal Best Hit & incorrect gene length
 A gene fragment was simulated by replacing a protein in
 the library dataset with a truncated sequence. The library
 was searched with the full-length protein. The best match
 is a paralog. The identical, but truncated, match has a
 less significant E-value. We would not be able to identify
 the ortholog using the reciprocal best hit method.
 1>>>CG2830-PA - 648 aa
 vs fly_test.lseg library

 The best scores are:
           length bits E(18717)      %_id alen
 CG7235-PC       ( 576) 474.6 9.4e-134 0.590 581
 CG7235-PA       ( 576) 474.6 9.4e-134 0.590 581
 CG7235-PB       ( 576) 474.6 9.4e-134 0.590 581
 CG12101-PB       ( 573) 474.4 1.1e-133 0.620 597
 CG12101-PA       ( 573) 474.4 1.1e-133 0.620 597
 CG2830-PA-short ( 240) 339.2 2.2e-93 1.000 240
 CG16954-PB       ( 558) 298.3 1.1e-80 0.450 536
 CG16954-PA       ( 558) 298.3 1.1e-80 0.450 536
     Identifying Orthologs Using
• Challenge - we need to create gene families prior to
  creating phylogenetic gene trees
• One approach is automated sequence clustering
Reasons for Clustering Protein Sequences

• Classifying proteins - structural and functional annotation
• Identifying errors in gene prediction
• Selecting representative proteins
• Creating models of protein domain families to aid
  identification of distantly related proteins
• Studying protein family evolution
Protein Clustering - General Method

 • Perform a protein similarity search of a protein
   database against itself using FASTA or BLAST
 • Use E-value or score as a similarity measure for
   automated clustering
    – E-value is not the optimal linkage criterion due to
      the length effect
Problems in Clustering Proteins

• Multidomain proteins can cause unrelated proteins to
  group together
   – This problem is amplified in large proteomes
• Cluster validation is difficult -
   – We do not know the correct number of clusters
   – We do not have a good test set for large
                   Src Tyrosine Kinase

Agglomerative Clustering Methods
• Single linkage - similarity between closest neighbors
  meets a threshold (BLASTCLUST)
• Complete linkage - similarity between furthest
  neighbors meets a threshold
• Average linkage - average pairwise similarity meets a
• Fractional linkage - fraction of pairwise similarities
  meet a threshold
                    Single linkage
“If A is related to B, and B is related to C, then A is related to C.”

This is true only if proteins are homologous over their entire length.
Multidomain proteins pull unrelated proteins into the same cluster



                    Complete linkage
Each protein must be homologous to every other protein in the cluster.

In the analysis of the human genome, Celera used the criterion that the list
of matches for each protein must be equivalent.

 Query:                  A            B                C             D
 Matches:                A            A                A             C
                         B            B                B             D
                         C            C                C

                     A       B             C       D

 Problem: Does the domain structure of D differ from A and B, or is it
 homologous, but too distantly related to be detected by the BLAST search?
 Complete linkage generates many small clusters.
                       Fractional linkage
Proteins must have a fraction of matches in common.
In analysis of the human genome, Celera used the Lek metric with a cutoff of .75:

Lek = 2 x (MatchesA  MatchesB) / (MatchesA + MatchesB)

    Query:                A             B               C             D
    Matches:              A             A               A             C
                          B             B               B             D
                          C             C               C
   LekAB = 2 x 3 / (3 + 3) = 1
   LekAC = 2 x 3 / (3 + 4) = 0.85
                                       Clusters:       A    B   C           D
   LekCD = 2 x 2 / (4 + 2) = 0.67
    LekAD = 2 x 1 / (3+2) = 0.40
                    Average linkage
The average pairwise similarity between all proteins must meet a threshold.

In hierarchical average linkage, the most closely related proteins are
grouped first. Then the threshold is relaxed, allowing clusters to merge.
Which is method is best for clustering
  full-length protein sequences?

• Objectives
   – Group full-length proteins with similar domain
   – Minimize the problem caused by multidomain
   – Minimize number of singletons
Comparing Protein Clustering Methods
• Four methods - single, complete, average and fractional linkage

• Thresholds range from E  10-10 (permissive) to E  10-200 (stringent)

• Look at cluster number, number of singletons, largest cluster size, CDC
  score for each cluster set

• Test set: the Arabidopsis proteome (~25,000 proteins)
Comparison of cluster sets assembled
  using linkage threshold E  10-10
    Method Clusters    Singletons Largest
           (including             Cluster Size

    SL       7702        4940        5852

    AL       9607         5943         385

    FL       12072        8248         428

    CL       17639       14151          60
Comparing protein cluster sets: CDC Score

 • Cluster Domain Consistency (CDC) score - a metric
   for testing similarity in domain organization among
   clustered proteins
 • CDC = 0: proteins have no domains in common
 • CDC = 1: proteins have identical domain organization
 • The proteins first must be searched against a domain
   database (Pfam) to identify domains
 Step 1. List domains in each cluster.                   Step 2. Domain score = number of pairwise domain
                                                         matches between proteins as fraction of total possible
 Cluster 1
                                                                    npjdi npj     Sdij                npjdi    npj    Sdij
  Protein 1
                                                                     3      4     0.5                  2        2     1
  Protein 2
                                                                     3      4     0.5                  2        2     1
  Protein 3                                                          3      4     0.5
  Protein 4                                                          2      4     0.167
                                                                     2      4     0.167     Sdij = domain score for
  All Domains
                                                                     2      4     0.167     domain i in cluster j =
  Cluster 2
                                                        npjdi = number of proteins in cluster j with domain i
  Protein 5
                                                        npj = number of proteins in cluster j
  Protein 6
  All Domains

                                                      Step 3. Calculate protein score (Spk) for
                                                      each protein.
                                                                                                        Protein Spk
Step 4. Calculate CDC score for the cluster set:
                                                      Spk = ∑ Sdijk / ndj       where
                                                                                                           1        0.278
Mean protein score for proteins in clusters of  1.                                                        2        0.250
                                                      Sdik = domain score of domain i in protein k
                                                                                                           3        0.334
                                                      nd = number of domains in cluster i                  4        0.056
CDC = (.278 + .250 + .334 + .056 + 1 + 1)/6 =                                                              5        1
                                                      Sp1 = (.5 + .5 + .5 +.167) / 6 = 0.278               6        1
     = 0.486
            Pfam CDC Score



0.4                                      CL
0.2                                      AL
 6000   10000    14000    18000      22000    26000
                Number of Clusters
Linkage Criteria for Protein Clustering

 • The linkage criterion should discern full- and
   partial-length matches
    – If proteins are similar in domain
      organization, we expect them to match
      over their entire length
        Linkage Criteria: E-value

• E-value may not be the best linkage criterion
  because it is length-dependent
• Matches to longer proteins have lower E-
  values (more significant)
• Length dependency makes it difficult to
  discern partial and full-length protein matches
           Linkage Criteria:
        Global Alignment Score

• Partial matches will have low scores and full-
  length matches will have high scores
• Disadvantage: global alignments for each
  protein pair must be generated using the
  ALIGN program
          Linkage Criteria:
   Weighted Local Alignment Score
• The local Smith-Waterman score for a pair of similar
  proteins is divided by the self-match score for the
  larger of the two proteins.
• The weighted score is low for partial matches.
• The advantage over global score is that global
  alignments are not required.
      Linkage Criteria Comparison
• Two clustering methods: single and average linkage
• Three linkage criteria: E()-value, weighted local
  score, global score
• Range of thresholds tested for each criterion
• Look at cluster number, number of singletons, largest
  cluster size, CDC score for each cluster set
Cluster set comparison using most permissive thresholds that
resulted in biologically meaningful family sizes (< 600 members).
Method                 Single Linkage                        Average Linkage

Linkage       E()-value     Weighted      Global   E()-value     Weighted      Global
Parameter                   Local Score   Score                  Local Score   Score

Threshold     E()  10-40       0.25        1.1    E()  10-10     0.05          0.5

(including     14570          11469       10905      9607           8087        10073

Singletons     11560            7780       7429      5943           4643         6195

Largest          250            538         563        385           225          171
            Pfam CDC Score


                                     AL,   E()-value
0.4                                  SL,   E()-value
                                     AL,   local
                                     SL,   local
0.2                                  AL,   global
                                     SL,   global
 6000   10000    14000    18000      22000             26000
                Number of Clusters
   Protein Clustering - Conclusions
• Single linkage using E-value is not appropriate for
  clustering eukaryote proteomes.
• Weighted local score and global score are better
  linkage criteria than E-value.
• Single linkage using a weighted local score above
  0.40 performs as well as average linkage and
  generates a moderate number (15,746) of
  Arabidopsis clusters.
• Average linkage generates the smallest cluster
  numbers and reasonable cluster sizes, and should be
  the useful for grouping distantly related proteins for
  function and structure prediction.
  Which clustering method is best prior to
phylogenetic analysis of predicted proteins?
   • Complete linkage? - Too many singletons
   • Fractional linkage? - Also a large number of singletons
   • Average linkage?
       – Problem caused by truncated or overextended gene
         predictions. Average linkage would likely separate orthologs
         with incorrectly predicted protein lengths.
   • Single linkage with a strict alignment criterion?
       – Also has problem caused by truncated or overextended
         gene predictions:
       – A strict alignment criterion separates orthologs due to
         incorrectly predicted protein lengths
   • Single linkage without a strict alignment criterion
       – Problem associated with multidomain proteins
        SL-AL: a compromise

• We have developed the SL-AL algorithm, which uses
  a combination of single linkage and average linkage
• First sequences are clustered using single linkage
  and a permissive alignment criterion
• Percent identity for each protein pair in a single
  linkage cluster is recorded
• Any cluster with < a threshold percent identity is
  reclustered using average linkage
       Advantages of SL-AL

• The multidomain problem is reduced.
• The algorithm is more tolerant to incorrect gene
  predictions, and less likely to separate incorrectly
  predicted orthologs into separate clusters. The
  cluster set can be used to identify gene prediction
• A minimum pairwise percent identity can be set,
  making multiple alignment possible.
Another Alternative for Protein Clustering

• Overcome the multidomain problem by
  dissecting proteins into domains prior to
• Prodom uses protein alignments to identify
  domain families (~129,000 domain families)
• Problem - partial sequences in protein databases
  cause the overfragmenting of domains.
• Our work toward a solution: use an alternative
  approach to identify domain boundaries
  Identifying protein linkers to
 delineate domain boundaries
• Protein structural domain - a unit of a protein
  that can independently fold into a stable
  tertiary structure
• Evolutionary domain - a conserved modular
  unit that can be identified by aligning protein
• Often structural domains are also
  evolutionary domains (not always)
• Linker - a peptide that connects two protein
                   Src Tyrosine Kinase

      Amino acid propensity
• Our goal is to predict protein linker
  regions using amino acid sequence
  alone, in the absence of known
• We take advantage of amino acid
  propensity - the preference of some
  amino acids for linkers or domains
                   Our Dataset
• Pfam-A - a collection of domain families created using profile
  HMMs built on multiple alignments of homologous proteins.
  (Boeckmann et al. Nucleic Acids Research 2003)
• Linker = a sequence segment of 4 to 20 residues that connects
  two adjacent regions identified by Pfam as domains.
• Non-linker regions = sequence segments excluding linker
• Used only protein sequences whose entire length can be
  classified as linker or domain by our criteria, except we allowed
  up to 20 non-domain residues at the N- and C-termini.
• Results - 11968 sequences with at least one linker region
  (14339 linker, 28726 corresponding domains and 824 unique
  domain regions).
   Removing Redundancy in the Dataset
• 11968 proteins were first grouped into homeomorphic families
  (identical domain organization).
• All by all sequence comparison of the 11968 sequences using
• Single-linkage clustering using criteria of E-value10-6 and at
  least 80% alignment coverage.
• Some of the resulting clusters contained sequences with
  different domain organizations, due to the transitive nature of
  single-linkage clustering, so we selected one sequence from
  each domain organization within each cluster.
• Result - 802 sequences with at least one linker region. These
  802 sequences contained 993 linkers and 1988 corresponding
  domain regions from 376 unique Pfam-A domain families.
• The average length of linkers and domains was 11.24 and
  141.38, respectively.
                    Linker Index
• Linker index, yl, reflects the preference of amino acids in the
  linkers relative to the domain region from Suyama and Ohara,
  2003, Bioinformatics.
• yl = - ln ( fllinker / fldomain )
• Where fllinker = relative frequency of the amino acid l in the linker
  (region in the data set.
• yl will be negative if the relative frequency of amino acid l in the
  linker region is greater than its relative frequency in the domain
• To calculate the smoothed linker index, average the linker index
  within each window size w and assign this averaged linker index
  value y to the center amino acid of the window by sliding from
  the N-terminus to the C-terminus of a protein sequence.
• Window size, w = 9, provided the greatest discrimination
  between linker and non-linker regions among the window sizes
  from 3 to 20.
      Relative Amino Acid Frequency
Amino Linker            Domain     yl
   A 7.97 (7.94)        8.10    0.0166
   C 0.89 (1.24)        1.50*   0.5724
   D 6.32 (5.28)        5.60*   -0.1278
   E 7.97 (6.89)        6.60*   -0.1794
   F 2.74 (4.34)        4.03*   0.3561
   G 7.74 (6.14)        7.37    -0.0442
   H 1.91 (2.32)        2.27    0.1643
   I   4.73 (5.13)      6.37*   0.2758
   K 6.97 (5.72)        5.81*   -0.2134
   L   7.51 (9.60)      9.54*   0.2523
                                          Numbers in parentheses
   M 2.13 (2.15)        2.24    0.0197
                                          are from the linker database
   N 4.22 (4.12)        4.08    -0.0786   of George and Heringa,
   P 6.63 (6.07)        4.30*   -0.4188   2002, Protein Engineering.
   Q 3.90 (4.05)        3.33*   -0.1051
   R 5.77 (5.79)        5.24    -0.0762
   S 7.20 (5.55)        6.13*   -0.1629
   T 5.80 (5.66)        5.35    -0.0701
   V 6.24 (6.64)        7.34*   0.1782
   W 0.81 (1.24)        1.32*   0.3836
   Y 2.46 (3.47)        3.38*   0.2500
      Hidden Markov Model to
       predict linker residues
• Sequences are assumed to have a structure
  composed of regions that are homogeneous within a
  region but may differ between regions.
• We assume protein sequence data is produced by a
  hidden Markov model and compositional variation is
  likely to reflect functional or structural differences
  between regions.
• Each region is classified into one of a finite number of
  states (linker and non-linker); we wish to estimate the
  states given the observed protein sequence.
• Instead of recognizing the protein sequence as a
  string of amino acids (categorical variables), we
  recognize the protein sequence as a string of linker
  index values (continuous variables).
        Predicting linker state
• Our objective is to identify linker index values that
  discriminate linker and non-linker regions.
• Used Gibbs sampling to overcome the problem of
  missing data.
• Calculate the probability state (linker or non-linker) for
  each residue i in a protein sequence.
• Predict the state of an amino acid using the
  classification variable
• CV = 1 if the probability of state k is  x.
• CV = 0 if the probability of state k is  x.
            Training and Testing
• We applied our model to the protein sequence dataset
  constructed from Pfam-A.
• 5-fold cross validation was applied to the dataset - we divided
  the dataset into the training dataset and the test dataset
  randomly with the ratio of 4:1.
• We trained the model with the training dataset of 642 sequences
  and tested the trained model with the test dataset of 160
• We ran Gibbs sampling with 40,000 iterations and 10,000 burn-
  in to train the model.
Examples of good predictions
Examples of Over-prediction
      Sensitivity and Selectivity
• Sensitivity = the percentage of actual linker residues
  that were predicted to be linker. Sn=Tp(Tp+Fn)
• Specificity = the percentage of predicted linker
  residues that were truly linker. Sp=Tp/(Tp+Fp)
Sensitivity and Selectivity plotted against cut-off for classification variable

Sensitivity and selectivity are each 67% at their intersection.
• This method will be useful in defining linkers for
  evolutionary domains.
• Our method appears to do well at distinguishing
  linkers from intra-domain loops.
• We must test our approach using a structurally
  defined dataset to fully understand its ability to
  distinguish structural linkers from intra-domain loops.
   – Since Pfam-A identifies domains as evolutionarily conserved
     units, non-conserved intra-domain loops can cause
     structural domains to be annotated as multiple Pfam-A
     domains. Thus, some of our Pfam-A defined linkers may
     actually be loops in structural domains.
   – Conversely, two structural domains that are always found
     together may be defined by Pfam-A as a single evolutionary
     domain; some of our false positives may actually be
     structural linkers.
• William Pearson - University of Virginia

• Bani Mallick - Texas A&M University

• Elsik Lab
   – Kyounghwa Bae
   – Justin Reese
   – Anand Venkatraman
   – Shreyas Murthi
   – Michael Dickens
   – Juan Anzola