Docstoc

PPI network construction and false positive detection

Document Sample
PPI network construction and false positive detection Powered By Docstoc
					PPI network construction and
   false positive detection
          Jin Chen
         CSE891-002
          2012 Fall

                               1
                   Layout
• Protein-protein interaction (PPI) networks

• PPI network construction

• PPI network false-positive detection




                                               2
                                                         Background
        • Study of interactions between proteins is fundamental to the
          understanding of biological systems

        • PPIs have been studied through a number of high-throughput experiments

        • PPIs have also been predicted through an array of computational methods
          that leverage the vast amount of sequence data generated

        • Comparative genomics at sequence level has indicated that species
          differences are due more to the difference in the interactions between the
          component proteins, rather than the individual genes themselves *




                                                                                                                                   3
* Valencia A, Pazos F: Computational methods for the prediction of protein interactions. Curr Opin Struct Biol 2002, 12:368-373.
                          PPI at different levels




                                                    3D structure

                                                    Protein folding

                                                    Protein docking

                                                    Domain




                                                              4
Nidhi et al. DSiMB 2009
PPI at different levels




                          Node – protein
                              Every node represents an
                              unique protein
                          Edge – protein interaction
                              Physical interaction
                              Functional interaction




     Hawoong Jeong

                                                         5
                 PPI Identification
• Concept of PPI ranges from direct physical interactions
  inferred from experimental methods (yeast two-hybrid) to
  functional linkages predicted on the basis of computational
  analysis (based on protein sequences and structures )

• Given the difficulties in experimentally identifying PPIs, a wide
  range of computational methods have been used to identify
  functional PPIs




                                                                  6
                                        Domain Fusion
         Hypothesis: if domains A and B exist fused in a single polypeptide AB in
         another organism, then A and B are functionally linked




Marcotte EM et al. Detecting Protein Function and Protein-Protein Interactions from Genome Sequences. Science, 285(5428) 751-753 1999
                                                                                                                                7
                               Domain Fusion
• Inclusion of eukaryotic sequences increased the robustness of domain
  fusion predictions *

• Eukaryotes, with a larger volume, cannot afford to accommodate separate
  proteins A and B, as the required concentrations of A and B would be
  prohibitively high, to achieve the same equilibrium concentration of AB.

• Limitation: low coverage




     *Veitia RA: Rosetta Stone proteins: "chance and necessity"? Genome Biol 2002,3(2):interactions1001.1-1001.3.   8
                   Conserved Neighborhood
Hypothesis: If the genes that encode two proteins are neighbors on the
chromosome in several genomes, the corresponding proteins are likely to be
functionally linked




Dandekar T et al. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochemical Sci 1998 , 23(9):324-328   9
                Conserved Neighborhood
• The method has been reported to identify high-quality functional
  relationships

• The method suffers from low coverage, due to the dual requirement of
  identifying orthologues in another genome and then finding those
  orthologues that are adjacent on the chromosome




  Marcotte EM: Computational genetics: finding protein function by nonhomology methods. Curr Opin Struct Biol 2000 , 10:359-365   10
               Phylogenetic Profiles
• Hypothesis: functionally linked proteins would co-occur in genomes
• Phylogenetic profile of a protein can be represented as a 'bit string',
  encoding the presence or absence of the protein in each of the genomes
  considered




                                                                            11
                                         Co-evolution
• Hypothesis: Co-evolution requires the existence of mutual selective
  pressure on two or more species
• in silico Two-hybrid (i2h) method has been proposed based on the study of
  correlated mutations in multiple sequence alignments



   Protein family A




   Protein family B


                                                                                                                                      12
    Pazos F et al: In silico Two-Hybrid System for the Selection of Physically Interacting Protein Pairs. Proteins 2002, 47:219-227
  Software: Protein Link Explorer (PLEX)




Date, S.V. and E.M. Marcotte, Protein function prediction using the Protein Link EXplorer (PLEX). Bioinformatics, 2005. 21(10): p. 2558-2559.
                                                                                                                                                13
 Biological Problem  Algorithm  Knowledge

1. Biological hypothesis

2. Mathematical representation

3. Algorithm design

4. Biological verification



                                              14
    High-throughput PPI Detection
• Booming of biotechnology
   –   Yeast-two hybrid / split ubiquitin system
   –   Mass spectrometry
   –   Protein microarrays
   –   etc.


• Limitations of computational prediction
   – Low coverage
   – Locally optimized (pair-wise)
   – Super-high negative PPI rates




                                                   15
               Yeast Two-Hybrid
• Two hybrid proteins are generated with transcription
  factor domains
• Both fusions are expressed in a yeast cell that carries a
  reporter gene whose expression is under the control of
  binding sites for the DNA-binding domain



                                   Activation
                                   Domain
                         Prey
     Bait                Protein
     Protein
         Binding
         Domain
                               Reporter Gene
               Yeast Two-Hybrid
• Interaction of bait and prey proteins localizes the
  activation domain to the reporter gene, thus activating
  transcription
• Since the reporter gene typically codes for a survival
  factor, yeast colonies will grow only when an interaction
  occurs

                         Activation
                         Domain
               Prey
     Bait      Protein
     Protein
         Binding
         Domain
                                      Reporter Gene
Mating based Split-ubiquitin System




                          Lalonde S et al. Plant J 2008
Yeast Cell Growth Rate

    Biomass




              The trends for yeast cell growth over time
                      PPI Databases
• STRING – PPIs derived from high-throughput experimental data, mined of
  databases and literature, analyses of co-expressed genes and also from
  computational predictions


• HPRD - Human Protein Reference Database. It integrates information
  relevant to the function of human proteins in health and disease

• DIP - Experimentally derived PPIs with assessments. DIP is generally
  considered as a valuable benchmark or verify the performance of any new
  method for prediction of PPIs

• Many others: MIPS, YGD, BIND, TAIR…


                                                                        20
         False-Positive Detection in PPI Networks

    • Background: PPI networks generated with high-throughput
      methods contain a sizeable number of false-positives and
      their reproducibility is not satisfactory*

    • Central to the understanding of PPI is the definition of
      “interaction” itself
           – Binding energy / Interaction / Complex
           – We need to define what we mean by interaction




* von Mering Comparative assessment of large-scale data sets of protein-protein interactions. Nature ;417(6887):399-403 2002
                                                                                                                               21
  Useful Data for False-Positive Detection
• Functional and localization data (Gene Ontology)

• Indirect high-throughput data (gene and protein
  expression)

• Sequence related data ( protein domain (domain fusion),
  interologs)

• Structure data (protein 3D structure)

• Network topological features (connectivity, network motif)


                                                            22
         Different Hypothesis for Different Data

Data                   Example of Hypothesis
Gene Ontology          Two proteins which share a similar annotation are more likely
                       to interact than proteins with different or null annotations
Gene Expression        Two proteins which have similar genes express patterns are
                       more likely to interact
Domain Interaction     If two domains are often found in PPIs, two proteins containing
                       such domains are more likely to interact
PPI network            PPI topologies fit spoke or matrix models are more likely to be
topological analysis   true



  Other hypotheses include: synthetic lethality, interlogs, linear motif, etc.




                                                                                         23
   Gold Standard for PPI Networks
• For algorithm evaluation and comparison
• To train a model as positive training data

• Manually annotated databases such as DIP
• Interactions from low-throughput experiments

• True negative set is equally important
   – Co-localized? No?




                                                 24
  Estimate PPI Network Reliability
• Overall index of reliability of a PPI network


                                            TP
                              precision 
                                          TP  FP

                                         TP
                              recall 
                                       TP  FN




                                                    25
   Estimate PPI Network Reliability
“capture-recapture” model - reaching back to the raw counts of
observed bait–prey clones of yeast-two hybrid experiments




   Huang et al. Where Have All the Interactions Gone? Estimating the Coverage of Two-Hybrid Protein   26
   Interaction Maps. PLoS Computational Biology 2007
                                                PPI Filtering
  • GOAL: To identify reliable protein complexes from two
    existing mass spectrometry (MS) data

  • Analyze the data with a purification enrichment (PE) scoring
    system

  • Using gold standard PPIs, the consolidated dataset is of
    greater accuracy than the original sets and is comparable to
    PPIs defined using more conventional small-scale methods



Collins et al. Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol Cell Proteomics. 2007 Mar;6(3):439-50
                                                                                                                                           27
                            PPI Filtering
                                 P(observation | TruePPI )
          eobservation  log10
                                 P(observation | FalsePPI )

• e=0  no evidence for or against the validity of a particular
  interaction was collected
• Two types of observations: bait-prey observations and prey-prey
  observations

                    PEij   eijk   e jik  M ij
                                  k         k

• i and j are two proteins (bait & prey). k indicates a distinct
  purification. Mij measures indirect evidence due to co-occurrence of
  proteins i and j as preys in the same purifications

                                                                    28
                            PPI Filtering
                                   r  (1  r )  pijk
                    eijk  log10
                                          pijk
where r representing the probability that a true association will be preserved
and detected in a purification experiment and pijk representing the probability
that a bait-prey pair will be observed for nonspecific reasons


                     pijk  1  exp( fi nik nbait )
                                           prey
                                                j




where nikprey is the number of preys identified in purification k with bait i, nibait
is the number of times protein i was used as bait, and fj is an estimate of the
nonspecific frequency of occurrence of prey j in the dataset

                                                                                   29
PPI Filtering




                30
PPI Filtering




                31
                     PPI Filtering
• PPI topological analysis

   – First student presentation is about a topological measure
     called “FS-weight”, which was compared with other
     topological measures

   – Suitable for large PPI networks rather than preliminary
     networks




                                                                 32
"Most good programmers do programming
not because they expect to get paid or get
adulation by the public, but because it is fun
to program." - Linus Torvalds




                                                 33

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:4/1/2013
language:Unknown
pages:33