Protein World

Document Sample
Protein World Powered By Docstoc
					Affymetrix Expression Data

        Comics Group
     12-05-2003 Nijmegen
         Tim Hulsen
      General Information (1)
• Affymetrix oligo microarrays: HG-U133 A and B
  (human) and MG-U74v2 A, B and C (mouse)
• Updated every two months; releases used here:
  november 2002 and january 2003
• UniGene-based
• Probes: 25mer oligos complementary to the
  sequences of interest
• Probe pairs: perfect match (PM) probe and
  mismatch (MM) probe, MM is different from PM
  in the 13th position
       General Information (2)
• Human chips: 3269 samples, 44792 fragments,
  115 tissue categories (114 for nov. 2002
  release), 15 SNOMED tissue categories
• Mouse chips: 859 samples, 36701 fragments, 25
  tissue categories, 12 SNOMED tissue categories
• Results from all samples within a tissue category
  are combined by generating electronic
• For each tissue fragment and each tissue
  category is determined:
  – Median expression value
  – Present call (percentage)
     Median expression value
• Expression value: intensity
• All expression values that have a ‘present
  call’ are used to determine the median
  expression value
• Varies from 0 to ~65,000 in human and
  from 0 to ~97,000 in mouse
     Present Call (Percentage)
• Normalization/scaling procedures (MAS 5.0) are
  used to determine an expression intensity value
  with an associated confidence level to each
• When confidence level p for expression is
  smaller than 0.05, the expression intensity for
  this specific fragment in this particular sample is
  called present (P)
• Call values are used to calculate a present call
  percentage (P calls / total calls)
Snomed category definitions (1)
• SNOMED: Systematised Nomenclature of
• Combines specific categories into more global
  categories, i.e. organ systems
• In human far more useful than in mouse (115-
• Categories like: cardiovascular system, digestive
  organs, endocrine gland, female genital system,
  male genital system, musculoskeletal system,
  nervous system, respiratory system, etc.
Snomed category definitions (2)
• Example: cat. 7: ‘hematopoietic system’:
     Human                          Mouse
     36. Bone marrow                10. Blood
     37. Dendritic reticulum cell   11. Bone marrow
     38. Lymph node                 12. Mesenteric lymph node
     39. Lymphocyte                 13. Spleen
     40. Monocyte                   14. Thymus
     41. Segmented neutrophil
     42. Spleen
     43. Thymus
     44. Tonsil
     45. White blood cell
    Annotation provided
For each fragment, if available:
• title               • unigeneAcc
• geneSymbol          • unigeneId
• geneAlias           • interproId
• exemplarAcc         • pfamId
• omimId              • swissprotId
• snpId               • goId
• refseqId            • goFunction
• refseqprotId        • goProcess
• ncbiNuclId          • goComponent
• ncbiProtId          • comment
         Goals & Problems
• Goal: use data set to see if co-expression
  between orthologous/paralogous gene
  pairs is higher than between ‘unrelated’
  gene pairs, in human & mouse
• Problem 1: limited annotation
• Problem 2: empty expression profiles
• Problem 3: size of data set
             Limited annotation (1)
For example for three of the most used protein ids:
ncbiProtId (in red), refseqProtId (in green), swissprotId (in blue)

Human:                                         Mouse:
       Limited annotation (2)
• Smith-Waterman of (SWX) of all Affymetrix
  sequences to the human & mouse IPI
  sets, for which orthologs and paralogs
  were already defined -> IPI id added to
• Smith-Waterman (SWN) of all Affymetrix
  sequences to each other for better
  orthology/paralogy prediction
     Empty expression profiles
• Lots of genes have no expression at all in any
  tissue category
• Useless for correlation calculation; two genes
  with no expression will have a top correlation!
• For human: 4114 out of 44792 fragments
  completely no expression in all tissue categories
  -> 40678 left
• For mouse: 6791 out of 36701 fragments
  completely no expression in all tissue categories
  -> 29910 left
            Size of data set
• Correlation between gene pairs is calculated:
  the number of pairs is (x2-x)/2 for x genes ->
  millions of data points
• Number of gene pairs is already brought down
  by the ‘no expression gene removal’: in human
  from 1,003,139,236 to 827,329,503, in mouse
  from 673,463,350 to 447,289,095
• For some quick analyses, sets of e.g. 1000
  randomly selected genes were used -> 499,500
  gene pairs
           Uncentered Correlation
• ‘Uncentered’: from 0 to 1
• UC(X,Y)= Σ( X / ( sqrt ( Σ( X2 / N ) ) ) * ( Y / ( sqrt ( Σ( Y2 / N ) ) ) ) / N
• Calculated correlations between gene pairs were used to
  see if the co-expression for orthologous pairs and/or
  paralogous pairs is higher than for ‘unrelated’ pairs
• This was measured by using the KEGG Pathway map
  (release 25)
• The best, however not completely convincing, result was
  found using PCP and not ME:
Correlation KEGG Pathway Check
               • Data points above a correlation
               threshold of 0,9 and 1,0 were left out
               because of very low numbers
               • Only orthologous conserved gene
               pairs have a higher accuracy when
               increasing the correlation threshold
               • May be a combination of PCP and
               ME should be used
               • Another measure could be used:
               same GO category instead of KEGG,
               GO is already annotated by Affymetrix
               • Lots of genes have only an
               expression value in one tissue; this
               correlation method is not really
               suitable -> mutual information analysis
           Mutual Information
• For each tissue category: 0 or 1 (ME/PCP value
  below/above a specified threshold)
• x0 = % of 0s, x1 = % of 1s
• x00 = % of 0-0 pairs, x01 = % of 0-1 pairs, x10 = %
  of 1-0 pairs, x11 = % of 1-1 pairs
• Entropy per gene: -(x0*ln(x0)+x1*ln(x1))
• Entropy per gene pair: -
• MI = Entropy(1) + Entropy(2) – Entropy(1,2)
• 0<=MI<=0,693147
            MI GO Category Check

• Mutual information check using GO Biological Process 3rd level of specification
• Horizontal axis shows log(MI)
• Different lines: different thresholds for defining as a ‘0’ or a ‘1’
• Accuracy indeed seems to be higher for pairs with much mutual information, but
there is also a peak at -9<=log(MI)<-8
• Orthologous/paralogous pairs not checked yet
             Future plans
• Complete mutual information analysis,
  using both KEGG Pathway and GO
  databases; look at orthologous and
  paralogous gene pairs too
• Check alternative splicing
• Speed – license ends at the end of June

Shared By: