Docstoc

microarray

Document Sample
microarray Powered By Docstoc
					Microarray


Yuki Juan
NTUST
May 26, 2003
Content

   Biology background of microarray
   Design of microarray
   The workflow of microarray
   Image analysis of microarray
   Data analysis of microarray
   Discussion
The Biology Background of Microarray

   The central dogma of life forms
   DNA
   RNA
   Monitoring the expression of genes
        Central Dogma

   DNA Replication
    --ACGCGA--
    --TGCGCT--
   RNA Transcription
    --UGCGCU--
   Protein Translation
    --CYSALA--
   DNA

replication
              transcription    translation

     DNA                 RNA            Protein
DNA
   The double helix
        stable
   Nucleotide
        A, T, G, C
   Base pair
        A–T
        G–C
   Oligonucleotide
        short DNA (tens of
         nucleotides, or bps)
    (http://www.nhgri.nih.gov/)
DNA Strand

   DNA has canonical orientation
       read from 5’ to 3’
       antiparallel: one strand has direction
        opposite to its complement’s

        5’ …   TACTGAA … 3’
        3’ …   ATGACTT … 5’
Hydrogen Bond Makes DNA Binding
Specifically
                         Hydrogen bond

          5’




               3’

               5’




                    3’
Hydrogen Bond Makes DNA Binding
Specifically

   The force between base pair is
    hydrogen bond, This force let
    A-T(U), C-G can specifically match
    together.
   RNA

replication
              transcription    translation

     DNA                 RNA            Protein
RNA

   Types
       messenger RNA
       ribosomal RNA (rRNA)
       transfer RNA (tRNA)




Gene is expressed by transcribing DNA
into single-stranded mRNA
RNA (Detailed)




                 (http://www.nhgri.nih.gov/)
   Reverse Transcription

replication
              transcription     translation

     DNA                  RNA              Protein

        Reverse Transcription

By reverse transcriptase, we can convert RNA into cDNA.
The Southern Blot
       Basic DNA detection technique that has
        been used for over 30 years, known as
        Southern blots:
         A “known” strand of DNA is deposited on a solid
          support (i.e. nitocellulose paper)
         An “unknown” mixed bag of DNA is labelled
          (radioactive or flourescent)
         “Unknown” DNA solution allowed to mix with
          known DNA (attached to nitro paper), then
          excess solution washed off
         If a copy of “known” DNA occurs in “unknown”
          sample, it will stick (hybridize), and labeled DNA
          will be detected on photographic film
mRNA Represent Gene Function
   When measure the level of a mRNA, we
    are monitoring the activity of a gene.
   Thus, if we can understand all the level of
    mRNAs, we can study the expression of
    whole genome.
   Microarray takes the advantage of getting
    over 10000 of blotting data in a single
    experiment, which makes monitoring the
    genome activity possible.
Content

   Biology background of microarray
   Design of microarray
   The workflow of microarray
   Image analysis of microarray
   Data analysis of microarray
   Discussion
Design of Microarray

   Microarray in different context
   The idea of microarray
   Main type of array chips
    mRNA Levels Compared in Many
    Different Contexts

   Different tissues, same organism (brain v.
    liver)
   Same tissue, same organism (tumor v. non-
    tumor)
   Same tissue, different organisms (wt v.
    mutant)
   Time course experiments (development)
   Other special designs (e.g. to detect spatial
    patterns).
        Idea of Microarray
    Cell A                                                   Cell B



                            Labeled cDNA
                            from geneX


                Hybridizaton to chip




Spot of geneX with
complementary sequence
of colored cDNA               This spot shows red color after scanning.
Over 10,000 Hybridization Could Be
Down at One Time
Several Types of Arrays
   Spotted DNA arrays
       Developed by Pat Brown’s lab at Stanford
       PCR products of full-length genes (>100nt)
   Affymetrix gene chips
       Photolithography technology from
        computer industry allows building many
        25-mers
   Ink-jet microarrays from Agilent
       25-60-mers “printed directly on glass
        slides
       Flexible, rapid, but expensive
    Array Fabrication Spotting

•   Use PCR to amplify DNA
•   Robotic "pen" deposits DNA at defined
    coordinates
    •   approximately 1-10 ng per spot
    •   Experimentation with oligos (40, 70 bp)
This machine can make 48 microarrays
simultaneously.
Array Fabrication Photolithography

•   Light activated synthesis
    •   synthesize oligonucleotides on glass slides
    •   107copies per oligo in 24 x 24 um square
•   Use 20 pairs of different 25-mers per
    gene
    •   Perfect match and mismatch
Array Fabrication Photolithography
         Affymetrix Microarrays
Raw image

                  1.28cm




  50um

                      ~107 oligonucleotides,
                      half perfectly match mRNA (PM),
                      half have one mismatch (MM)
                      Raw gene expression is intensity
                      difference: PM - MM
Agilent cDNA microarray and
oligonucelotides microarray




        Agilent delivering printed 60-mer
         microarrays in addition to 25-mer formats.
        The inkjet process uses standard
         phosphoramidite chemistry to deliver
         extremely small volumes (picoliters) of the
         chemicals to be spotted.
Content

   Biology background of microarray
   Design of microarray
   The workflow of microarray
   Image analysis of microarray
   Data analysis of microarray
    The Workflow of Microarray
                                            sample
     Plate
                    Plate Preparation
                                          RNA extraction


Array Fabrication
                                          cDNA synthesis
                                          and labeled

      Array
                       Hybridization
                                            Labeled cDNA


                       Hybridized Array


                         Scanning
cDNA Synthesis And Directly Labeling
             Cy3 and Cy5 cDNA Hybridization On
             To The Chip
                        e.g. treatment / control
                            normal / tumor tissue
                Sample loading
                                                    1.Loading from the corner of the
1                                                   cover slip
                                                    It is time consuming and easily
                                                    producing bubbles.


                                                    2. Loading sample at the center
2                                                   of array then put the slip
                                                    smoothly
                                                    Faster, and have lower chance of
                                                    bubble producing then the last one.
    Sample loading

                                                    3. Loading sample at the side of
                                                    the array then put the slip on.
3                                                   Solution would attach to the slip right
                                                    after the slip contact with it, and
                                                    would diffuse with the movement of
                                                    slip when we slowly move down.
    Sample loading
Scan




       Green: down regulate
       Red: up regulate
       Yellow: equal level
Content

   Biology background of microarray
   Design of microarray
   The workflow of microarray
   Image analysis of microarray
   Data analysis of microarray
   Discussion
Image analysis

   To find a spot
   Convert feature into numeric data
   Image normalization
            The Algorithms

1. Find spots: Finds the location of each spot on
  the microarray.

2. Cookie cutter algorithm:
  (1).Suppose the distribution of pixels vs
  intensity is Gaussian curve
  (2).Using SD or IQR to identify the feature and
      background of each spot
  (3).Calculates statistics for the pixel population
    Interquartile Range(IQR)




                               D




                    K=IQR/2 1.42 IQR



Boundary for   25    50   75       Boundary for
  rejection    %      %   %          rejection
                    IQR
Feature
or cookie



            D




Exclusion       Local
zone            background
        Data Quality
           Irregular size or                   Saturation
            shape                               Spot variance
           Irregular                           Background
            placement                            variance
           Low intensity




indistinguishable   saturated   bad print   miss alignment   artifact
                  Convert Feature Into Numeric Value
         Green    Green b.g.-corrected Red b.g.-corrected
         background                      (R. b.g.-c)/(G. b.g.-
Green                    Red intensity
                                         c) Systematic name
intensity                  Red b.g.
                                                Gene function
         Ctrl        Ctrl        Ctrl       Data      Data    Data
                     B
         D x A - PSL kgd         sDxA                 B
                                            D x A - PSL kgd   sDxA     Ratio (sDxA): Data /   Ctrl
A_1_1     59358.75        512.92 58845.83 50953.13 1779.913 49173.22 0.835628 YAL003W         translation elongation factor eef1beta
A_1_2      1209.19        512.92   696.271 2522.345 1779.913 742.4323 1.066298 YAR053W        hypothetical protein
A_1_3         1948.2      512.92   1435.28 3100.152 1779.913 1320.239 0.919848 YBL078C        essential for autophagy
A_1_4     4940.806        512.92 4427.886 6670.604 1779.913 4890.691 1.104521 YAL008W         protein of unknown function
A_1_5      1485.59        512.92   972.671 2916.086 1779.913 1136.173 1.168096 YAR062W        putative pseudogene
A_1_6     32642.03        512.92 32129.11 42304.13 1779.913 40524.22 1.261293 YBL087C         60s large subunit ribosomal protein l23.e
A_1_7     6919.441        512.92 6406.521 8540.246 1779.913 6760.333 1.055227 YAL014C
A_1_8     2698.301        512.92 2185.382     4314.47 1779.913 2534.557 1.159778 YAR068W      strong similarity to hypothetical protein yhr214w
A_1_9     7167.958        512.92 6655.038 7379.286 1779.913 5599.373 0.841374 YBL100C         questionable orf
A_1_10    5470.062        512.92 4957.142 6953.799 1779.913 5173.886 1.043724 YAL025C         nuclear viral propagation protein
A_1_11    27879.49        512.92 27366.57     33746.9 1779.913 31966.99 1.168103 YBL002W      histone h2b.2
A_1_12    2589.613        512.92 2076.693 4385.568 1779.913 2605.655 1.254713 YBL107C         hypothetical protein
A_1_13    6196.245        512.92 5683.326 8840.475 1779.913 7060.562 1.242329 YDR044W         coproporphyrinogen iii oxidase
A_1_14     34737.1        512.92 34224.18 36129.62 1779.913     34349.7 1.003668 YDR134C      strong similarity to flo1p, flo5p, flo9p and ylr110
A_1_15    34035.35        512.92 33522.43 27128.53 1779.913 25348.62 0.756169 YDR233C         similarity to hypothetical protein ydl204w
A_1_16    1638.381        512.92 1125.461 2988.042 1779.913 1208.129 1.073453 YDR048C         questionable orf
A_1_17    3873.718        512.92 3360.799 4955.141 1779.913 3175.228 0.944784 YDR139C         ubiquitin-like protein
A_1_18    2433.625        512.92 1920.706 3502.406 1779.913 1722.493 0.896802 YDR252W         strong similarity to egd1p and to human btf3 pro
A_1_19    1800.736        512.92 1287.816 3011.855 1779.913 1231.942 0.956613 YDR053W         questionable orf
A_1_20    1296.689        512.92      783.77 2636.549 1779.913 856.6356 1.092968 YDR149C      questionable orf
A_1_21     3453.24        512.92   2940.32 4968.026 1779.913 3188.113 1.084274 YDR260C        hypothetical protein
A_1_22    10731.55        512.92 10218.63 9307.246 1779.913 7527.333 0.736629 YDR056C         hypothetical protein
A_1_23    6191.309        512.92   5678.39 8808.398 1779.913 7028.485    1.23776 YDR152W      weak similarity to c.elegans hypothetical protein
A_1_24    3589.998        512.92 3077.078 4420.744 1779.913 2640.831 0.858227 YDR269C         questionable orf
A_1_25    27568.34        512.92 27055.42     20856.2 1779.913 19076.29 0.705082 YGL189C      40s small subunit ribosomal protein s26e.c7
A_1_26    1956.182        512.92 1443.262 3150.716 1779.913 1370.803 0.949795 YGL261C         strong similarity to members of the srp1/tip1 fa
Data Normalization

   Normalize data to correct for
    variances
       Dye bias
       Location bias
       Intensity bias
       Pin bias
       Slide bias
   Control vs. non-control spots
      Data Normalization
Uncalibrated, red light   Calibrated, red and green
under detected            equally detected
Data Normalization

   Assumptions
       Overall mean average ratio should be 1
            Most genes are not differentially
             expressed
       Total intensity of dyes are equivalent
Intensity Dependent Normalization
After Normalization
Additional Normalization
   Pin dependent
       Similar to intensity dependent fit.
       Compute individual lowess fits for each
        pin group
   Within slide normalization
       After pin dependent normalization, log
        ratios for each pin are centered around
        0
       Scale variance for each pin
            Uses MAD (median absolute deviation)
Additional Normalization

   Dye swap
       Combine relative expression levels
        without explicit normalization
       Compute lowess fit for
          log2(RR’/GG’)/2 vs. log2(A + A’)/2
       Normalized ratio is
           log2(R/G) - c(A)
        where c(A) is the lowess prediction
Content

   Biology background of microarray
   Design of microarray
   The workflow of microarray
   Image analysis of microarray
   Data analysis of microarray
   Discussion
Data analysis

   Data filtering
   Fold change analysis
   Classification
   Clustering
   Future direction
         Microarray Data Classification
Microarray chips   Images scanned by laser       Gene                Value
                                                 D26528_at           193
                                                 D26561_cds1_at      -70
                                                 D26561_cds2_at      144
                                                 D26561_cds3_at       33
                                                 D26579_at           318
                                                 D26598_at           1764
                                                 D26599_at           1537
                                                 D26600_at           1204
                                                 D28114_at           707


        New                                                       Datasets
       sample                          Class Sno D26528 D63874 D63880 …
                                       ALL      2    193   4157    556
                                       ALL      3    129 11557     476
                                       ALL      4     44 12125     498
                      Data Mining      ALL      5    218   8484   1211
   Prediction:        and analysis
                                       AML
                                       AML
                                               51
                                               52
                                                     109
                                                     106
                                                           3537
                                                           4578
                                                                   131
                                                                    94
                                       AML     53    211   2431    209
                                       …
The Threshold of Spots
   Filtering - remove genes with insufficient
    variation
       Remove insufficient spot:
        saturated, None uniform, too high
        background…
       Remove extreme signal:
        e.g. MaxVal - MinVal < 500 and
        MaxVal/MinVal < 5
       Statistical filtering (e.g. p-value<0.01)
       biological reasons
       feature reduction for algorithmic
Microarray Data Analysis Types
Different    gene expression
  Fold   change analysis
Classification   (Supervised)
  identifydisease
  predict outcome / select best treatment

Clustering    (Unsupervised)
  find new biological classes / refine
   existing ones
  exploration

…
Differential Gene Expression
   n-fold change
       n typically >= 2
       May hold no biological relevance
       Often too restrictive

   2 expression
       Calculate standard deviation 
       Genes with expression more than 2
        away are differentially expressed
     Fold Changes-Scatter Plot
             72
              (raw)

     10000




      1000




       100




        10




         1




       0.1



                                                72 (con tro l)
      0.01

21           1        10   100   1000   10000
                           Fold Changes Table
                                                      Genebank
                                                                       6h         24 h        48 h        72 h
     Description                                      accession
                                                                  Fold Change Fold Change Fold Change Fold Change
                                                         No.
     Group 1
     caspase 10, apoptosis-related cysteine protease U60519            -           -           -         0.471
     CASP8 and FADD-like apoptosis regulator         U97075            -           -           -         0.355
     nucleoside diphosphate kinase type 6 (inhibitor
     of p53-induced apoptosis-alpha)                 AF051941          -           -           -         0.376
     Group 2
     caspase 3, apoptosis-related cysteine protease   U13738           -         2.301         -           -
     CASP8 and FADD-like apoptosis regulator          AF005775         -         2.272         -           -
     Group 3
     caspase 9, apoptosis-related cysteine protease   U60521           -           -         2.519         -
     Group 4
     caspase 4, apoptosis-related cysteine protease   Z48810         2.615         -         2.796       2.819
     Group 5
     inhibitor of apoptosis protein                   AAF19819         -           -           -         5.249
     caspase 7, apoptosis-related cysteine protease   U67319           -           -           -         2.19
     caspase 4, apoptosis-related cysteine protease   U28976           -           -           -         2.603
     Group 6

23   CASP8 and FADD-like apoptosis regulator          AF015450         -           -           -         6.912
  Classification: Multi-Class
Similar Approach:
select top genes most correlated to each
 class
select best subset using cross-validation

build a single model separating all classes

Advanced:
  buildseparate model for each class vs. rest
  choose model making the strongest prediction
  Popular Classification Methods
Decision     Trees/Rules
  find   smallest gene sets, but also false positives
Neural    Nets -
  work    well if number of genes is reduced
SVM
  good accuracy, does its own gene selection,
   hard to understand
K-nearest neighbor - robust for small
 number genes
Bayesian nets - simple, robust
Multi-class Data Example

Braindata, Pomeroy et al 2002,
 Nature (415), Jan 2002
  42examples, about 7,000 genes, 5
  classes
Selected  top 100 genes most
 correlated to each class
Selected best subset by testing
 1,2, …, 20 genes subsets, leave-one-
 out x-validation for each
Classification – Other Applications

  Combining clinical and genetic data
  Outcome / Treatment prediction
     Age,  Sex, stage of disease, are useful
     e.g. if Data from Male, not Ovarian
      cancer
Clustering

Goals
Find natural classes in the data

Identify new classes / gene
 correlations
Refine existing taxonomies

Support biological analysis /
 discovery
Different Methods
  Hierarchical   clustering, SOM's, etc
SOM clustering

SOM  - self organizing maps
Preprocessing
        away genes with insufficient
  filter
   biological variation
  normalize gene expression (across
   samples) to mean 0, st. dev 1, for each
   gene separately.
Run  SOM for many iterations
Plot the results
     SOM & K Mean By GeneSpring




27
     Hierarchical Clustering
   The most popular hierarchical clustering
    method used in microarray data analysis
    is the so called agglomerative method
       works with the data in a bottom-up manner.
            Initially, each data point forms a cluster and the
             algorithm works through the cluster sets by
             repeatedly merging the two which are the most
             similar or have the shortest distance.
       algorithm involves the computation of the
        distance or similarity matrix
            O(N^2) complexity and thus is not very
             efficient.
Hierarchical clustering
 Genomic Reprogramming in Response to Oxidant                                    minutes
                                                                                0 10 20 40 60 120




      One-third of genome expression is
      transiently reprogrammed



                                                                   6218 genes




                              Fold re pr e ssion         Fold induction
                              >9   >6   >3         1:1        >3    >6    >9
Future directions
 Algorithms   optimized for small samples (the
  no. of samples will remain small for many
  tasks)
 Integration with other data

    biological networks

    medical text

    protein data

 cost-sensitive classification algorithms

    error cost depends on outcome (don’t
     want to miss treatable cancer), treatment
     side effects, etc.
        Integrate biological knowledge when
        analyzing microarray data (from Cheng
        Li, Harvard SPH)




Right picture: Gene Ontology: tool for the unification of biology, Nature Genetics, 25, p25
Content

   Biology background of microarray
   Design of microarray
   The workflow of microarray
   Image analysis of microarray
   Data analysis of microarray
   Discussion
Microarray Potential Applications
    Biological discovery
         new and better molecular diagnostics
         new molecular targets for therapy
         finding and refining biological pathways
         Mutation and polymorphism detection

    Recent examples
         molecular diagnosis of leukemia, breast
          cancer, ...
         appropriate treatment for genetic signature
         potential new drug targets
                 Microarray Limitations

   Cross-hybridization of sequences with high identity
   Chip to chip variation
   True measure of abundance?
   Does mRNA levels reflect protein levels?
       Generally, do not “prove” new biology - simply suggest genes
        involved in a process, a hypothesis that will require traditional
        experimental verification.
   What fold change has biological relevance?
   Need cloned EST or some sequence knowledge -- rare
    messages may be undetected
   Expensive!! Not every lab can afford experiment
    repeat.
   The real limitation is Bioinformatics
Additional Information

   Review papers on microarray
       Genomics, gene expression and DNA
        arrays (Nature, June 2000)
       Microarray - technology review (Natural
        Cell Biology, Aug. 2001)
       Magic of Microarray (Scientific
        American, Feb. 2002)
   Molecular biology tutorial
       http://www.lsic.ucla.edu/ls3/tutorials/
     Biological data retrieval systems: Entrez
     http://www.ncbi.nlm.nih.gov/Database/index.html

1.   A retrieval system for searching a number of inter-connected
     databases at the NCBI. It provides access to:
       PubMed: The biomedical literature (Medline)
       Genbank: Nucleotide sequence database
       Protein sequence database
       Structure: three-dimensional macromolecular structures
       Genome: complete genome assemblies
       PopSet: population study data sets
       OMIM: Online Mendelian Inheritance in Man
       Taxonomy: organisms in GenBank
       Books: online books
       ProbeSet: gene expression and microarray datasets
       3D Domains: domains from Entrez Structure
       UniSTS: markers and mapping data
       SNP: single nucleotide polymorphisms
       CDD: conserved domains


2. Entrez allows users to perform various searches.

				
DOCUMENT INFO