Microarray_data_analysis_WS2004_4spp by SUSB

VIEWS: 75 PAGES: 23

									Microarray Data Analysis                                                                                        Why Microarrays?

Outline                                                                                                          • Sequence similarity seems to provide “function” identification of only ~40-
                                                                                                                      60% of genes identified in genome sequencing projects            many lineage-
 •   Introduction                                                                                                     and species-specific genes
 •   Overview of technologies                                                                                    •    also, sequence similarity will not identify novel functions of proteins
                                                                                                                      (different functions under different conditions)
 •   Bioinformatics needs
                                                                                                                 •    genes involved in regulation, interaction, or integration of pathways are the
 •   Normalization                                                                                                    most difficult to identify in this context  traditionally identified using genetic
 •   Clustering                                                                                                       (mutant) analysis and biochemically
 •   Promoter Analyses                                                                                           •    many of these genes are expressed at low levels, or show transient
 •   Metabolic Analyses                                                                                               expression, and may have been missed by typical molecular
                                                                                                                      biological/genetic methods




BIO472                                               N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 1
                                                                                                                BIO472                                                               N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 2




Functional Genomics                                                                                             Functional Genomics                                                                         Oliver (2000) Nature 403:601

                                                                                                                     Level of
                                                                                                                                        Definition                      Status                         Method of analysis
                                                                                                                     analysis
The field of functional genomics seeks to devise and apply technologies that
                                                                                                                                                                Context-independent
take advantage of the growing body of sequence information to analyze the full                                                  Complete set of genes of an   (modifications to the yeast                Systematic DNA
                                                                                                                     Genome
complement of genes and proteins encoded by an organism.                                                                         organism or its organelles   genome may be made with                      sequencing
                                                                                                                                                                 exquisite precision)
                                                                                                                                                                                               Hybridization arrays (Microarrays)
Among the major approaches to be used that might provide insight into the                                                            Complete set of               Context-dependent
                                                                                                                                                                                                               SAGE
possible function of genes are                                                                                                       messenger RNA                (the complement of
                                                                                                                Transcriptome       molecules present         messenger RNAs varies with
                                                                                                                                                                                                         High-throughput
                                                                                                                                    in a cell, tissue or         changes in physiology,
                                                                                                                                                                                                         Northern analysis
                                                                                                                                          organ                development or pathology)
 • Determining the expression pattern for all genes                                                                                                                                                             ESTs
 • Determining the expression and the distribution of all proteins                                                                                                                                      Two-dimensional
 • “Knocking out" of genes, and subsequent examination of phenotype and/or                                                            Complete set of
                                                                                                                                                                                                      gel electrophoresis,
                                                                                                                                                                                                   peptide mass fingerprinting
                                                                                                                                     protein molecules
     gene expression patterns                                                                                        Proteome
                                                                                                                                     present in a cell,
                                                                                                                                                                  Context-dependent
                                                                                                                                                                                                       Two-hybrid analysis
 •   Identifying interactions among proteins (two-hybrid analysis and newer bait                                                      tissue or organ

     methods)                                                                                                                                                                                      peptide/protein microarrays
                                                                                                                                                                                                        Nuclear magnetic
                                                                                                                                                                                                     resonance spectrometry
                                                                                                                                Complete set of metabolites
                                                                                                                                  (low-molecular-weight
                                                                                                                 Metabolome                                       Context-dependent                     Mass spectrometry
                                                                                                                                      intermediates)
                                                                                                                                 in a cell, tissue or organ
                                                                                                                                                                                                             Infra-red
                                                                                                                                                                                                           spectroscopy

BIO472                                               N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 3
                                                                                                                BIO472                                                               N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 4
Overview of Technologies – cDNA Microarrays                                                                                                                   Overview of Technologies – cDNA microarrays




             from http://biology.kenyon.edu/courses/biol14/image3/Microarray.gif




                                                                                                                                                             http://www.xenbase.org/xmmr/Marker_pages/CNS/p17-30.gif



BIO472                                                                                            N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 5
                                                                                                                                                             BIO472                                                              N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 6




Overview of Technologies – Oligonucleotide Microarrays                                                                                                        Overview of Technologies – Oligonucleotide Microarrays

                                                                                                                                                              The photolithographic process




   from http://www.med.yale.edu/wmkeck/affymetrix/images/affy.h2.gif




                                                                       from www.tmri.org/gene_exp_web/ oligoarray_4.gif

                                                                                                                                                              Figure from http://bio5495.wustl.edu/course/week13-microarrays/5

BIO472                                                                                            N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 7
                                                                                                                                                             BIO472                                                              N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 8
Application of Bioinformatics w.r.t. Microarrays                                                                                               Application of Bioinformatics w.r.t. Microarrays

 •   probe selection/chip design
 •   image analysis
 •   data filtering and transformation
 •   cluster analysis
 •   functional classification
 •   promoter analyses
 •   metabolic pathway analysis




BIO472                                                                              N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 9
                                                                                                                                               BIO472                                                                                        N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 10




Pre-manufacture bioinformatics needs                                                                                                           Calculation of Tm

 • Oligonucleotides for selected genes should be designed with similar Tm                                                                      For short oligos, less than 20 nts:
     value, which would allow hybridization to occur with similar efficiency
 • If possible, oligonucleotides should discriminate against members of the                                                                    Tm = 2°C (A + T) + 4°C (G + C)
     same family; but caution: some extremely similar paralogs are almost
     impossible to get unique oligonucleotide sequences                                                                                        Wallace, R.B., Shaffer J., Murphy, R.F., Bonner, J., Hirose, T., and Itakura, K. (1979) "Hybridization of synthetic oligodeoxyribonucleotides to phi

 •   Should be possible to design oligos having a specific Tm value or specific
                                                                                                                                               chi 174 DNA: the effect of single base pair mismatch."Nucleic Acids Research 6:3543-3547.


     length
 •   Oligonucleotides should be free from secondary structures and self-                                                                       For longer pieces of DNA:
     annealing tendencies
 •   Design unique oligonucleotides of genes to one species but without                                                                        Tm = 81.5 + 0.41(%GC) – 500/L + 16.6 log[M]
     homology with another species, e. g. discriminate bacterial from human
     genes                                                                                                                                     Howley, P.M., Israel, M.F., Law, M-F., and Martin, M.A. (1979) "A rapid method for detecting and mapping homology between heterologous
                                                                                                                                               DNAs. Evaluation of polyomavirus genomes" J. Biological Chemistry 254:4876-4883.



http://www.u-vision-biotech.com/english/product_service/service/ServiceArray.htm
                                                                                                                                               Most accurate is based on thermodynamic analysis of the melting process:
                                                                                                                                                            ∆H
                                                                                                                                               Tm =                   − 273.15 + 12.0 log[ Na + ]
                                                                                                                                                        ∆S + R ln(C )
                                                                                                                                               Allawi, H.T., SantaLucia, J. Jr. (1997) "Thermodynamics and NMR of internal G.T mismatches in DNA." Biochemistry 36: 10581-10594.


BIO472                                                                             N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 11
                                                                                                                                               BIO472                                                                                        N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 12
Pre-manufacture – cDNA Microarrays                                                                                         Pre-manufacture – Oligonucleotide Microarrays




http://www.premierbiosoft.com/dnamicroarray/toura/toura.html                                                               http://www.premierbiosoft.com/dnamicroarray/toura/toura.html



BIO472                                                         N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 13
                                                                                                                           BIO472                                                                     N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 14




Image Processing                                                                                                           Image Processing

This step is required to extract information after a hybridization.

 • The input for this step is scanned images of array fluorescence.
 • A grid must be applied to the image
 • and then each position and value (with background correction) on the slide
     must be associated with the appropriate identifier
 • output is a table of IDs and their values

There are many programs, both public and proprietary, for doing image
analysis. E.g. ScanAnalyze from the Eisen lab. Affymetrix has its own software
for reading its chips and generating datafiles.




                                                                                                                                                                                          http://rana.lbl.gov/manuals/ScanAlyzeDoc.pdf



BIO472                                                         N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 15
                                                                                                                           BIO472                                                                     N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 16
Statistical Primer for Microarrays                                                                              Type I (FP) and Type II (FN) Errors, and the Fuzzy Blob Model

For Affymetrix chips
 • 0.2% FP for same RNA hyb’d to two different chips
 • 2% FP for replicate RNA preps from indep. samples hyb’d to two chips
   (i.e. biological variation is higher than chip-chip variation)
For cDNA microarrays the chip-chip variation is higher




BIO472                                              N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 17
                                                                                                                BIO472                                            N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 18




Type I (FP) and Type II (FN) Errors                                                                             Statistical significance in large screening

                                                                                                                Types of Error
                                                                                                                 • Type I Error (false positives)
                                                                                                                 • Type II Error (false negatives)

                                                                                                                Error Control in Large Screening
                                              Cutoff
                                             (FP=.01)                                                            • Minimize Type I Error to reduce noise
                                     H1
                                                                                                                 • Balance with Type II Error trade-off with Type I Error rate
                                  (signal)


                       H0                Type II
                   (no signal)           Error




                                                        Type I Error




BIO472                                              N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 19
                                                                                                                BIO472                                            N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 20
Microarray Experimental Design                                                                                      Microarray Experimental Design

This aspect is important!                                                                                           Statistical Validation
                                                                                                                     • Baseline distribution: duplicates (or more replicates)
How can we do an array study for 3 biological factors using 4 chips?                                                 • Independent chip replicates: independent sample collections and RNA
 • Answer: don’t do it unless it’s a general overview study!                                                           preps are most ideal to validate bioinformatic findings even within biological
   Enormous false positive findings:                                                                                   variability
       high quality FPE 0.5 ~ 2% (x 22K = 440 false positives)                                                       • Limited resources: Replicates for most variable biological factors
       medium quality FPE >6% (x 22K = >1320 FPs)
                                                                                                                    Replication and Experimental design (blocking)
Can we do without a statistical design?                                                                              • Replicates of genes on a chip
 • Various statistical factors of variability in Microarray expts.                                                   • Replicated chips for treatments, especially for interaction
       gene and variety (types of sample, treatment, time…)                                                          • Blocking errors from individual, array, and dye
       individual sample, chip array, and dye (microarray)                                                                   (not interested in identifying them separately,
                                                                                                                             but need to have replicates to “factor them out” together)




BIO472                                                  N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 21
                                                                                                                    BIO472                                               N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 22




What now?                                                                                                           Selecting significant genes

                                                                                                                    See paper on this subject:

                                                                                                                    Cui and Churchill (2003). Genome Biol 4:210. Statistical tests for differential
                                                                                                                    expression in cDNA microarray experiments

                                                                                                                     • fold-change: not a statistical test; subject to bias as low expressers have a
                                                                                                                         higher variance.
                                    Data are filtered according to various
                                    thresholds, e.g. based on ANOVA, t-                                              • t-test: better, but may suffer from low power due to small sample size. Also,
                                     test, P-A call, expression level, etc.                                              error variance estimates from each gene are not stable
                                                                                                                     • S-test (used in Significance Analysis of Microarrays): small positive constant
                                                                                                                         added to denominator of gene-specific t-test. In doing so, genes with small
                                                                                                                         fold-changes will not be selected as significant.
                                                                                                                     •   regularized t-test: combines information from gene-specific and global
                                                                                                                         average variance estimates by using weighted average in denom. of t-test.
                                                                                                                     •   B statistic: log posterior odds ratio of differential expression versus non-
                                                                                                                         differential expression.


BIO472                                                  N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 23
                                                                                                                    BIO472                                               N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 24
Test statistics                                                                                                                  Selecting significant genes

Fold-change                                                                                                                      Significance and multiple testing
                                              Average(treated )
                            fold − change =
                                              Average( control )                                                                  • convenient to convert test statistic to p-value, either by reference to a
                                                                                                                                      statistical distribution table or by permutation analysis. For permutation
                                                                                                                                      analysis, a minimum of six replicates is recommended for a two-sample
t-test
                                               xtreated − xcontrol                                                                    comparision. With multiple conditions, fewer replicates are required. Also, if
                                        t=                                                                                            expt. is too small, then can do permutation analysis by shuffling residual
                                                 2        2
                                                streated scontrol                                                                     values across genes under the assumption of homogeneous variance.
                                                        +
                                                ntreated ncontrol                                                                 •   FWER: family-wise error rate is probability of accumulating one or more
                                                                                                                                      false-positive errors over a number of tests;
S-test                                                                                                                            •   FDR: false-discovery-rate control (or pFDR). Unlike significance level, which
                                                                                                                                      is determined before looking at the data, FDR is a post-data measure of
                                                r                                                                                     confidence, and uses data to estimate proportion of false positive results
r = xtreatment − xcontrol              d=
                                              s + s0                                                                                  that have occurred.
                                                                      2
                           ∑ (x j − xcontrol ) + ∑ (x j − xtreatment ) 
                                               2
                                                                                                                                  • For more than two conditions, it is appropriate to use ANOVA with one of
    1           1   j∈control                  j∈treatment           
s= 
   n       +           
                                                                                                                                     three flavours of F test.
    control ntreatment            (ncontrol + ntreatment − 2 )
BIO472                                                               N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 25
                                                                                                                                 BIO472                                               N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 26




Selecting significant genes                                                                                                      And now what?

Tusher et al. (2001) PNAS 98:5116-21. Significance analysis of microarrays                                                       Data analysis is necessary to make sense of large amounts of information.
applied to the ionizing radiation response.                                                                                      Essentially it is an organisational problem.
http://www-stat.stanford.edu/~tibs/SAM/
                                                                                                                                 How can expression data be organised?
 • Convenient Excel Add-in
 • Can be applied to data from Oligo or cDNA arrays, SNP arrays, protein                                                          •   similar expression profiles
     arrays, etc.                                                                                                                 •   groups of genes of interest
 •   Correlates expression data to clinical parameters including treatment,                                                       •   functional classification according to GO or MIPS categories
     diagnosis categories, survival time, paired (before and after), quantitative
     (eg. tumor volume) and one-class.                                                                                            •   pathway analysis
 •   Correlates expression data with time, to study time trends
 •   Automatic imputation of missing data via nearest neighbor algorithm
 •   Adjustable threshold determines number of genes called significant
 •   Uses data permutations to provide estimate of FDR for multiple testing
 •   Can deal with blocked designs, for example, when treatments are applied
     within different batches of arrays


BIO472                                                               N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 27
                                                                                                                                 BIO472                                               N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 28
             Tufte ER, Visual Explanations. 1997. Graphics Press, CN.

BIO472                                                            N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 29
                                                                                                                              BIO472                                                                                   N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 30




Log-transformation, Centering, and Normalization                                                                              When and Why of Log Transformation

Not normalizing or log-transforming the data is acceptable if you really want to                                              “The results of many DNA microarray experiments are fluorescent ratios. Ratio
cluster based on the expression level. Most clustering programs normalize the                                                 measurements are most naturally processed in log space. Consider an
data internally.                                                                                                              experiment where you are looking at gene expression over time, and the
                                                                                                                              results are relative expression levels compared to time 0. Assume at timepoint
If you are working with fold-change, then it is better to log-transform. You can                                              1, a gene is unchanged, at timepoint 2 it is up 2-fold and at timepoint three is
use natural log, log10 or log2. Log2 probably makes most sense, and is                                                        down 2-fold relative to time 0. The raw ratio values are 1.0, 2.0 and 0.5. In most
convenient, as you just use the value as the exponent for 2 to get the actual                                                 applications, you want to think of 2-fold up and 2-fold down as being the same
fold-change. The advantage to using log-values is that values less than 1 are                                                 magnitude of change, but in an opposite direction. In raw ratio space, however,
not compressed in number space, but extend equally in the negative direction                                                  the difference between timepoint 1 and 2 is +1.0, while between timepoint 1
as the positive counterparts do.                                                                                              and 3 is -0.5. Thus mathematical operations that use the difference between
                                                                                                                              values would think that the 2-fold up change was twice as significant as the 2-
Norming and median-centering allow the shape of the change to be better                                                       fold down change. Usually, you do not want this. In log space (we use log base
visualized. The two are usually done sequentially. Median-centering is probably                                               2 for simplicity) the data points become 0,1.0,-1.0. With these values, 2-fold up
better than mean centering, as it is less susceptible to outliers.                                                            and 2-fold down are symmetric about 0. For most applications, we
                                                                                                                              recommend you work in log space.”



                                                                                                                              Michael Eisen, Cluster Manual (http://rana.lbl.gov/index.htm?software/manuals/ClusterTreeView.pdf)


BIO472                                                            N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 31
                                                                                                                              BIO472                                                                                   N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 32
Why Log-space is a Good Thing                                                                                                                             When and Why of Mean/Median Centering

                                                                                                                                                          “Consider a now common experimental design where you are looking at a large number
                 t0         t1          t2         t3        t4                          t0          t1        t2        t3        t4
                                                                                                                                                          of tumor samples all compared to a common reference sample made from a collection of
  Gene1          1          0.5        0.25    0.125    0.0625             Gene1         0           -1        -2        -3        -4
                                                                                                                                                          cell-lines. For each gene, you have a series of ratio values that are relative to the
  Gene2          1          2           4          8         16            Gene2         0           1         2         3         4
                                                                                                                                                          expression level of that gene in the reference sample. Since the reference sample really
                                                                                                                                                          has nothing to do with your experiment, you want your analysis to be independent of the
                                                                                                                                                          amount of a gene present in the reference sample. This is achieved by adjusting the
                                                                            5
   18                                                                                                                                                     values of each gene to reflect their variation from some property of the series of observed
                                                                            4
   16
                                                                            3
                                                                                                                                                          values such as the mean or median. This is what mean and/or median centering of genes
   14
                                                                            2                                                                             does. Centering makes less sense in experiments where the reference sample is part of
   12
                                                                            1                                                                             the experiment, as it is many timecourses.
   10
                                                                            0
     8
                                                                                   t0         t1          t2        t3        t4
                                                                                                                                                          Centering the data for columns/arrays can also be used to remove certain types of
                                                                           -1
     6
                                                                           -2                                                                             biases. The results of many two-color fluorescent hybridization experiments are not
     4
                                                                           -3                                                                             corrected for systematic biases in ratios that are the result of differences in RNA
     2
     0
                                                                           -4                                                                             amounts, labeling efficiency and image acquisition parameters. Such biases have the
            t0         t1         t2          t3        t4
                                                                           -5
                                                                                                                                                          effect of multiplying ratios for all genes by a fixed scalar. Mean or median centering the
                                                                                                                                                          data in log-space has the effect of correcting this bias, although it should be noted that an
                                                                                                                                                          assumption is being made in correcting this bias, which is that the average gene in a
                                                                                                                                                          given experiment is expected to have a ratio of 1.0 (or log-ratio of 0).
                                                                                                                                                          In general, I recommend the use of median rather than mean centering.”
                                                                                                                                                          Michael Eisen, Cluster Manual (http://rana.lbl.gov/index.htm?software/manuals/ClusterTreeView.pdf)




BIO472                                                                                   N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 33
                                                                                                                                                          BIO472                                                                                               N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 34




When and Why of Normalization                                                                                                                             Centering and Normalization
                                                                                                                                                     A.                                                                                  B.
“Normalization sets the magnitude (sum of the squares of the values) of a
row/column vector to 1.0.”



   divide values of vector by square root of the sum of squares of the values




Michael Eisen, Cluster Manual (http://rana.lbl.gov/index.htm?software/manuals/ClusterTreeView.pdf)


BIO472                                                                                   N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 35
                                                                                                                                                          BIO472                                                                                               N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 36
     Centering and Normalization                                                                  Centering and Normalization – continued
                                                                                                 A.                                                                    B.

A.




B.




     BIO472                          N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 37
                                                                                                 BIO472                                                                                    N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 38




     Clustering Methods                                                                           Hierarchical Clustering – Pearson Correlation Coefficient

      •   Hierarchical                                                                            The Pearson Correlation coefficient for any two vectors
      •   k-means                                                                                 (series of numbers)
      •   SOM
                                                                                                  X = { X 1 , X 2 ,..., X N } and Y = {Y1 , Y2 ,..., Y N } is defined as
      •   Principal Component, SVM

                                                                                                         1              X i − X  Yi − Y         
                                                                                                  r=          ∑
                                                                                                                                
                                                                                                                                                 
                                                                                                                                                   
                                                                                                         N   i =1, N    σ X  σ Y                

                                                                                                  Where X is the average of values in X, and σ X is the
                                                                                                  standard deviation of these values. This is the dot product of
                                                                                                  the two vectors.




                                                                                                  Michael Eisen, Cluster Manual (http://rana.lbl.gov/index.htm?software/manuals/ClusterTreeView.pdf)


     BIO472                          N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 39
                                                                                                 BIO472                                                                                    N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 40
Hierarchical Clustering                                                                                                               K-means clustering

Once this matrix of distances is computed, the clustering begins. The process
used by Cluster is agglomerative hierarchical processing, which consists of
repeated cycles where the two closest remaining items (those with the smallest
distance) are joined by a node/branch of a tree, with the length of the branch
set to the distance between the joined items.




Average Linkage Clustering is                 Complete Linkage Clustering -         Single Linkage Clustering.
UPGMA                                         This method tends to produce          This methods produces long
                                              very tight clusters of similar        chains which form loose,
                                              cases                                 straggly clusters.


 http://149.170.199.144/multivar/ca_alg.htm                                                                                                      Bergeron, Bioinformatics Computing, p. 255-256 (Fig. 6-23)


BIO472                                                                    N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 41
                                                                                                                                      BIO472                                                                  N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 42




Ordering of Genes                                                                                                                     Ordering of Genes – Presorting using SOMs

                                                                                                                                      The Self-Organizing Maps have been used in such applications as:

                                                                                                                                       •   Automatic speech recognition
                                                                                                                                       •   Clinical voice analysis
                                                                                                                                       •   Monitoring of the condition of industrial plants and processes
                                                                                                                                       •   Cloud classification from satellite images
                                                                                                                                       •   Analysis of electrical signals from the brain
                                                                                                                                       •   Organization of and retrieval from large document collections
                                                                                                                                       •   Analysis and visualization of large collections of statistical data (e.g.
                                                                                                                                           macroeconomic data
                                                                                                                                       •   microarray data analysis


The ordering for any given tree is not unique. There is a family of 2N-1 orderings
consistent with any tree of N items; you can flip any node on the tree
(exchange the bottom and top branches) and you will get a new ordering that is
equally consistent with the tree.

BIO472                                                                    N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 43
                                                                                                                                      BIO472                                                                  N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 44
Ordering Genes – Presorting using SOMs                                                                                 Clustering Example




                                                                                                                              log2     cluster


                                                                                                                                       SOM                                    cluster
Mount, Bioinformatics, p. 522. (Fig 10-12)   Chen et al. (2002), Plant Cell 14:559-575




BIO472                                                     N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 45
                                                                                                                       BIO472                                               N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 46




Clustering - Subclusters                                                                                               Promoter Analysis

                                                                                                                       Gibbs Sampler: A common method for motif identification, which finds
                                                                                                                       application in diverse areas (banking, speach recognition, etc.). This is a
                                                                                                                       probabilistic sequence model method. The other major group of motif-finding
                                                                                                                       programs are based on word counting methods.

                                                                                                                       Searches for the statistically most probable motifs in unaligned sequences
                                                                                                                       Lawrence et al. 1993, Neuwald et al. 1994, Liu et al. 1995, Thijs et al. 2002.




                                                                                                                       Example for 29 sequences

                                                                                                                       First, guess a width for the motif (this can be done automatically). In our
                                                                                                                       example we set it to 6.




BIO472                                                     N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 47
                                                                                                                       BIO472                                               N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 48
Promoter Analysis – Gibbs Sampler: Predictive update step                                                         Promoter Analysis – Gibbs Sampler: Sampling step
                                                                                                                  3. Scan the left out sequence one and estimate the probability of finding the motif at
1. Chose a random position for the start of the motif in all but one                                                 any position. Calculate an odds score ratio for each position.
   sequences. The left-out sequence is not used yet
                                                                                                                  4. Add up all of the above odds scores and then divide the odds score for each
2. Estimate the amino acid (or nt) frequencies in the motif columns of all                                           position by the total to obtain the probability that the motif is at that position.
   but the left-out sequence. Also estimate background frequencies
                                                                                                                  5. These probabilities are used as weights to decide a probable location of the motif
                                                                                                                     in the left-out sequence.




                                                                                                                                                                                                                       Pobserved
                                                                                                                                                                                                            Ax =
                                                                                                                                                                                                                      Pbackground


BIO472                                                N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 49
                                                                                                                  BIO472                                                                                N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 50




Promoter Analysis – Gibbs Sampler                                                                                 Promoter Analysis – Gibbs Sampler

6. Repeat Steps 2-5 >100 times...it works!                                                                        Gibbs Sampler has been modified to

                                                                                                                   •   search for multiple motifs in the same sets of sequences
                                                                                                                   •   seek a pattern in only a fraction of the input sequences
                                                                                                                   •   look for motifs of different widths
                                                                                                                   •   avoid being locked into a suboptimal solution by shifting the current
                                                                                                                       alignments a certain number of positions to the right and left

                                                                                                                  Motif Sampler http://www.esat.kuleuven.ac.be/~thijs/Work/MotifSampler.html.
                                                                                                                  Will examine a collection of promoters of e.g. co-regulated genes for common
                                                                                                                  motifs.

                                                                                                                  Thijs G, Marchal K, Lescot M, Rombauts S, De Moor B, Rouze P, Moreau Y. (2002) J Comput Biol 9(2):447-64. A Gibbs sampling method to
                                                                                                                  detect overrepresented motifs in the upstream regions of coexpressed genes.
Then for a new random alignment, i.e. Step 1, repeat the entire procedure all
over again. Goal is to find the most probable pattern common to all of the
sequences by sliding them back and forth until the ratio of the motif probability
to the background probability is a maximum.

BIO472                                                N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 51
                                                                                                                  BIO472                                                                                N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 52
Promoter Analysis – Motif Sampler                                                                     Promoter Analysis – Motif Sampler




BIO472                                    N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 53
                                                                                                      BIO472                                 N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 54




Motif Sampler                                                                                         Functional Classification

                                                                                                      Recall GO – Gene Ontologies
                                                                                                      for Molecular Function,
                                                                                                      Biological Process and
                                                                                                      Cellular Component for a
                                                                                                      given Gene Product (insofar
                                                                                                      as known). MIPS also has a
                                                                                                      categorization system.


Seq      Motif   Posn.   Motif Seq.   P-value
 1        1      79      TAGCTACA     0.961
 2        1      79      TAGCTACA     0.961
 3        1      280     CTGCTACC     0.974
 5        1      169     TAGCTACC     0.993
 6        1      186     CAGCTACC     0.997
 7        1      186     CAGCTACC     0.997
 8        1      84      CAGCTGCC     0.987
 9        1      142     TAGCTACC     0.995
 10       1      132     AAGCTACC     0.954
                                                                                                                                          Number of Genes with given Classification

BIO472                                    N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 55
                                                                                                      BIO472                                 N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 56
   Pathway Analysis                                                                                                   Pathway Analysis




http://www.arabidopsis.org/tools/aracyc/



  BIO472                                                  N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 57
                                                                                                                      BIO472                                            N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 58




   Examples from the Literature                                                                                       Example – DeRisi et al. paper on Diauxic Shift

     • Derisi et al. (1997) Science 278: 680-686. Exploring the Metabolic and                                         Background
         Genetic Control of Gene Expression on a Genomic Scale.
     • Chen et al. (2002) Plant Cell 14: 559-574. Expression Profile Matrix of                                        seminal paper using 6K yeast chips to study changes in gene expression upon
         Arabidopsis Transcription Factor Genes Suggests their putative Function in                                   shift from anaerobic (fermentation) to aerobic (respiration) metabolism.
         Response to Environmental Stress.
     •   Harmer et al. (2000) Science 290: 2110-2113. Orchestrated transcription of                                   Experimental
         key pathways in Arabidopsis by the circadian clock.
     •   Zhu et al. (2003) Plant Biotech J 1: 59-70. Transcriptional control of nutrient
         partitioning during rice grain filling.                                                                       • 7 timepoints taken over course of growth in a sugar-rich medium.
     •   Sorlie et al. (2001) PNAS 98: 10869-74. Gene expression patterns of breast                                    • glucose concentration measured at each time point.
         carcinomas distinguish tumor subclasses with clinical implications.                                           • RNA then isolated and hybridised to 6K chips, and results compared to t0.
     •   Marton et al. (1998) Nature Medicine 4: 1293-1301. Drug target validation
         and identification of secondary drug target effects using DNA microarrays.
     •   Behr et al. (1999) Science 284: 1520-1523. Comparative Genomics of BCG
         Vaccines by Whole-Genome DNA Microarray.




  BIO472                                                  N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 59
                                                                                                                      BIO472                                            N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 60
Example – DeRisi et al. paper on Diauxic Shift                                                        Example – DeRisi et al. paper on Diauxic Shift




                                                                                  15
                                                                                  19
                                                                                  24




                                                                                 112
                                                                                  25
                                                                                  17




BIO472                                    N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 61
                                                                                                      BIO472                                              N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 62




Pathway Analysis                                                                                      Example – Circadian Rhythm Analysis (Harmer et al. paper)

                                                                                                      Background

                                                                                                      Circadian rhythms control large numbers of biological processes in eukaryotes.
                                                                                                      The circadian clock allows organisms to anticipate daily changes in the
                                                                                                      environment such as the onset of dawn and dusk, providing them with an
                                                                                                      adaptive advantage.

                                                                                                      Experimental

                                                                                                       • samples of Arabidopsis plants collected every 4 hours over two days.
                                                                                                       • sets of genes with sinusoidal function of 24 hours identified. Peaks identified
                                                                                                         for every timepoint window.




BIO472                                    N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 63
                                                                                                      BIO472                                              N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 64
Example – Harmer et al. paper                                                                                   Example – Harmer et al. paper




Photosynthesis genes peak near the middle of the subjective day. Data were
normalized such that the median signal strength for each gene over all time points
was 1. The average signal strength at each time point was then graphed as a ratio
relative to the median signal strength of that gene.

Blue: LHCA genes                                                                                                Phenylpropanoid biosynthesis genes peak before subjective dawn. The gene
Pink: LHCB genes                                                                                                encoding the Myb transcription factor PAP1 is in blue. The red traces represent
Red: Photosystem I genes                                                                                        phenylpropanoid biosynthesis genes. Phenylpropanoid biosynthetic genes
Green: Photosystem II genes                                                                                     encoding all enzymes indicated in red are clock-controlled.

BIO472                                              N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 65
                                                                                                                BIO472                                             N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 66




Example – Harmer et al. paper                                                                                   Example – Harmer et al. paper




dark blue: putative starch kinase (R1 homolog)
                                                                                                                Genes implicated in cell elongation are circadian-regulated.
gold: β -amylase
                                                                                                                Red: genes encoding the auxin efflux carriers PIN3 and PIN7
red: putative fructose-bisphosphate aldolase, plastidic form,
                                                                                                                Green: a putative expansin
red: putative fructose-bisphosphate aldolase, predicted to be plastidic
                                                                                                                Light Blue: a putative polygalacturonase
light blue: a putative sugar transporter
                                                                                                                Dark Blue: an aquaporin -TIP
green: a sucrose-phosphate synthase homolog
                                                                                                                Gold: Three enzymes implicated in cell wall synthesis (two cellulose synthase
                                                                                                                isologs, and a gene similar to dTDP-D-glucose 4,6-dehydratase)

BIO472                                              N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 67
                                                                                                                BIO472                                             N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 68
Example – Harmer et al. paper                                                                                              Example – Transcription Factor Analysis (Chen et al. paper)
                                                                         gagcagctgc
                                                                                                                           Background

                                                                                                                           Use a datamining approach to deduce the function of known or putative
                                                                                                                           transcription factors in Arabidopsis.

                                                                                                                           Experimental

                                                                                                                            • 57 datasets from plants subjected to different abiotic and biotic stresses.
                                                                                                                              RNA samples were submitted by different labs.
                                                                                                                            • 402 TFs on the 8K A.th. GeneChip were identified based on annotation and
                                                                                                                              the presence of conserved motifs for AP2/EREBPs, Myb proteins, bZIPs,
                                                                                                                              and WRKY zinc finger proteins.
                                                                                                                            • cluster analyis was performed and interesting clusters were examined
A cluster of 31 clock-controlled genes containing an AAAATATCT promoter                                                     • pathogen cluster was examined for novel motifs
evening element. Promoters of clock-controlled genes were scanned for                                                       • For cold-cluster, promoters analyzed for presence of known motifs.
overrepresented elements using AlignACE and ScanACE The evening element
was not overrepresented in any other circadian phase cluster.

BIO472                                                         N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 69
                                                                                                                           BIO472                                             N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 70




                                                                                                                           Alignment of TGAs from Arabidopsis




                            Chen et al. (2002), Plant Cell 14:559-575


BIO472                                                         N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 71
                                                                                                                           BIO472                                             N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 72
Example – Transcription Factor Analysis                                                                                                                Example – Grain Filling Analysis (Zhu et al. paper)

                                                                                                                                                       Background

                                                                                                                                                       Examine gene expression profiles from various rice tissues, including
                                                                                                                                                       developing seed. Look for co-regulated clusters, isoform expression.

                                                                                                                                                       Experimental

                                                                                                                                                         • The mRNA expression levels of 21,000 genes in 33 rice samples, including
                                                                                                                                                             17 from various stages of grain filling was examined with Rice GeneChip
                                                                                                                                                         • genes found to be increasing in expression over the course of grain filling
                                                                                                                                                             were functionally classified.
                                            Promoters of 41 genes in a pathogen-inducible cluster                                                        • starch biosynthetic genes (isoforms) were examined over the course of
                                            examined for presence of novel motif with MotifSampler,
                                            available at PlantCARE. A W-box-like element was                                                                 grain filling.
                                            identified [T|C|G][T|C|G][A|T]GAC[C|T]T.




Chen et al. (2002), Plant Cell 14:559-575


BIO472                                                                                 N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 73
                                                                                                                                                      BIO472                                               N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 74




Example – Grain Filling Analysis (Zhu et al. paper)                                                                                                    Functional Classification – revisited




                                               from http://www.riceweb.org/Plant.htm                                                               Zhu et al. (2003), PBJ 1:59-70.


BIO472                                                                                 N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 75
                                                                                                                                                      BIO472                                               N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 76
     Pathway Analysis                                                                           Example – Sorlie et al. paper




Zhu et al. (2003), PBJ 1:59-70.


   BIO472                           N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 77
                                                                                                BIO472                                             N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 78




    Example – Sorlie et al. paper                                                               Example – Drug target validation (Marton et al. paper)

                                                                                                Background

                                                                                                Good drugs are potent and specific: i.e., they must have strong effects on a
                                                                                                specific biological pathway and minimal effects on all other pathways.
                                                                                                Confirmation that a compound inhibits the intended target (drug target
                                                                                                validation) and the identification of undesirable secondary effects are among
                                                                                                the main challenges in developing new drugs.

                                                                                                Experimental

                                                                                                 • yeast cells treated with FK506 and Cyclosporin A, samples taken before
                                                                                                   and after treatment.
                                                                                                 • Also, null mutants of immunophilins examined for expression patterns,
                                                                                                   before and after treatment.




   BIO472                           N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 79
                                                                                                BIO472                                             N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 80
Example – Marton et al. paper                                                               Example – Marton et al. paper




BIO472                          N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 81
                                                                                            BIO472                          N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 82




Example – Marton et al. paper                                                               Example – Marton et al. paper




BIO472                          N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 83
                                                                                            BIO472                          N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 84
Example – Marton et al. paper                                                                                Example – Marton et al. paper




                                                                                                                          http://www-sequence.stanford.edu/group/yeast_deletion_project/




BIO472                                           N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 85
                                                                                                             BIO472                                                             N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 86




Example – BCG Vaccine Effectivity (Behr et al. paper)                                                        Example – Behr et al. paper

Background

BCG vaccine are live attenuated strains of Mycobacterium bovis administered
to immunize against TB. However, some daughter strains have become
ineffective at conferring resistance to this disease.

Experimental

 • microarrays covering entire genome of M. tuberculosis H37Rv were                                             red: M. tuberculosis H37Rv
   generated.
                                                                                                                yellow: equal hybridisation
 • hybridized with genomic DNA isolated from 13 M. bovis strains used for                                       green: BCG-Danish 1331
   vaccination.




BIO472                                           N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 87
                                                                                                             BIO472                                                             N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 88
Example – Behr et al. paper                                                               Gene Expression Databases

                                                                                           • Stanford Microarray Database
                                                                                             http://genome-www5.stanford.edu/MicroArray/SMD/
                                                                                           • ArrayExpress
                                                                                             http://www.ebi.ac.uk/microarray/ArrayExpress/arrayexpress.html
                                                                                           • GEO
                                                                                             http://www.ncbi.nlm.nih.gov/geo/
                                                                                           • NASCArrays (Arabidopsis Gene Expression DB)
                                                                                             http://ssbdjc2.nottingham.ac.uk/narrays/experimentbrowse.pl

                                                                                          Recall that a lot of information, such as source of tissue, age, microarray
                                                                                          element identifiers, identifier annotation, and so on, is needed to be able to
                                                                                          interpret gene expression datasets correctly.
                                                                                          The MIAME (Minimum Information About a Microarray Experiment)
                                                                                          specification was drafted to ensure this information is provided when data are
                                                                                          submitted to such databases. Although the Stanford database currently doesn’t
                                                                                          use MIAME, many journals are now requiring researchers to provide microarray
                                                                                          data along with MIAME-compliant information if they wish to publish results
                                                                                          from microarray experiments. MAGE-ML is a further development of this spec.

BIO472                        N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 89
                                                                                          BIO472                                           N. Provart BIO472 Lectures of Mar. 24 + 31, 2004 Slide 90

								
To top