Microarray Yuki Juan NTUST May 26, 2003 Content Biology background of microarray Design of microarray The workflow of microarray Image analysis of microarray Data analysis of microarray Discussion The Biology Background of Microarray The central dogma of life forms DNA RNA Monitoring the expression of genes Central Dogma DNA Replication --ACGCGA-- --TGCGCT-- RNA Transcription --UGCGCU-- Protein Translation --CYSALA-- DNA replication transcription translation DNA RNA Protein DNA The double helix stable Nucleotide A, T, G, C Base pair A–T G–C Oligonucleotide short DNA (tens of nucleotides, or bps) (http://www.nhgri.nih.gov/) DNA Strand DNA has canonical orientation read from 5’ to 3’ antiparallel: one strand has direction opposite to its complement’s 5’ … TACTGAA … 3’ 3’ … ATGACTT … 5’ Hydrogen Bond Makes DNA Binding Specifically Hydrogen bond 5’ 3’ 5’ 3’ Hydrogen Bond Makes DNA Binding Specifically The force between base pair is hydrogen bond, This force let A-T(U), C-G can specifically match together. RNA replication transcription translation DNA RNA Protein RNA Types messenger RNA ribosomal RNA (rRNA) transfer RNA (tRNA) Gene is expressed by transcribing DNA into single-stranded mRNA RNA (Detailed) (http://www.nhgri.nih.gov/) Reverse Transcription replication transcription translation DNA RNA Protein Reverse Transcription By reverse transcriptase, we can convert RNA into cDNA. The Southern Blot Basic DNA detection technique that has been used for over 30 years, known as Southern blots: A “known” strand of DNA is deposited on a solid support (i.e. nitocellulose paper) An “unknown” mixed bag of DNA is labelled (radioactive or flourescent) “Unknown” DNA solution allowed to mix with known DNA (attached to nitro paper), then excess solution washed off If a copy of “known” DNA occurs in “unknown” sample, it will stick (hybridize), and labeled DNA will be detected on photographic film mRNA Represent Gene Function When measure the level of a mRNA, we are monitoring the activity of a gene. Thus, if we can understand all the level of mRNAs, we can study the expression of whole genome. Microarray takes the advantage of getting over 10000 of blotting data in a single experiment, which makes monitoring the genome activity possible. Content Biology background of microarray Design of microarray The workflow of microarray Image analysis of microarray Data analysis of microarray Discussion Design of Microarray Microarray in different context The idea of microarray Main type of array chips mRNA Levels Compared in Many Different Contexts Different tissues, same organism (brain v. liver) Same tissue, same organism (tumor v. non- tumor) Same tissue, different organisms (wt v. mutant) Time course experiments (development) Other special designs (e.g. to detect spatial patterns). Idea of Microarray Cell A Cell B Labeled cDNA from geneX Hybridizaton to chip Spot of geneX with complementary sequence of colored cDNA This spot shows red color after scanning. Over 10,000 Hybridization Could Be Down at One Time Several Types of Arrays Spotted DNA arrays Developed by Pat Brown’s lab at Stanford PCR products of full-length genes (>100nt) Affymetrix gene chips Photolithography technology from computer industry allows building many 25-mers Ink-jet microarrays from Agilent 25-60-mers “printed directly on glass slides Flexible, rapid, but expensive Array Fabrication Spotting • Use PCR to amplify DNA • Robotic "pen" deposits DNA at defined coordinates • approximately 1-10 ng per spot • Experimentation with oligos (40, 70 bp) This machine can make 48 microarrays simultaneously. Array Fabrication Photolithography • Light activated synthesis • synthesize oligonucleotides on glass slides • 107copies per oligo in 24 x 24 um square • Use 20 pairs of different 25-mers per gene • Perfect match and mismatch Array Fabrication Photolithography Affymetrix Microarrays Raw image 1.28cm 50um ~107 oligonucleotides, half perfectly match mRNA (PM), half have one mismatch (MM) Raw gene expression is intensity difference: PM - MM Agilent cDNA microarray and oligonucelotides microarray Agilent delivering printed 60-mer microarrays in addition to 25-mer formats. The inkjet process uses standard phosphoramidite chemistry to deliver extremely small volumes (picoliters) of the chemicals to be spotted. Content Biology background of microarray Design of microarray The workflow of microarray Image analysis of microarray Data analysis of microarray The Workflow of Microarray sample Plate Plate Preparation RNA extraction Array Fabrication cDNA synthesis and labeled Array Hybridization Labeled cDNA Hybridized Array Scanning cDNA Synthesis And Directly Labeling Cy3 and Cy5 cDNA Hybridization On To The Chip e.g. treatment / control normal / tumor tissue Sample loading 1.Loading from the corner of the 1 cover slip It is time consuming and easily producing bubbles. 2. Loading sample at the center 2 of array then put the slip smoothly Faster, and have lower chance of bubble producing then the last one. Sample loading 3. Loading sample at the side of the array then put the slip on. 3 Solution would attach to the slip right after the slip contact with it, and would diffuse with the movement of slip when we slowly move down. Sample loading Scan Green: down regulate Red: up regulate Yellow: equal level Content Biology background of microarray Design of microarray The workflow of microarray Image analysis of microarray Data analysis of microarray Discussion Image analysis To find a spot Convert feature into numeric data Image normalization The Algorithms 1. Find spots: Finds the location of each spot on the microarray. 2. Cookie cutter algorithm: (1).Suppose the distribution of pixels vs intensity is Gaussian curve (2).Using SD or IQR to identify the feature and background of each spot (3).Calculates statistics for the pixel population Interquartile Range(IQR) D K=IQR/2 1.42 IQR Boundary for 25 50 75 Boundary for rejection % % % rejection IQR Feature or cookie D Exclusion Local zone background Data Quality Irregular size or Saturation shape Spot variance Irregular Background placement variance Low intensity indistinguishable saturated bad print miss alignment artifact Convert Feature Into Numeric Value Green Green b.g.-corrected Red b.g.-corrected background (R. b.g.-c)/(G. b.g.- Green Red intensity c) Systematic name intensity Red b.g. Gene function Ctrl Ctrl Ctrl Data Data Data B D x A - PSL kgd sDxA B D x A - PSL kgd sDxA Ratio (sDxA): Data / Ctrl A_1_1 59358.75 512.92 58845.83 50953.13 1779.913 49173.22 0.835628 YAL003W translation elongation factor eef1beta A_1_2 1209.19 512.92 696.271 2522.345 1779.913 742.4323 1.066298 YAR053W hypothetical protein A_1_3 1948.2 512.92 1435.28 3100.152 1779.913 1320.239 0.919848 YBL078C essential for autophagy A_1_4 4940.806 512.92 4427.886 6670.604 1779.913 4890.691 1.104521 YAL008W protein of unknown function A_1_5 1485.59 512.92 972.671 2916.086 1779.913 1136.173 1.168096 YAR062W putative pseudogene A_1_6 32642.03 512.92 32129.11 42304.13 1779.913 40524.22 1.261293 YBL087C 60s large subunit ribosomal protein l23.e A_1_7 6919.441 512.92 6406.521 8540.246 1779.913 6760.333 1.055227 YAL014C A_1_8 2698.301 512.92 2185.382 4314.47 1779.913 2534.557 1.159778 YAR068W strong similarity to hypothetical protein yhr214w A_1_9 7167.958 512.92 6655.038 7379.286 1779.913 5599.373 0.841374 YBL100C questionable orf A_1_10 5470.062 512.92 4957.142 6953.799 1779.913 5173.886 1.043724 YAL025C nuclear viral propagation protein A_1_11 27879.49 512.92 27366.57 33746.9 1779.913 31966.99 1.168103 YBL002W histone h2b.2 A_1_12 2589.613 512.92 2076.693 4385.568 1779.913 2605.655 1.254713 YBL107C hypothetical protein A_1_13 6196.245 512.92 5683.326 8840.475 1779.913 7060.562 1.242329 YDR044W coproporphyrinogen iii oxidase A_1_14 34737.1 512.92 34224.18 36129.62 1779.913 34349.7 1.003668 YDR134C strong similarity to flo1p, flo5p, flo9p and ylr110 A_1_15 34035.35 512.92 33522.43 27128.53 1779.913 25348.62 0.756169 YDR233C similarity to hypothetical protein ydl204w A_1_16 1638.381 512.92 1125.461 2988.042 1779.913 1208.129 1.073453 YDR048C questionable orf A_1_17 3873.718 512.92 3360.799 4955.141 1779.913 3175.228 0.944784 YDR139C ubiquitin-like protein A_1_18 2433.625 512.92 1920.706 3502.406 1779.913 1722.493 0.896802 YDR252W strong similarity to egd1p and to human btf3 pro A_1_19 1800.736 512.92 1287.816 3011.855 1779.913 1231.942 0.956613 YDR053W questionable orf A_1_20 1296.689 512.92 783.77 2636.549 1779.913 856.6356 1.092968 YDR149C questionable orf A_1_21 3453.24 512.92 2940.32 4968.026 1779.913 3188.113 1.084274 YDR260C hypothetical protein A_1_22 10731.55 512.92 10218.63 9307.246 1779.913 7527.333 0.736629 YDR056C hypothetical protein A_1_23 6191.309 512.92 5678.39 8808.398 1779.913 7028.485 1.23776 YDR152W weak similarity to c.elegans hypothetical protein A_1_24 3589.998 512.92 3077.078 4420.744 1779.913 2640.831 0.858227 YDR269C questionable orf A_1_25 27568.34 512.92 27055.42 20856.2 1779.913 19076.29 0.705082 YGL189C 40s small subunit ribosomal protein s26e.c7 A_1_26 1956.182 512.92 1443.262 3150.716 1779.913 1370.803 0.949795 YGL261C strong similarity to members of the srp1/tip1 fa Data Normalization Normalize data to correct for variances Dye bias Location bias Intensity bias Pin bias Slide bias Control vs. non-control spots Data Normalization Uncalibrated, red light Calibrated, red and green under detected equally detected Data Normalization Assumptions Overall mean average ratio should be 1 Most genes are not differentially expressed Total intensity of dyes are equivalent Intensity Dependent Normalization After Normalization Additional Normalization Pin dependent Similar to intensity dependent fit. Compute individual lowess fits for each pin group Within slide normalization After pin dependent normalization, log ratios for each pin are centered around 0 Scale variance for each pin Uses MAD (median absolute deviation) Additional Normalization Dye swap Combine relative expression levels without explicit normalization Compute lowess fit for log2(RR’/GG’)/2 vs. log2(A + A’)/2 Normalized ratio is log2(R/G) - c(A) where c(A) is the lowess prediction Content Biology background of microarray Design of microarray The workflow of microarray Image analysis of microarray Data analysis of microarray Discussion Data analysis Data filtering Fold change analysis Classification Clustering Future direction Microarray Data Classification Microarray chips Images scanned by laser Gene Value D26528_at 193 D26561_cds1_at -70 D26561_cds2_at 144 D26561_cds3_at 33 D26579_at 318 D26598_at 1764 D26599_at 1537 D26600_at 1204 D28114_at 707 New Datasets sample Class Sno D26528 D63874 D63880 … ALL 2 193 4157 556 ALL 3 129 11557 476 ALL 4 44 12125 498 Data Mining ALL 5 218 8484 1211 Prediction: and analysis AML AML 51 52 109 106 3537 4578 131 94 AML 53 211 2431 209 … The Threshold of Spots Filtering - remove genes with insufficient variation Remove insufficient spot: saturated, None uniform, too high background… Remove extreme signal: e.g. MaxVal - MinVal < 500 and MaxVal/MinVal < 5 Statistical filtering (e.g. p-value<0.01) biological reasons feature reduction for algorithmic Microarray Data Analysis Types Different gene expression Fold change analysis Classification (Supervised) identifydisease predict outcome / select best treatment Clustering (Unsupervised) find new biological classes / refine existing ones exploration … Differential Gene Expression n-fold change n typically >= 2 May hold no biological relevance Often too restrictive 2 expression Calculate standard deviation Genes with expression more than 2 away are differentially expressed Fold Changes-Scatter Plot 72 (raw) 10000 1000 100 10 1 0.1 72 (con tro l) 0.01 21 1 10 100 1000 10000 Fold Changes Table Genebank 6h 24 h 48 h 72 h Description accession Fold Change Fold Change Fold Change Fold Change No. Group 1 caspase 10, apoptosis-related cysteine protease U60519 - - - 0.471 CASP8 and FADD-like apoptosis regulator U97075 - - - 0.355 nucleoside diphosphate kinase type 6 (inhibitor of p53-induced apoptosis-alpha) AF051941 - - - 0.376 Group 2 caspase 3, apoptosis-related cysteine protease U13738 - 2.301 - - CASP8 and FADD-like apoptosis regulator AF005775 - 2.272 - - Group 3 caspase 9, apoptosis-related cysteine protease U60521 - - 2.519 - Group 4 caspase 4, apoptosis-related cysteine protease Z48810 2.615 - 2.796 2.819 Group 5 inhibitor of apoptosis protein AAF19819 - - - 5.249 caspase 7, apoptosis-related cysteine protease U67319 - - - 2.19 caspase 4, apoptosis-related cysteine protease U28976 - - - 2.603 Group 6 23 CASP8 and FADD-like apoptosis regulator AF015450 - - - 6.912 Classification: Multi-Class Similar Approach: select top genes most correlated to each class select best subset using cross-validation build a single model separating all classes Advanced: buildseparate model for each class vs. rest choose model making the strongest prediction Popular Classification Methods Decision Trees/Rules find smallest gene sets, but also false positives Neural Nets - work well if number of genes is reduced SVM good accuracy, does its own gene selection, hard to understand K-nearest neighbor - robust for small number genes Bayesian nets - simple, robust Multi-class Data Example Braindata, Pomeroy et al 2002, Nature (415), Jan 2002 42examples, about 7,000 genes, 5 classes Selected top 100 genes most correlated to each class Selected best subset by testing 1,2, …, 20 genes subsets, leave-one- out x-validation for each Classification – Other Applications Combining clinical and genetic data Outcome / Treatment prediction Age, Sex, stage of disease, are useful e.g. if Data from Male, not Ovarian cancer Clustering Goals Find natural classes in the data Identify new classes / gene correlations Refine existing taxonomies Support biological analysis / discovery Different Methods Hierarchical clustering, SOM's, etc SOM clustering SOM - self organizing maps Preprocessing away genes with insufficient filter biological variation normalize gene expression (across samples) to mean 0, st. dev 1, for each gene separately. Run SOM for many iterations Plot the results SOM & K Mean By GeneSpring 27 Hierarchical Clustering The most popular hierarchical clustering method used in microarray data analysis is the so called agglomerative method works with the data in a bottom-up manner. Initially, each data point forms a cluster and the algorithm works through the cluster sets by repeatedly merging the two which are the most similar or have the shortest distance. algorithm involves the computation of the distance or similarity matrix O(N^2) complexity and thus is not very efficient. Hierarchical clustering Genomic Reprogramming in Response to Oxidant minutes 0 10 20 40 60 120 One-third of genome expression is transiently reprogrammed 6218 genes Fold re pr e ssion Fold induction >9 >6 >3 1:1 >3 >6 >9 Future directions Algorithms optimized for small samples (the no. of samples will remain small for many tasks) Integration with other data biological networks medical text protein data cost-sensitive classification algorithms error cost depends on outcome (don’t want to miss treatable cancer), treatment side effects, etc. Integrate biological knowledge when analyzing microarray data (from Cheng Li, Harvard SPH) Right picture: Gene Ontology: tool for the unification of biology, Nature Genetics, 25, p25 Content Biology background of microarray Design of microarray The workflow of microarray Image analysis of microarray Data analysis of microarray Discussion Microarray Potential Applications Biological discovery new and better molecular diagnostics new molecular targets for therapy finding and refining biological pathways Mutation and polymorphism detection Recent examples molecular diagnosis of leukemia, breast cancer, ... appropriate treatment for genetic signature potential new drug targets Microarray Limitations Cross-hybridization of sequences with high identity Chip to chip variation True measure of abundance? Does mRNA levels reflect protein levels? Generally, do not “prove” new biology - simply suggest genes involved in a process, a hypothesis that will require traditional experimental verification. What fold change has biological relevance? Need cloned EST or some sequence knowledge -- rare messages may be undetected Expensive!! Not every lab can afford experiment repeat. The real limitation is Bioinformatics Additional Information Review papers on microarray Genomics, gene expression and DNA arrays (Nature, June 2000) Microarray - technology review (Natural Cell Biology, Aug. 2001) Magic of Microarray (Scientific American, Feb. 2002) Molecular biology tutorial http://www.lsic.ucla.edu/ls3/tutorials/ Biological data retrieval systems: Entrez http://www.ncbi.nlm.nih.gov/Database/index.html 1. A retrieval system for searching a number of inter-connected databases at the NCBI. It provides access to: PubMed: The biomedical literature (Medline) Genbank: Nucleotide sequence database Protein sequence database Structure: three-dimensional macromolecular structures Genome: complete genome assemblies PopSet: population study data sets OMIM: Online Mendelian Inheritance in Man Taxonomy: organisms in GenBank Books: online books ProbeSet: gene expression and microarray datasets 3D Domains: domains from Entrez Structure UniSTS: markers and mapping data SNP: single nucleotide polymorphisms CDD: conserved domains 2. Entrez allows users to perform various searches.