Analysis of microarray data Introduction • Microarrays are chips which measure whether genes are switched on or off in cells. • They can be used to detect sets of genes responsible for genetic diseases such as cancer. • This lecture: – introduce microarray technology – discuss a few applications – introduce statistical and computational techniques for analysing microarray data Gene expression • All cells in an organism have the same genomic DNA. • Distinct cellular identities are due to differences in gene expression (= transcription & translation of gene). • Whether a gene is transcribed is often determined by the presence/ absence of other genes products (esp. proteins) … • … so genes interact in complex networks: gene A switches on B, which turns off C which upregulates (increases) A, … • Hence perturbations to single gene can lead to changes in expression of many genes. Functional genomics • Next step after sequencing of human genome: understand connection between DNA sequence & phenotypic (actual) characteristics of organism. • This is complex, because proteins and genes act in highly connected networks and signalling pathways in an orchestrated manner. • Traditionally molecular biology has worked on a “one gene one function” basis & experiments tend to study the effects of a single gene/ few genes at a time, but... Microarray chips • …microarrays can measure many genes at once. • Microarray chips are commonly glass slides with a matrix of spots printed (using eg. dot matrix technology) on to them. • A spot contains millions of identical molecules of DNA or oligonucleotide (the probes), which will bind a specific DNA sequence, such as the cDNA of a gene. • The glass slides can contain 1000s of spots, each recognising a different sequence, eg. one spot for every gene in the human genome. Microarray experiments • Since almost all mRNA translated protein, total mRNA of cell ~ genes expressed. • Mash up cells and extract mRNA. • Reverse transcribe RNA cDNA (can be heated to make single-stranded). • Label cDNA from reference cells green (Cy3) and cDNA from target cells red (Cy5). • Hybridise (wash on equal amounts of target & reference sample & allow to bind to probes which have complementary bases) both samples, reference and target, to a single microarray chip. Results of microarray experiments • The spot for gene 1 = – red if more mRNA 1 in target cells – green if more mRNA 1 in reference cells – yellow if same in both • Actually, images of red & green fluorescence are taken separately using laser & scanner & their intensities are measured using image software. • Data often expressed as matrix of intensity red relative expression levels = intensity green , indexed by genes and target samples. Microarray data Red (Cy5) and green (Cy3) images are overlaid- each spot corresponds to a gene. Microarray data • Reason for using relative intensities: process of printing of spots on to chips does not give a reliable fixed number of molecules, so the intensity measurements (which correspond to the amount of bound sample cDNA) represent not only the level of expression of the gene, but also the peculiarities of the chip. • Some disadvantages to not having the absolute gene expression values- eg. confidence limits on the microarray measurement depend heavily on the actual values. Principal uses of chips • Genome-scale expression analysis – Differentiation – Response to environmental factors – Disease states – Effect of drugs • Detection of sequence variation Applications of microarrays - yeast • The fact that we can only reliably measure relative gene expression, means that microarrays tend to be used for comparative experiments: • Eg. “what changes in gene expression arise when yeast is in anaerobic v. aerobic conditions?” - deRisi et al, Science v. 278, pp680-686 • Spot arrays with complementary DNAs to all genes from the yeast genome (the probes). • Approx. 6400 probes Applications of microarrays - yeast • Reverse transcribe mRNA from yeast cells harvested at various time points as conditions are varied from anaerobic to aerobic (start fermentation in sugary solution and allow yeast to deplete sugar). • 7 time points (2hr intervals, first 9 hrs after placed in sugary medium) • Let sample from first time point be “reference” (totally anaerobic, lots of sugar). • Label reference cDNA with green dye (Cy3) and other sample cDNA (later time points) with red dye (Cy5). Applications of microarrays - yeast • Hybridize mixture of equal quantities of reference sample and one of the later-time samples (also do timepoint 1 against itself as control test) to a microarray chip- one experiment/ chip per timepoint. • Take images of red and green fluorescence, measure intensities, process (details of this later in lecture) and create a matrix, M, with entries, intensity red M ij , at spot intensity green representing gene i in chip containing sample j (jth timepoint). Applications of microarrays - yeast • Look for genes that are differentially expressed in aerobic and anaerobic conditions. • Find that when sample at initial timepoint is compared to itself, 99% correlation between intensity values. • Timepoint 1 v. timepoint 2: 95% of genes have < 1.5-fold difference in expression- correlation of 98% between data at 2 timepoints • Timepoint 1 v. timepoint 7: c. 1700 genes out of 6400 had > 2-fold difference in expression- some genes had much higher ratio. • Authors could infer properties of signalling pathways involved in the shift in metabolism. Applications of microarrays- cancer • Take a set of patients with a certain type of cancer and a set of control patients with no cancer, take cells from tumour/ region where tumour is in cancer patients. Extract mRNA, make cDNA and dye one of the samples from a control patient green; all other samples red. • Make/ buy a chip with human genes- as many as possible/ those thought to be relevant for cancer. • Hybridise mixture of reference sample (green) and one of the other target samples to each chip. Applications of microarrays- cancer • Process data and statistically analyse to find genes which have significantly higher/ lower expression in cancer cells than in normal cells. These genes are likely to be important in causing cancer/ effects of cancer. • Can also cluster data to discover different subclasses of cancer, eg. Alizadeh et al. (2000) Nature, v. 403, pp503-511 • A cancer of the immune cells (lymphoma) is clinically diverse: 40% patients respond well to therapy and have good survival. Authors used hierarchical clustering (see later) to discover two new subclasses of the cancer, classified based on gene expression profiles. Applications of microarrays- cancer • Thinking of the relative gene expression values (in fact intensities) of the different samples (patients) as a vector, the authors were able to cluster the data. • Microarray profiling of tumours can be used to classify tumours into subclasses (with eg. survival implications) of already known tumour types. Different kinds of microarray • cDNA versus oligonucleotide • We have discussed so far gene expression microarrays, but also: – Sequencing chips: contain as probes, all possible sequences of a given length k (typ. k=8-10 bases long). Mark target sample with fluorescent dye and hybridise. The spots with fluorescence are where target bound. The corresponding sequence is part of the target spectrum (=set of k-base sequences in target). Then use computers to assemble whole sequence. Target cannot be too long (eg. 150-200 bps if k=8). – Can be used for looking for gene mutations/ polymorphism. Analysis of microarray data • Data is matrix, Mij of (absolute or usually) relative expression values of gene i in condition j. Often presented as log2 values, since this means that downregulation of gene (eg. ratio ½) is not squashed into interval (0,1), but takes values (eg. –1) in (-,0). • Pre-processing: There are several sources of variation in intensity in microarray experiments other than differences in gene expression between samples. These are thought of as noise and we want to remove them, by pre-processing. First subtract background intensity, which is due to binding to wrong spot, etc. (this is usually done by the image processing software). Analysis of microarray data • Normalization: Another source of noise is due to differences in labelling and detection efficiencies for the fluorescent labels and in the amount of RNA between the 2 samples (red/green). Normalization tries to get rid of this by dividing all the ratios by an appropriate constant to make the mean or the median of the ratios =1 (mean /median centring respectively). If the data is in log form- simply subtract constant. • Assumption is that on average across all (or a chosen subset of) genes the levels of mRNA produced will be the same in the two samples. • Alternatively use scatter plot of intensity green v. red & normalize to make slope=1. Analysis of microarray data • Normalized data: log 2 ( R / G) c log 2 ( R /(Gk )) where R & G are the red & green intensities – the respective backgrounds and c log 2 k is the normalization constant. • Filtering: This is the process of working out which genes are differentially expressed across the different conditions (eg. timepoints of the yeast experiment or cancer v. non- cancer) and removing from the dataset those genes which don’t vary. We will discuss this in detail later. Analysis of microarray data • Clustering: If you view the expression values of a single gene across different samples (rows of the expression matrix) as a vector then the genes can be clustered based on the similarity of the vectors. Likewise, using the columns of the matrix, the samples can be clustered. This helps eg. to classify cancers/ find genes which are in same network as each other or have similar functions. Conclusions • We have described microarray chips for analysing gene expression. • We have mentioned three key areas of analysis – Normalization – Filtering – Clustering • In the next session we will cover statistical methods necessary for filtering microarray data.