Cloning genomic DNA If you want to understand how genes are regulated and how they have evolved you need to be able to examine their structure. cDNA can only tell you so much about a gene – an individual cDNA clone is derived from a processed RNA transcript. Compared with the whole gene, it will have lost any upstream regulatory regions and it will have lost the introns. So to study genes in a complete way you need to be able to make clones containing genomic DNA and, from your library of clones representing the whole genome, you must be able to identify those clones which are derived from the gene of interest. As with all molecular biology techniques, the techniques of library construction have evolved during the past twenty years. Principles The key principles are 1. Ensure that the library is representative of the entire genome. i.e. any point in the genome should have an equal chance of appearing in the library compared with any other point. This is not always as easy to accomplish as you might think. 2. Make sure that the library contains enough cloned DNA so that the chance of any point in the genome being contained within at least one of the clones is high. Subsidiary to this, but still important 3. Individual clones should contain sufficiently large inserts so that the number of clones needed to ensure that point 2 is satisfied is not so great that the library size becomes too big to handle conveniently. It goes without saying (I hope) that the vector chosen, as well as helping to satisfy the conditions above, should be able to propagate clones stably in the chosen host so that each cloned fragment of genomic DNA is a true copy of the piece of genome from which it originated and is not rearranged (duplicated, deleted, reshuffled or otherwise mutated) in any way. How many clones are needed to make a representative library? Even if every piece of the genome is equally “clonable” (which is far from true and will be discussed below), no If this is not easy to envisage, imagine a very large number of necklaces each library is guaranteed to contain every point in the genome. containing 100 beads numbered from 1 This is a matter of statistics. The formula which relates the to 100. If we snap all the strings and number of clones needed in the library to the size of the pour the beads into a box how many individual beads will we have to pull clone insert and the size of the target genome, for any given blindfold from the box to be able to chance that any point in the genome will appear in the library completely reassemble at least one at least once is: necklace? If the number of necklaces ln 1 p really is “very large” (as it would be if N each necklace represents one complete i genome isolated from one cell) then we ln 1 g can never be certain that we will pull one copy of each bead. However, the more beads we take the greater the where chance that we will in the end have one copy of each number. N = number of clones in the library, p = probability that any point in the genome will occur at least once in the library, i = insert size, g = genome size How does this work in practice? Several different vectors have been used as methods have evolved. This little spreadsheet shows N calculated for different vectors given various desired values of p and a genome size of 3 × 109 bp which is the approximate size of the human or mouse genome. genome size = 3000000000 base pairs P vector insert size 0.5 0.9 0.99 0.999 lambda 20000 103972 345387 690773 1036160 cosmid 45000 46209 153505 307009 460514 YAC 1000000 2079 6907 13813 20720 BAC 500000 4159 13814 27629 41443 In practice, these are typical values… Vector Maximum Approx. No. of Advantages Disadvantages Insert size clones required in library lambda 20 kb 5 × 105 easy to construct many clones required, libraries, relatively hard to prepare DNA stable inserts from clones cosmid 45 kb 2 × 105 easy to construct not always stable libraries and to prepare DNA from clones YAC 1 Mb 104 few clones required very prone to rearrangement, difficult to construct PAC ~120 kb 105 fewer clones single copy origin of required than for replication therefore cosmids, stable harder to prepare DNA BAC > 500 kb 2 × 104 few clones single copy origin of required, very replication therefore stable harder to prepare DNA Bacteriophage lambda was originally used as a genomic cloning vector because 20kb of its genome, containing the genes required for lysogeny, could be replaced by insert DNA from another species. It is now only used for genomic library building in exceptional circumstances. Cosmids were popular in the 1980s and early 1990s for genomic cloning. They are a hybrid vector, mostly plasmid with the lambda cohesive end incorporated. This is used to give a very high efficiency of bacterial transformation because the recombinant DNA molecules can be packaged into lambda protein coats which then will infect bacteria. YACs, yeast artificial chromosomes, were designed to propagate in yeast, because of problems with genomic instability they are no longer much used. PACs are another hybrid vector, part plasmid and part phage P1 and BACs are based around the F´ origin of replication. The book, Analysis of Genes and Genomes by Richard Reece, (Wiley 2004) has an excellent chapter outlining all these vectors which I recommend that you read. Why are some pieces of genomic DNA more difficult to clone than others? Some DNA sequences which form normal parts of eukaryotic genomes seem to be effectively unclonable in bacteria. Some sequences are simply very AT rich, others may contain long inverted repeat sequences which can form stable cruciform structures and which may interfere with the supercoiling of plasmids containing them. Thus the plasmids cannot be maintained, the antibiotic resistance is lost and the host dies. Some sequences are simply “poison”. Possibly they can accidentally code for a polypeptide with detrimental effects, sometimes they accidentally act as promoters driving unwanted transcription into the vector. Modern vectors contain powerful “terminator” sequences flanking the cloning site which prevent insert driven transcription escaping from the cloning site. A DNA sequence does not have to be absolutely unclonable to be effectively unclonable. If a sequence is present at just 10% of the average frequency in a library and that library is less than 10 deep then it is likely to be absent. How do we create the fragments for cloning? If we want to clone fragments of genomic DNA which are even as small as 20kb we cannot just cut the genomic DNA to completion with a restriction enzyme. For an average enzyme, the average distance between sites is given by the formula: D = 4n where D = distance in base pairs, n = number of bases in the recognition site and 4 because there are 4 different bases in DNA this is shown in this table for the two commonest classes of enzyme. Number of bases in recognition Average distance between sites site 4 256 6 4096 Neither of these distances is long enough to be useful in a complete digestion. Two methods have been used to generate suitable large fragments of DNA. 1. carry out a partial restriction digest where a small amount of the restriction enzyme is added to the DNA for a short time so that only some sites are cleaved. This can be difficult to control because high molecular weight DNA solutions are very viscous and it is difficult to mix the enzyme with the DNA adequately in the short time available (maybe only 2 min. or so) without shearing the DNA by too vigorous stirring. 2. A better method is to mix the restriction enzyme with its corresponding methylase. Then the cleavage and the protective methylation reactions are in a race (see Figure 1: a partial digestion time course. Figure 2). Both enzymes will diffuse into the viscous Genomic DNA has been digested with a small amount of restriction enzyme for from DNA solution at the same rate so the proportion of 0 to 25 min. The size markers are digested cleaved sites compared to the protected sites remains with HindIII constant throughout the reaction. The proportion of sites which are cleaved is determined by the relative amounts of the two enzymes which can be adjusted in trial reactions beforehand Restriction sites occur randomly in the genome. But that is not what you want, you would prefer them to occur evenly spaced in the genome. True randomness gives some areas of the genome where the sites for one enzyme are widely spaced and other areas where the sites are tightly clustered. This has the consequence that it requires different digestion conditions to prepare the same length DNA fragments from different regions of the genome. Consequently, a range of conditions are sometimes employed and the resulting digests are mixed before cloning is attempted. This effect has more marked consequences for enzymes with a six base pair site rather than those with a four base pair Figure 2: competition between a restriction enzyme site. So, on the whole, four base pair enzymes and its methylase such as MboI have been preferred for library construction. (MboI has the added advantage that its sticky end is compatible with the sticky end generated by the six base pair enzyme BamHI which can be used to prepare the vector.) No further treatment of the DNA is required after the digestion for lambda and cosmid cloning because the lambda head packaging machinery takes care of the size selection. But for YACs or BACs the partial restriction digest is usually size fractionated by electrophoresis using pulsed field gel electrophoresis, the region containing the required size of DNA fragments is cut from the gel and the agarose is digested away using the enzyme “β-agarase”. Sometimes, to protect the high molecular weight DNA from breakage by random shearing, the partial restriction digest and the ligation to vector are both carried out before the agarose is removed. What do we do once the library is made? A library is formed when a population of recombinant DNA molecules is inserted (by transformation or transfection) into the host E. coli. At this point the experiment will still comprise just a suspension of bacteria in some nutrient broth. Any “clone” will still be represented by just a single bacterium. The suspension will then be plated onto nutrient agar (which in the case of plasmid libraries will contain a selective antibiotic). Colonies (or plaques – but from here on I'll just refer to colonies) form overnight, each will consist of about a million descendants of the original transformant. At this point the library is “unamplified”. If the library could be screened at this point (for instance by making a replica plate and carrying out a “colony hybridisation” using a gene specific probe) then all the signals obtained would result from independent cloning events. (The number of signals would give you some indication of the “depth” of the library. Usually the library will be plated at a high density onto large square filters (20cm × 20cm) placed on nutrient agar plus antibiotic. It is not easy to store libraries in this unamplified state. They must first be replicated onto fresh filters so that replicas retain the positions of all the colonies. Replica filters can then be stored at -80°. Often all the colonies of one replica will be resuspended in a small volume of buffer and 1ml aliquots frozen at -80°. In this state the library is available for replating if it is desired to screen it. However, there is no longer any guarantee that signals obtained from the replated library will be independent – each colony will have contributed approximately one million genetically identical cells to the suspension each of which can found a new colony. Also, if two different laboratories each screen the same amplified library with the same probe and each obtain some positive colonies it will be necessary to test if they have each obtained exactly the same clone or whether they have obtained overlapping clones which were both originally present in the unamplified library. To circumvent these problems “gridded” libraries have been introduced. The first stage of library construction is carried out as normal except that the library is plated out at a Figure 3: “gridding” a library low colony density so that clones are well separated. (In the old fashioned libraries above, colonies were plated at a very high density to minimise the number of plates which had to be replica plated and to minimise the number of filters which had to be screened.) Then a colony picking robot takes 96 sterile needles and pricks 96 colonies. The 8 × 12 array of needles is used to inoculate small cultures in the wells of a 96 well tray, the needles are sterilised and the robot is then ready to pick another 96 colonies. In this way thousands of individual clones can be picked. (The Sanger institute has a good information page at http://www.sanger.ac.uk/Teams/Team54/faq.shtml which is worth a visit.) Each clone thus gains an address, its plate, row and column number e.g. 255A6 would mean tray 255 row A column 6. The advantages of gridding are several. The trays are easily handled by robots and so replication is straightforward. Trays are easy to store in a deep freeze. Copies of the same library can be made available to any interested laboratory and information about any individual clone can be placed in a central database. In this way, if two laboratories discover independent pieces of information they will realise that both Figure 3: Colony picking robot at the Sanger Institute facts are true of the same clone – for On the right, illuminated by a red lamp, are the nutrient agar plates, in instance that two different genes are in the middle are trays of 70% ethanol for sterilising the pins, on the left the same clone, which might not have are trays containing wells of nutrient broth into which individual placed. The pins are in rack behind the lamp been easy to detect if the two laboratories colonies will beforward individually and apositioned very exactlyand they can be brought over a had probed the same amplified library single colony. each with their own unrelated probes. As time has gone by most labs have upgraded to 384 well (double density) and even to 1536 well (quadruple density) trays. The same 96 pin “hedgehogs” can still be used to transfer between trays – it just takes extra cycles by the robot. Library screening Libraries may be screened by one of two methods, colony hybridisation or by PCR screening. 1. colony hybridisation: this is essentially the same as is used to screen a cDNA library. The probe has frequently been a cDNA –a genomic clone for a known gene is identified by screening Chromosome walking colonies with the appropriate cDNA clone. The From an initial clone, “subclone” the end probe will only hybridise to the exons. If the gene fragments so that each may be used to reprobe the library. The ends may hybridise contains introns which are bigger than the insert to genomic clones which overlap and extend size in the library then you will not be able to the first clone. Identify how the clones identify clones containing them directly. You overlap, subclone the end fragment of the new clone and continue for as many cycles as may need to “walk”. Gridded libraries can be necessary. replica plated by robots onto nylon filters so that End probe colonies grow in a very dense array. When the colonies have been lysed and their DNA has stuck to the filter at the site of the colony, they can be probed with a radioactive cDNA probe. Problems may arise if your probe contains a repetitive original sequence or if the gene is a member of a gene clone New clone family in which case genomic clones for related Original site of genes may be found. This is especially a problem hybridisation to cDNA probe if there are “retroposons” of your gene in the genome which will probably contain most of the cDNA sequence and thus give a much stronger signal than clones derived from the real gene but which contain only a few small exons. 2. Figure 4 shows a typical array pattern. Each dot represents one colony. Thirty two 96 well trays of colonies have been arrayed here (or more likely, 16 × 96 wells twice, so that each colony is present in duplicate). Figure 4: A typical colony screening filter. Figure 5: some duplicate signals overlaid with an alignment grid 3. PCR screening: When you already know some DNA sequence in the target gene you can design a PCR assay to ask which genomic clone contains it. It would be time consuming, expensive and foolish to PCR every clone in the library – this would be thousands of PCR reactions. So it is necessary to subdivide the library into pools which can be used to reveal the coordinates of the positive clones. A library of 20,000 clones will consist of about 200 × 96 well plates. The clones in these can be pooled into groups of 10 plates – see the diagram in figure 6 of one such “super-pool”. Then, from each super-pool of 10 plates, can be made 10 individual plate pools, 12 column pools and 8 row pools. In the figure this corresponds to 10 yellow pools, 12 red pools and 8 blue pools. Every clone in the library will thus be found within one plate pool, one row pool and one column pool. The library is screened by first just testing the super- pools (20 PCR reactions plus controls). When a positive signal is obtained in one such pool then its plate / row / column sub-pools are tested (10 + 8 + 12 = 30 PCR Figure 6: part of a library pooling scheme reactions). There should be one positive signal in each of these which will reveal the coordinates of the positive clone. Finally this can be confirmed with a third round of testing (1 PCR reaction). Giving a grand total of about 60 reactions (including controls). This method works well so long as the super-pools are unlikely to contain more than 2 or 3 positive clones each (preferably no more than one). It is also very amenable to robotization and so robots can carry out many such screens simultaneously.