Organization of the human genome
Human genome = nuclear genome + mitochondrial genome
Mitochondrial genome HUMAN NUCLEAR GENOME
24 chromosomes (haploid) 3200 Mbp 30,000 genes
16569 bp
37 genes
Human Mitochondrial Genome
Small (16.5 kb) circular DNA rRNA, tRNA and protein encoding genes (37) 1 gene/0.45 kb Very few repeats No introns 93% coding; Genes are transcribed as multimeric transcripts Recombination not evident Maternal inheritance
H strand enriched in G L strand enriched in C
7S DNA – short repetitive segment of H strand attached to L strand (abortive replication) Element of triple–DNA stand structure
What are the mitochondrial genes?
• 24 of 37genes are RNA coding
– 22 mt tRNA – 2 mit ribosomal RNA (23S, 16S)
• 13 of 37 genes are protein coding (synthethized on ribosomes inside mitochondria)
some subunits of respiratory complexes and oxidative phosphorylation enzymes
Limited autonomy of mitochondria
mt encoded
NADH dehydrogenase 7 subunits Succinate CoQ reductase 0 subunits Cytochrome b-c1 comp 1 subunit Cytochrome C oxidase 3 subunits ATP synthase complex 2 subunits tRNA components 22 tRNAs rRNA components 2 components Ribosomal proteins none Other mt proteins none
nuclear
>41 subunits 4 subunits 10 subunits 10 subunits 14 subunits none none ~80 mtDNA pol RNA pol etc.
Two overlapping genes encoded by same strand of mt DNA (unique example)
Two independent AUG located in Frame-shift to each other, second stop codon is derived from TA + A (from poly-A)
Mitochondrial codon table
22 tRNA cover for 60 positions via third base wobble
Human Nuclear Genome
3200 Mb 23 (XX) or 24 (XY) linear chromosomes 30-35,000 genes 1 gene/100kb Introns in the most of the genes 1,5 % of DNA is coding Genes are transcribed individually Repetitive DNA sequences (45%) Recombination at least once for each chrom. Mendelian inheritance (X + auto, paternal Y)
Human Genome Organization
From: Dr Finbarr Hayes lec
HUMAN GENOME
Nuclear genome 3000 Mb 65-80000 genes
Mitochondrial genome 16.6 kb 37 genes
30% Genes and generelated sequences
70% Extragenic DNA
Two rRNA genes
22 tRNA genes 80%
13 polypeptideencoding genes 20%
Moderate to highly repetitive
Unique or moderately repetitive 10% 90% Coding DNA Noncoding DNA
Unique or low copy number
Pseudogenes
Gene fragments
Introns, untranslated sequences, etc.
Tandemly repeated or clustered repeats
Interspersed repeats
Human nuclear genome
Euchromatic portion 3000Mb Constitutive heterochromatine 200 Mb
Hetero DNA 30 3 15 3 11
chr Total DNA 1 279 Heterochromatin is distributed 2 251 between chromosomes 16 104 unevenly 17 88 21 45
Gene-poor chromosomes (With extra heterochromatin)
Short arms of acrocentric chromosomes
–13, –14, –15, –21, –22
Part of long arms of chr 1,9,16 Long arm of chromosome Y
Human genome base content
• 41% CG in average 38% CG for chromosomes 4 and 13 49% for chromosome 19
• Regions with wide swings in GC content (e.g. from 33,1% to 59,3%) GC content is correlated with Giemsa staining; Genes correlated too. Gene density correlates with higher GC content
CpG dinucleotide conspicious depletion
• Expected frequency is 0,042 (4,2%) • Observed frequency is five times lower
It happens due to methylation-dependent mutation based CpG depletion CpG islands in the regulatory areas of human genes
Location of CpG islands in the gene
CpG islands do NOT have a deficit of CpG dinucelotides
REPEATS!!!!
3 Main Components in Eukaryotic Genomes
DNA purified from a human, do not self-anneal as a simple sigmoidal curve. Instead we see a curve which is the sum of the reannealings of many different components
REPEATS
CoT curve is a measure of sequence complexity
NO REPEATS
C0 = the initial concentration of nucleotides, T – time in seconds
human CoT DNA (commercial preparate)
This is human DNA which has been denatured and allowed to reanneal to a C0t value of 1.
The double stranded component is then purified from the single stranded component and is supplied commercially. It contains most of the human repetitive DNA but very little "single copy" DNA (unique genes). Used to suppress background hybridization of compleх probes
Satellite DNA is repetitive DNA that could be separated by buoyant density
Equilibrium density gradient centrifugation
Sheared DNA in Cesium Chloride gradient
Satellite DNA
Alpha –satellite (Centromere DNA)
Microsatellites
Minisatellites
Are you still remember what it is? If not please refer to previous lectures and to the book
Repetitive DNA
• Moderately repeated DNA
– Tandemly repeated rRNA, tRNA and histone genes (gene products needed in high amounts) – Large duplicated gene families – Mobile DNA (transposons) – to be discussed soon
• Simple-sequence DNA
– Tandemly repeated short sequences – Found in centromeres and telomeres (and others)
= (MINI and MICROSATELLITES)
Human Mobile DNA (transposons)
• Moves within genome • LINE (Long interspersed nuclear elements)
– L1, L2, L3 LINE is ~21% of human DNA (~1,00,000 copies)
• SINE (Short interspersed nuclear elements)
– Alu is ~10,7% of human DNA (1,200, 000 copies) – MIR, MIR3 is 3% of hum DNA (500,000 copies)
• LTR elements (Long Terminal Repeats)
– ERV and MalR are 8% of human DNA (500,000 copies)
• Transposons
– MER1 (Charlie), MER2 (Tigger), others (350, 000 copies), 2,8% of human DNA
TOTAL: approx; 45% of human DNA
RNA or DNA intermediate
• Transposon moves using DNA intermediate • Retrotransposon moves using RNA intermediate
LINEs and ERVs
http://www.hos.ufl.edu/mooreweb/
Long interspersed nuclear elements
(LINEs ) 20% of genome
RNA binding
also endonuclease
Internal promoter
• LINE1 – active (Also many truncated inactive sequences) • Line2 – inactive • Line 3 – inactive
LINEs prefer AT-rich euchromatic bands
IN everyone’s genome 60-100 copies of LINE1 are still capable of transposing, and may occasionally cause the disease by gene disruption
Mechanism of LINE repeat jumps
Full length LINE transcript is generated from 5’-UTR-based promoter
5’ 3’
ORF1 and ORF2 translated into proteins that stay bound to LINE mRNA
orf2 5’ orf1 3’
ORF1/ORF2/mRNA complex moves back into the nucleus Product of ORF2
5’ orf2 orf1 5’ 3’ cut ds DNA
3’ 5’
Freed 3’ serves as a primer for LINE reverse transcription from 3’ UTR
3’
ORF2 and ORF1 function
• ORF1 keeps ORF2 and LINE mRNA bound together
and retracted into nucleus
• ORF2 (endonuclease) cut dsDNA to provide free 3’
end as a primer to LINE 3’UTR
• ORF2 (reverse transcriptase)
makes cDNA copy of LINE mRNA, which becomes integrated into chromosomal DNA (as it bound to it by former 3’ freed end)
TTTT A is ORF1 cleavage site, that is why integration prefers AT rich regions
LINE replication is not very efficient process
Reverse transcriptase of LINE elements is a “weak” enzyme (have a low processivity)
Many insertions are truncational (copies are not able to copy itself further) Most insertions are only 900 bp (instead of 6.1 kb), only 1 of 100 insertions is successful
Illustration to full-size LINEs and their fossil derivates
Short interspersed nuclear elements (SINE) 13% of genome
• • • • Non-autonomous (no revertase) 100-400 bp long; No open reading frames Derived from tRNA (transcribed with RNA pol III, leaving internal promoter) • Share sequences with 3’ ends of LINEs • Depend on LINE machinery for its movement
AluI - elements
• Derived from signal recognition particle 7SL • Does not share its 3’ end with a LINE • Internal promoter is active, but require appropriate flanking sequence for activation – so it’s active only if lucky with it’s integration site • Integrates in GC rich sequences
• Only active SINE in the human genome
Mark A. Batzer and Prescott L.Deininger
As ALU repeats do not have open reading frames, ALUs have to use RT enzyme and endonuclease provided by LINE repeats or other transposons
After integration Alu copies rapidly mutate at sites of their 24 CpGs
Alignment of Alu-subfamily consensus sequences. Mark A. Batzer and Prescott L.Deininger
The expansion of Alu-elements in primate lineage
Mark A. Batzer and Prescott L.Deininger
Potential Alu-mediated damage to human genome
Insertional mutagenesis
ALU-mediated uneven recombination
Diseases that sometimes caused by de novo Alu-integration
• • • • Neurofibromatosis (Shwann cell tumors), haemophilia, breast cancer, Apert syndrome (distortions of the head and face and webbing of the hands and feet), • cholinesterase deficiency (congenital myasthenic syndrome) • complement deficiency (hereditary angioedema)
Disease that sometimes caused by Alu-mediated uneven recombination
• insulin-resistant diabetes type II (InsReceptor) • Lesch–Nyhan syndrome (overproduction of uric acid leading to neurologic syndrome), • Tay–Sachs disease, • complement component C3 deficiency, • Familial hypercholesterolaemia • α-thalassaemia • Several types of cancer, including Ewing sarcoma, breast cancer, acute myelogenous leukaemia
Positive role of Alu repeats in evolution
Insertions of the repeat near gene may change
Alu
its expression pattern,
gene structure,
Alu
or leads to alternatively spliced mRNA isoforms
Alu LTRs contain promoters, ALUs repeats contain TF binding sites
Human repeat distribution depends on GC content of integration sites
Alu paradox
• Alu repeats are found in GC-rich (gene rich) regions more often than in AT rich; • De novo integration of ALU-repeats happens in AT-rich areas (as they hijacked ORF2 product of LINE)
ALUs are subject of positive selection
(as they CREATE new genes) by supplying genome segments ready to become genes with promoter like elements and exonic-like boundaries. Also they are GC rich themselves, so they transform AT-rich regions into GC rich
LTR transposons
• Any trasposon flanked by Long Terminal Repeats; • DNA bases transposons and Retrotransposons;
Contain Transposase;
Already silent in the human genome Fossils (Charlie and Tigger types)
Endogenous Retroviral Sequences (ERVs) Contain Gag and Pol genes
Only HERV-K look still OK for moving
DNA transposons and retrotransposons
LINE
SINE
Kazazian, Science, Vol 303, Issue 5664
Human RNA genes (non-coding RNA transcripts)
• • • • • • • • • 3000 RNA genes in human genome (rough) rRNA tRNA THIS IS NOT TRUE, Small nuclear RNA MY OPINION Small nucleolar RNA IS CLOSE TO 100,000 SRP RNA MicroRNA Antisense RNA Non-coding gene mRNA isoforms; RNAs form transcribed pseudogenes
miRNA and antisense RNA are underestimated; “other non-coding RNA” are not represented
rRNA genes (1200 genes)
18S, 5.8S and 28S are encoded by single transcription units;
Located in 5 clusters: Chr. 13,14,15, 21,22
5S is in tandem arrays, largest is on Chr. 1q41-42
All this is to increase a gene dosage
tRNA genes
(497 nuclear genes + 324 putative pseudogenes)
• Humans have fewer tRNA genes that the worm (584), but more than the fly (284); • Frog X.laevis have thousands of tRNA genes;
• Number of tRNA genes correlates with size of the oocytes; In large oocytes lots of protein needs to be sythethized simultaneously….
tRNA genes
(497 nuclear genes + 324 putative pseudogenes)
• 49 families according to codon recognition; (Should by 61 for every coding triplet) Paradox is eliminated by codon wobbling
• Very rough correlation between tRNA gene number and amino acid frequency in the protein • 280 out of 497 genes are on Chr.6, most are clustered in the same 4 Mb region; other are also more or less clustered (Chr. 1 and 7)
• All chromosomes still carry at least one tRNA gene – chr.22 and Y are exclusions
Representation of aminoacids by human tRNA (examples)
Amino Acid Alanine Leucine Tryptophan Valine Aspartate Cysteine Histidine Selenocysteine Frequency 7,06% 9,95% 1,30% 6,12% 4,78% 2,25% 2,56% <0,01% Number of tRNAs 40 35 7 44 10 30 12 1
Small nuclear RNA (snRNA)
• Uridine rich; • Numbered U1, U2, U3 etc • Include spliceosomal RNAs U6 and U1 U6 (44 genes) and U1 (16 genes)
• Sometimes clustered as very irregular or almost perfect groups, e.g. RNU1 locus at 1p36 and RNU2 at 17q21; • For U6 snRNA 1135 fragmental/pseudogenic sequences are identified
Small nucleolar RNA (snoRNA)
• Employed in nucleolus to guide site-specific base modifications in rRNA; • Also can modify U6 RNA;
• snoRNA genes often found in other gene’ introns • Generally not clustered except SNURF-SNRPN unit on 15 q which possibly involved in Prader-Willi sydrome • C/D box snoRNA and H/ACA snoRNA
Site-specific 2’-O-ribose methylation of rRNA (105-107 sites) Site-specific Pseudouridylation (95 sites)
SRP RNA (7SL RNA)
Protein export machinery of the endoplasmic reticulum binds a protein RNA complex (Signal Recongnition Particle) that contains 7SL RNA
four 7SL genes, 500 7SL pseudogenes and all the Alu repeats that are derived form 7SL gene
Micro RNA (miRNA)
• a family of 21–25-nucleotide small RNAs that negatively regulate gene expression at the post-transcriptional level; • primary transcripts of miRNAs are processed sequentially by two RNase-III enzymes, Drosha and Dicer, into a small, imperfect dsRNA duplex (miRNA:miRNA*) mature miRNA strand plus its complementary strand (miRNA*). • RNA-induced silencing complex (RISC) is operated by miRNA;miRNA* and Ago-proteins
Exonuclease III Drosha
This form is exported from the nucleus By Exportin-5
Dicer cleaves microRNAs into their mature form
miRNA incorporated into effector complexes
Elizabeth P Murchison and Gregory J Hannon
miRNA is recognized by the PAZ domain of an Ago protein, and incorporated into RISC
facilitates transfer of miRNAs into RISC.
ss miRNA
Depending on RISC components, RISC may target homologous mRNA for cleavage, or stall mRNA translation
Non-coding mRNA with poly(A )tail transcribed by RNA pol II
• • • • Mid-to-large size mRNA For most of them function is unknown Often overexpressed in tumors 7SK RNA decreases rate of RNA pol II elongation and inhibits the activity of CDK9/cyclin T complexes; • SRA RNA co-activator of steroid receptors • XIST RNA – X-chromosome incativation in female cells
Antisense RNA
• TSIX regulates XIST gene • Antisense regulation of imprinted genes • aHIF: regulates hypoxia-inducible factor (HIF)1alpha and HIF-2alpha; • Makorin-2 gene as an antisense to the RAF1 oncogen • RFP2 CLL candidate gene and RFP2OS transcript • antisense beta myosin heavy chain RNA switches myosin heavy chain gene expression from myosin beta to myosin alpha in heart musc
Polypeptide encoding genes
In human genome clusters of gene-rich regions are separated by gene deserts Chr. 19 has the highest gene density, Chr. 13 & Y show the lowest gene density;
Gene total estimated 30,000-40,000 average gene size of 27 Kb
Hundreds of human genes share homology with bacterial genes
Some more statistics
• • • • • • • • • Gene density 1/100 kb (vary widely); Averagely 9 exons per gene 363 exons in titin gene Many genes are intronsless Largest intron is 800 kb (WWOX gene) Smallest introns – 10 bp Average 5’ UTR 0,2-0,3 kb Average 3’ UTR 0,77 kb but underestimated… Largest protein: titin: 38,138 aa
INTRONLESS GENES
• • • • • • • Interferon genes Histone genes Many ribonuclease genes Heat shock protein genes Many G-protein coupled receptors Some genes with HMG boxes Various neurotransmitters receptors and hormone receptors
Smallest human genes
Percentages describe exon content to the length of the gene
Typical human genes
Extra Large human genes
IG genes are shown as germline genes, before rearrangements
transcription of long introns is costly, in highly expressed genes introns are 14 times shorter than in low-level express
Castillo-Davis et al., 2002
Presumable functions of human genes
HUMAN genes and their homology to genes from other organisms
Why so small amount of genes we, humans, kings of nature, have?
Human 30,000 genes Drosophila – 13,000 Nematode – 19,000
Potential of proteome and transcriptome diversity is so great that it is no need for increase of amount of genes
Gene families
• Functionally identical genes
-- Recently duplicated genes (Alpha-globins); -- Histone genes (86 members, some are identical) -- Ubiquitin-encoding genes (some are in polycistronc transcription units)
• Functionally similar genes
usually arise by duplications also, than diverge
• Functionally related genes
belong to the same pathway or to encode subunits of protein complex (usually non-related)
Chromosomal distribution of human histone genes
Bidirectional and partially overlapping genes
• Not very common in human genome as 1 gene/100 kb density allow genes to be loose… • Provides possibility for common regulation of a gene pair. • Partially overlapping genes are usually encoded by opposite DNA strands. Found in dense gene areas, as HLA class III complex on 6p21.3 Could represent sense-antisense pair with one gene is coding mRNA, another is non-coding
HLA class III complex on 6p21.3: an example of tightly packed genes
MHC Class III genes Encoding complement proteins C4A and C4B, C2 and FACTOR B
TUMOUR NECROSIS FACTORS AND Plus some Immunologically irrelevant genes Genes encoding 21-hydroxylase, RNA Helicase, Casein kinase Heat shock protein 70, Sialidase
An example of complex human gene locus INK4a-ARF
From: Prof. Gordon Peters website
Genes within genes
Neurofibromatosis gene (NF1) intron 26 encode :
OGMP (oligodendrocyte myelin glycoprotein) EVI2A and EVO2B (homologues of ecotropic viral intergration sites in mouse)
Gene families
• Classical gene families (overall conservativeness)
Histones, alpha and beta-globines
• Gene families with large conservative domains (other parts could be low conservative)
HLH/bZIP box transcription factors
• Gene families with short conservative motifs
e.g. DEAD box (Asp-Glu-Ala-Asp), WD repeat
Example of human gene families clustered together
CS = chorionic somatomammotropin
four placenta-specific genes, primates only
serum albumin
alpha-albumin
vitamin D-binding protein
Example of human protein motifs
DEAD box proteins are involved in mRNA splicing and translation initiation; 8 conservative boxes, DEAD is the most evident
WD proteins take part in a variety of regulatory functions, GH (Gly-His should be at 23-41 aa distance from WD (Trp-Aps)
Gene superfamilies
• Proteins that are functionally related in a general sense, but show only weak homology
• Immunoglobulin superfamily (IG genes, T- cell receptor genes, HLA-genes….) • Globin superfamily (myoglobin, alpha and betaglobins, neuroglobin etc….) • G-protein coupled receptor superfamily (seven transmembrane domains, but low homology) And so on….
Illustration to gene superfamily
Major mechanisms of gene family spreading • Ancient gene or chromosomal segment duplications
– Tandem duplications – Duplications with gene transfer to another chromosome
• Retrotransposition events
(processed copies with no introns only)
Fig 33
Finished HG has 1.5% interchromosomal and 2% intrachromosomal segmental duplications. The duplications are 10–50 kb long and highly homologous.
Human Gene Families extended recently by gene duplication
Some regions of genome are more prone to rearrangements than others
Chromosome 22q
Human pseudogenes
Non-processed pseudogenes
Contain introns; Arise by duplications; Frequency of transfer depend on chromosomal context (pericentromeral fragment are transferred more often)
Processed pseudogenes
Do not contain introns;
Arise by retrotransposition; Frequency of transfer depends on initial level of gene expression (Highly expressed genes are transferred more often)
Complete
Partial
Both types of pseudogenes are raw material for evolution
HLA type I cluster
Domain structure of a typical HLA type I gene Complete pseudogenes
Partial pseudogenes and their structure
NF1 gene and its pseudogenes on different chromosomes
All NF1 pseudogenes are partial; 11 of them are found in the genome
Mechanism of processed pseudogene transfer into new location
Could be very prolific: there are 95 functional ribosomal genes and 2090 pseudogenes
Transcription from pseudogenes
Chr 1
Master gene
Chr 15 Chr 7
Partial duplication with preservation of the promoter;
Expression is preserved in evolution, If transcript encode partial protein (regulatory), or if rare transcription factors sites present in both promoters
LINE-mediated inclusion of cDNA
copy of master gene; Brought under heterologous promoter by chance,
could be antisense
Human Genome Project
The International Human Genome Consortium Initial sequencing and analysis of the human genome Nature, 409, February 15, 860-921 (2001)
Venter et al. (Celera) The Sequence of the Human Genome Science, 291, February 16, 1304-1351 (2001)
History of Human Genome Consortium
1984 to 1986 – first proposed at US DOE meetings
1988 – endorsed by US National Research Council - creation of genetic, physical and sequence maps of the human genome - parallel efforts in five model organisms: bacteria, yeast, worms, flies and mice; - develop of supporting technology - ethical, legal and social issues (ELSI) 1990 – Human Genome Project (NHGRI) with NIH Later – UK, France, Japan, Germany, China, Russia
Technical development necessary for human genome completion
1. Automated capillary based DNA sequencing
2. Electronic databases GenBank, UniGene, sequence assembly software
Completed sequences
1995 – First complete bacterial genomes 2002 – About 35 bacterial genomes; 0.5-5 Mb; hundreds to 2000 genes 1996 April – Yeast (Saccharomyces cerevisiae) 12 Mb, 5,500 genes 1998 Dec. -Worm (Caenorhabditis elegans) Mouse, 97 Mb, 19,000 genes rat, 2000 March - Fly (Drosophila melanogaster) chimp 137 Mb, 13,500 genes 2000 Dec. - Mustard (Arabidopsis thaliana) 125 Mb, 25,498 genes 2000 June – Human (Homo sapiens) 1st rough draft 2001 Feb 15/16 – Human, “working draft” 3000 Mb, 35,000~40,000 genes
Bac- by Ba shotgu (public sequence
Total shotgu from the BAC end (Celera
Clone contig is a prerequisite
No prerequisites
Prerequiste of human genome sequencing:
genetic and physical maps
Genetic Mapping –
based on recombination frequency (expressed as cM); Key word : co-segregation
Physical Mapping –
actual molecular distance in nucleotide base pairs (expressed as bp, kb, or mb) Key word: contig
Genetic maps are important crutches for physical maps
Genetic Markers for Mapping
• Polymorphic markers for genetic mapping. RFLPs – restriction fragment length polymorphisms SSRs – simple sequence repeats (also called microsatellites) • For High-resolution physical mapping STSs – sequence tagged sites ESTs – expressed sequenced tags
Fragment of a human genetic map
One map unit = one centimorgan (cM) = 1% recombination between loci
Physical maps
Ways to create a genetic maps: Analysis of large human pedigrees
The CEPH Family Panel (Centre du Etude Polymorphisme Humain) – 40 nuclear families
• • • •
10 are French families: 27 are Utah Mormon pedigrees: 2 are Venezuelan Huntington's pedigrees: 1 is an Old Order Amish pedigree: (has bipolar affective disorder segregating). The total number of individuals in the panel is 520, with an average sibship size of 8.
Typical task of CEPH based research
To integrate new polymorphic marker into human genetic map:
1) To PCR given marker in DNA from members of CEPH families 2) To compare segregation patterns of given marker with segregation patterns of other known markers (already mapped) 3) To conclude genetic location of marker of interest as co-segregated with known marker
Ways to create a physical maps:
• 1. Somatic cell hybrids • 2. Radiation hybrids • 3. Enrichments of starting DNA for library construction - chromosome flow-sorting; - microdissection; • 4. Contig construction from genomic fragments - BAC, YAC, PAC….. - cosmids….
cgil.uoguelph.ca/ QTL/ SomaticCellHybrids.htm
Somatic cell hybrids
each of resulting cell colonies will contain a full mouse genome plus a few human chromosomes
Resulting colonies are stable
Monochromosomal hybrides are most useful
24 colonies = 24 PCRs = chromosomal location of your sample
Radiation hybrids
Whole-genome radiation hybrids
RH maps are constructed by typing a panel of hybrids with a set of human DNA markers
Only a PROPORTION of the pieces of the broken human chromosomes will integrate into rodent chromosomes
RH panels available before human genome sequence
GeneBridge 4 panel
93 human-hamster cell lines
Stanford G3 panel
83 human-hamster cell lines
Each contain 32% of Each contain 16% of human genome in average human genome in average 25 Mb is average fragment 2,4 Mb is average fragment size size Allow finer mapping Both operates via databases on corresponding central servers
Creation of representative clone libraries
In Shot Gun cloning we hope that randomly picked fragment will cover every piece of genome
Overlap whole genome with handy bacterial clones, screen for one that you need after
Contig and contigous maps
www.biozentrum.uni-wuerzburg.de/ .../weenie/fig1.html
Shotgun cloning (clone everything, hope for the best)
How many clones would be required to have a complete library of the human genome?
N = ln(1 - P) / ln(1 - f) where N = number of colonies required to have P = probability of recovering any particular sequence, when f = average fraction of genome / clone (clone/genome ratio) For the human genome of 3 x 109 bp pUC18 plasmid accepts 5,000bp insert => f = 5000 / 3 x 109 set P = 0.999 to be on safe side i.e., 99.9% of the genome is represented at least once or, any particular gene segment is present with 99.9% probability
Then, N = 4.1 x 106 plasmid clones required
You’ll need less clones, if your average size of insert will be larger
ww2.mcgill.ca/biology/undergra/ c200a/f07-16.gif
Partial restriction with Sau3A
4-bp sequence GATC with sticky ends
Partial digestion of this region of DNA would yield a variety of overlapping fragments (blue) ~ 20 kb long. Use of such overlapping fragments increases the probability that all sequences in the genomic DNA will be represented in a Lambda library.
Gridded library in a 96 well plate
For human genome sequencing:
• Underlying map: YAC map
– YACs are too large for sequencing, often are rearranged and difficult to handle);
• Subject to sequencing map: BAC/PAC map
• Most important and interesting regions: were mapped by cosmid-based maps
Yeast artificial chromosome (YAC)
YAC vectors capacity is 150 kb -2 Mb. (in yeast cells)
The YAC vectors contain:
1) yeast centromere (CEN) and yeast telomeres (TEL) 2) autonomously replicating sequence (ARS) = origins of replication 3) URA3 gene involved in uracil synthesis for positive selection of yeast cells with YAC (in URA- host strain) 4) To propagate empty vector a bacterial replication origin (ORI) and bacterial selection marker (Amp)
ARS
URA
www.indstate.edu/thcme/ mwking/yac.gif
BAC (Bacterial Artificial Chromosome) capacity = 100-200 kb
www.labs.roslin.ac.uk/ jwilliam/bacinfo.html
Cosmid vectors
Cosmid = plasmid that contains cos site (packaging signal) of the lambda phage
Vector capacity = 42-45 kb
Vector size = 5 kb
Phage-derived advantages: Size selection, large capacity
Plasmid-derived advantages: Plasmid minipreps, growing as a colony
www.web-books.com/MoBio/ Free/images/Ch9A6.gif
http://www.epicentre.com/f5_3/f5_3pw3.gif
www.web-books.com/MoBio/ Free/images/Ch9A6.gif
Cosmid map in PDE4a human region
For human genome sequencing:
YAC, BAC/PAC and cosmid map were verified by STS-maps: STS-maps included : 1) Lots of polymorphic STR markers (genetic maps) 2) Lots of short sequences mapped on RH hybrid panels or on somatic cell hybrids 3) Lots of ESTs (mostly representing genes)
For human genome sequencing:
• BACs/PACs representing verified contigs were subjected to subcloning into M13 vectors (short inserts) and randomly sequenced; • Resulting sequences of every BAC/PAC clone were aligned by PHRED/PHRAP software
– PHRED analyses raw seq traces and provides quality score for each bp position (estimating degree of confidence); – PHRAP perform sequence assembly itself
PHRED base calling program that works with different image process settings
most accurate PHRED quality scores
PHRAP aligns sequenced clones
Problems with genomic sequence alignments
• High copy repeats (Alu, LINEs) • Segmental duplications
Sometimes blocks are very large (>200Kb) Highly Similar (>97%) No characteristic sequences
Duplication detection
From. Dr. Vicky Choi lectures
5-10 copies
10-20 copies
Over-representation of Celera reads in the duplicated regions of clones
~40 copies
Human gene maps
Gene prediction
Signal based: Starts, stops, splicing signals, promoters GRAIL, GENEFINDER
Content-based: ORF, Codon preference by the organism, EST coverage
GRAIL algorithm
1. Generate the list of all exon candidates -- translational start, splice donor, splice acceptor, translational stop -- 1000s of candidate exons per 10 kb sequence 2. Remove improbable exon candidates by set of 30 rules (95% of exons removed) 3. Evaluate exon candidates by neural network -- exon candidate scores as output node
Gene predictions are inaccurate
sammyc2007 4/17/2008 |
79 |
4 |
0 |
educational
sammyc2007 4/15/2008 |
55 |
0 |
0 |
educational
sammyc2007 4/17/2008 |
47 |
0 |
0 |
educational
sammyc2007 4/17/2008 |
80 |
2 |
0 |
educational
sammyc2007 3/28/2008 |
90 |
2 |
0 |
educational
sammyc2007 4/12/2008 |
107 |
3 |
0 |
educational
sammyc2007 3/29/2008 |
65 |
0 |
0 |
educational
sammyc2007 4/17/2008 |
31 |
0 |
0 |
educational
sammyc2007 4/17/2008 |
59 |
0 |
0 |
educational
sammyc2007 4/28/2008 |
17 |
0 |
0 |
educational
sammyc2007 4/16/2008 |
64 |
1 |
0 |
educational
sammyc2007 4/17/2008 |
24 |
0 |
0 |
educational
sammyc2007 4/17/2008 |
86 |
4 |
0 |
educational
mountainmom01 5/14/2008 |
89 |
5 |
0 |
educational
AmnaKhan 4/16/2008 |
38 |
1 |
0 |
educational
sammyc2007 6/13/2008 |
293 |
2 |
0 |
legal
sammyc2007 6/13/2008 |
251 |
0 |
0 |
legal
sammyc2007 6/13/2008 |
311 |
4 |
0 |
legal
sammyc2007 6/13/2008 |
279 |
3 |
0 |
legal
sammyc2007 6/13/2008 |
521 |
2 |
0 |
legal
sammyc2007 6/13/2008 |
423 |
1 |
0 |
legal
sammyc2007 6/13/2008 |
249 |
0 |
0 |
legal
sammyc2007 6/13/2008 |
229 |
0 |
0 |
legal
sammyc2007 6/13/2008 |
353 |
0 |
0 |
legal
sammyc2007 6/13/2008 |
316 |
0 |
0 |
legal