Embed
Email

Genome Annotation

Document Sample
Genome Annotation
Shared by: HC111129104154
Categories
Tags
Stats
views:
7
posted:
11/29/2011
language:
English
pages:
41
Genome Annotation

What we are going to discuss

• Finding RNA-only genes

• Gene prediction

– Prokaryotes vs. eukaryotes

– Introns and exons

– Transcription signals

– ESTs

• Functional annotation

• Biochemical pathways and subsystems

• Metabolic reconstruction of whole organisms

Genome Overview

• What’s in a genome?

– Protein coding genes.

• In long open reading frames

• ORFs interrupted by introns in eukaryotes

• Take up most of the genome in prokaryotes, but only a small portion of the eukaryotic

genome

– RNA-only genes

• Transfer RNA, ribosomal RNA, snoRNAs (guide ribosomal and transfer RNA

maturation), intron splicing, guiding mRNAs to the membrane for translation, gene

regulation—this is a growing list

– Gene control sequences

• Promoters

• Regulatory elements

– Transposable elements, both active and defective

• DNA transposons and retrotransposons

• Many types and sizes

– Repeated sequences.

• Centromeres and telomeres

• Many with unknown (or no) function

– Unique sequences that have no obvious function

• As a general rule, each part of a genomic sequence has only one function:

protein-coding gene, RNA gene, control signal, transposable element,

repeat sequence, maybe no functional at all. But, most sequence elements

overlap only slightly if at all.

RNA Genes

• The most universal genes, such as tRNA and rRNA, are very

conserved and thus easy to detect. Finding them first removes some

areas of the genome from further consideration.

– One easy approach to finding common RNA genes is just looking for

sequence homology with related species: a BLAST search will find most

of them quite easily

• Functional RNAs are characterized by secondary structure caused

by base pairing within the molecule.

– Determining the folding pattern is a matter of testing many possibilities

to find the one with the minimum free energy, which is the most stable

structure.

• The free energy calculations are in turn based on experiments where short

synthetic RNA molecules are melted

– Related to this is the concept that paired regions (stems) will be

conserved across species lines even if the individual bases aren’t

conserved. That is, if there is an A-U pairing on one species, the same

position might be occupied by a G-C in another species.

• This is an example of concerted evolution: a deleterious mutation at one site

is cancelled by a compensating mutation at another site.

RNA Structures

• RNA differs from DNA in having fairly

common G-U base pairs. Also, many

functional RNAs have unusual modified

bases such as pseudouridine and

inosine.

• The pseudoknot, pairing between a loop

and a sequence outside its stem, is

especially difficult to detect:

computationally intense and not subject

to the normal situation that RNA base

pairing follows a nested pattern

– But pseudoknots seem to be fairly rare.

• Essentially, RNA folding programs start

with all possible short sequences, then

build to larger ones, adding the

contribution of each structural element.

– There is an element of dynamic

programming here as well.

– And, “stochastic context-free grammars”,

something I really don’t want to approach

right now!

Finding tRNAs

• tRNAs have a highly conserved

structure, with 3 main stem-and-

loop structures that form a

cloverleaf structure, and several

conserved bases. Finding such

sequences is a matter of looking

in the DNA for the proper features

located the proper distance apart.

• Looking for such sequences is

well-suited to a decision tree, a

series of steps that the sequence

must pass.

• In addition, a score is kept, rating

how well the sequence passed

each step. This allows a more

stringent analysis later on, to

eliminate false positives.

tRNAscan

Decision

Tree

•tRNAscan is estimated to have

an error rate of 1 in 3 million

bases.

•This is very suitable for

prokaryotes,

whose genomes are

approximately this size.

Prokaryotic Genes

• Gene finding in prokaryotes is relatively simple compared to

eukaryotes:

– no introns, so all genes are in open reading frames starting at a

start codon and ending at a stop codon

– most of the DNA is involved in coding from proteins.

• Thus, you can achieve 100% accuracy, if you don’t mind false

positives, by simply listing all possible ORFs above a certain

size.

– There is a problem in that it is not clear how many short ORFs (say

less than 100 bp) are real genes.

• If you compare predicted genes with actual genes, you can

classify each base according to whether it is:

Sn = TP / (TP + FN)

– true positive: predicted correctly to be in a gene

– true negative: predicted correctly to not be in a gene Sp = TP / (TP + FP)

– false positive: predicted to be in a gene but actually not

– false negative: predicted to not be in a gene but actually is within a

gene.

• The sensitivity (Sn) of a prediction is the fraction of bases in

real genes that are predicted to be within genes.

• The specificity (Sp) is the fraction of bases predicted to be in a

gene that actually are.

• Both of these parameters need to be optimized.

General Considerations

• Bacteria use ATG as their main start codon, but GTG and TTG are also fairly

common, and a few others are occasionally used.

– Remember that start codons are also used internally: the actual start codon may not be

the first one in the ORF.

• The stop codons are the same as in eukaryotes: TGA, TAA, TAG

– stop codons are (almost) absolute: except for a few cases of programmed frameshifts

and the use of TGA for selenocysteine, the stop codon at the end of an ORF is the end

of protein translation.

• Genes can overlap by a small amount. Not much, but a few codons of overlap

is common enough so that you can’t just eliminate overlaps as impossible.

• Cross-species homology works well for many genes. It is very unlikely that non-

coding sequence will be conserved.

– But, a significant minority of genes (say 20%) are unique to a given species.

• Translation start signals (ribosome binding sites; Shine-Dalgarno sequences)

are often found just upstream from the start codon

– however, some aren’t recognizable

– genes in operons sometimes don’t always have a separate ribosome binding site for

each gene

Compositional Methods

• The frequency of various codons is different in coding regions

as compared to non-coding regions.

– This extends to G-C content, dinucleotide frequencies, and other

measures of composition. Dicodons (groups of 6 bases) are often

used

– Well documented experimentally.

• The composition varies between different proteins of course,

and it is affected within a species by the amounts of the various

tRNAs present

– horizontally transferred genes can also confuse things: they tend to

have compositions that reflect their original species.

– A second group with unusual compositions are highly expressed

genes.

GeneMark

• GeneMark uses fifth order Markov chains to examine dicodons. That is, every

base is evaluated in terms of its probability given the previous 5 bases.

– P(a|x1x2x3x4x5), the probability that the sixth base in the sequence is a given that the

bases preceding it are x1x2x3x4x5, so the final sequence is x1x2x3x4x5a.

– The necessary parameters are obtained by looking at pentamers (5-mers) within

known genes and counting the number of times each base appears in the sixth

position. This is the training set. Possible use of pseudocounts here.

• GeneMark pays attention to reading frame also. Each reading frame gets its

own set of statistics. Thus there is a separate P1(a|x1x2x3x4x5), P2(a|x1x2x3x4x5),

and P3(a|x1x2x3x4x5), where 1, 2, and 3 are the reading frames.

– Based on the position of the stop codon, each base in an ORF has a unique codon

position.

– Non-coding regions are assumed to have the same statistics for all frames.

– The final probability is given as the probability that it is coding for a specified reading

frame.

• A 96 base sliding window is moved across the genome, scoring all possible

reading frames. Start and stop codons are not accurately predicted, especially

with overlapping genes--they need to be identified separately.

GLIMMER

• GLIMMER also uses Markov chains, but they vary from zeroth order (i.e. GC content) to eighth

order (what is the probability of a base given the previous 8 bases?).

– The point of this is to help get around the need for huge sets of training data while avoiding pseudocounts.

– Called “interpolated Markov models”

• GLIMMER selects training data from a genome sequence by picking non-overlapping long ORFs,

which are almost all genes.

– Note that high GC-content genomes need “long” defined differently than low GC genomes, since random stop

codons are rarer.

• GLIMMER builds its Markov models from the lowest order up. At each step, there must be at least

400 observations to accept the model as valid.

– If there are too few observations, the model is compared to each of the next order down model, using a chi-

square test.

• If the new model isn’t significantly different from the lower order model, it is discarded.

• If the new model is significantly different, it is weighted based on the number of observations and the

significance level.

– For example, if there are less than 400 observations of x1x2x3x4x5, the P(a|x1x2x3x4x5) for each base a is

tested against P(a|x2x3x4x5) probabilities.

• After all parameters are obtained, only the highest order model is used for any given subsequence.

• Each ORF longer than a minimum is scored (as opposed to using a sliding window that ignores

ORFs)

• New versions don’t require that the “given” bases be adjacent to the base they are scored with.

ORPHEUS

• ORPHEUS uses Markov models of codon frequency, based on a set of high-

confidence (i.e. highly conserved) genes. However, ORPHEUS also looks for

ribosome binding sites.

– The score for a given codon abc is bases on the frequency of this codon compared to the

frequencies of the individual bases. This score is then summed for all codons in the

training set and used as a parameter for the Markov model.

• Each ORF in the genome is scored for the correct reading frame (as set by the

stop codon) and for the other 2 forward, incorrect reading frames. If the correct

frame score exceeds the incorrect frame scores by a certain amount, this ORF is

accepted as protein-coding.

• After a good ORF is found, it is extended 5’ to find possible start codons (but only

allowing 6 bases of overlap with another gene).

• Ribosome binding sites are then defined, based on genes that have only 1 possible

start codon. Twenty bases upstream from the start sites are aligned.

– RBS are not an exact distance upstream from the start codon.

– The RBS scoring matrix derived from this is used to locate RBS for other genes.

• The search is done progressively, starting with the longest ORFs and working

towards the smaller ones. This avoids a lot of overlap problems.

GeneMark.hmm

• More Markov chains, but here, the probabilities are based on overall

length of the ORF.

– A true Markov model only considers the previous state, so this is a semi-

Markov model.

• It has been found that the length of coding regions can be modelled

with a gamma distribution, and the length of non-coding regions can be

modelled with an exponential distribution. (just empirical observations,

not based on theory).

• GeneMark.hmm changes the probability that a base is in a coding

region depending on the length of the coding region defined to that

point.

• It also looks for ribosome binding sites.

Eukaryotic Gene Prediction

• Some fundamental differences between

prokaryotes and eukaryotes:

• There is lots of non-coding DNA in eukaryotes.

– First step: find repeated sequences and RNA

genes

– Note that eukaryotes have 3 main RNA

polymerases. RNA polymerase 2 (pol2)

transcribes all protein-coding genes, while pol1

and pol3 transcribe various RNA-only genes.

• most eukaryotic genes are split into exons and

introns.

• Only 1 gene per transcript in eukaryotes.

• No ribosome binding sites: translation starts at

the first ATG in the mRNA

– thus, in eukaryotic genomes, searching for the

transcription start site (TSS) makes sense.

• Many fewer eukaryotic genomes have been

sequenced

Exons and Introns

• Size distribution of exons varies

according to position in the gene. It is

also quite different between plants and

animals.

• Exons are generally shorter than

prokaryotic ORFs, as short as 10 bp.

– Note that the leading exon and the trailing

exon always contain some non-coding

bases, and sometimes they are entirely

non-coding.

– Exon-intron boundaries can occur within a

codon as well as between codons.

• Introns can be incredibly long, with some

human introns over 400,000 bp.

Minimum size is about 50 bp.

• Many genes have alternate splicing

patterns: a sequence that is an exon in

one tissue might be an intron in another

tissue.

More Exon-Intron

• Each gene has a transcription start

site, but promoters and other

features are not well conserved, as

compared to coding sequences.

• Splicing signals are not absolute

(especially given alternative splicing),

and they also vary widely.

– In general, introns start with GT and

end with AG,and have a slice

acceptor region just upstream from

the end.

• There are also the relatively rare (<

1%) U12 introns, which are removed

by different spliceosomes than the

usual U2 introns. The U12 introns

start with AT and end in AC.









Human on left, Arabidopsis on right

Predicting Exons and Introns

• Exon sequences can often be identified by sequence

conservation, at least roughly.

• Dicodon statistics, as was used for prokaryotes, also is useful

– eukaryotic genomes tend to contain many isochores, regions of

different GC content, and composition statistics can vary between

isochores.

• The initial and terminal exons contain untranslated regions, and

thus special methods are needed to detect them.

• Predicting splice junctions is a matter of collecting information

about the sequences surrounding each possible GT/AC pair,

then running this information through some combination of

decision tree, Markov models, discriminant analysis, or neural

networks, in an attemp to massage the data into giving a reliable

score.

– In general, sites are more likely to be correct if predicted by multiple

methods

– Experimental data from ESTs can be very helpful here.

ESTs

• Experimental information about intron/exon boundaries is mostly obtained by

analyzing expressed sequence tags (ESTs).

• EST production starts out by extracting mRNA from a specific tissue, then

reverse-transcribing it to make double-stranded cDNA, then cloning the cDNA

into a plasmid vector.

– After the clone is produced, it is sequenced for one or both ends, just a single time.

– A 5’ EST from the 5’ end, which usually contains at least some protein-coding portion.

– 3’ EST, sequenced from the 3’ end, is often 3’ untranslated region, which is less

conserved across species lines.

• This leads to an imperfect sequence, but BLAST can generally locate its position

in the genome exactly.

– Also, lots of redundancy in an EST library, especially with highly expressed sequences.

• ESTs provide evidence that a given sequence has been expressed.

• They also show which sequences are exons, since introns have already been

spliced out of the mRNA.

• Large numbers deposited in dbEST, part of NCBI.

– The UniGene set organizes ESTs from individual genes to remove a lot of redundancy.

Finding the Transcription Start Site

• The basic idea is to first create a model of transcription start sites based on

experimentally-determined starts, then devise ways to score sequences relative to

this model.

• Work by Bucher in 1990 produced scoring matrices for the GC box, CCAAT box,

TATA box, and RNA initiation/cap site (Inr). (Moving 5’ to 3’ upstream from the

TSS itself).

– Only vertebrates have GC and CCAAT boxes

– not all genes have recognizable TATA boxes

– the cap signal is quite short, and thus noisy.

Eukaryotic ab initio Gene Prediction

• Based on hidden Markov model (HMM)

• As you move along the DNA sequence, a

given nucleotide can be in an exon or an

intron or in an intergenic region.

• The oversimplified model on this slide

doesn’t have the ”non-gene” state

• Use a training set of known genes (from

the same or closely related species) to

determine transmission and emission

probabilities.



Very simple HMM: each

base is either in an intron

or an exon, and gets

emitted with different

frequencies depending on

which state it is in.



Genemark scoring of the likelihood each

nucleotide is in an intron, based on HMM.

HMM Model with Intron Phases

• A more realistic model: you can move from the

non-gene state (N) to either a singleton exon

(Es: a gene with just one exon ) or to an initial

exon (Ei).

• From Es you move back to N.

• From Ei you can move to an intron, which can be

in any of 3 different phases.

– Intron and exon phases designate whether the

exon/intron boundary splits a codon: in phase 0

the boundary is between codons; phase 1 splits A more realistic

the codon between the first and second bases, model from SNAP

and phase 2 splits the codon between the second

and third bases.

– Also, exon/intron boundaries don’t split stop

codons, which necessitates the I1T etc. intron

states.

• Then back and forth between introns and exons,

until you reach a terminal exon (Et), then back to

the intergenic state (N).

• SNAP: Korf (2004) BMC Bioinformatics 5:59.

Codon Bias within Exons

• Depending on the GC

content of the

organism as well as

other, less well defined

characteristics, the

frequency with which

different synonymous

codons are used can

vary widely. This

makes it necessary to

train the HMM gene

finder with a set of

genes from the same

or a closely related

At: Arabidopsis thaliana; Ce:

species. Caenorhabditis elegans; Dm:

Drosophila melanogaster; Os:

Oryza sativa

Exon/Intron Boundaries and

Start Codons

• Gene finders use HMMs

that look for signals in the

DNA by applying a “weight

matrix” to each nucleotide

based on the nucleotides

around it. Thus, the HMM

is considering more than

just the immediately

preceding nucleotide.

Sequence logos around (b) the intron slice donor

site (usually GT) and (c) the ATG translation start

codon, in four well-studied eukaryotes.

Some Results with SNAP

• Here, sensitivity (SN) and specificity (SP) are listed for:

1. Whether a given nucleotide is contained in an exon

2. Whether a given predicted exon has exactly the same boundaries as

a real exon

3. Whether a given gene has exactly the same intron/exon structure

and boundaries as the actual gene.

Discriminant Analysis

• Scoring sequences for the presence of eukaryotic

promoters uses several techniques, including hidden

Markov models, neural networks (which we will

discuss later), and other scoring schemes.

• Discriminant analysis is a statistical technique for

combining scores from several different parameters

and drawing a line that discriminates between “good”

and “bad”.

• Each factor is considered an independent dimension

on a multi-dimensional plot.

– As opposed to just adding up scores for each factor

– or, using individual scores as part of yes/no decisions Several factors used to score

• Each sequence from a training set is plotted, knowing promoter sequences. This is

in advance which sequences are genuine promoters part of a neural network

and which are not. model, but the factors are

• Using a least-squares fitting method, draw the line (a common to many programs.

hyperplane really) that best separates the two groups.

– This is linear discriminant analysis

• Quadratic discriminant analysis draws a parabola

instead. Sometimes this works better.

Discriminant Analysis

• Illustrated here for 2 factors, but of course there can be many more.

• The quadratic discriminant works much better in this case.

• The position of each sequence in a scan of a region can be scored

according to where it falls on the plot.

• Support Vector Machines (SVM) are a fancier way of doing this: they can

generate a much more complex curve than a hyperplane to separate the

groups.

Annotation

• Once genes have been identified, we need to assign them names

and functions.

– In well-studied genomes, such as Drosophila, there are many already-

named genes, some of which are quite whimisical. They often reflect the

mutant phenotype, e.g. white eyes. A mutant whose wings are held at an

unusual angle: Frodo (“lowered of the wings”).

– But in general, gene names from genome project tend to be descriptions

of function. For example, the gene for glucose 6-phosphate

dehydrogenase is just called that in bacteria, but it is “Zwischenferment”

(a German word) in Drosophila.

• Who is going to do the annotations? There are a lot of genes, and no

one is an expert in all of them.

– One approach: use amateurs who are trained to follow certain guidelines

and have easy access to as much useful information as possible.

Problem: inconsistent results

– Another approach: have experts in specific genes annotate all examples

of that gene. Problem: getting experts for all genes and keeping them

interested.

– Yet another: do as much automated annotation as possible, with trained

personnel examining only the hard cases. Problem: identifying the hard

cases.

More Annotation

• Need for experimental evidence. All gene identification is based on

experimental work: biochemistry, genetics, etc.

– Most annotation is thus based on logic like “Gene X in my organism is similar to

gene Y in another organism that has been experimentally determined to have

such-and-such a function.”

– How similar is “similar”? Are there other functions that might use similar

proteins?

• Gene function predictions vary in their reliability: how well does the current

gene match previously discovered genes?

• We need gene names that are computer-recognizable. This means using a

controlled vocabulary: only certain words and punctuation is used, and

standard genes are named the same way in all organisms.

– Gene Ontology descriptions are useful, but they are not detailed

enough, and they tend to focus on human genes at the expense of

bacterial genes.

– Enzyme Commission (E.C.) numbers are very useful or enzymes

because they describe a function precisely.

– Otherwise, you either follow the conventions of the group you are

working with or try to mimic the best BLAST hits

Confidence in Name Assignment

• The basic hierarchy:

– Confident assignment. We are almost certain we know what this gene

is, based on its similarity to other genes.

• If all of the top hits are high quality (say, better than 35% identical amino acids and

within 20% of the same length), and they all have similar names, a gene name can be

confidently given.

– Some uncertainty, often with regard to exact enzyme or transporter

specificity. Names are often called “putative” here.

– We know it belongs to a gene family, or it contains a known domain, but

function is unclear

– Conserved hypothetical genes. Found in other species, but with no

known function.

– Hypothetical genes. The gene caller predicts a gene, but there is no

match to any gene in another species.

• But in fact these ideas are only loosely applied across many

different annotation systems, and it is common to find highly similar

genes given slightly different names.

– Also, sometimes “hypothetical” is used too freely. It is always correct to

call a gene hypothetical, but it doesn’t convey any useful information.

Gene Ontology

• One of my “rules of biology” I tell the introductory students is that quite often there is

more than one word used to describe the same phenomenon, and the same word is

often used to describe completely different phenomena

– The citric acid cycle is also the tricarboxylic acid cycle and the Krebs cycle

– “nucleus” of a cell and an atom

• The Gene Ontology (GO) consortium (http://www.geneontology.org/) is an attempt

describe gene products with a structured controlled vocabulary, a set of invariant

terms that have a known relationship to each other.

• Each GO term is given a number of the form GO:nnnnnnn (7 digits), as well as a term name. For

example, GO:0005102 is “receptor binding”.

• There are 3 root terms: biological process, cellular component, and molecular function.

A gene product will probably be described by GO terms from each of these

“ontologies”. (ontology is a branch of philosophy concerned with the nature of being,

and the basic categories of being and their relationships.)

– For instance, cytochrome c is described with the molecular function term “oxidoreductase

activity”, the biological process terms “oxidative phosphorylation” and “induction of cell

death”, and the cellular component terms “mitochondrial matrix” and “mitochondrial inner

membrane”

• The terms are arranged in a hierarchy that is a “directed acyclic graph” and not a

tree. This means simply that each term can have more than one parent term, but the

direction of parent to child (i.e. less specific to more specific) is always maintained.

More GO

• Cellular component describes what larger structure the gene product is part of. For

example, “ribosome” or “endoplasmic reticulum” or “cytoplasm”.

• Molecular function describes activities, such as catalytic or binding activities, that

occur at the molecular level. They are always described as activities: the enzyme

adenlylate cyclase is given the term “adenylate cyclase activity”

• Biological process describes the higher level activity that the molecular function

contributes to. For example, “signal transduction” or “mannose transport”



• GO doesn’t go above the level of the cell, and it doesn’t deal with cell types.

– It also doesn’t describe disease states or abnormal functions (cancer, for example).

– It also doesn’t describe individual protein domains or gene structure.

• Terms range in a hierarchy from very specific to very general. During annotation, the

trick is to find terms that are as specific as possible without over-interpreting the data.

This can be tricky with unfamiliar gene functions.



• My opinion is that GO is a great tool, but hard to do well. And, it doesn’t quite get

down to the level of exactly what the gene does. We really do want to name a gene

“cytochrome c”, and not just use GO terms as descriptions.

Enzyme Nomenclature

• Enzyme functions: which reactants are converted to which products

– Across many species, the enzymes that perform a specific function are usually

evolutionarily related. However, this isn’t necessarily true. There are cases of two

entirely different enzymes evolving similar functions.

– Often, two or more gene products in a genome will have the same E.C. number.

• Enzyme functions are given unique numbers by the Enzyme Commission.

– E.C. numbers are four integers separated by dots. The left-most number is the

least specific

– For example, the tripeptide aminopeptidases have the code "EC 3.4.11.4", whose

components indicate the following groups of enzymes:

• EC 3 enzymes are hydrolases (enzymes that use water to break up some other molecule)

• EC 3.4 are hydrolases that act on peptide bonds

• EC 3.4.11 are those hydrolases that cleave off the amino-terminal amino acid from a

polypeptide

• EC 3.4.11.4 are those that cleave off the amino-terminal end from a tripeptide

• Top level E.C. numbers:

– E.C. 1: oxidoreductases (often dehydrogenases): electron transfer

– E.C. 2: transferases: transfer of functional groups (e.g. phosphate) between

molecules.

– E.C. 3: hydrolases: splitting a molecule by adding water to a bond.

– E.C. 4: lyases: non-hydrolytic addition or removal of groups from a molecule

– E.C. 5: isomerases: rearrangements of atoms within a molecule

– E.C. 6: ligases: joining two molecules using energy from ATP

Information Used in Annotation

• BLAST searches

• HMM models of specific genes or gene families (Pfam, TIGRfam,

FIGfam).

• Sequence motifs and domains. If the gene is not a good match to

previously known genes, these provide useful clues.

• Cellular location predictions, especially for transmembrane proteins.

• Genomic neighbors, especially in bacteria, where related functions

are often found together in operons and divergons (genes

transcribed in opposite directions that use a common control region).

• Biochemical pathway/subsystem information. If an organism has

most of the genes needed to perform a function, any missing

functions are probably present too.

– Also, experimental data about an organism’s capacities can be used to

decide whether the relevant functions are present in the genome.

Transmembrane Predictions

• Integral membrane proteins contain amino acid

sequences that go through the membrane one or

several times.

– There are also peripheral membrane proteins that stick

to the hydrophilic head groups by ionic and polar

interactions

– There are also some that have covalently bound

hydrophobic groups, such as myristoylate, a 14 carbon

saturated fatty acid that is attached to the N-terminal

amino group.

• There are 2 main protein structures that cross

membranes.

– Most are alpha helices, and in proteins that span

multiple times, these alpha helices are packed together

in a coiled-coil. Length = 15-30 amino acids.

– Less commonly, there are proteins with membrane

spanning “beta barrels”, composed of beta sheets

wrapped into a cylinder. An example: porins, which

transport water across the membrane.

Hydrophobicity and Amphipathy

• Membrane interiors are hydrophobic, so the simplest way of finding

membrane-spanning regions is to look for relatively hydrophobic regions.

• There are several measures of amino acid hydrophobicity available,

based on partitioning in water vs. solvent or on crystallography of

membrane proteins. No one scale dominates prediction models.

• However, beta barrels and coiled-coils of alpha helices have interior

regions that don’t need to be hydrophobic because they don’t interact

with the hydrophobic fatty acid chains of the membrane.

– Thus, many membrane-spanning regions are amphipathic: they

have a hydrophobic side and a hydrophilic side.

– The helical wheel is a simple way of visualizing this. It is a view

looking down the helix. If most of the hydrophobic residues fall on

one side, the sequence is likely to be membrane-spanning.

HMM Prediction of Transmembrane Regions

• Hidden Markov models seem to do a

good job predicting transmembrane

regions.

• The states are: loops inside the cell,

loops outside the cell, and

transmembrane regions.

– In addition, the cap amino acids (at

the membrane/aqueous interface)

can be a state, and it is possible to

globular domains either inside or

outside the cell.

• The HMM is circular, allowing for

multiple passes through the

membrane.

– Many of the states allow transition

back to themselves: there is more

than one amino acids in the

membrane interior, for example.

• The model is parameterized using

known membrane proteins (from X-

ray crystallography).

• The model pictured here is TMHMM.

Biochemical Pathways and Co-localization



• Operon structure is often

maintained over fairly large

taxonomic regions.

– Sometimes gene order is altered,

and sometimes one or more

enzymes are missing.

– But in general, this phenomenon

allows recognition or verification

that widely diverged enzymes do

in fact have the same function.

• This is an operon that contains

part of the glycolytic pathway.

– 1: phosphoclycerate mutase

– 2: triosephosphate isomerase

– 3: enolase

– 4: phosphoglycerate kinase

– 5: glyceraldehyde 3-phosphate

dehydrogenase

– 6: central glycolytic gene

regulator

Alternate

pathways

• There are often alternate ways

of going through a pathway.

– Often dependent on

taxonomic group (but beware

of horizontal gene transfer).

– Reversible pathways often

have irreversible steps that

need alternate enzymes to

get around. And some

species will only have the

pathway functioning in one

direction.

• This pathway is glycolysis and

gluconeogenesis in Bacillus

megaterium. The colored

boxes indicate enzymes that

are present.

– Both glycolysis and

gluconeogenesis are present

– Several alternative enzymes

are not found here.

BIOLOG

• BIOLOG is a company that performs batteries of tests on

bacteria. The idea is to develop a complete metabolic

profile for the organism.

– They are grown in microtiter plates with standard growth media

supplemented or substituted with various possible nutrients or

growth inhibitors. For example, carbon sources, nitrogen

sources, phosphate sources, various osmotic strengths and pHs

– Growth is checked over several days

– Strain comparison or individual data

• The yellow triangles in each well position are growth

curves.

– Red = strain A grew better than strain B; green is the opposite

– Outlined boxes were significant by the company’s standards

BIOLOG Results


Related docs
Other docs by HC111129104154
Gaussian Elimination
Views: 0  |  Downloads: 0
generale formattazione doc
Views: 0  |  Downloads: 0
Allegato A
Views: 4  |  Downloads: 0
PERFIL PROYECTO DE INVERSION
Views: 107  |  Downloads: 0
ryan
Views: 0  |  Downloads: 0
welding pipe LP
Views: 0  |  Downloads: 0
Bayes
Views: 1  |  Downloads: 0
S 8thGr earthSpace U2 L4 nitCycleGame
Views: 0  |  Downloads: 0
Sheet1
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!