Docstoc

FILTER and complexity-2011

Document Sample
FILTER and complexity-2011 Powered By Docstoc
					                      FILTERING OUT LOW COMPLEXITY SEQUENCES

Short repeats and low complexity sequences, such as glutamine-rich regions, confound
most database searching methods.

For BLAST, the random model against which the significance of segment pair scores is
evaluated assumes that at each position, each residue has a probability of occurring which is
proportional to its composition in the database as a whole.

Low complexity or highly repetitive sequences are inconsistent with this assumption.

Suspect this problem when the number of significant segment pair scores is much higher
than you would expect. The output is either enormous or the output size limits cut off your
output long before all the segments are displayed.

The x filter (Claverie and States, Computers Chm. 17; 191-201, (1993)) masks short
repeats

        Xnu replaces statistically significant tandem repeats in protein sequences with X
        characters. If a resulting protein sequence is used as a query for a BLAST search, the
        regions with X characters are ignored.

The s filter masks low complexity sequences (Wootton and Federhen, Computers Chem.
17; 149-163 (1993))

        Seg replaces low complexity regions in protein sequences with X characters. If a
        resulting protein sequence is used as a query for a BLAST search, the regions with X
        characters are ignored.

       DUST is employed with nucleotide sequences (NCBI)

When you run BLAST with filtering on, masked regions are excluded from the search.
These regions are replaced with X's in the output to let you identify the regions that were
excluded.

Here is the query sequence from the example session aligned to a filtered copy of itself

  1 MAAKIFCLIMXXXXXXXXXXXXIFPQCSQAPIASLLPPYLSPAMSSVCENPILLPYRIQQ 60
  1 MAAKIFCLIMLLGLSASAATASIFPQCSQAPIASLLPPYLSPAMSSVCENPILLPYRIQQ 60

 61 AIAAGIXXXXXXXXXXXXXXXXXXXXXXXXXXNIRXXXXXXXXXXXXXXYSQQQQFLPFN 120
 61 AIAAGILPLSPLFLQQSSALLQQLPLVHLLAQNIRAQQLQQLVLANLAAYSQQQQFLPFN 120

121 QXXXXXXXXXXXXXXXXPFSQLAAAYPRQFLPFNQLAALNSHAYVXXXXXXPFSQLAAVS 180
121 QLAALNSAAYLQQQQLLPFSQLAAAYPRQFLPFNQLAALNSHAYVQQQQLLPFSQLAAVS 180

181 PAAFLTQQQLLPFYLHTAPNVGTXXXXXXXXXXXXXXXTNPAAFYQQPIIGGALF 235
181 PAAFLTQQQLLPFYLHTAPNVGTLLQLQQLLPFDQLALTNPAAFYQQPIIGGALF 235

                 Always check the filter status


More recently (Aug 2010) web sites mark out the filtered regions, they do NOT use them to locate
similarity, but they DO SHOW the filtered region (in lower case) in the resulting alignments


  1 MAAKIFCLIMXXXXXXXXXXXXIFPQCSQAPIASLLPPYLSPAMSSVCENPILLPYRIQQ 60
  1 MAAKIFCLIMllglsasaatasIFPQCSQAPIASLLPPYLSPAMSSVCENPILLPYRIQQ 60
Illustration of a technical definition of ‘complexity’ (from JC Wootton Simple sequences of protein
and DNA; in DNA and Protein Sequence Analysis eds CJ Raklins and MJ Bishop 1997

Complexity        (compositional complexity) is a computed property that is independent of pattern or
                 periodicity
Patterns         are usually analysed by their content and spacings of residues and k-grams (ktuples
                 kmers kwords)
Periodicity      is repetition of residue type or k-grams at a constant interval

Local Compositional Complexity as represented by an ordered list of numbers
(known as a complexity state vector) eg for a 5 nucleotide window, there are 6
complexity states, each complexity state has 12 different compositions and each
composition has 20 possible sequences
                                                                                   Sequence
                                                                              1.    CCCAG
                                                                              2.    CCCGA
                                                                              3.    CCAGC
                                            Composition
                                                                              4.    CCGAC
        Complexity State
                                                                              5.    CCACG
                                              1.   (T3,C,A)
                                                                              6.    CCGCA
        1. {5,0,0,0}                          2.   (T3,C,G)
                                                                              7.    CACCG
                                              3.   (T3,A,G)
                                                                              8.    CAGCC
        2. (4,1,0,0}                          4.   (C3,T,A)
                                                                              9.    CACGC
                                              5.   (C3,T,G)
                                                                              10. ACCCG
        3. {3,2,0,0}                          6.   (C3,A,G)
                                                                              11. ACCGC
                                              7.   (A3,T,C)
                                                                              12. ACGCC
        4. {3,1,1,0}                          8.   (A3,T,G)
                                                                              13. CGCCA
                                              9.   (A3,C,G)
                                                                              14. CGCAC
        5. {2,2,1,0}
                                              10. (G3,T,C)
                                                                              15. CGACC
                                              11. (G3,T,A)
        6. {2,1,1,1}                                                          16. GCCCA
                                              12. (G3,C,A)
                                                                              17. GCCAC
                                                                              18. GCACC
                                                                              19. AGCCC
                                                                              20. GACCC


For any one complexity state, all compositions have the
same number of possible sequences, and this number
provides the basis of complexity measures.
Based on the complexity state vector of a sequence window, the local compositional complexity is
defined as the information needed per position, given the window’s composition to specify a particular
residue sequence. To express complexity in frequently used information units, logs may be taken to
base 2 for bits, or base e for nats.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:10/2/2012
language:Unknown
pages:2