ProjectResultsppt - Index Structure for String Databases by maclaren1

VIEWS: 95 PAGES: 46

									Project Results

    Alexandra Martinez

Computational Molecular Biology
  CISE, University of Florida
         Spring 2004
   Outline

 Project Overview
 Implementation
 Results
 Benchmark
 Future Work
    Project Overview
 Map the strings of the db into an integer space.
      Frequency Vector
      Vector of Wavelet Coefficients

 Use a distance function in this integer space, which
  is a lower bound on the edit distance.
    Frequency Distance
    Wavelet Distance


 Cluster the vectors of consecutive substrings into
  Minimum Bounding Rectangles (MBRs).
 Obtain an array of MBRs for different resolutions.
   Outline

 Project Overview
 Implementation
 Results
 Benchmark
 Future Work
       Implementation
 Test Database: Plant Genomics Database (Kishore).
 Indexing built over nucleotide sequences.
 Programming done in Java – platform independent.
 Interact to DB through JDBC/ODBC.
 Index stores two vectors for each nucleotide sequence in DB:
     Frequency Vector
    Wavelet Vector
 String matching based on the maximum of
    Frequency Distance
    Wavelet Distance
 Cluster WT-vectors of consecutive substrings into Minimum
  Bounding Rectangles (MBRs) at different resolutions.
Frequency Distance Calculation


posDist      (u  v )
            i:ui  vi
                        i   i   negDist      (v  u )
                                            i:ui  vi
                                                        i   i




     FD1(u, v) = max { posDist, negDist }
Wavelet Vector Computation

         f (ci)                      k=0
Ak,i = A
         k-1,2i + Ak-1,2i+1          0 < k < log2n


         0                           k=0
Bk,i =
         Ak-1,2i - Ak-1,2i+1         0 < k < log2n

                      0<i<(n/2k)-1
Wavelet Distance Calculation

   pos        (a
           i:a1,i  a2 ,i
                            1,i     a2,i )        (b
                                                i:b1,i b2 ,i
                                                                1,i     b2,i )


   neg        (a
           i:a1,i  a2 ,i
                            2 ,i    a1,i )         (b
                                                i:b1,i b2 ,i
                                                                2 ,i    b1,i )
Maximum Frequency Distance
Calculation

FD(s1,s2) =
  max { FD1(f (s1), f (s2)), FD2(ψ(s1),ψ(s2)) }


FD1 is the Frequency Distance
FD2 is the Wavelet Distance
   Outline

 Problem Definition
 Implementation
 Results
 Benchmark
 Future Work
DB Augmented with Vectors
Output Samples

 Species Name: allium_cepa
 Gene Type: ces_a9
 See output file with vectors & MBRs
 See WT vectors for substrings of
  different sizes: 16, 32, 64
 See some query results for:
     Populus tremuloides (5)
     Arabidopsis thaliana (1)
     Allium cepa (1)
   Performance of Dist Functions
                      4000


                      3500


                      3000
                                                                                        Wavelet Dist
WT & Freq Distances




                      2500                                                              Frequency Dist

                                                                                        Max Freq Dist
                      2000

                                                                                        Linear (Max Freq Dist)
                      1500
                                                                                        Linear (Frequency
                                                                                        Dist)
                      1000
                                                                                        2 per. Mov. Avg.
                                                                                        (Wavelet Dist)
                      500


                        0
                             0


                                 500


                                       1000


                                              1500


                                                     2000


                                                            2500


                                                                   3000


                                                                          3500


                                                                                 4000
                                              Edit Ditance
   Outline

 Problem Definition
 Implementation
 Results
 Benchmark
 Future Work
Comparing Query Results

 Joint tests:
      5 queries from Populus tremuloides (a6-a10)
      1 query from Arabidopsis thaliana (a1)
      1 query from Allium cepa (a31)
             Populus tremuloides, ces_a6
              Wavelet Index                                K-gram Index
             Query from Populus tremuloides, ces_a6
populus_tremuloides ces_a6 0.0 amborella_trichopoda ces_a1                          1
helianthus_annuus         ces_a3    16.0 beta_vulgaris                ces_a11       1
helianthus_annuus                   search = 5%
                        Error for WT16.0 beta_vulgaris
                          ces_a5                                      ces_a17       1
beta_vulgaris             ces_a10   20.0 ceratopteris_richardii       ces_a2        1
nuphar_advena   Query results:
                          ces_a4    21.0 lpomoea_nil                  ces_a8        0
poncirus_trifoliata
prunus_persica           common sequences in output set = 2 ces_a8
                        # ces_a6
                          ces_a9
                                    24.0 lycopersicon_esculentum
                                    24.0 medicago_truncatula          ces_a6
                                                                                    1
                                                                                    1
poncirus_trifoliata
cycas_rumphii
                       # ces_a5    25.0 poncirus_trifoliata
                          common species in output set
                          ces_a3
                                                                 = 4 ces_a13        1
                                    26.0 populus_tremula_populus_tremuloides ces_a4 1
prunus_persica           common genus populus_tremuloides
                        # ces_a4    26.0  in output set          = 4 ces_a6         1
helianthus_annuus         ces_a4    27.0 prunus_persica               ces_a4        1
capsicum_annuum See comparative table
                          ces_a12   28.0 prunus_persica               ces_a14       1
vitis_vinifera            ces_a3    28.0 solanum_tuberosum            ces_a9        1
vitis_vinifera            ces_a7    28.0 triticum_monococcum          ces_a15       1
allium_cepa               ces_a60   30.0 zea_mays                     ces_a6        1
vitis_vinifera            ces_a2    30.0 zinnia_elegans               ces_a5        1
saccharum_sp              ces_a3    32.0 nicotiana_benthamiana        ces_a1        1
                                         pennisetum_glaucum           ces_a1        1

17 matching sequences retrieved             18 matching sequences retrieved
              Populus tremuloides, ces_a7

               Query from Populus tremuloides, ces_a7
               Wavelet Index                K-gram Index
populus_tremuloides        ces_a7          0.0     cucumis_sativus             ces_a2        2
populus_tremuloides   Error for WT0.0
                           ces_a8           search = 8%
                                                   populus_balsamifera_trichocarpa ces_a14   2
populus_tremuloides        ces_a5          1.0     populus_tremuloides         ces_a5        0
                 Query results:23.0
eucalyptus_tereticornis ces_a2                     populus_tremuloides         ces_a3        2
sorghum_propinquum ces_a10                 26.0    sorghum_propinquum ces_a10                1
                      # common sequences in output set = 5
populus_tremula_populus_tremuloides ces_a2 27.0    zinnia_elegans              ces_a13       2
ceratodon_purpureus # common species in output set
                      ces_a1              28.5    hordeum_vulgare = 3         ces_a1        2
ceratodon_purpureus ces_a3                 29.0    populus_tremuloides         ces_a7        1
lotus_corniculatus    # common genus in output set
                           ces_a7          30.0    populus_tremuloides     = 3ces_a8         1
triticum_aestivum          ces_a7          31.0    saccharum_sp                ces_a1        2
saccharum_sp Seeces_a1     comparative table
                                           31.5    lpomoea_nil                 ces_a3        2
poncirus_trifoliata        ces_a13         32.0
lotus_corniculatus         ces_a6          33.0
13 matching sequences retrieved                    11 matching sequences retrieved
Populus tremuloides, ces_a8
                 Wavelet Index                                  K-gram Index
populus_tremuloides         ces_a7           0.0    allium_cepa            ces_a4     3
populus_tremuloides         ces_a8           0.0    allium_cepa            ces_a43    3
populus_tremuloides         ces_a5           1.0    allium_cepa            ces_a52    3
 Query from Populus tremuloides, ces_a8
eucalyptus_tereticornis ces_a2
sorghum_propinquum ces_a10
                                             23.0
                                             26.0
                                                    allium_cepa
                                                    antirrhinum_majus
                                                                           ces_a1
                                                                           ces_a4
                                                                                      3
                                                                                      3
populus_tremula_populus_tremuloides ces_a2   27.0   capsicum_annuum        ces_a5     3
           Error for WT search = 10%
ceratodon_purpureus
ceratodon_purpureus
                            ces_a1
                            ces_a3
                                             28.5
                                             29.0
                                                    ceratodon_purpureus
                                                    eucalyptus_grandis
                                                                           ces_a3
                                                                           ces_a2
                                                                                      3
                                                                                      3
lotus_corniculatus          ces_a7           30.0   gossypium_hirsutum     ces_a7     3
 Query results:
triticum_aestivum
saccharum_sp
                            ces_a7
                            ces_a1
                                             31.0
                                             31.5
                                                    helianthus_annuus
                                                    hordeum_vulgare
                                                                           ces_a7
                                                                           ces_a7
                                                                                      3
                                                                                      3
poncirus_trifoliata         ces_a13          32.0   lpomoea_nil            ces_a8     3
           # common sequences in output set = 4
lotus_corniculatus
eucalyptus_grandis
                            ces_a6
                            ces_a9
                                             33.0
                                             36.0
                                                    lycopersicon_esculentum ces_a1
                                                    pennisetum_glaucum     ces_a1
                                                                                      2
                                                                                      3
lpomoea_trifida             ces_a2           36.0   pinus_taeda            ces_a1     3
       
lpomoea_nil
            # common species in output set
triticum_turgidum_durum ces_a5
                            ces_a2
                                             =8
                                             37.0
                                             38.0
                                                    poncirus_trifoliata
                                                    poncirus_trifoliata
                                                                           ces_a4
                                                                           ces_a5
                                                                                      3
                                                                                      3
citrus_sinensis             ces_a17          40.5   poncirus_trifoliata    ces_a6     3
           # common genus in output set
alstroemeria_peruviana ces_a1               =8
                                             41.0   poncirus_trifoliata    ces_a9     3
sorghum_propinquum ces_a5                    41.0   populus_tremuloides    ces_a5     0
triticum_aestivum           ces_a2           41.0   populus_tremuloides    ces_a7     0
 See comparative table
populus_balsamifera_trichocarpa ces_a5       43.0   populus_tremuloides
                                                    populus_tremuloides
                                                                           ces_a8
                                                                           ces_a10
                                                                                      0
                                                                                      3
                                                    populus_tremuloides    ces_a3     3
                                                    populus_tremuloides    ces_a5     2
                                                    populus_tremuloides    ces_a8     3
                                                    prunus_persica         ces_a5     3
                                                    saccharum_officinarum ces_a2      2
                                                    sorghum_bicolor        ces_a4     3
                                                    sorghum_bicolor        ces_a5     3
                                                    triphysaria_versicolor ces_a1     3
                                                    triticum_aestivum      ces_a6     2
                                                    vitis_vinifera         ces_a4     3
                                                    lycopersicon_esculentum ces_a2    3
                                                    triticum_turgidum_durum ces_a2    3
22 matching sequences retrieved                     35 matching sequences retrieved
Populus tremuloides, ces_a9
               Wavelet Index                          K-gram Index
populus_tremuloides   ces_a9      0.0    allium_cepa           ces_a29     0
pinus_taeda           ces_a10     13.0   cicer_arietinum       ces_a1      1
 Query from Populus tremuloides, ces_a9
prunus_persica
rosa_hybrid
                      ces_a5
                      ces_a10
                                  20.0
                                  20.0
                                         citrus_sinensis       ces_a1
                                         gossypium_barbadense ces_a1
                                                                           1
                                                                           1
allium_cepa           ces_a33     21.0   gossypium_barbadense ces_a2       1
          Error for WT search = 5%
citrus_sinensis       ces_a12     21.0   gossypium_herbaceum ces_a3        1
lpomoea_nil           ces_a8      21.0   gossypium_hirsutum    ces_a5      1
 Query results:
populus_deltoides     ces_a1      22.0   gossypium_raimondii ces_a3        1
prunus_persica        ces_a18     22.0   helianthus_annuus     ces_a9      1
          # common sequences in output set = 1
prunus_persica
gossypium_arboreum
                      ces_a10
                      ces_a9
                                  26.0
                                  27.0
                                         helianthus_annuus
                                         lpomoea_trifida
                                                               ces_a10
                                                               ces_a3
                                                                           1
                                                                           1
          # common species in output set
hedyotis_terminalis
solanum_tuberosum
                      ces_a3
                      ces_a5
                                  27.0
                                  27.0
                                            =5
                                         lactuca_sativa
                                         lactuca_sativa
                                                               ces_a5
                                                               ces_a7
                                                                           0
                                                                           1
solanum_tuberosum     ces_a8      27.0   lactuca_sativa        ces_a8      0
          # common genus in output set
brassica_napus        ces_a10     28.0     =7
                                         leymus_chinensis      ces_a1      0
secale_cereale        ces_a2      28.5   pinus_taeda           ces_a6      1
 See comparative table                  poncirus_trifoliata   ces_a8      1
                                         poncirus_trifoliata   ces_a10     1
                                         populus_tremuloides   ces_a4      1
                                         populus_tremuloides   ces_a9      0
                                         prunus_persica        ces_a9      1
                                         prunus_persica        ces_a13     1
                                         sorghum_propinquum ces_a3         1
                                         triticum_aestivum     ces_a4      0
                                         zea_mays              ces_a10     1
                                         zinnia_elegans        ces_a2      1
16 matching sequences retrieved          26 matching sequences retrieved
            Populus tremuloides, ces_a10
                                            K-gram Index
            Wavelet Index Populus tremuloides, ces_a10
              Query from
populus_tremuloides    ces_a10     0.0    allium_cepa            ces_a19    3
allium_cepa                        search = 8%
                     Error for WT 31.0
                       ces_a58            allium_cepa            ces_a23    2
allium_cepa            ces_a49     36.5   allium_cepa            ces_a39    2
                Query results: 41.0
saccharum_officinarum ces_a2              citrus_sinensis        ces_a5     3
poncirus_trifoliata    ces_a3      47.0   oryza_minuta           ces_a2     3
allium_cepa             common sequences in output set = 2ces_a2
                     #ces_a2      48.0   pinus_radiata                     3
amborella_trichopoda ces_a2        49.0   pinus_radiata
allium_cepa
                     # common species in output set
                       ces_a32     50.0   pinus_taeda
                                                             = 4ces_a6
                                                                 ces_a2
                                                                            3
                                                                            1
allium_cepa             common genus in output set
                     #ces_a50     50.0   pinus_taeda        = 4ces_a3      3
allium_cepa            ces_a47     50.0   populus_tremuloides    ces_a10    0
pinus_taeda  Seeces_a1 comparative table
                                   51.0   saccharum_sp           ces_a6     3
arachis_hypogaea       ces_a1      51.0   saccharum_sp           ces_a10    3
helianthus_annuus      ces_a9      53.0   triticum_turgidum_durum ces_a4    3
allium_cepa            ces_a23     56.0   vitis_aestivalis       ces_a1     2
beta_vulgaris          ces_a3      56.0   allium_cepa            ces_a17    3
pinus_radiata          ces_a7      56.0   triticum_aestivum      ces_a8     3
prunus_armeniaca       ces_a4      56.0
17 matching sequences retrieved           16 matching sequences retrieved
          Allium cepa, ces_a31
             Wavelet Index                             K-gram Index
allium_cepa Query from Allium cepa, ces_a31
                 ces_a31    0.0  allium_cepa                     ces_a31    1
mesembryanthemum_crystallinum ces_a8 20.5 allium_cepa            ces_a59    1
                  Error for WT search = 6%
allium_cepa               ces_a36    26.0 allium_cepa            ces_a62    1
allium_cepa               ces_a23    30.0 amborella_trichopoda ces_a2       1
            Query results:
prunus_armeniaca          ces_a3     30.0 antirrhinum_majus      ces_a3     1
                          ces_a1     32.0 cicer_arietinum
glycine_clandestina # common sequences in output set = 3
                                                                ces_a1     1
apium_graveolens          ces_a1     33.0 citrus_sinensis        ces_a12    1
                                           output set
citrus_sinensis  # common species incitrus_sinensis
                          ces_a22    33.0                      =3ces_a22    1
prunus_persica            ces_a6     34.0 dictyostelium_discoideum ces_a1   1
                                          gossypioides_kirkii = 4
                  # common genus in output set
gossypium_arboreum ces_a4            38.0                        ces_a3     0
lactuca_sativa            ces_a6     38.0 gossypium_barbadense ces_a1       0
            See comparative table
lactuca_sativa            ces_a4     39.0 gossypium_barbadense ces_a2       1
arachis_hypogaea          ces_a1     41.0 gossypium_herbaceum ces_a3        0
capsicum_annuum           ces_a9     40.0 gossypium_hirsutum     ces_a10    1
pinus_radiata             ces_a4     40.0 gossypium_raimondii ces_a3        0
prunus_armeniaca          ces_a5     40.0 thespesia_lampas       ces_a1     1
                                          thespesia_thespesioides ces_a1    1
                                          capsicum_annuum        ces_a9     1
16 matching sequences retrieved           18 matching sequences retrieved
       Arabidopsis thaliana, ces_a1
                Wavelet Index                              K-gram Index
        Query from Arabidopsis thaliana, ces_a1
arabidopsis_thaliana ces_a1 0.0 allium_cepa ces_a53                            3
eschscholzia_californica ces_a7        15.0   sorghum_propinquum    ces_a2     3
                Error for WT search = 6%
amborella_trichopoda ces_a1            17.0
glycine_max               ces_a3       18.0
        
rosa_hybrid Query results:ces_a6       20.0
lpomoea_nil               ces_a8       22.0
              
citrus_sinensis  # common sequences in output set = 0
                          ces_a18      23.5
populus_tremuloides       ces_a3       24.0
lpomoea_nil     # common species in output set
                          ces_a7       25.0       =0
solanum_tuberosum         ces_a4       25.0
              
triticum_aestivum# common genus in output set
                          ces_a4       25.0      =0
brassica_napus            ces_a10      26.0
nicotiana_tabacum         ces_a4       27.0
           See comparative table
citrus_sinensis           ces_a13      32.0
citrus_sinensis           ces_a15      31.0
eschscholzia_californica ces_a5        31.0
lycopersicon_esculentum ces_a2         32.0
mesembryanthemum_crystallinum ces_a5   32.0
glycine_max               ces_a9       33.0
rosa_hybrid               ces_a9       33.0
zinnia_elegans            ces_a15      33.5
21 matching sequences retrieved               2 matching sequences retrieved
   Outline

 Project Overview
 Implementation
 Results
 Benchmark
 Future Work
   Future Work

 Add a query capability to the web interface of
  the database that makes use of this index.
 Improve performance for substring matching.
     Strategy: Use local information
        Compute    vectors of substrings rather than vectors
         of the entire sequence.
        Group vectors into MBR’s.

        Compute distance to MBR’s.

 Tackle the superstring matching problem.
     References

T. Kahveci, A. K. Singh. Efficient Index Structures for String
     Databases. VLDB 2001: 351-360.
THANKS
APPENDIX
Frequency Vectors
    Frequency Vector

 s : string from alphabet ={1, ..., }
 ni : number of occurrences of i in s
      (1  i  )
 Define the frequency vector of s as
      f(s)=[n1, ..., n]
 Example:
   s = AATGATAG
   f(s) = [nA, nC, nG, nT] = [4, 0, 2, 2]
    Frequency Distance (FD1):
    A Lower Bound on the ED
 Define FD1(u, v) as the minimum number of steps in
  order to go from u to v (or viceversa) by moving to a
  neighbor point at each step.
 Two points u and v in   σdim space are neighbors if
  one of them can be obtained from the other by a
  single edit operation.
      Frequency Distance:
      Example
s = AATGATAG => f(s)=[4, 0, 2, 2]
t = ACTTAGC => f(t)=[2, 2, 1, 2]                      f(s)
     pos = (4-2) + (2-1) = 3
     neg = (2-0) = 2
     FD1(f(s), f(t)) = 3             FD1(f(t),f(s))
     ED(s, t) = 4

 FD1( f(s), f(t) ) = max{pos, neg}                           f(t)
 FD1( f(s), f(t) )  ED(s, t)
Wavelet Vectors
    Wavelet Transformation:
    String Decomposition
                                  i




k


     First wavelet                    Second wavelet
     coefficient                      coefficient
             (s)=

         Ak,i = Ak-1,2i + Ak-1,2i+1 0<k<(log2n)
         Bk,i = Ak-1,2i - Ak-1,2i+1 0<i<(n/2k)-1
     Wavelet Distance (FD2):
     A Lower Bound on the ED




 Maximum Frequency Distance FD(s 1,s2) =
  max { FD1(f(s1), f(s2)), FD2(ψ(s1),ψ(s2)) }
              Wavelet Transformation:
              Example
s     = T C A C                                            n = |s| = 4
 0(s) = [v0,0 , v0,1 , v0,2 , v0,3]
          = [ (A0,0, B0,0), (A0,1, B0,1), (A0,2, B0,2), (A0,3, B0,3)        ]
          = [ (f(t), 0),      (f(c), 0),     (f(a), 0),      (f(c), 0)      ]
          = [([0,0,0,1], 0), ([0,1,0,0], 0), ([1,0,0,0], 0), ([0,1,0,0], 0) ]


 1(s) = [     ([0,1,0,1], [0,-1,0,1]), ([1,1,0,0], [1,-1,0,0])            ]


 2(s) = [                     ( [1,2,0,1], [-1,0,0,1] )                        ]
                  First wavelet                         Second wavelet
                  coefficient                           coefficient
MRS Index Structure
      MRS Index Creation

               s1
w=2a



  transform




MBR
MRS Index Creation

            s1




transform
  MRS Index Creation

           s1




MBR
    MRS Index Creation

                       s1
      ...


slide c times
            c=box capacity




   MBR
   MRS Index Creation

                    s1




                   ...

MBRs containing wavelet coefficients of substrings of s1
     MRS Index Creation

                        s1




                Ta,1

                       ...        W=2a

Tree of MBRs for a resolution of W=2a over s1
Using Different Resolutions

                s1


       Ta,1

               ...   W=2a


      Ta+1,1

               ...   W=2a+1
                     MRS Index Structure
                                       j             1≤j≤d
            Database
             Resolution levels




        i                          Ti,j index for
                                   j th string and
                                   window size 2i




a≤i≤b
Range Queries
                           1. Partition a partial range to
                           3. Disk pages corresponding
                           2. Perform the query string
                           into result eachare read, and
                           last subqueries at various
                           query for set subquery on
                           the corresponding row our
                           resolutions available in of the
                           postprocessing is done to
           Range Queries   index
                           elminate false retrievals.
                           index.structure, and refine ε.
            s1      s2                 sd
    w=24     ...     ...   ...          ...      1=

    w=25     ...     ...   ...          ...

    w=26     ...     ...   ...          ...     2 1

    w=27     ...     ...   ...          ...     3 2


                    208
q   q1      q2               q3
    16      64             128

								
To top