# ProjectResultsppt - Index Structure for String Databases by maclaren1

VIEWS: 95 PAGES: 46

• pg 1
```									Project Results

Alexandra Martinez

Computational Molecular Biology
CISE, University of Florida
Spring 2004
Outline

 Project Overview
 Implementation
 Results
 Benchmark
 Future Work
Project Overview
 Map the strings of the db into an integer space.
   Frequency Vector
   Vector of Wavelet Coefficients

 Use a distance function in this integer space, which
is a lower bound on the edit distance.
 Frequency Distance
 Wavelet Distance

 Cluster the vectors of consecutive substrings into
Minimum Bounding Rectangles (MBRs).
 Obtain an array of MBRs for different resolutions.
Outline

 Project Overview
 Implementation
 Results
 Benchmark
 Future Work
Implementation
 Test Database: Plant Genomics Database (Kishore).
 Indexing built over nucleotide sequences.
 Programming done in Java – platform independent.
 Interact to DB through JDBC/ODBC.
 Index stores two vectors for each nucleotide sequence in DB:
  Frequency Vector
 Wavelet Vector
 String matching based on the maximum of
 Frequency Distance
 Wavelet Distance
 Cluster WT-vectors of consecutive substrings into Minimum
Bounding Rectangles (MBRs) at different resolutions.
Frequency Distance Calculation

posDist      (u  v )
i:ui  vi
i   i   negDist      (v  u )
i:ui  vi
i   i

FD1(u, v) = max { posDist, negDist }
Wavelet Vector Computation

f (ci)                      k=0
Ak,i = A
k-1,2i + Ak-1,2i+1          0 < k < log2n

0                           k=0
Bk,i =
Ak-1,2i - Ak-1,2i+1         0 < k < log2n

0<i<(n/2k)-1
Wavelet Distance Calculation

pos        (a
i:a1,i  a2 ,i
1,i     a2,i )        (b
i:b1,i b2 ,i
1,i     b2,i )

neg        (a
i:a1,i  a2 ,i
2 ,i    a1,i )         (b
i:b1,i b2 ,i
2 ,i    b1,i )
Maximum Frequency Distance
Calculation

FD(s1,s2) =
max { FD1(f (s1), f (s2)), FD2(ψ(s1),ψ(s2)) }

FD1 is the Frequency Distance
FD2 is the Wavelet Distance
Outline

 Problem Definition
 Implementation
 Results
 Benchmark
 Future Work
DB Augmented with Vectors
Output Samples

 Species Name: allium_cepa
 Gene Type: ces_a9
 See output file with vectors & MBRs
 See WT vectors for substrings of
different sizes: 16, 32, 64
 See some query results for:
   Populus tremuloides (5)
   Arabidopsis thaliana (1)
   Allium cepa (1)
Performance of Dist Functions
4000

3500

3000
Wavelet Dist
WT & Freq Distances

2500                                                              Frequency Dist

Max Freq Dist
2000

Linear (Max Freq Dist)
1500
Linear (Frequency
Dist)
1000
2 per. Mov. Avg.
(Wavelet Dist)
500

0
0

500

1000

1500

2000

2500

3000

3500

4000
Edit Ditance
Outline

 Problem Definition
 Implementation
 Results
 Benchmark
 Future Work
Comparing Query Results

 Joint tests:
   5 queries from Populus tremuloides (a6-a10)
   1 query from Arabidopsis thaliana (a1)
   1 query from Allium cepa (a31)
Populus tremuloides, ces_a6
Wavelet Index                                K-gram Index
 Query from Populus tremuloides, ces_a6
populus_tremuloides ces_a6 0.0 amborella_trichopoda ces_a1                          1
helianthus_annuus         ces_a3    16.0 beta_vulgaris                ces_a11       1
helianthus_annuus                   search = 5%
Error for WT16.0 beta_vulgaris
ces_a5                                      ces_a17       1
beta_vulgaris             ces_a10   20.0 ceratopteris_richardii       ces_a2        1
ces_a4    21.0 lpomoea_nil                  ces_a8        0
poncirus_trifoliata
prunus_persica           common sequences in output set = 2 ces_a8
# ces_a6
ces_a9
24.0 lycopersicon_esculentum
24.0 medicago_truncatula          ces_a6
1
1
poncirus_trifoliata
cycas_rumphii
   # ces_a5    25.0 poncirus_trifoliata
common species in output set
ces_a3
= 4 ces_a13        1
26.0 populus_tremula_populus_tremuloides ces_a4 1
prunus_persica           common genus populus_tremuloides
# ces_a4    26.0  in output set          = 4 ces_a6         1
helianthus_annuus         ces_a4    27.0 prunus_persica               ces_a4        1
capsicum_annuum See comparative table
ces_a12   28.0 prunus_persica               ces_a14       1
vitis_vinifera            ces_a3    28.0 solanum_tuberosum            ces_a9        1
vitis_vinifera            ces_a7    28.0 triticum_monococcum          ces_a15       1
allium_cepa               ces_a60   30.0 zea_mays                     ces_a6        1
vitis_vinifera            ces_a2    30.0 zinnia_elegans               ces_a5        1
saccharum_sp              ces_a3    32.0 nicotiana_benthamiana        ces_a1        1
pennisetum_glaucum           ces_a1        1

17 matching sequences retrieved             18 matching sequences retrieved
Populus tremuloides, ces_a7

 Query from Populus tremuloides, ces_a7
Wavelet Index                K-gram Index
populus_tremuloides        ces_a7          0.0     cucumis_sativus             ces_a2        2
populus_tremuloides   Error for WT0.0
ces_a8           search = 8%
populus_balsamifera_trichocarpa ces_a14   2
populus_tremuloides        ces_a5          1.0     populus_tremuloides         ces_a5        0
 Query results:23.0
eucalyptus_tereticornis ces_a2                     populus_tremuloides         ces_a3        2
sorghum_propinquum ces_a10                 26.0    sorghum_propinquum ces_a10                1
 # common sequences in output set = 5
populus_tremula_populus_tremuloides ces_a2 27.0    zinnia_elegans              ces_a13       2
ceratodon_purpureus # common species in output set
 ces_a1              28.5    hordeum_vulgare = 3         ces_a1        2
ceratodon_purpureus ces_a3                 29.0    populus_tremuloides         ces_a7        1
lotus_corniculatus    # common genus in output set
ces_a7          30.0    populus_tremuloides     = 3ces_a8         1
triticum_aestivum          ces_a7          31.0    saccharum_sp                ces_a1        2
saccharum_sp Seeces_a1     comparative table
31.5    lpomoea_nil                 ces_a3        2
poncirus_trifoliata        ces_a13         32.0
lotus_corniculatus         ces_a6          33.0
13 matching sequences retrieved                    11 matching sequences retrieved
Populus tremuloides, ces_a8
Wavelet Index                                  K-gram Index
populus_tremuloides         ces_a7           0.0    allium_cepa            ces_a4     3
populus_tremuloides         ces_a8           0.0    allium_cepa            ces_a43    3
populus_tremuloides         ces_a5           1.0    allium_cepa            ces_a52    3
 Query from Populus tremuloides, ces_a8
eucalyptus_tereticornis ces_a2
sorghum_propinquum ces_a10
23.0
26.0
allium_cepa
antirrhinum_majus
ces_a1
ces_a4
3
3
populus_tremula_populus_tremuloides ces_a2   27.0   capsicum_annuum        ces_a5     3
    Error for WT search = 10%
ceratodon_purpureus
ceratodon_purpureus
ces_a1
ces_a3
28.5
29.0
ceratodon_purpureus
eucalyptus_grandis
ces_a3
ces_a2
3
3
lotus_corniculatus          ces_a7           30.0   gossypium_hirsutum     ces_a7     3
 Query results:
triticum_aestivum
saccharum_sp
ces_a7
ces_a1
31.0
31.5
helianthus_annuus
hordeum_vulgare
ces_a7
ces_a7
3
3
poncirus_trifoliata         ces_a13          32.0   lpomoea_nil            ces_a8     3
    # common sequences in output set = 4
lotus_corniculatus
eucalyptus_grandis
ces_a6
ces_a9
33.0
36.0
lycopersicon_esculentum ces_a1
pennisetum_glaucum     ces_a1
2
3
lpomoea_trifida             ces_a2           36.0   pinus_taeda            ces_a1     3

lpomoea_nil
# common species in output set
triticum_turgidum_durum ces_a5
ces_a2
=8
37.0
38.0
poncirus_trifoliata
poncirus_trifoliata
ces_a4
ces_a5
3
3
citrus_sinensis             ces_a17          40.5   poncirus_trifoliata    ces_a6     3
    # common genus in output set
alstroemeria_peruviana ces_a1               =8
41.0   poncirus_trifoliata    ces_a9     3
sorghum_propinquum ces_a5                    41.0   populus_tremuloides    ces_a5     0
triticum_aestivum           ces_a2           41.0   populus_tremuloides    ces_a7     0
 See comparative table
populus_balsamifera_trichocarpa ces_a5       43.0   populus_tremuloides
populus_tremuloides
ces_a8
ces_a10
0
3
populus_tremuloides    ces_a3     3
populus_tremuloides    ces_a5     2
populus_tremuloides    ces_a8     3
prunus_persica         ces_a5     3
saccharum_officinarum ces_a2      2
sorghum_bicolor        ces_a4     3
sorghum_bicolor        ces_a5     3
triphysaria_versicolor ces_a1     3
triticum_aestivum      ces_a6     2
vitis_vinifera         ces_a4     3
lycopersicon_esculentum ces_a2    3
triticum_turgidum_durum ces_a2    3
22 matching sequences retrieved                     35 matching sequences retrieved
Populus tremuloides, ces_a9
Wavelet Index                          K-gram Index
populus_tremuloides   ces_a9      0.0    allium_cepa           ces_a29     0
pinus_taeda           ces_a10     13.0   cicer_arietinum       ces_a1      1
 Query from Populus tremuloides, ces_a9
prunus_persica
rosa_hybrid
ces_a5
ces_a10
20.0
20.0
citrus_sinensis       ces_a1
1
1
allium_cepa           ces_a33     21.0   gossypium_barbadense ces_a2       1
  Error for WT search = 5%
citrus_sinensis       ces_a12     21.0   gossypium_herbaceum ces_a3        1
lpomoea_nil           ces_a8      21.0   gossypium_hirsutum    ces_a5      1
 Query results:
populus_deltoides     ces_a1      22.0   gossypium_raimondii ces_a3        1
prunus_persica        ces_a18     22.0   helianthus_annuus     ces_a9      1
  # common sequences in output set = 1
prunus_persica
gossypium_arboreum
ces_a10
ces_a9
26.0
27.0
helianthus_annuus
lpomoea_trifida
ces_a10
ces_a3
1
1
  # common species in output set
hedyotis_terminalis
solanum_tuberosum
ces_a3
ces_a5
27.0
27.0
=5
lactuca_sativa
lactuca_sativa
ces_a5
ces_a7
0
1
solanum_tuberosum     ces_a8      27.0   lactuca_sativa        ces_a8      0
  # common genus in output set
brassica_napus        ces_a10     28.0     =7
leymus_chinensis      ces_a1      0
secale_cereale        ces_a2      28.5   pinus_taeda           ces_a6      1
 See comparative table                  poncirus_trifoliata   ces_a8      1
poncirus_trifoliata   ces_a10     1
populus_tremuloides   ces_a4      1
populus_tremuloides   ces_a9      0
prunus_persica        ces_a9      1
prunus_persica        ces_a13     1
sorghum_propinquum ces_a3         1
triticum_aestivum     ces_a4      0
zea_mays              ces_a10     1
zinnia_elegans        ces_a2      1
16 matching sequences retrieved          26 matching sequences retrieved
Populus tremuloides, ces_a10
K-gram Index
Wavelet Index Populus tremuloides, ces_a10
Query from
populus_tremuloides    ces_a10     0.0    allium_cepa            ces_a19    3
allium_cepa                        search = 8%
 Error for WT 31.0
ces_a58            allium_cepa            ces_a23    2
allium_cepa            ces_a49     36.5   allium_cepa            ces_a39    2
 Query results: 41.0
saccharum_officinarum ces_a2              citrus_sinensis        ces_a5     3
poncirus_trifoliata    ces_a3      47.0   oryza_minuta           ces_a2     3
allium_cepa             common sequences in output set = 2ces_a2
allium_cepa
 # common species in output set
ces_a32     50.0   pinus_taeda
= 4ces_a6
ces_a2
3
1
allium_cepa             common genus in output set
 #ces_a50     50.0   pinus_taeda        = 4ces_a3      3
allium_cepa            ces_a47     50.0   populus_tremuloides    ces_a10    0
pinus_taeda  Seeces_a1 comparative table
51.0   saccharum_sp           ces_a6     3
arachis_hypogaea       ces_a1      51.0   saccharum_sp           ces_a10    3
helianthus_annuus      ces_a9      53.0   triticum_turgidum_durum ces_a4    3
allium_cepa            ces_a23     56.0   vitis_aestivalis       ces_a1     2
beta_vulgaris          ces_a3      56.0   allium_cepa            ces_a17    3
pinus_radiata          ces_a7      56.0   triticum_aestivum      ces_a8     3
prunus_armeniaca       ces_a4      56.0
17 matching sequences retrieved           16 matching sequences retrieved
Allium cepa, ces_a31
Wavelet Index                             K-gram Index
allium_cepa Query from Allium cepa, ces_a31
ces_a31    0.0  allium_cepa                     ces_a31    1
mesembryanthemum_crystallinum ces_a8 20.5 allium_cepa            ces_a59    1
 Error for WT search = 6%
allium_cepa               ces_a36    26.0 allium_cepa            ces_a62    1
allium_cepa               ces_a23    30.0 amborella_trichopoda ces_a2       1
Query results:
prunus_armeniaca          ces_a3     30.0 antirrhinum_majus      ces_a3     1
ces_a1     32.0 cicer_arietinum
glycine_clandestina # common sequences in output set = 3
                                               ces_a1     1
apium_graveolens          ces_a1     33.0 citrus_sinensis        ces_a12    1
output set
citrus_sinensis  # common species incitrus_sinensis
ces_a22    33.0                      =3ces_a22    1
prunus_persica            ces_a6     34.0 dictyostelium_discoideum ces_a1   1
gossypioides_kirkii = 4
 # common genus in output set
gossypium_arboreum ces_a4            38.0                        ces_a3     0
lactuca_sativa            ces_a6     38.0 gossypium_barbadense ces_a1       0
See comparative table
lactuca_sativa            ces_a4     39.0 gossypium_barbadense ces_a2       1
arachis_hypogaea          ces_a1     41.0 gossypium_herbaceum ces_a3        0
capsicum_annuum           ces_a9     40.0 gossypium_hirsutum     ces_a10    1
pinus_radiata             ces_a4     40.0 gossypium_raimondii ces_a3        0
prunus_armeniaca          ces_a5     40.0 thespesia_lampas       ces_a1     1
thespesia_thespesioides ces_a1    1
capsicum_annuum        ces_a9     1
16 matching sequences retrieved           18 matching sequences retrieved
Arabidopsis thaliana, ces_a1
Wavelet Index                              K-gram Index
 Query from Arabidopsis thaliana, ces_a1
arabidopsis_thaliana ces_a1 0.0 allium_cepa ces_a53                            3
eschscholzia_californica ces_a7        15.0   sorghum_propinquum    ces_a2     3
  Error for WT search = 6%
amborella_trichopoda ces_a1            17.0
glycine_max               ces_a3       18.0

rosa_hybrid Query results:ces_a6       20.0
lpomoea_nil               ces_a8       22.0

citrus_sinensis  # common sequences in output set = 0
ces_a18      23.5
populus_tremuloides       ces_a3       24.0
lpomoea_nil     # common species in output set
ces_a7       25.0       =0
solanum_tuberosum         ces_a4       25.0

triticum_aestivum# common genus in output set
ces_a4       25.0      =0
brassica_napus            ces_a10      26.0
nicotiana_tabacum         ces_a4       27.0
   See comparative table
citrus_sinensis           ces_a13      32.0
citrus_sinensis           ces_a15      31.0
eschscholzia_californica ces_a5        31.0
lycopersicon_esculentum ces_a2         32.0
mesembryanthemum_crystallinum ces_a5   32.0
glycine_max               ces_a9       33.0
rosa_hybrid               ces_a9       33.0
zinnia_elegans            ces_a15      33.5
21 matching sequences retrieved               2 matching sequences retrieved
Outline

 Project Overview
 Implementation
 Results
 Benchmark
 Future Work
Future Work

 Add a query capability to the web interface of
the database that makes use of this index.
 Improve performance for substring matching.
   Strategy: Use local information
 Compute    vectors of substrings rather than vectors
of the entire sequence.
 Group vectors into MBR’s.

 Compute distance to MBR’s.

 Tackle the superstring matching problem.
References

T. Kahveci, A. K. Singh. Efficient Index Structures for String
Databases. VLDB 2001: 351-360.
THANKS
APPENDIX
Frequency Vectors
Frequency Vector

 s : string from alphabet ={1, ..., }
 ni : number of occurrences of i in s
(1  i  )
 Define the frequency vector of s as
f(s)=[n1, ..., n]
 Example:
s = AATGATAG
f(s) = [nA, nC, nG, nT] = [4, 0, 2, 2]
Frequency Distance (FD1):
A Lower Bound on the ED
 Define FD1(u, v) as the minimum number of steps in
order to go from u to v (or viceversa) by moving to a
neighbor point at each step.
 Two points u and v in   σdim space are neighbors if
one of them can be obtained from the other by a
single edit operation.
Frequency Distance:
Example
s = AATGATAG => f(s)=[4, 0, 2, 2]
t = ACTTAGC => f(t)=[2, 2, 1, 2]                      f(s)
pos = (4-2) + (2-1) = 3
neg = (2-0) = 2
FD1(f(s), f(t)) = 3             FD1(f(t),f(s))
ED(s, t) = 4

FD1( f(s), f(t) ) = max{pos, neg}                           f(t)
FD1( f(s), f(t) )  ED(s, t)
Wavelet Vectors
Wavelet Transformation:
String Decomposition
i

k

First wavelet                    Second wavelet
coefficient                      coefficient
(s)=

Ak,i = Ak-1,2i + Ak-1,2i+1 0<k<(log2n)
Bk,i = Ak-1,2i - Ak-1,2i+1 0<i<(n/2k)-1
Wavelet Distance (FD2):
A Lower Bound on the ED

 Maximum Frequency Distance FD(s 1,s2) =
max { FD1(f(s1), f(s2)), FD2(ψ(s1),ψ(s2)) }
Wavelet Transformation:
Example
s     = T C A C                                            n = |s| = 4
 0(s) = [v0,0 , v0,1 , v0,2 , v0,3]
= [ (A0,0, B0,0), (A0,1, B0,1), (A0,2, B0,2), (A0,3, B0,3)        ]
= [ (f(t), 0),      (f(c), 0),     (f(a), 0),      (f(c), 0)      ]
= [([0,0,0,1], 0), ([0,1,0,0], 0), ([1,0,0,0], 0), ([0,1,0,0], 0) ]

 1(s) = [     ([0,1,0,1], [0,-1,0,1]), ([1,1,0,0], [1,-1,0,0])            ]

 2(s) = [                     ( [1,2,0,1], [-1,0,0,1] )                        ]
First wavelet                         Second wavelet
coefficient                           coefficient
MRS Index Structure
MRS Index Creation

s1
w=2a

transform

MBR
MRS Index Creation

s1

transform
MRS Index Creation

s1

MBR
MRS Index Creation

s1
...

slide c times
c=box capacity

MBR
MRS Index Creation

s1

...

MBRs containing wavelet coefficients of substrings of s1
MRS Index Creation

s1

Ta,1

...        W=2a

Tree of MBRs for a resolution of W=2a over s1
Using Different Resolutions

s1

Ta,1

...   W=2a

Ta+1,1

...   W=2a+1
MRS Index Structure
j             1≤j≤d
Database
Resolution levels

i                          Ti,j index for
j th string and
window size 2i

a≤i≤b
Range Queries
1. Partition a partial range to
3. Disk pages corresponding
2. Perform the query string
last subqueries at various
query for set subquery on
the corresponding row our
resolutions available in of the
postprocessing is done to
Range Queries   index
elminate false retrievals.
index.structure, and refine ε.
s1      s2                 sd
w=24     ...     ...   ...          ...      1=

w=25     ...     ...   ...          ...

w=26     ...     ...   ...          ...     2 1

w=27     ...     ...   ...          ...     3 2

208
q   q1      q2               q3
16      64             128

```
To top