Fast Sub-Sequence Matching in Time-Series Databases by oga20203

VIEWS: 9 PAGES: 29

• pg 1
```									Fast Sub-Sequence Matching in Time-
Series Databases

Michael Käser

May 2008
Outline

 Time-series databases
 Building an index
 Evaluation of the method
 Conclusion

May 2008              Michael Käser   2
The paper
 Published in 1994
 Awarded “Best paper”
at SIGMOD 1994

May 2008              Michael Käser   3
Definition of time-series databases
 Each row is a sequence of numbers
 Sequences length can be variable

 Difference to other sequence data like text or
DNA?
   Data is based on continuous data that was sampled in
a certain interval
   Not discrete symbols

May 2008                           Michael Käser                  4
Applications for time-series databases

 Financial data
 Astrological data
 Weather data
 Sociological data
 …many more

May 2008              Michael Käser      5
Example database
25

20
21.    22.    23.       24.           25.       26.    27.      28.
ZH         8.8    7.3    8.8      11.8           11.4      11.5   13.3    13.5
15                                                                   Zurich
BE         8.3    7.8    10.4     13.8           12.8      13.1   14.8 12.1
Bern
RE
10         6.8    7.5    7.9       8.9           8.0       5.6     3.7       5.0
Reykjavik
BR         22.4   22.0   20.6     21.2           21.7      21.5   21.3 22.3
Brasilia
5

0
4/21            4/23          4/25                4/27

May 2008                         Michael Käser                                6
Searching

 A query on the database has two properties:
   Query sequence: R
   Query distance: ε

 Queries can be categorized by their distance
and by the length of R

May 2008                       Michael Käser     7
Query distance

 Allows searching for similar data
 Distance of 0 is exact search

 Distances between sequences are calculated
using the Euclidian distance function:

S iQi
l
D( S , Q) 
2

i 1

May 2008                      Michael Käser       8
Length of query

 Same length as data: Searching is easy
 Shorter than data: Do a comparison at every
possible offset
21/4   22/4   23/4   24/4            25/4   26/4   27/4   28/4

ZH          8.8
7.4    7.4
7.3
7.0    7.4
7.0
8.8
8.1     7.0
8.1
7.4
11.8
10.5           8.1
7.4
10.5
11.4
7.0
10.3   10.5
7.0
11.5
8.1
10.6    8.1
13.3
10.5
13.1   10.5
13.5
11.8
BE          8.3
7.4    7.8
7.0    10.4
8.1     13.8
10.5           12.8
10.3   13.1
10.6   14.8
13.1   12.1
11.8
RE          6.8
7.4    7.5
7.0    7.9
8.1      8.9
10.5            8.0
10.3    5.6
10.6    3.7
13.1    5.0
11.8
BR          22.4
7.4    22.0
7.0    20.6
8.1     21.2
10.5           21.7
10.3   21.5
10.6   21.3
13.1   22.3
11.8

Qry         7.4    7.0    8.1     10.5           10.3   10.6   13.1   11.8
May 2008                        Michael Käser                           9
What should be achieved

 Sequential searching on the sequences is slow
 The new search method should:
   Improve performance for all query types
   Require little space overhead
   Not miss any matching sequences
   (But can generate few false alarms)

May 2008                        Michael Käser        10
How it is achieved

 Step 1: Extract information of sequences
 Step 2: Add support for short queries
 Step 3: Store in efficient data structure
 Step 4: Query the index

May 2008                 Michael Käser        11
Step 1: Extracting features

 Compress the information of a complete
sequence into a smaller number of “features”
 Number of features f should be defined in
 Transform each sequence to a point in the
f-dimensional feature space

May 2008                  Michael Käser             12
Discrete Fourier Transformation

 Transforms sequence into another sequence of
same length
 Each element of the transformed sequence
holds information about all elements of the
original sequence
 Transformed elements are complex numbers
n 1

X   F
1   n  xi exp  j 2Fi n 
i 0

May 2008                      Michael Käser          13
DFT for feature extraction

 Cut off transformed sequence after f elements
 Use amplitude of complex number
 Distance between transformed sequences is
always smaller than original distance

S iQi
l
D( S , Q) 
2

i 1

May 2008                     Michael Käser        14
Extracting features in the example
21/4         22/4   23/4        24/4            25/4       26/4    27/4      28/4

ZH          8.8          7.3    8.8          11.8           11.4       11.5     13.3      13.5
BE          8.3          7.8    10.4         13.8           12.8       13.1     14.8      12.1
RE          6.8          7.5    7.9           8.9           8.0        5.6      3.7        5.0
BR          22.4         22.0   20.6         21.2           21.7       21.5     21.3      22.3

F0                 F1                    DstOriginal      DstTransformed
ZH                 27.5             3.05
1.22 + 2.80i                      0                0
BE                 30.6             3.87
0.66 + 3.82i                  15.8               10.0
RE                 18.3             2.70
-2.70 – 0.01i                 224.04              85.9
BR                 57.0             0.56
0.49 – 0.28i                 976.18             870.9

May 2008                                   Michael Käser                                      15
Step 2: Extend index for subsequences

 Define a minimum query length w
 Use a sliding window over the original data
 At each window position extract features
 All transformed points of subsequences form the
trail of a sequence in the feature space

May 2008                   Michael Käser        16
Generating trails in the example
21/4   22/4       23/4   24/4            25/4   26/4   27/4   28/4

ZH         8.8    7.3        8.8     11.8           11.4   11.5   13.3   13.5

Offset            F0        F1

0               14.4        0.9
1               16.1        2.3
2               18.5        1.6
3               20.0        0.2
4               20.9        1.1
5               22.1        1.1

May 2008                            Michael Käser                          17
Example of trails
3.5
3.0
2.5
2.0                                                     Zurich
Bern
1.5
Reykjavik
1.0
Brasilia
0.5
0.0
0.0   10.0   20.0                   30.0   40.0

May 2008                 Michael Käser                        18
Step 3: Storage of trails

 Storing all the points in a trail requires a lot of
space
 Searching in all the points is much slower than
pure sequential searching
 An efficient data structure for spatial data has to
be used

May 2008                  Michael Käser                 19
The R-Tree

 Data structure for saving multi-dimensional
areas (i.e. rectangles)
 Content is in leaf nodes
 Other nodes are minimum bounding rectangles
around the child nodes
 Rectangles can overlap
 Good algorithms for inserting and deleting exist

May 2008                   Michael Käser             20
R-tree example
R1
R3                   R5

R4
R2                            R7                         R8
R6

R1        R2

R3        R4   R5                        R6   R7    R8

May 2008                                      Michael Käser                   21
Using the R-tree to store the trails

 Split each trail into a number of sub trails
 Put a rectangle around the sub trail
 Save it together with sequence id and offsets

 How should the trails be split?
   Fixed number of points per sub trail is not optimal
   Use an adaptive algorithm that minimizes the number
of disk accesses

May 2008                         Michael Käser                   22
Example: Selecting sub-trails
2.5

2.0

1.5

1.0

0.5

0.0
7.0    9.0   11.0              13.0   15.0

May 2008           Michael Käser                 23
Step 4: Querying the index

 Use only the first w elements of query
   Extract the features of the query
   Represent it as circle around the feature point with
query distance as radius
   Intersect with R-tree nodes
   Add the offsets associated with each matching child
node to the result set
   Recalculate every distance in the result set and

May 2008                         Michael Käser                    24
Better method

 Split query into p parts of length w
 Do a query for each part
 Merge the results
 The query distance can be reduced to e   p

May 2008                Michael Käser          25
Evaluation

 Tested on a real database with 329’000 points
 Minimal query length w of 512

 Queries of length 512 were 3 to 100 times faster
 Longer queries were 2 to 40 times faster

 Index size was 5 KB

May 2008                Michael Käser             26
Evaluation

May 2008     Michael Käser   27
Conclusion

 Proposed method works fast for real-world data
 Influential paper
 A lot of research based on it
 Reducing false alarms
 Adding constraints to the query
 Streaming Time Series
 Improvements in R-Trees
 …many more (250 citations)

May 2008                       Michael Käser       28