Fast Sub-Sequence Matching in Time-Series Databases by oga20203

VIEWS: 9 PAGES: 29

									Fast Sub-Sequence Matching in Time-
         Series Databases

             Michael Käser




                                  May 2008
Outline

 Time-series databases
 Building an index
 Answering queries
 Evaluation of the method
 Conclusion




May 2008              Michael Käser   2
The paper
 Published in 1994
 Awarded “Best paper”
     at SIGMOD 1994




May 2008              Michael Käser   3
Definition of time-series databases
 Each row is a sequence of numbers
 Sequences length can be variable


 Difference to other sequence data like text or
      DNA?
              Data is based on continuous data that was sampled in
               a certain interval
              Not discrete symbols


May 2008                           Michael Käser                  4
Applications for time-series databases

 Financial data
 Astrological data
 Weather data
 Sociological data
 …many more




May 2008              Michael Käser      5
Example database
25


20
           21.    22.    23.       24.           25.       26.    27.      28.
ZH         8.8    7.3    8.8      11.8           11.4      11.5   13.3    13.5
15                                                                   Zurich
BE         8.3    7.8    10.4     13.8           12.8      13.1   14.8 12.1
                                                                     Bern
RE
10         6.8    7.5    7.9       8.9           8.0       5.6     3.7       5.0
                                                                     Reykjavik
BR         22.4   22.0   20.6     21.2           21.7      21.5   21.3 22.3
                                                                     Brasilia
 5


 0
  4/21            4/23          4/25                4/27

May 2008                         Michael Käser                                6
Searching

 A query on the database has two properties:
          Query sequence: R
          Query distance: ε


 Queries can be categorized by their distance
     and by the length of R




May 2008                       Michael Käser     7
Query distance

 Allows searching for similar data
 Distance of 0 is exact search


 Distances between sequences are calculated
     using the Euclidian distance function:


                              S iQi
                        l
           D( S , Q) 
                                              2

                       i 1

May 2008                      Michael Käser       8
Length of query

 Same length as data: Searching is easy
 Shorter than data: Do a comparison at every
      possible offset
            21/4   22/4   23/4   24/4            25/4   26/4   27/4   28/4

ZH          8.8
            7.4    7.4
                   7.3
                   7.0    7.4
                          7.0
                          8.8
                          8.1     7.0
                                  8.1
                                   7.4
                                  11.8
                                  10.5           8.1
                                                  7.4
                                                 10.5
                                                 11.4
                                                  7.0
                                                 10.3   10.5
                                                         7.0
                                                        11.5
                                                         8.1
                                                        10.6    8.1
                                                               13.3
                                                               10.5
                                                               13.1   10.5
                                                                      13.5
                                                                      11.8
BE          8.3
            7.4    7.8
                   7.0    10.4
                          8.1     13.8
                                  10.5           12.8
                                                 10.3   13.1
                                                        10.6   14.8
                                                               13.1   12.1
                                                                      11.8
RE          6.8
            7.4    7.5
                   7.0    7.9
                          8.1      8.9
                                  10.5            8.0
                                                 10.3    5.6
                                                        10.6    3.7
                                                               13.1    5.0
                                                                      11.8
BR          22.4
            7.4    22.0
                   7.0    20.6
                          8.1     21.2
                                  10.5           21.7
                                                 10.3   21.5
                                                        10.6   21.3
                                                               13.1   22.3
                                                                      11.8

Qry         7.4    7.0    8.1     10.5           10.3   10.6   13.1   11.8
 May 2008                        Michael Käser                           9
What should be achieved

 Sequential searching on the sequences is slow
 The new search method should:
          Improve performance for all query types
          Require little space overhead
          Not miss any matching sequences
          (But can generate few false alarms)




May 2008                        Michael Käser        10
How it is achieved

 Step 1: Extract information of sequences
 Step 2: Add support for short queries
 Step 3: Store in efficient data structure
 Step 4: Query the index




May 2008                 Michael Käser        11
Step 1: Extracting features

 Compress the information of a complete
     sequence into a smaller number of “features”
 Number of features f should be defined in
     advance
 Transform each sequence to a point in the
     f-dimensional feature space




May 2008                  Michael Käser             12
Discrete Fourier Transformation

 Transforms sequence into another sequence of
     same length
 Each element of the transformed sequence
     holds information about all elements of the
     original sequence
 Transformed elements are complex numbers
                            n 1

             X   F
                     1   n  xi exp  j 2Fi n 
                            i 0

May 2008                      Michael Käser          13
DFT for feature extraction

 Cut off transformed sequence after f elements
 Use amplitude of complex number
 Distance between transformed sequences is
     always smaller than original distance



                             S iQi
                       l
           D( S , Q) 
                                             2

                      i 1


May 2008                     Michael Käser        14
Extracting features in the example
            21/4         22/4   23/4        24/4            25/4       26/4    27/4      28/4

ZH          8.8          7.3    8.8          11.8           11.4       11.5     13.3      13.5
BE          8.3          7.8    10.4         13.8           12.8       13.1     14.8      12.1
RE          6.8          7.5    7.9           8.9           8.0        5.6      3.7        5.0
BR          22.4         22.0   20.6         21.2           21.7       21.5     21.3      22.3

                    F0                 F1                    DstOriginal      DstTransformed
ZH                 27.5             3.05
                                 1.22 + 2.80i                      0                0
BE                 30.6             3.87
                                 0.66 + 3.82i                  15.8               10.0
RE                 18.3             2.70
                                -2.70 – 0.01i                 224.04              85.9
BR                 57.0             0.56
                                 0.49 – 0.28i                 976.18             870.9

 May 2008                                   Michael Käser                                      15
Step 2: Extend index for subsequences

 Define a minimum query length w
 Use a sliding window over the original data
 At each window position extract features
 All transformed points of subsequences form the
     trail of a sequence in the feature space




May 2008                   Michael Käser        16
Generating trails in the example
           21/4   22/4       23/4   24/4            25/4   26/4   27/4   28/4

ZH         8.8    7.3        8.8     11.8           11.4   11.5   13.3   13.5


                         Offset            F0        F1

                         0               14.4        0.9
                         1               16.1        2.3
                         2               18.5        1.6
                         3               20.0        0.2
                         4               20.9        1.1
                         5               22.1        1.1


May 2008                            Michael Käser                          17
Example of trails
3.5
3.0
2.5
2.0                                                     Zurich
                                                        Bern
1.5
                                                        Reykjavik
1.0
                                                        Brasilia
0.5
0.0
      0.0   10.0   20.0                   30.0   40.0

 May 2008                 Michael Käser                        18
Step 3: Storage of trails

 Storing all the points in a trail requires a lot of
     space
 Searching in all the points is much slower than
     pure sequential searching
 An efficient data structure for spatial data has to
     be used




May 2008                  Michael Käser                 19
The R-Tree

 Data structure for saving multi-dimensional
     areas (i.e. rectangles)
 Content is in leaf nodes
 Other nodes are minimum bounding rectangles
     around the child nodes
 Rectangles can overlap
 Good algorithms for inserting and deleting exist

May 2008                   Michael Käser             20
R-tree example
                     R1
                          R3                   R5



                      R4
           R2                            R7                         R8
                R6




                                          R1        R2


                     R3        R4   R5                        R6   R7    R8

May 2008                                      Michael Käser                   21
Using the R-tree to store the trails

 Split each trail into a number of sub trails
 Put a rectangle around the sub trail
 Save it together with sequence id and offsets


 How should the trails be split?
          Fixed number of points per sub trail is not optimal
          Use an adaptive algorithm that minimizes the number
           of disk accesses

May 2008                         Michael Käser                   22
 Example: Selecting sub-trails
2.5


2.0


1.5


1.0


0.5


0.0
      7.0    9.0   11.0              13.0   15.0

  May 2008           Michael Käser                 23
Step 4: Querying the index

 Use only the first w elements of query
          Extract the features of the query
          Represent it as circle around the feature point with
           query distance as radius
          Intersect with R-tree nodes
          Add the offsets associated with each matching child
           node to the result set
          Recalculate every distance in the result set and
           discard false alarms


May 2008                         Michael Käser                    24
Better method

 Split query into p parts of length w
 Do a query for each part
 Merge the results
 The query distance can be reduced to e   p




May 2008                Michael Käser          25
Evaluation

 Tested on a real database with 329’000 points
 Minimal query length w of 512


 Queries of length 512 were 3 to 100 times faster
 Longer queries were 2 to 40 times faster


 Index size was 5 KB

May 2008                Michael Käser             26
Evaluation




May 2008     Michael Käser   27
Conclusion

 Proposed method works fast for real-world data
 Influential paper
 A lot of research based on it
        Reducing false alarms
        Adding constraints to the query
        Streaming Time Series
        Improvements in R-Trees
        …many more (250 citations)



May 2008                       Michael Käser       28
           Your questions?



May 2008         Michael Käser   29

								
To top