Docstoc

Indexing

Document Sample
Indexing Powered By Docstoc
					CS232A: Database System
       Principles

      INDEXING



                          1
         Indexing

Given condition on attribute find qualified
  records               Qualified records
Attr = value     ?                value
                                 value
                                  value


Condition may also be
• Attr>value
• Attr>=value                                 2
                       Indexing
• Data Stuctures used for quickly locating tuples that
  meet a specific type of condition
  – Equality condition: find Movie tuples where Director=X
  – Other conditions possible, eg, range conditions: find
    Employee tuples where Salary>40 AND Salary<50
• Many types of indexes. Evaluate them on
  –   Access time
  –   Insertion time
  –   Deletion time
  –   Disk Space needed (esp. as it effects access time)
Topics


• Conventional indexes
• B-trees
• Hashing schemes




                         4
           Terms and Distinctions
• Primary index
                                  A Dense Primary Index
   – the index on the attribute
     (a.k.a. search key) that                 10
     determines the               10          20
     sequencing of the table      20          30
                                  30          40
• Secondary index                 40          50
   – index on any other           50
                                              70
                                              80
     attribute                    70          90
• Dense index                     80          100
                                  90          120
   – every value of the
                                  100
     indexed attribute appears
                                  120
     in the index                                   Sequential
                                  140
• Sparse index                    150                  File
   – many values do not
     appear
           Dense and Sparse Primary
                   Indexes
Dense Primary Index                                  Sparse Primary Index
                       10
                                                                   10
   10                  20                     10
                                                                   20
   20                  30                     30
                       40                                          30
   30                                         50                   40
   40                  50                     80
                       70                                          50
   50                  80
                                              100                  70
   70                  90                     140                  80
   80                                         160                  90
                       100
   90                  120                    200                  100
                                                                   120
   100
   120                                     Find the index record with largest
   140                                     value that is less or equal to the
   150
                                           value we are looking.
+ can tell if a value exists without       + less index space
  accessing file (consider projection)
+ better access to overflow records      more + and - in a while
Sparse vs. Dense Tradeoff

• Sparse: Less index space per record
          can keep more of index in memory
• Dense: Can tell if any record exists
          without accessing file

(Later:
   – sparse better for insertions
   – dense needed for secondary indexes)


                                           7
             Multi-Level Indexes
                                     10
                               10
                        10           20
                               30
                                     30
                        100
• Treat the index as    250
                               50    40
                               80
  a file and build an   400    100
                                     50
                                     70
  index on it           600    140   80
• “Two levels are       750    160   90
                        920          100
  usually sufficient.   1000
                               200
                                     120
                               250
  More than three              270
  levels are rare.”            300
• Q: Can we build a            350
  dense second level           400
                               460
  index for a dense            500
  index ?                      550
        A Note on Pointers

• Record pointers consist of block pointer
  and position of record in the block
• Using the block pointer only saves
  space at no extra disk accesses cost
    Representation of Duplicate
     Values in Primary Indexes
                                 10
• Index may point to       10    10

  first instance of each   40
                           70
                                 10
                                 40
  value only               100   40
                                 70
                                 70
                                 70
                                 100
                                 120
      Deletion from Dense Index
                         Delete 40, 80

• Deletion from dense     Header
                                           Header
  primary index file                       10
                          10
  with no duplicate
                                           20
                          20               30
  values is handled in    30
  the same way with                        50
                                           70
  deletion from a         50
  sequential file         70               90
                                           100
• Q: What about           90               120
  deletion from dense
  primary index with
                               Lists of available entries
  duplicates
    Deletion from Sparse Index
                         Delete 40   Header

• if the deleted entry        10     10
                                     20
                              30
  does not appear in          50
                                     30

  the index do nothing        80     50
                              100    70
                              140    80
                              160    90
                              200    100
                                     120
     Deletion from Sparse Index
               (cont‟d)
                                 Delete 30   Header
• if the deleted entry                       10
                                10
  does not appear in the        40
                                             20

  index do nothing              50           40
• if the deleted entry          80           50

  appears in the index          100          70
                                140          80
  replace it with the next      160          90
  search-key value              200          100
                                             120
   – comment: we could leave
     the deleted value in the
     index assuming that no
     part of the system may
     assume it still exists
     without checking the
     block
     Deletion from Sparse Index
               (cont‟d)
                              Delete 40, then 30
                              Header           Header
• if the deleted entry                         10
                                10
  does not appear in the                       20
  index do nothing              50
• if the deleted entry          80             50
  appears in the index          100            70
                                140            80
  replace it with the next      160            90
  search-key value              200            100
                                               120
• unless the next search
  key value has its own
  index entry. In this case
  delete the entry
       Insertion in Sparse Index
                       Insert 35
                                   Header

• if no new block is       10      10
                                   20
  created then do
                           30
                                   30
                           50      35
  nothing                  80      50
                           100     70
                           140     80
                           160     90
                           200     100
                                   120
        Insertion in Sparse Index
                                Insert 15
                                            Header

• if no new block is                 10     10
                                            20
  created then do nothing            30
                                            30
• else create overflow               50
                                     80
  record                                    50
                                    100     70
   – Reorganize periodically        140     80
   – Could we claim space of        160     90
     next block?                    200     100
   – How often do we                        120
     reorganize and how
     much expensive it is?
   – B-trees offer convincing
     answers

                                                     16
Secondary indexes
                              Sequence
                              field

                        30
 File not sorted on     50
 secondary search key   20
                        70
                        80
                        40
                        100
                         10
                        90
                        60

                                         17
Secondary indexes
                               Sequence
   • Sparse index              field

             30
                         30
             20
                         50
             80          20
            100          70
             90          80
             ...         40
                         100
                          10
  does not make sense!   90
                         60

                                          18
Secondary indexes
                                          Sequence
     • Dense index                        field

                  10                30
                  20                50
    10            30
                  40                20
    50
                                    70
    90
    ...           50
                  60
                                    80
  sparse          70
                                    40
   high           ...               100
   level                             10
                                    90
                                    60
First level has to be dense,
next levels are sparse (as usual)                    19
Duplicate values & secondary indexes
                           20
                           10
                           20
                           40
                           10
                           40
                           10
                           40
                           30
                           40


                                       20
Duplicate values & secondary indexes
  one option...
                   10      20
                   10      10
                   10
                           20
Problem:           20
                           40
                   20
excess overhead!   30      10
  • disk space     40      40
                   40
  • search time            10
                   40      40
                   40
                   ...     30
                           40


                                       21
Duplicate values & secondary indexes
another option: lists of pointers
                    10              20
                                    10
                                    20
   Problem:         20
                                    40
   variable size                    10
   records in       30
                                    40
   index!           40              10
                                    40
                                    30
                                    40


                                         22
Duplicate values & secondary indexes
                                     20               
               10                    10
               20
               30                    20
               40                    40

               50                    10
               60                    40
               ...
                                     10               
                                     40
  Yet another idea :                 30               

  Chain records with same key?       40               

   Problems:
   • Need to add fields to records, messes up maintenance
   • Need to follow chain to know records
                                                            23
Duplicate values & secondary indexes
                           20
      10                   10
      20
      30                   20
      40                   40
      50                   10
      60                   40
      ...
                           10
                           40
                           30
                           40
               buckets
                                       24
Why “bucket” idea is useful
• Enables the processing of queries working
     with pointers only.
• Very common technique in Information
     Retrieval

Indexes             Records
Name: primary       EMP (name,dept,year,...)
Dept: secondary
Year: secondary
                                               25
   Advantage of Buckets: Process
    Queries Using Pointers Only
   Find employees of the Toys dept with 4 years in the company
         SELECT Name FROM Employee
         WHERE Dept=“Toys” AND Year=4
    Dept Index                                      Year Index
Toys                 Aaron     Suits    4                        1
PCs                  Helen     Pens     3                        2
Pens                 Jack      PCs      4                        3
Suits                Jim       Toys     4                        4
                     Joe       Toys     3
                     Nick      PCs      2
                     Walt      Toys     5
                     Yannis    Pens     1

                      Intersect toy bucket and
                      2nd Floor bucket to get
                      set of matching EMP’s
This idea used in
     text information retrieval
                             Documents
  cat
                            ...the cat is
                                 fat ...
  dog
                            ...my cat and my
                            dog like each
                            other...
                            ...Fido the
Buckets known as                 dog ...
Inverted lists
                                            27
Information Retrieval (IR) Queries

• Find articles with “cat” and “dog”
   – Intersect inverted lists
• Find articles with “cat” or “dog”
   – Union inverted lists
• Find articles with “cat” and not “dog”
   – Subtract list of dog pointers from list of cat pointers

• Find articles with “cat” in title
• Find articles with “cat” and “dog”
           within 5 words
                                                         28
Common technique:
         more info in inverted list


cat     Title 5
                                  d1

       Author 10
       Abstract 57
                      d3     d2


dog     Title   100
        Title   12

                                       29
Posting: an entry in inverted list.
         Represents occurrence of
         term in article

Size of a list:   1     Rare words or
  (in postings)          mis-spellings

                  106   Common words

Size of a posting: 10-15 bits     (compressed)


                                             30
 Vector space model


       w1 w2 w3 w4 w5 w6 w7 …
 DOC = <1  0 0 1    1 0 0 …>

 Query= <0   0   1    1   0   0   0 …>

PRODUCT =             1 + ……. = score


                                        31
• Tricks to weigh scores + normalize

e.g.: Match on common word not as
      useful as match on rare words...




                                         32
    Summary of Indexing So Far
• Basic topics in conventional indexes
  – multiple levels
  – sparse/dense
  – duplicate keys and buckets
  – deletion/insertion similar to sequential files
• Advantages
  – simple algorithms
  – index is sequential file
• Disadvantages
  – eventually sequentiality is lost because of
    overflows, reorganizations are needed
Example          Index   (sequential)
                  10
                  20                    39
                  30                    31
                  33                    35
    continuous                          36
                  40
                  50
                  60                    32
                                        38
    free space                          34
                  70
                  80
                  90               overflow area
                                  (not sequential)

                                                34
Outline:


• Conventional indexes
• B-Trees              NEXT
• Hashing schemes




                               35
• NEXT: Another type of index
  – Give up on sequentiality of index
  – Try to get “balance”




                                        36
     3
     5
     11

     30
           30
     35

     100
     101
                              B+Tree Example




     110
                 100
                       Root




     120
     130

     150
     156   120
     179   150
           180
     180
                              n=3




     200
37
Sample non-leaf




                     81
               57



                          95
to keys   to keys              to keys   to keys
< 57      57 k<81             81k<95   95


                                                   38
     To record     57
     with key 57
                                                       Sample leaf node:




     To record     81
     with key 81
     To record
     with key 85   95
                                  From non-leaf node

                   to next leaf
                   in sequence




39
In textbook‟s notation           n=3


Leaf:
                         30 35
        30
        35



Non-leaf:
                         30
        30




                                       40
Size of nodes:   n+1 pointers
                 n keys       (fixed)




                                        41
Non-root nodes have to be at least
half-full

• Use at least

  Non-leaf:      (n+1)/2 pointers

  Leaf:          (n+1)/2 pointers to data



                                         42
n=3

           Full node   min. node

Non-leaf


            150
            120

            180




                          30
Leaf        11




                         30
                         35
             5
             3




                                   43
B+tree rules        tree of order n


(1) All leaves at same lowest level
           (balanced tree)
(2) Pointers in leaves point to records
           except for “sequence pointer”




                                           44
(3) Number of pointers/keys for B+tree

              Max Max Min           Min
              ptrs keys ptrsdata   keys
  Non-leaf    n+1   n   (n+1)/2 (n+1)/2- 1
 (non-root)
    Leaf      n+1   n   (n+1)/2   (n+1)/2
 (non-root)
    Root      n+1   n       1          1




                                                 45
Insert into B+tree


(a) simple case
  – space available in leaf
(b) leaf overflow
(c) non-leaf overflow
(d) new root



                              46
(a) Insert key = 32     n=3




                  100
           30
   11



                30
                31
                32
    3
    5




                              47
(a) Insert key = 7                   n=3




                               100
                         30
                     7
               57
              11



                              30
                              31
   3
   5



               3




                                           48
                 100
                 160
     150
     156   120
           150
                       (c) Insert key = 160




     179
           180



     160
     179
           180


     180
                       n=3




     200
49
(d) New root, insert 45        n=3


         new root




                    30
            10
            20
            30




                          40
        10
        12



                    20
                    25


                          30
                          32
                          40



                               40
                               45
2
1

3




                                     50
Deletion from B+tree


(a) Simple case - no example
(b) Coalesce with neighbor (sibling)
(c) Re-distribute keys
(d) Cases (b) or (c) at non-leaf




                                       51
(b) Coalesce with sibling
                                 n=4
  – Delete 50




                 100
                 10
                 40
          40
          20
          10

          30




                            40
                            50

                                       52
(c) Redistribute keys
                             n=4
  – Delete 50




                 40 35
                 100
                 10


                        35
          20
          10

          30
          35




                        40
                        50
                                   53
(d) Non-leaf coalese
                                      n=4
  – Delete 37




                       25
new root


           40
           25
           20
           10




                            30
                            40
                        30


                                 30
                                 37

                                       40
           10
           14

                20
                22

                        25
                        26




                                       45
  3
  1




                                            54
B+tree deletions in practice


– Often, coalescing is not implemented
  – Too hard and not worth it!




                                         55
Comparison: B-trees vs. static
            indexed sequential file

Ref #1: Held & Stonebraker
        “B-Trees Re-examined”
        CACM, Feb. 1978




                                      56
Ref # 1 claims:
 - Concurrency control harder in B-Trees
  - B-tree consumes more space

For their comparison:
  block = 512 bytes
  key = pointer = 4 bytes
  4 data records per block


                                           57
Example: 1 block static index
                                  k1
                                                  1 data
                                                  block
                k1                k2
 127 keys       k2
                k3                k3



 (127+1)4 = 512 Bytes
 -> pointers in index implicit!        up to 127
                                       contigous blocks

                                                  58
Example: 1 block B-tree
                k1                k1
                                                  1 data
                                                  block
                k2
                                  k2
                ...
63 keys
               k63
                                  k3
                -
                      next
63x(4+4)+8 = 512 Bytes
-> pointers needed in B-tree           up to 63
  blocks because index and data        blocks
  are not contiguous
                                                  59
Size comparison             Ref. #1

    Static Index                  B-tree
# data                         # data
blocks       height            blocks    height

2 -> 127              2   2 -> 63                 2
128 -> 16,129         3   64 -> 3968              3
16,130 -> 2,048,383   4   3969 -> 250,047         4
                          250,048 -> 15,752,961   5




                                                      60
Ref. #1 analysis claims
 • For an 8,000 block file,
      after 32,000 inserts
      after 16,000 lookups
   Static index saves enough accesses
      to allow for reorganization


Ref. #1 conclusion   Static index better!!

                                         61
Ref #2: M. Stonebraker,
        “Retrospective on a database
         system,” TODS, June 1980


Ref. #2 conclusion     B-trees better!!




                                          62
Ref. #2 conclusion          B-trees better!!

• DBA does not know when to reorganize
  – Self-administration is important target
• DBA does not know how full to load
    pages of new index




                                               63
Ref. #2 conclusion          B-trees better!!

• Buffering
  – B-tree: has fixed buffer requirements
  – Static index: large & variable size
                   buffers needed due to
                  overflow




                                               64
• Speaking of buffering…
    Is LRU a good policy for B+tree buffers?

  Of course not!
  Should try to keep root in memory
   at all times
      (and perhaps some nodes from second level)




                                                   65
Interesting problem:
 For B+tree, how large should n be?


                  …




                   n is number of keys / node


                                                66
              Assumptions
• You have the right to set the disk page size
  for the disk where a B-tree will reside.
• Compute the optimum page size n assuming
  that
  – The items are 4 bytes long and the pointers are
    also 4 bytes long.
  – Time to read a node from disk is 12+.003n
  – Time to process a block in memory is unimportant
  – B+tree is full (I.e., every page has the maximum
    number of items and pointers
Can get:
 f(n) = time to find a record

    f(n)




                 nopt           n




                                    68
 FIND nopt by f‟(n) = 0

  Answer should be nopt = “few hundred”



 What happens to nopt as
      • Disk gets faster?
      • CPU get faster?


                                      69
Variation on B+tree: B-tree (no +)

• Idea:
  – Avoid duplicate keys
  – Have record pointers in non-leaf nodes




                                             70
              K1 P1       K2 P2       K3 P3


          to record      to record        to record
          with K1        with K2          with K3
to keys            to keys         to keys            to keys
< K1              K1<x<K2         K2<x<k3             >k3




                                                                71
 B-tree example                                      n=2
• sequence pointers
  not useful now!




                                   125
 (but keep space for simplicity)




                                   65
                                   105




                                                145
                                                165
                  25
                  45




                                   85



                                         120




                                                           170
                                   100
                                         110


                                               130
                                               140
                                                     150
                                                     160

                                                           180
                            70
   10
   20
            30
            40
                   50
                   60


                            80
                                   90




                                                            72
 So, for B-trees:

                  MAX                   MIN
           Tree   Rec Keys      Tree     Rec        Keys
           Ptrs   Ptrs           Ptrs    Ptrs
Non-leaf
non-root   n+1    n     n    (n+1)/2 (n+1)/2-1 (n+1)/2-1
Leaf
non-root   1      n     n       1       (n+1)/2   (n+1)/2
Root
non-leaf   n+1    n     n       2          1          1
Root
Leaf       1      n     n       1          1          1


                                                          73
Tradeoffs:
 B-trees have marginally faster average lookup
 than B+trees (assuming the height does not
 change)

 in B-tree, non-leaf & leaf different sizes
 Smaller fan-out
 in B-tree, deletion more complicated


           B+trees preferred!
                                               74
Example:
  - Pointers       4 bytes
  - Keys           4 bytes
  - Blocks         100 bytes (just example)
  - Look at full   2 level tree




                                              75
B-tree:
Root has 8 keys + 8 record pointers
                   + 9 son pointers
          = 8x4 + 8x4 + 9x4 = 100 bytes

Each of 9 sons: 12 rec. pointers (+12 keys)
          = 12x(4+4) + 4 = 100 bytes

2-level B-tree, Max # records =
           12x9 + 8 = 116

                                         76
B+tree:

Root has 12 keys + 13 son pointers
          = 12x4 + 13x4 = 100 bytes

Each of 13 sons: 12 rec. ptrs (+12 keys)
          = 12x(4 +4) + 4 = 100 bytes

2-level B+tree, Max # records
          = 13x12 = 156


                                           77
So...                                        8 records


          B+                      B

 ooooooooooooo               ooooooooo
        156 records            108 records
                              Total = 116
• Conclusion:
  – For fixed block size,
  – B+ tree is better because it is bushier
                                                    78
Outline/summary
• Conventional Indexes
     • Sparse vs. dense
     • Primary vs. secondary
• B trees
     • B+trees vs. B-trees
     • B+trees vs. indexed sequential
• Hashing schemes              --> Next


                                          79
                   Hashing
                                                     Records

• hash function h(key)         key        h(key)
  returns address of
  bucket or record
• for secondary index                          Buckets     Records
  buckets are required
                         key     h(key)            key
• if the keys for a
  specific hash value
  do not fit into one
  page the bucket is a
  linked list of pages
 Example hash function


• Key = „x1 x2 … xn‟ n byte character string
• Have b buckets
• h: add x1 + x2 + ….. xn
  –   compute sum modulo b




                                          81
 This may not be best function …
 Read Knuth Vol. 3 if you really
    need to select a good function.


Good hash      Expected number of
 function:        keys/bucket is the
                  same for all buckets



                                         82
Within a bucket:


• Do we keep keys sorted?

• Yes, if CPU time critical
   & Inserts/Deletes not too frequent




                                        83
Next: example to illustrate
         inserts, overflows, deletes


         h(K)




                                       84
EXAMPLE 2 records/bucket


INSERT:    0
               d
h(a) = 1   1
               a       e
h(b) = 2       c
           2
               b
h(c) = 1
           3
h(d) = 0
h(e) = 1

                           85
EXAMPLE: deletion

Delete:    0
               a
   e       1
               b                  d
   f           c d
   c       2
               e
           3   f     maybe move
               g       “g” up




                                      86
Rule of thumb:
• Try to keep space utilization
  between 50% and 80%
    Utilization =    # keys used
                  total # keys that fit

• If < 50%, wasting space
• If > 80%, overflows significant
          depends on how good hash
          function is & on # keys/bucket
                                           87
How do we cope with growth?


 • Overflows and reorganizations
 • Dynamic hashing

            • Extensible
            • Linear



                                   88
Extensible hashing: two ideas


(a) Use i of b bits output by hash function
                 b
  h(K)     00110101

           use i  grows over time….



                                          89
(b) Use directory
                    .
                    .
h(K)[0-i ]          .   to bucket
                    .
                    .
                    .




                                    90
Example: h(k) is 4 bits; 2 keys/bucket
                          i=2
                  1
                                     00
  i= 1            0001
                                     01

                                     10
                   1 2
                   1001              11

              1010 1100

                  1 2      New directory
Insert 1010       1100


                                           91
Example continued   2
                    0000
         i= 2       0001
          00

          01        1 2
                    0001 0111
          10
                    0111
          11        2
                    1001
                    1010
        Insert:
                    2
         0111       1100
         0000
                                92
Example continued
                                    i=3
                        0000 2            000
     i= 2               0001
                                          001
     00
                        0111 2
                                          010
     01
                                          011
     10
                        1001 3
                        1001              100
     11

                    1010 1001 2 3         101

                         1010             110
   Insert:
   1001                 1100 2            111


                                            93
Extensible hashing: deletion


   • No merging of blocks
   • Merge blocks
        and cut directory if possible
        (Reverse insert procedure)




                                        94
Deletion example:


• Run thru insert example in reverse!




                                        95
Summary            Extensible hashing
+   Can handle growing files
       - with less wasted space
       - with no full reorganizations
-   Indirection
       (Not bad if directory in memory)
-   Directory doubles in size
       (Now it fits, now it does not)

                                          96
Linear hashing
• Another dynamic hashing scheme
Two ideas:
                                           b
(a) Use i low order bits of hash
                                    01110101
                                   grows       i

 (b) File grows linearly




                                                   97
Example b=4 bits,                 i =2, 2 keys/bucket
               0101             • insert 0101
                                • can have overflow chains!

                                                    Future
                                                    growth
0000           0101                                 buckets

1010           1111
00             01         10        11
            m = 01 (max used block)

     Rule     If h(k)[i ]  m, then
                   look at bucket h(k)[i ]
                  else, look at bucket h(k)[i ] - 2i -1
                                                              98
Example b=4 bits,          i =2, 2 keys/bucket
          0101         • insert 0101


                                       Future
                                       growth
0000      0101      1010      1111     buckets
        0101
1010      1111
00        01         10        11
       m = 01 (max used block)
           10
           11



                                                 99
Example Continued: How to grow beyond this?


            i=2 3


 0000      0101      1010         1111         0101
           0101                                0101
0 00      0 01        010     0 11       100   101
                                                 ...
100       101         110     111
        m = 11 (max used block)
            100
            101

                                               100
 When do we expand file?

• Keep track of:   # used slots       =U
                   total # of slots


• If U > threshold then increase m
           (and maybe i )


                                       101
Summary        Linear Hashing

+   Can handle growing files
      - with less wasted space
      - with no full reorganizations

+   No indirection like extensible hashing

-    Can still have overflow chains

                                             102
Example: BAD CASE

                 Very full



    Very empty               Need to move
                             m here…
                             Would waste
                             space...




                                        103
Summary

Hashing
  - How it works
  - Dynamic hashing
        - Extensible
        - Linear



                       104
Next:


• Indexing vs Hashing
• Index definition in SQL
• Multiple key access




                            105
Indexing vs Hashing
• Hashing good for probes given key
    e.g.,     SELECT …
              FROM R
              WHERE R.A = 5




                                      106
Indexing vs Hashing
• INDEXING (Including B Trees) good for
     Range Searches:
     e.g.,    SELECT
              FROM R
              WHERE R.A > 5



                                      107
Index definition in SQL


• Create index name on rel (attr)
• Create unique index name on rel (attr)

                 defines candidate key

• Drop INDEX name


                                           108
Note CANNOT SPECIFY TYPE OF INDEX
        (e.g. B-tree, Hashing, …)
     OR PARAMETERS
        (e.g. Load Factor, Size of Hash,...)


     ... at least in SQL...




                                               109
Note ATTRIBUTE LIST  MULTIKEY INDEX
                            (next)
     e.g., CREATE INDEX foo ON R(A,B,C)




                                     110
Multi-key Index

Motivation: Find records where
            DEPT = “Toy” AND SAL > 50k




                                    111
Strategy I:

• Use one index, say Dept.
• Get all Dept = “Toy” records
           and check their salary


                 I1




                                    112
Strategy II:


• Use 2 Indexes; Manipulate Pointers

Toy                               Sal
                                  > 50k




                                       113
Strategy III:


• Multiple Key Index
                       I2
One idea:
                I1     I3




                            114
Example
          10k
          15k
 Art      17k      Example
Sales     21k      Record
 Toy

          12k       Name=Joe
Dept      15k       DEPT=Sales
Index     15k       SAL=15k
          19k
          Salary
          Index

                             115
For which queries is this index good?


 Find   RECs   Dept = “Sales”   SAL=20k
 Find   RECs   Dept = “Sales”   SAL > 20k
 Find   RECs   Dept = “Sales”
 Find   RECs   SAL = 20k




                                       116
Interesting application:


• Geographic Data
   y
                       DATA:
                           <X1,Y1, Attributes>
                x          <X2,Y2, Attributes>


                             ...
                                           117
Queries:


• What city is at <Xi,Yi>?
• What is within 5 miles from <Xi,Yi>?
• Which is closest point to <Xi,Yi>?




                                         118
Example                                                                         a
                                             40           i        e   d
                                                      h
                                             30                             b
                         10      20                       n        f
                                             20   l           o             c
                                             10    j               g
          25             15 35         20                 m
                                                   k

                                                              10       20
 5        h i        g    f      d e    c   a b


     15         15                          • Search points near f
                                            • Search points near b
j k       l      m        n o
                                                                            119
Queries


•   Find   points   with Yi > 20
•   Find   points   with Xi < 5
•   Find   points   “close” to i = <12,38>
•   Find   points   “close” to b = <7,24>




                                             120
• Many types of geographic index
    structures have been suggested
    • Quad Trees
    • R Trees




                                     121
Two more types of multi key indexes


• Grid
• Partitioned hash




                                  122
Grid Index
                           Key 2
                  X1 X2      ……            Xn
         V1
         V2
Key 1
             Vn

                    To records with key1=V3, key2=X2


                                                 123
CLAIM

• Can quickly find records with
  – key 1 = Vi  Key 2 = Xj
  – key 1 = Vi
  – key 2 = Xj

• And also ranges….
  – E.g., key 1  Vi  key 2 < Xj


                                    124
 But there is a catch with Grid Indexes!

  • How is Grid Index stored on disk?

           V1        V2        V3
Like
Array...
           X1
           X2
           X3
           X4

                  X1




                            X2
                  X2
                  X3
                  X4

                            X1

                            X3
                            X4
Problem:
• Need regularity so we can compute
     position of <Vi,Xj> entry
                                        125
Solution: Use Indirection

      X1 X2 X3              Buckets
                       --
 V1                    --
                       --
 V2                    --
                       --
 V3                    --       *Grid only
                       --
 V4                    --
                       --       contains
                                pointers to
                                buckets
             --   --
             --   --
 Buckets     --   --


                                              126
 With indirection:


• Grid can be regular without wasting space
• We do have price of indirection




                                         127
Can also index grid on value ranges


Salary                      Grid
     0-20K         1
    20K-50K        2
      50K-         3
               8




                        1    2       3
Linear Scale           Toy Sales Personnel


                                         128
Grid files


+   Good for multiple-key search
-   Space, management overhead
        (nothing is free)

-   Need partitioning ranges that evenly
      split keys




                                           129
Partitioned hash function

Idea:   010110 1110010

Key1       h1     h2        Key2




                                   130
EX:
h1(toy)     =0        000
h1(sales)   =1        001      <Fred>
h1(art)     =1        010
  .                   011
  .
h2(10k)     =01       100
h2(20k)     =11       101   <Joe><Sally>
h2(30k)     =01       110
h2(40k)     =00       111
  .
  .
            <Fred,toy,10k>,<Joe,sales,10k>
 Insert     <Sally,art,30k>
                                             131
h1(toy)   =0          000      <Fred>
h1(sales) =1          001    <Joe><Jan>
h1(art)   =1          010      <Mary>
  .                   011
  .
h2(10k)   =01         100      <Sally>
h2(20k)   =11         101
h2(30k)   =01         110    <Tom><Bill>
h2(40k)   =00         111      <Andy>
  .
  .
• Find Emp. with Dept. = Sales  Sal=40k

                                           132
h1(toy)   =0          000     <Fred>
h1(sales) =1          001   <Joe><Jan>
h1(art)   =1          010     <Mary>
  .                   011
  .
h2(10k)   =01         100     <Sally>
h2(20k)   =11         101
h2(30k)   =01         110   <Tom><Bill>
h2(40k)   =00         111     <Andy>
  .
  .
• Find Emp. with Sal=30k            look here

                                           133
h1(toy)   =0          000      <Fred>
h1(sales) =1          001    <Joe><Jan>
h1(art)   =1          010      <Mary>
  .                   011
  .
h2(10k)   =01         100      <Sally>
h2(20k)   =11         101
h2(30k)   =01         110    <Tom><Bill>
h2(40k)   =00         111      <Andy>
  .
  .
• Find Emp. with Dept. = Sales    look here


                                              134
Summary
Post hashing discussion:
 - Indexing vs. Hashing
 - SQL Index Definition
 - Multiple Key Access
     - Multi Key Index
            Variations: Grid, Geo Data
     - Partitioned Hash


                                         135
Reading Chapter 5
• Skim the following sections:
  – 5.3.6, 5.3.7, 5.3.8
  – 5.4.2, 5.4.3, 5.4.4
• Read the rest




                                 136
The   BIG picture….
• Chapters 2 & 3: Storage, records, blocks...
• Chapter 4 & 5: Access Mechanisms
                 - Indexes
                 - B trees
                 - Hashing
                 - Multi key
• Chapter 6 & 7: Query Processing NEXT



                                                137

				
DOCUMENT INFO