Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Hashing CSE 326

VIEWS: 5 PAGES: 64

									CSE 326
Hashing

David Kaplan

Dept of Computer Science & Engineering
Autumn 2001
    Reminder: Dictionary ADT
     Dictionary operations                                    insert   Adrien
                                              Donald                       Roller-blade demon
             insert                            l33t haxtor
                                                                       Hannah
             find                                                         C++ guru
             delete                                  find(Adrien) Dave
                                Adrien
             create             Roller-blade demon                        Older than dirt
             destroy                                                  …

     Stores values associated with user-
     specified keys
           values may be any (homogeneous) type
           keys may be any (homogeneous)
              comparable type

Hashing                    CSE 326 Autumn 2001                                        2
    Dictionary Implementations So Far
                                    Insert       Find    Delete
                  Unsorted list O(1)             O(n)    O(n)
                          Trees O(log n) O(log n) O(log n)
                  Sorted array O(n)              O(log n) O(n)
            Array special case O(1)              O(1)    O(1)
          known keys  {1, … , K}




Hashing                    CSE 326 Autumn 2001                    3
    ADT Legalities:
    A Digression on Keys
     Methods are the contract between an ADT and the
     outside agent (client code)
           Ex: Dictionary contract is {insert, find, delete}
           Ex: Priority Q contract is {insert, deleteMin}

     Keys are the currency used in transactions between
     an outside agent and ADT
           Ex: insert(key), find(key), delete(key)

     So …
           How about O(1) insert/find/delete for any key type?


Hashing                     CSE 326 Autumn 2001                   4
    Hash Table Goal:
    Key as Index
     We can access a record as a[5]               We want to access a record as
                                                                 a[“Hannah”]


              Adrien                                            Adrien
          2        roller-blade demon
                                                       Adrien        roller-blade demon




              Hannah                                            Hannah
          5       C++ guru
                                                      Hannah        C++ guru




Hashing                         CSE 326 Autumn 2001                               5
    Hash Table Approach

   Hannah
      Dave
    Adrien                       f(x)
    Donald
          Ed



    But… is there a problem with this pipe-dream?

Hashing                CSE 326 Autumn 2001          6
    Hash Table
    Dictionary Data Structure
     Hash function: maps keys to
     integers
          Result:
               Can quickly find the right
                 spot for a given entry         Hannah
                                                  Dave   f(x)
                                                Adrien
     Unordered and sparse table                 Donald
                                                    Ed
          Result:
               Cannot efficiently list all
                 entries
               Cannot efficiently find
                 min, max, ordered ranges




Hashing                        CSE 326 Autumn 2001              7
    Hash Table Taxonomy
             hash function
   Hannah
      Dave
    Adrien                    f(x)
                                                   collision
    Donald
        Ed
     keys

             load factor  = # oftableSize table
                                  entries in




Hashing             CSE 326 Autumn 2001            8
    Agenda:
    Hash Table Design Decisions
      What should the hash function be?

      What should the table size be?

      How should we resolve collisions?




Hashing             CSE 326 Autumn 2001    9
    Hash Function
     Hash function maps a key to a table index
          Value & find(Key & key) {
            int index = hash(key) % tableSize;
            return Table[index];
          }




Hashing               CSE 326 Autumn 2001        10
    What Makes A Good Hash Function?
     Fast runtime
           O(1) and fast in practical terms

     Distributes the data evenly
           hash(a) % size  hash(b) % size

     Uses the whole hash table
           for all 0  i < size,  k such that hash(k) % size = i



Hashing                  CSE 326 Autumn 2001                 11
     Good Hash Function for
     Integer Keys
     Choose
                                                  0
           tableSize is prime
           hash(n) = n                           1

                                                  2
     Example:
           tableSize = 7                         3

          insert(4)                               4
          insert(17)
                                                  5
          find(12)
          insert(9)                               6
          delete(17)


Hashing                     CSE 326 Autumn 2001       12
     Good Hash Function for Strings?
     Let s = s1s2s3s4…sn: choose
           hash(s) = s1 + s2128 + s31282 + s41283 + … + sn128n
           Think of the string as a base 128 (aka radix 128) number

     Problems:
           hash(“really, really big”) = well… something really, really big

           hash(“one thing”) % 128 = hash(“other thing”) % 128




Hashing                     CSE 326 Autumn 2001                       13
    String Hashing
    Issues and Techniques
     Minimize collisions
           Make tableSize and radix relatively prime
              Typically, make tableSize not a multiple of 128


     Simplify computation
           Use Horner’s Rule
          int hash(String s) {
            h = 0;
            for (i = s.length() - 1; i >= 0; i--) {
              h = (s[i] + 128*h) % tableSize;
            }
            return h;
          }



Hashing                     CSE 326 Autumn 2001                 14
     Good Hashing:
     Multiplication Method
     Hash function is defined by size plus a parameter A
          hA(k) = size * (k*A mod 1) where 0 < A < 1


     Example: size = 10, A = 0.485
          hA(50) = 10 * (50*0.485 mod 1)
                = 10 * (24.25 mod 1) = 10 * 0.25 = 2


           no restriction on size!
           when building a static table, we can try several values of A
           more computationally intensive than a single mod



Hashing                    CSE 326 Autumn 2001                       15
    Hashing Dilemma
     Suppose your Worst Enemy 1) knows your hash function; 2) gets
     to decide which keys to send you?

     Faced with this enticing possibility, Worst Enemy decides to:
         a) Send you keys which maximize collisions for your hash function.
         b) Take a nap.

     Moral: No single hash function can protect you!

     Faced with this dilemma, you:
         a) Give up and use a linked list for your Dictionary.
         b) Drop out of software, and choose a career in fast foods.
         c) Run and hide.
         d) Proceed to the next slide, in hope of a better alternative.

Hashing                      CSE 326 Autumn 2001                          16
    Universal Hashing1
                                                                                         0
     Suppose we have a set K of                       k1                                 1
     possible keys, and a finite set                                      h                    .
     H of hash functions that map                       k2                                     .
                                                 K                                             .
     keys to entries in a hashtable
                                                                                     m-1
     of size m.
                                                                     hi
                                                             H            hj

    Definition:
    H is a universal collection of hash functions if and only if …
         For any two keys k1, k2 in K, there are at most |H|/m functions in H for
         which h(k1) = h(k2).

     So … if we randomly choose a hash function from H, our chances of collision
      are no more than if we get to choose hash table entries at random!

                              1Motivation:   see previous slide (or visit http://www.burgerking.com/jobs)
Hashing                      CSE 326 Autumn 2001                                         17
    Random Hashing – Not!
     How can we “randomly choose a hash function”?
         Certainly we cannot randomly choose hash functions at runtime,
            interspersed amongst the inserts, finds, deletes! Why not?


      We can, however, randomly choose a hash function each time
       we initialize a new hashtable.

     Conclusions
         Worst Enemy never knows which hash function we will choose –
            neither do we!
           No single input (set of keys) can always evoke worst-case behavior




Hashing                      CSE 326 Autumn 2001                         18
    Good Hashing:
    Universal Hash Function A (UHFa)
     Parameterized by prime table size and vector:
        a = <a0 a1 … ar> where 0 <= ai < size

     Represent each key as r + 1 integers where ki < size
         size = 11, key = 39752 ==> <3,9,7,5,2>
         size = 29, key = “hello world” ==>
          <8,5,12,12,15,23,15,18,12,4>

                           r       
                  ha(k) =   ai ki  mod size
                           i 0    


Hashing                CSE 326 Autumn 2001              19
    UHFa: Example
      Context: hash strings of length 3 in a table of size 131

          let a = <35, 100, 21>
          ha(“xyz”) = (35*120 + 100*121 + 21*122) % 131
                    = 129




Hashing                CSE 326 Autumn 2001                20
    Thinking about UHFa
     Strengths:
           works on any type as long as you can form ki’s
           if we’re building a static table, we can try many
            values of the hash vector <a>
           random <a> has guaranteed good properties no
            matter what we’re hashing


     Weaknesses
           must choose prime table size larger than any ki


Hashing                  CSE 326 Autumn 2001                21
    Good Hashing:
    Universal Hash Function 2 (UHF2)
     Parameterized by j, a, and b:
           j * size should fit into an int
           a and b must be less than size

            hj,a,b(k) = ((ak + b) mod (j*size))/j




Hashing                 CSE 326 Autumn 2001    22
    UHF2 : Example
     Context: hash integers in a table of size 16

          let j = 32, a = 100, b = 200
          hj,a,b(1000) = ((100*1000 + 200) % (32*16)) / 32
                        = (100200 % 512) / 32
                        = 360 / 32
                        = 11




Hashing                 CSE 326 Autumn 2001              23
    Thinking about UHF2
     Strengths
           if we’re building a static table, we can try many
            parameter values
           random a,b has guaranteed good properties no
            matter what we’re hashing
           can choose any size table
           very efficient if j and size are powers of 2 (why?)

     Weaknesses
           need to turn non-integer keys into integers


Hashing                  CSE 326 Autumn 2001                24
     Hash Function Summary
     Goals of a hash function
             reproducible mapping from key to table index
             evenly distribute keys across the table
             separate commonly occurring keys (neighboring keys?)
             fast runtime


     Some hash function candidates
             h(n) = n % size
             h(n) = string as base 128 number % size
             Multiplication hash: compute percentage through the table
             Universal hash function A: dot product with random vector
             Universal hash function 2: next pseudo-random number


Hashing                     CSE 326 Autumn 2001                      25
    Hash Function Design Considerations
      Know what your keys are
      Study how your keys are distributed
      Try to include all important information in a
       key in the construction of its hash
      Try to make “neighboring” keys hash to very
       different places
      Prune the features used to create the hash
       until it runs “fast enough” (very application
       dependent)


Hashing             CSE 326 Autumn 2001           26
    Handling Collisions
     Pigeonhole principle says we can’t avoid all collisions
           try to hash without collision n keys into m slots with n > m
           try to put 6 pigeons into 5 holes


     What do we do when two keys hash to the same entry?
           Separate Chaining: put a little dictionary in each entry
           Open Addressing: pick a next entry to try within hashtable

     Terminology madness :-(
           Separate Chaining sometimes called Open Hashing
           Open Addressing sometimes called Closed Hashing

Hashing                    CSE 326 Autumn 2001                      27
     Separate Chaining                                   h(a) = h(d)
     Put a little dictionary at each entry               h(e) = h(b)
           Commonly, unordered linked list
            (chain)
           Or, choose another Dictionary type as    0
            appropriate (search tree, hashtable,
            etc.)                                    1
                                                         a         d
                                                     2
     Properties                                      3
                                                         e         b
            can be greater than 1                  4
           performance degrades with length of
            chains                                   5
                                                         c
           Alternate Dictionary type (e.g. search   6
            tree, hashtable) can speed up
            secondary search

Hashing                    CSE 326 Autumn 2001                28
    Separate Chaining Code
     void insert(const Key & k, const Value & v) {
       findBucket(k).insert(k,v);
     }

     Value & find(const Key & k) {
       return findBucket(k).find(k);
     }

     void delete(const Key & k) {
       findBucket(k).delete(k);
     }


          [private]
          Dictionary & findBucket(const Key & k) {
            return table[hash(k)%table.size];
          }


Hashing                  CSE 326 Autumn 2001         29
    Load Factor in Separate Chaining
     Search cost
           unsuccessful search:


           successful search:



     Desired load factor:



Hashing                 CSE 326 Autumn 2001   30
    Open Addressing
     Allow one key at each table entry            h(a) = h(d)   0

           two objects that hash to the same     h(e) = h(b)   1
            spot can’t both go there                                 a
           first one there gets the spot                       2
                                                                     d
           next one must go in another spot
                                                                3
                                                                     e
     Properties                                                 4
                                                                     b
           1                                                 5
           performance degrades with                                c
            difficulty of finding right spot                    6




Hashing                     CSE 326 Autumn 2001                 31
     Probing
     Requires collision resolution function f(i)

     Probing how to:
             First probe - given a key k, hash to h(k)
             Second probe - if h(k) is occupied, try h(k) + f(1)
             Third probe - if h(k) + f(1) is occupied, try h(k) + f(2)
             And so forth
     Probing properties
             we force f(0) = 0
             ith probe is to (h(k) + f(i)) mod size
             if i reaches size - 1, the probe has failed
             depending on f(), the probe may fail sooner
             long sequences of probes are costly!

Hashing                      CSE 326 Autumn 2001                          32
    Linear Probing
     f(i) = i
     Probe sequence is
             h(k) mod size
             h(k) + 1 mod size
             h(k) + 2 mod size
             …
                      bool findEntry(const Key & k, Entry *& entry) {
                        int probePoint = hash(k);
                        do {
                          entry = &table[probePoint];
                          probePoint = (probePoint + 1) % size;
                        } while (!entry->isEmpty() && entry->key != k);
                        return !entry->isEmpty();
                      }



Hashing                     CSE 326 Autumn 2001                  33
Linear Probing Example
   insert(76) insert(93) insert(40) insert(47) insert(10) insert(55)
    76%7 = 6   93%7 = 2   40%7 = 5   47%7 = 5   10%7 = 3   55%7 = 6
     0         0          0          0           0          0
                                         47          47         47
     1         1          1          1           1          1
                                                                55
     2         2          2          2           2          2
                   93         93         93          93         93
     3         3          3          3           3          3
                                                     10         10
     4         4          4          4           4          4

     5         5          5          5           5          5
                              40         40          40         40
     6         6          6          6           6          6
         76        76         76         76          76         76
probes: 1          1          1          3           1          3
    Load Factor in Linear Probing
     For any  < 1, linear probing will find an empty slot
     Search cost (for large table sizes)
           successful search:     1    1 
                                    1 
                                     1    
                                               
                                   2          
           unsuccessful search:   1     1 
                                    1 
                                     1   2 
                                   2           
                                                
     Linear probing suffers from primary clustering
     Performance quickly degrades for  > 1/2


Hashing                    CSE 326 Autumn 2001               35
    Quadratic Probing
     f(i) = i2
     Probe sequence:
             h(k) mod size
             h(k) + 1 mod size
             h(k) + 4 mod size
             h(k) + 9 mod size
             …
                       bool findEntry(const Key & k, Entry *& entry) {
                         int probePoint = hash(k), i = 0;
                         do {
                           entry = &table[probePoint];
                           i++;
                           probePoint = (probePoint + (2*i - 1)) % size;
                         } while (!entry->isEmpty() && entry->key != k);
                         return !entry->isEmpty();
                       }

Hashing                     CSE 326 Autumn 2001                 36
Good Quadratic Probing Example 
   insert(76)   insert(40)   insert(48)   insert(5)   insert(55)
    76%7 = 6    40%7 = 5     48%7 = 6     5%7 = 5     55%7 = 6
     0           0            0            0           0
                                  48           47          47
     1           1            1            1           1

     2           2            2            2           2
                                               5           5
     3           3            3            3           3
                                                           55
     4           4            4            4           4

     5           5            5            5           5
                     40           40           40          40
     6           6            6            6           6
         76          76           76           76          76
probes: 1            1            2            3           3
Bad Quadratic Probing Example 
   insert(76)   insert(93)   insert(40)   insert(35)   insert(47)
    76%7 = 6    93%7 = 2     40%7 = 5     35%7 = 0     47%7 = 5
     0           0            0            0            0
                                               35           35
     1           1            1            1            1

     2           2            2            2            2
                     93           93           93           93
     3           3            3            3            3

     4           4            4            4            4

     5           5            5            5            5
                                  40           40           40
     6           6            6            6            6
         76          76           76           76           76
probes: 1            1            1            1            
     Quadratic Probing Succeeds
     for   ½
     If size is prime and   ½, then quadratic probing will
     find an empty slot in size/2 probes or fewer.
           show for all 0  i, j  size/2 and i  j
             (h(x) + i2) mod size  (h(x) + j2) mod size
           by contradiction: suppose that for some i, j:
             (h(x) + i2) mod size = (h(x) + j2) mod size
             i2 mod size = j2 mod size
             (i2 - j2) mod size = 0
             [(i + j)(i - j)] mod size = 0
           but how can i + j = 0 or i + j = size when
             i  j and i,j  size/2?
           same for i - j mod size = 0


Hashing                 CSE 326 Autumn 2001               39
    Quadratic Probing May Fail
    for  > ½
      For any i larger than size/2, there is some j
          smaller than i that adds with i to equal size
          (or a multiple of size). D’oh!




Hashing                CSE 326 Autumn 2001            40
    Load Factor in Quadratic Probing
      For any   ½, quadratic probing will find an
       empty slot
      For  > ½, quadratic probing may find a slot
      Quadratic probing does not suffer from primary
       clustering
      Quadratic probing does suffer from secondary
       clustering
           How could we possibly solve this?



Hashing                 CSE 326 Autumn 2001      41
    Double Hashing
     f(i) = i*hash2(k)
     Probe sequence:
          h1(k) mod size
          (h1(k) + 1  h2(x)) mod size
          (h1(k) + 2  h2(x)) mod size
           …
                   bool findEntry(const Key & k, Entry *& entry) {
                     int probePoint = hash1(k), delta = hash2(k);
                     do {
                       entry = &table[probePoint];
                       probePoint = (probePoint + delta) % size;
                     } while (!entry->isEmpty() && entry->key != k);
                     return !entry->isEmpty();
                   }


Hashing                CSE 326 Autumn 2001                  42
    A Good Double Hash Function…
     … is quick to evaluate.
     … differs from the original hash function.
     … never evaluates to 0 (mod size).

     One good choice:
      Choose a prime p < size
      Let hash2(k)= p - (k mod p)



Hashing              CSE 326 Autumn 2001          43
Double Hashing Example (p=5)
   insert(76) insert(93) insert(40) insert(47) insert(10) insert(55)
    76%7 = 6   93%7 = 2   40%7 = 5      47%7 = 5    10%7 = 3    55%7 = 6
                                     5 - (47%5) = 3          5 - (55%5) = 5
     0         0          0             0           0            0

     1         1          1             1           1            1
                                            47          47           47
     2         2          2             2           2            2
                   93         93            93          93           93
     3         3          3             3           3            3
                                                        10           10
     4         4          4             4           4           4
                                                                     55
     5         5          5             5           5           5
                              40            40          40           40
     6         6          6             6           6           6
         76        76         76            76          76           76
probes: 1          1          1             2           1            2
    Load Factor in Double Hashing
     For any  < 1, double hashing will find an empty slot
     (given appropriate table size and hash2)

     Search cost appears to approach optimal (random hash):
                             1      1
                                ln
           successful search:  1  

           unsuccessful search: 1
                                1 
     No primary clustering and no secondary clustering

     One extra hash calculation

Hashing                 CSE 326 Autumn 2001              45
    Deletion in Open Addressing
     delete(2)    find(7)
          0       0
              0       0
          1
              1   1
                      1          Where is it?!
          2       2
              2
          3       3
              7       7
          4       4
                                Must use lazy deletion!
          5       5
                                On insertion, treat a (lazily)
          6       6                deleted item as an empty slot



Hashing                   CSE 326 Autumn 2001                     46
    The Squished Pigeon Principle
      Insert using Open Addressing cannot work with   1.
      Insert using Open Addressing with quadratic probing
       may not work with   ½.
      With Separate Chaining or Open Addressing, large
          load factors lead to poor performance!

     How can we relieve the pressure on the pigeons?
           Hint: what happens when we overrun array storage in a
            {queue, stack, heap}?
           What else must happen with a hashtable?



Hashing                   CSE 326 Autumn 2001                   47
     Rehashing
     When the  gets “too large” (over some constant
     threshold), rehash all elements into a new, larger table:
           takes O(n), but amortized O(1) as long as we (just about)
              double table size on the resize
             spreads keys back out, may drastically improve performance
             gives us a chance to retune parameterized hash functions
             avoids failure for Open Addressing techniques
             allows arbitrarily large tables starting from a small table
             clears out lazily deleted items




Hashing                     CSE 326 Autumn 2001                     48
    Case Study
     Spelling dictionary                         Practical notes
           30,000 words                             almost all searches are
           static                                    successful – Why?
           arbitrary(ish) preprocessing             words average about 8
             time                                     characters in length
     Goals                                           30,000 words at 8
                                                      bytes/word ~ .25 MB
           fast spell checking
                                                     pointers are 4 bytes
           minimal storage
                                                     there are many
                                                      regularities in the
                                                      structure of English
                                                      words


Hashing                    CSE 326 Autumn 2001                         49
    Case Study:
    Design Considerations
     Possible Solutions
           sorted array + binary search
           Separate Chaining
           Open Addressing + linear probing


     Issues
           Which data structure should we use?
           Which type of hash function should we use?



Hashing                 CSE 326 Autumn 2001              50
    Case Study:
    Storage
     Assume words are strings and entries are
     pointers to strings
             Array +                                Open addressing
          binary search         Separate Chaining




                                                           …
How many pointers
does each use?

Hashing                   CSE 326 Autumn 2001              51
    Case Study:
    Analysis
                            storage              time
                            n pointers + words   log2n  15 probes per
              Binary search = 360KB              access, worst case
                            n + n/ pointers +   1 + /2 probes per
          Separate Chaining words                access on average
                            ( = 1  600KB)      ( = 1  1.5 probes)
                           n/ pointers + words (1 + 1/(1 - ))/2 probes
                                                per access on average
           Open Addressing ( = 0.5  480KB)
                                                ( = 0.5  1.5 probes)



      What to do, what to do? …

Hashing                    CSE 326 Autumn 2001                   52
    Perfect Hashing
     When we know the entire key set in advance …
           Examples: programming language keywords, CD-
            ROM file list, spelling dictionary, etc.


     … then perfect hashing lets us achieve:
           Worst-case O(1) time complexity!
           Worst-case O(n) space complexity!




Hashing                  CSE 326 Autumn 2001           53
    Perfect Hashing Technique
         Static set of n known keys             0

         Separate chaining, two-level hash      1
         Primary hash table size=n              2
         jth secondary hash table size=nj2      3
            (where nj keys hash to slot j in
          primary hash table)                    4           Secondary hash tables

      Universal hash functions in all           5
       hash tables                               6
      Conduct (a few!) random trials,
       until we get collision-free hash              Primary hash table

       functions


Hashing                    CSE 326 Autumn 2001                            54
    Perfect Hashing Theorems1
     Theorem: If we store n keys in a hash table of size n2 using a randomly
     chosen universal hash function, then the probability of any collision is < ½.

     Theorem: If we store n keys in a hash table of size m=n using a randomly
     chosen universal hash function, then
             m1 2 
          E  n j   2n
             j 0 
     where nj is the number of keys hashing to slot j.

     Corollary: If we store n keys in a hash table of size m=n using a randomly
     chosen universal hash function and we set the size of each secondary hash
     table to mj=nj2, then:
      a) The expected amount of storage required for all secondary hash tables is less than 2n.
      b) The probability that the total storage used for all secondary hash tables exceeds 4n is
         less than ½.
                                                                 1Intro   to Algorithms, 2nd ed. Cormen,
Hashing                          CSE 326 Autumn 2001             Leiserson, Rivest, Stein      55
    Perfect Hashing Conclusions
     Perfect hashing theorems set tight expected bounds on
     sizes and collision behavior of all the hash tables (primary
     and all secondaries).

      Conduct a few random trials of universal hash
     functions, by simply varying UHF parameters, until we get
     a set of UHFs and associated table sizes which deliver …
         Worst-case O(1) time complexity!
         Worst-case O(n) space complexity!



Hashing                CSE 326 Autumn 2001                 56
    Extendible Hashing:
    Cost of a Database Query




  I/O to CPU ratio is 300-to-1!

Hashing                     CSE 326 Autumn 2001   57
    Extendible Hashing
     Hashing technique for huge data sets
           optimizes to reduce disk accesses
           each hash bucket fits on one disk block
           better than B-Trees if order is not important – why?


     Table contains
           buckets, each fitting in one disk block, with the data
           a directory that fits in one disk block is used to hash
            to the correct bucket


Hashing                  CSE 326 Autumn 2001                 58
    Extendible Hash Table
      Directory entry: key prefix (first k bits) and a pointer to the bucket
       with all keys starting with its prefix
      Each block contains keys matching on first j  k bits, plus the data
       associated with each key

                                                           directory for k = 3
                  000   001       010      011      100    101    110     111


           (2)           (2)                (3)            (3)           (2)
          00001         01001              10001          10101         11001
          00011         01011              10011          10110         11011
          00100         01100                             10111         11100
          00110                                                         11110


Hashing                       CSE 326 Autumn 2001                        59
Inserting (easy case)

              000   001     010   011     100    101    110    111

       (2)           (2)           (3)           (3)           (2)
      00001         01001         10001         10101         11001
      00011         01011         10011         10110         11100
      00100         01100                       10111         11110
      00110

                insert(11011)

              000   001     010   011     100    101    110    111

       (2)           (2)           (3)           (3)           (2)
      00001         01001         10001         10101         11001
      00011         01011         10011         10110         11011
      00100         01100                       10111         11100
      00110                                                   11110
Splitting a Leaf

                    000     001     010    011      100    101     110    111

            (2)              (2)            (3)            (3)            (2)
           00001            01001          10001          10101          11001
           00011            01011          10011          10110          11011
           00100            01100                         10111          11100
           00110                                                         11110

                           insert(11000)
          000   001        010    011     100      101    110     111

   (2)              (2)            (3)            (3)            (3)          (3)
  00001            01001          10001          10101          11000        11100
  00011            01011          10011          10110          11001        11110
  00100            01100                         10111          11011
  00110
    Splitting the Directory
     1.   insert(10010)
                                                    00      01    10     11
           But, no room to insert and
           no adoption!
                                                     (2)          (2)          (2)
     2.   Solution: Expand directory                01101        10000        11001
                                                                 10001        11110
     3.   Then, it’s just a normal split.                        10011
                                                                 10111



                                            000 001 010 011 100 101 110 111




Hashing                       CSE 326 Autumn 2001                             62
    If Extendible Hashing Doesn’t Cut It
     Store only pointers to the items
          + (potentially) much smaller M
          + fewer items in the directory
          – one extra disk access!
     Rehash
          + potentially better distribution over the buckets
          + fewer unnecessary items in the directory
          – can’t solve the problem if there’s simply too much data

     What if these don’t work?
           use a B-Tree to store the directory!



Hashing                    CSE 326 Autumn 2001                        63
    Hash Wrap
     Collision resolution                      Hash functions
     •Separate Chaining                            Simple integer hash: prime
           Expand beyond hashtable                    table size
            via secondary Dictionaries                Multiplication method
           Allows  > 1                              Universal hashing guarantees
     •Open Addressing                                  no (always) bad input
           Expand within hashtable            Perfect hashing
           Secondary probing: {linear,
            quadratic, double hash}                 Requires known, fixed keyset
             1 (by definition!)                  Achieves O(1) time, O(n) space
             ½ (by preference!)                     - guaranteed!
                                               Extendible hashing
     Rehashing                                      For disk-based data
         Tunes up hashtable when                  Combine with b-tree directory
            crosses the line                           if needed

Hashing                        CSE 326 Autumn 2001                          64

								
To top