# Hashing CSE 326

Document Sample

```					CSE 326
Hashing

David Kaplan

Dept of Computer Science & Engineering
Autumn 2001
   insert                            l33t haxtor
Hannah
   find                                                         C++ guru
   create             Roller-blade demon                        Older than dirt
   destroy                                                  …

Stores values associated with user-
specified keys
 values may be any (homogeneous) type
 keys may be any (homogeneous)
comparable type

Hashing                    CSE 326 Autumn 2001                                        2
Dictionary Implementations So Far
Insert       Find    Delete
Unsorted list O(1)             O(n)    O(n)
Trees O(log n) O(log n) O(log n)
Sorted array O(n)              O(log n) O(n)
Array special case O(1)              O(1)    O(1)
known keys  {1, … , K}

Hashing                    CSE 326 Autumn 2001                    3
A Digression on Keys
Methods are the contract between an ADT and the
outside agent (client code)
 Ex: Dictionary contract is {insert, find, delete}
 Ex: Priority Q contract is {insert, deleteMin}

Keys are the currency used in transactions between
 Ex: insert(key), find(key), delete(key)

So …
 How about O(1) insert/find/delete for any key type?

Hashing                     CSE 326 Autumn 2001                   4
Hash Table Goal:
Key as Index
We can access a record as a[5]               We want to access a record as
a[“Hannah”]

Hannah                                            Hannah
5       C++ guru
Hannah        C++ guru

Hashing                         CSE 326 Autumn 2001                               5
Hash Table Approach

Hannah
Dave
Donald
Ed

But… is there a problem with this pipe-dream?

Hashing                CSE 326 Autumn 2001          6
Hash Table
Dictionary Data Structure
Hash function: maps keys to
integers
Result:
 Can quickly find the right
spot for a given entry         Hannah
Dave   f(x)
Unordered and sparse table                 Donald
Ed
Result:
 Cannot efficiently list all
entries
 Cannot efficiently find
min, max, ordered ranges

Hashing                        CSE 326 Autumn 2001              7
Hash Table Taxonomy
hash function
Hannah
Dave
collision
Donald
Ed
keys

load factor  = # oftableSize table
entries in

Hashing             CSE 326 Autumn 2001            8
Agenda:
Hash Table Design Decisions
 What should the hash function be?

 What should the table size be?

 How should we resolve collisions?

Hashing             CSE 326 Autumn 2001    9
Hash Function
Hash function maps a key to a table index
Value & find(Key & key) {
int index = hash(key) % tableSize;
return Table[index];
}

Hashing               CSE 326 Autumn 2001        10
What Makes A Good Hash Function?
Fast runtime
 O(1) and fast in practical terms

Distributes the data evenly
 hash(a) % size  hash(b) % size

Uses the whole hash table
 for all 0  i < size,  k such that hash(k) % size = i

Hashing                  CSE 326 Autumn 2001                 11
Good Hash Function for
Integer Keys
Choose
0
 tableSize is prime
 hash(n) = n                           1

2
Example:
 tableSize = 7                         3

insert(4)                               4
insert(17)
5
find(12)
insert(9)                               6
delete(17)

Hashing                     CSE 326 Autumn 2001       12
Good Hash Function for Strings?
Let s = s1s2s3s4…sn: choose
 hash(s) = s1 + s2128 + s31282 + s41283 + … + sn128n
 Think of the string as a base 128 (aka radix 128) number

Problems:
 hash(“really, really big”) = well… something really, really big

 hash(“one thing”) % 128 = hash(“other thing”) % 128

Hashing                     CSE 326 Autumn 2001                       13
String Hashing
Issues and Techniques
Minimize collisions
 Make tableSize and radix relatively prime
Typically, make tableSize not a multiple of 128

Simplify computation
 Use Horner’s Rule
int hash(String s) {
h = 0;
for (i = s.length() - 1; i >= 0; i--) {
h = (s[i] + 128*h) % tableSize;
}
return h;
}

Hashing                     CSE 326 Autumn 2001                 14
Good Hashing:
Multiplication Method
Hash function is defined by size plus a parameter A
hA(k) = size * (k*A mod 1) where 0 < A < 1

Example: size = 10, A = 0.485
hA(50) = 10 * (50*0.485 mod 1)
= 10 * (24.25 mod 1) = 10 * 0.25 = 2

 no restriction on size!
 when building a static table, we can try several values of A
 more computationally intensive than a single mod

Hashing                    CSE 326 Autumn 2001                       15
Hashing Dilemma
to decide which keys to send you?

Faced with this enticing possibility, Worst Enemy decides to:
a) Send you keys which maximize collisions for your hash function.
b) Take a nap.

Moral: No single hash function can protect you!

Faced with this dilemma, you:
b) Drop out of software, and choose a career in fast foods.
c) Run and hide.
d) Proceed to the next slide, in hope of a better alternative.

Hashing                      CSE 326 Autumn 2001                          16
Universal Hashing1
0
Suppose we have a set K of                       k1                                 1
possible keys, and a finite set                                      h                    .
H of hash functions that map                       k2                                     .
K                                             .
keys to entries in a hashtable
m-1
of size m.
hi
H            hj

Definition:
H is a universal collection of hash functions if and only if …
For any two keys k1, k2 in K, there are at most |H|/m functions in H for
which h(k1) = h(k2).

 So … if we randomly choose a hash function from H, our chances of collision
are no more than if we get to choose hash table entries at random!

1Motivation:   see previous slide (or visit http://www.burgerking.com/jobs)
Hashing                      CSE 326 Autumn 2001                                         17
Random Hashing – Not!
How can we “randomly choose a hash function”?
 Certainly we cannot randomly choose hash functions at runtime,
interspersed amongst the inserts, finds, deletes! Why not?

 We can, however, randomly choose a hash function each time
we initialize a new hashtable.

Conclusions
 Worst Enemy never knows which hash function we will choose –
neither do we!
 No single input (set of keys) can always evoke worst-case behavior

Hashing                      CSE 326 Autumn 2001                         18
Good Hashing:
Universal Hash Function A (UHFa)
Parameterized by prime table size and vector:
a = <a0 a1 … ar> where 0 <= ai < size

Represent each key as r + 1 integers where ki < size
 size = 11, key = 39752 ==> <3,9,7,5,2>
 size = 29, key = “hello world” ==>
<8,5,12,12,15,23,15,18,12,4>

 r       
ha(k) =   ai ki  mod size
 i 0    

Hashing                CSE 326 Autumn 2001              19
UHFa: Example
 Context: hash strings of length 3 in a table of size 131

let a = <35, 100, 21>
ha(“xyz”) = (35*120 + 100*121 + 21*122) % 131
= 129

Hashing                CSE 326 Autumn 2001                20
Strengths:
 works on any type as long as you can form ki’s
 if we’re building a static table, we can try many
values of the hash vector <a>
 random <a> has guaranteed good properties no
matter what we’re hashing

Weaknesses
 must choose prime table size larger than any ki

Hashing                  CSE 326 Autumn 2001                21
Good Hashing:
Universal Hash Function 2 (UHF2)
Parameterized by j, a, and b:
 j * size should fit into an int
 a and b must be less than size

hj,a,b(k) = ((ak + b) mod (j*size))/j

Hashing                 CSE 326 Autumn 2001    22
UHF2 : Example
Context: hash integers in a table of size 16

let j = 32, a = 100, b = 200
hj,a,b(1000) = ((100*1000 + 200) % (32*16)) / 32
= (100200 % 512) / 32
= 360 / 32
= 11

Hashing                 CSE 326 Autumn 2001              23
Strengths
 if we’re building a static table, we can try many
parameter values
 random a,b has guaranteed good properties no
matter what we’re hashing
 can choose any size table
 very efficient if j and size are powers of 2 (why?)

Weaknesses
 need to turn non-integer keys into integers

Hashing                  CSE 326 Autumn 2001                24
Hash Function Summary
Goals of a hash function
   reproducible mapping from key to table index
   evenly distribute keys across the table
   separate commonly occurring keys (neighboring keys?)
   fast runtime

Some hash function candidates
   h(n) = n % size
   h(n) = string as base 128 number % size
   Multiplication hash: compute percentage through the table
   Universal hash function A: dot product with random vector
   Universal hash function 2: next pseudo-random number

Hashing                     CSE 326 Autumn 2001                      25
Hash Function Design Considerations
 Know what your keys are
 Study how your keys are distributed
 Try to include all important information in a
key in the construction of its hash
 Try to make “neighboring” keys hash to very
different places
 Prune the features used to create the hash
until it runs “fast enough” (very application
dependent)

Hashing             CSE 326 Autumn 2001           26
Handling Collisions
Pigeonhole principle says we can’t avoid all collisions
 try to hash without collision n keys into m slots with n > m
 try to put 6 pigeons into 5 holes

What do we do when two keys hash to the same entry?
 Separate Chaining: put a little dictionary in each entry
 Open Addressing: pick a next entry to try within hashtable

 Separate Chaining sometimes called Open Hashing
 Open Addressing sometimes called Closed Hashing

Hashing                    CSE 326 Autumn 2001                      27
Separate Chaining                                   h(a) = h(d)
Put a little dictionary at each entry               h(e) = h(b)
(chain)
 Or, choose another Dictionary type as    0
appropriate (search tree, hashtable,
etc.)                                    1
a         d
2
Properties                                      3
e         b
  can be greater than 1                  4
 performance degrades with length of
chains                                   5
c
 Alternate Dictionary type (e.g. search   6
tree, hashtable) can speed up
secondary search

Hashing                    CSE 326 Autumn 2001                28
Separate Chaining Code
void insert(const Key & k, const Value & v) {
findBucket(k).insert(k,v);
}

Value & find(const Key & k) {
return findBucket(k).find(k);
}

void delete(const Key & k) {
findBucket(k).delete(k);
}

[private]
Dictionary & findBucket(const Key & k) {
return table[hash(k)%table.size];
}

Hashing                  CSE 326 Autumn 2001         29
Search cost
 unsuccessful search:

 successful search:

Hashing                 CSE 326 Autumn 2001   30
Allow one key at each table entry            h(a) = h(d)   0

 two objects that hash to the same     h(e) = h(b)   1
spot can’t both go there                                 a
 first one there gets the spot                       2
d
 next one must go in another spot
3
e
Properties                                                 4
b
 1                                                 5
difficulty of finding right spot                    6

Hashing                     CSE 326 Autumn 2001                 31
Probing
Requires collision resolution function f(i)

Probing how to:
   First probe - given a key k, hash to h(k)
   Second probe - if h(k) is occupied, try h(k) + f(1)
   Third probe - if h(k) + f(1) is occupied, try h(k) + f(2)
   And so forth
Probing properties
   we force f(0) = 0
   ith probe is to (h(k) + f(i)) mod size
   if i reaches size - 1, the probe has failed
   depending on f(), the probe may fail sooner
   long sequences of probes are costly!

Hashing                      CSE 326 Autumn 2001                          32
Linear Probing
f(i) = i
Probe sequence is
   h(k) mod size
   h(k) + 1 mod size
   h(k) + 2 mod size
   …
bool findEntry(const Key & k, Entry *& entry) {
int probePoint = hash(k);
do {
entry = &table[probePoint];
probePoint = (probePoint + 1) % size;
} while (!entry->isEmpty() && entry->key != k);
return !entry->isEmpty();
}

Hashing                     CSE 326 Autumn 2001                  33
Linear Probing Example
insert(76) insert(93) insert(40) insert(47) insert(10) insert(55)
76%7 = 6   93%7 = 2   40%7 = 5   47%7 = 5   10%7 = 3   55%7 = 6
0         0          0          0           0          0
47          47         47
1         1          1          1           1          1
55
2         2          2          2           2          2
93         93         93          93         93
3         3          3          3           3          3
10         10
4         4          4          4           4          4

5         5          5          5           5          5
40         40          40         40
6         6          6          6           6          6
76        76         76         76          76         76
probes: 1          1          1          3           1          3
For any  < 1, linear probing will find an empty slot
Search cost (for large table sizes)
 successful search:     1    1 
1 
 1    

2          
 unsuccessful search:   1     1 
1 
 1   2 
2           

Linear probing suffers from primary clustering
Performance quickly degrades for  > 1/2

Hashing                    CSE 326 Autumn 2001               35
f(i) = i2
Probe sequence:
   h(k) mod size
   h(k) + 1 mod size
   h(k) + 4 mod size
   h(k) + 9 mod size
   …
bool findEntry(const Key & k, Entry *& entry) {
int probePoint = hash(k), i = 0;
do {
entry = &table[probePoint];
i++;
probePoint = (probePoint + (2*i - 1)) % size;
} while (!entry->isEmpty() && entry->key != k);
return !entry->isEmpty();
}

Hashing                     CSE 326 Autumn 2001                 36
insert(76)   insert(40)   insert(48)   insert(5)   insert(55)
76%7 = 6    40%7 = 5     48%7 = 6     5%7 = 5     55%7 = 6
0           0            0            0           0
48           47          47
1           1            1            1           1

2           2            2            2           2
5           5
3           3            3            3           3
55
4           4            4            4           4

5           5            5            5           5
40           40           40          40
6           6            6            6           6
76          76           76           76          76
probes: 1            1            2            3           3
insert(76)   insert(93)   insert(40)   insert(35)   insert(47)
76%7 = 6    93%7 = 2     40%7 = 5     35%7 = 0     47%7 = 5
0           0            0            0            0
35           35
1           1            1            1            1

2           2            2            2            2
93           93           93           93
3           3            3            3            3

4           4            4            4            4

5           5            5            5            5
40           40           40
6           6            6            6            6
76          76           76           76           76
probes: 1            1            1            1            
for   ½
If size is prime and   ½, then quadratic probing will
find an empty slot in size/2 probes or fewer.
 show for all 0  i, j  size/2 and i  j
(h(x) + i2) mod size  (h(x) + j2) mod size
 by contradiction: suppose that for some i, j:
(h(x) + i2) mod size = (h(x) + j2) mod size
i2 mod size = j2 mod size
(i2 - j2) mod size = 0
[(i + j)(i - j)] mod size = 0
 but how can i + j = 0 or i + j = size when
i  j and i,j  size/2?
 same for i - j mod size = 0

Hashing                 CSE 326 Autumn 2001               39
for  > ½
 For any i larger than size/2, there is some j
smaller than i that adds with i to equal size
(or a multiple of size). D’oh!

Hashing                CSE 326 Autumn 2001            40
 For any   ½, quadratic probing will find an
empty slot
 For  > ½, quadratic probing may find a slot
 Quadratic probing does not suffer from primary
clustering
 Quadratic probing does suffer from secondary
clustering
 How could we possibly solve this?

Hashing                 CSE 326 Autumn 2001      41
Double Hashing
f(i) = i*hash2(k)
Probe sequence:
 h1(k) mod size
 (h1(k) + 1  h2(x)) mod size
 (h1(k) + 2  h2(x)) mod size
 …
bool findEntry(const Key & k, Entry *& entry) {
int probePoint = hash1(k), delta = hash2(k);
do {
entry = &table[probePoint];
probePoint = (probePoint + delta) % size;
} while (!entry->isEmpty() && entry->key != k);
return !entry->isEmpty();
}

Hashing                CSE 326 Autumn 2001                  42
A Good Double Hash Function…
… is quick to evaluate.
… differs from the original hash function.
… never evaluates to 0 (mod size).

One good choice:
Choose a prime p < size
Let hash2(k)= p - (k mod p)

Hashing              CSE 326 Autumn 2001          43
Double Hashing Example (p=5)
insert(76) insert(93) insert(40) insert(47) insert(10) insert(55)
76%7 = 6   93%7 = 2   40%7 = 5      47%7 = 5    10%7 = 3    55%7 = 6
5 - (47%5) = 3          5 - (55%5) = 5
0         0          0             0           0            0

1         1          1             1           1            1
47          47           47
2         2          2             2           2            2
93         93            93          93           93
3         3          3             3           3            3
10           10
4         4          4             4           4           4
55
5         5          5             5           5           5
40            40          40           40
6         6          6             6           6           6
76        76         76            76          76           76
probes: 1          1          1             2           1            2
For any  < 1, double hashing will find an empty slot
(given appropriate table size and hash2)

Search cost appears to approach optimal (random hash):
1      1
ln
 successful search:  1  

 unsuccessful search: 1
1 
No primary clustering and no secondary clustering

One extra hash calculation

Hashing                 CSE 326 Autumn 2001              45
delete(2)    find(7)
0       0
0       0
1
1   1
1          Where is it?!
2       2
2
3       3
7       7
4       4
 Must use lazy deletion!
5       5
 On insertion, treat a (lazily)
6       6                deleted item as an empty slot

Hashing                   CSE 326 Autumn 2001                     46
The Squished Pigeon Principle
 Insert using Open Addressing cannot work with   1.
may not work with   ½.
 With Separate Chaining or Open Addressing, large

How can we relieve the pressure on the pigeons?
 Hint: what happens when we overrun array storage in a
{queue, stack, heap}?
 What else must happen with a hashtable?

Hashing                   CSE 326 Autumn 2001                   47
Rehashing
When the  gets “too large” (over some constant
threshold), rehash all elements into a new, larger table:
 takes O(n), but amortized O(1) as long as we (just about)
double table size on the resize
   spreads keys back out, may drastically improve performance
   gives us a chance to retune parameterized hash functions
   avoids failure for Open Addressing techniques
   allows arbitrarily large tables starting from a small table
   clears out lazily deleted items

Hashing                     CSE 326 Autumn 2001                     48
Case Study
Spelling dictionary                         Practical notes
 30,000 words                             almost all searches are
 static                                    successful – Why?
 arbitrary(ish) preprocessing             words average about 8
time                                     characters in length
Goals                                           30,000 words at 8
bytes/word ~ .25 MB
 fast spell checking
 pointers are 4 bytes
 minimal storage
 there are many
regularities in the
structure of English
words

Hashing                    CSE 326 Autumn 2001                         49
Case Study:
Design Considerations
Possible Solutions
 sorted array + binary search
 Separate Chaining
 Open Addressing + linear probing

Issues
 Which data structure should we use?
 Which type of hash function should we use?

Hashing                 CSE 326 Autumn 2001              50
Case Study:
Storage
Assume words are strings and entries are
pointers to strings
binary search         Separate Chaining

…
How many pointers
does each use?

Hashing                   CSE 326 Autumn 2001              51
Case Study:
Analysis
storage              time
n pointers + words   log2n  15 probes per
Binary search = 360KB              access, worst case
n + n/ pointers +   1 + /2 probes per
Separate Chaining words                access on average
( = 1  600KB)      ( = 1  1.5 probes)
n/ pointers + words (1 + 1/(1 - ))/2 probes
per access on average
Open Addressing ( = 0.5  480KB)
( = 0.5  1.5 probes)

What to do, what to do? …

Hashing                    CSE 326 Autumn 2001                   52
Perfect Hashing
When we know the entire key set in advance …
 Examples: programming language keywords, CD-
ROM file list, spelling dictionary, etc.

… then perfect hashing lets us achieve:
 Worst-case O(1) time complexity!
 Worst-case O(n) space complexity!

Hashing                  CSE 326 Autumn 2001           53
Perfect Hashing Technique
    Static set of n known keys             0

    Separate chaining, two-level hash      1
    Primary hash table size=n              2
    jth secondary hash table size=nj2      3
(where nj keys hash to slot j in
primary hash table)                    4           Secondary hash tables

 Universal hash functions in all           5
hash tables                               6
 Conduct (a few!) random trials,
until we get collision-free hash              Primary hash table

functions

Hashing                    CSE 326 Autumn 2001                            54
Perfect Hashing Theorems1
Theorem: If we store n keys in a hash table of size n2 using a randomly
chosen universal hash function, then the probability of any collision is < ½.

Theorem: If we store n keys in a hash table of size m=n using a randomly
chosen universal hash function, then
 m1 2 
E  n j   2n
 j 0 
where nj is the number of keys hashing to slot j.

Corollary: If we store n keys in a hash table of size m=n using a randomly
chosen universal hash function and we set the size of each secondary hash
table to mj=nj2, then:
a) The expected amount of storage required for all secondary hash tables is less than 2n.
b) The probability that the total storage used for all secondary hash tables exceeds 4n is
less than ½.
1Intro   to Algorithms, 2nd ed. Cormen,
Hashing                          CSE 326 Autumn 2001             Leiserson, Rivest, Stein      55
Perfect Hashing Conclusions
Perfect hashing theorems set tight expected bounds on
sizes and collision behavior of all the hash tables (primary
and all secondaries).

 Conduct a few random trials of universal hash
functions, by simply varying UHF parameters, until we get
a set of UHFs and associated table sizes which deliver …
 Worst-case O(1) time complexity!
 Worst-case O(n) space complexity!

Hashing                CSE 326 Autumn 2001                 56
Extendible Hashing:
Cost of a Database Query

I/O to CPU ratio is 300-to-1!

Hashing                     CSE 326 Autumn 2001   57
Extendible Hashing
Hashing technique for huge data sets
 optimizes to reduce disk accesses
 each hash bucket fits on one disk block
 better than B-Trees if order is not important – why?

Table contains
 buckets, each fitting in one disk block, with the data
 a directory that fits in one disk block is used to hash
to the correct bucket

Hashing                  CSE 326 Autumn 2001                 58
Extendible Hash Table
 Directory entry: key prefix (first k bits) and a pointer to the bucket
with all keys starting with its prefix
 Each block contains keys matching on first j  k bits, plus the data
associated with each key

directory for k = 3
000   001       010      011      100    101    110     111

(2)           (2)                (3)            (3)           (2)
00001         01001              10001          10101         11001
00011         01011              10011          10110         11011
00100         01100                             10111         11100
00110                                                         11110

Hashing                       CSE 326 Autumn 2001                        59
Inserting (easy case)

000   001     010   011     100    101    110    111

(2)           (2)           (3)           (3)           (2)
00001         01001         10001         10101         11001
00011         01011         10011         10110         11100
00100         01100                       10111         11110
00110

insert(11011)

000   001     010   011     100    101    110    111

(2)           (2)           (3)           (3)           (2)
00001         01001         10001         10101         11001
00011         01011         10011         10110         11011
00100         01100                       10111         11100
00110                                                   11110
Splitting a Leaf

000     001     010    011      100    101     110    111

(2)              (2)            (3)            (3)            (2)
00001            01001          10001          10101          11001
00011            01011          10011          10110          11011
00100            01100                         10111          11100
00110                                                         11110

insert(11000)
000   001        010    011     100      101    110     111

(2)              (2)            (3)            (3)            (3)          (3)
00001            01001          10001          10101          11000        11100
00011            01011          10011          10110          11001        11110
00100            01100                         10111          11011
00110
Splitting the Directory
1.   insert(10010)
00      01    10     11
But, no room to insert and
(2)          (2)          (2)
2.   Solution: Expand directory                01101        10000        11001
10001        11110
3.   Then, it’s just a normal split.                        10011
10111

000 001 010 011 100 101 110 111

Hashing                       CSE 326 Autumn 2001                             62
If Extendible Hashing Doesn’t Cut It
Store only pointers to the items
+ (potentially) much smaller M
+ fewer items in the directory
– one extra disk access!
Rehash
+ potentially better distribution over the buckets
+ fewer unnecessary items in the directory
– can’t solve the problem if there’s simply too much data

What if these don’t work?
 use a B-Tree to store the directory!

Hashing                    CSE 326 Autumn 2001                        63
Hash Wrap
Collision resolution                      Hash functions
•Separate Chaining                            Simple integer hash: prime
 Expand beyond hashtable                    table size
via secondary Dictionaries                Multiplication method
 Allows  > 1                              Universal hashing guarantees
 Expand within hashtable            Perfect hashing
 Secondary probing: {linear,
quadratic, double hash}                 Requires known, fixed keyset
   1 (by definition!)                  Achieves O(1) time, O(n) space
   ½ (by preference!)                     - guaranteed!
Extendible hashing
Rehashing                                      For disk-based data
 Tunes up hashtable when                  Combine with b-tree directory
crosses the line                           if needed

Hashing                        CSE 326 Autumn 2001                          64

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 5 posted: 2/21/2012 language: English pages: 64