VIEWS: 5 PAGES: 64 POSTED ON: 2/21/2012
CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001 Reminder: Dictionary ADT Dictionary operations insert Adrien Donald Roller-blade demon insert l33t haxtor Hannah find C++ guru delete find(Adrien) Dave Adrien create Roller-blade demon Older than dirt destroy … Stores values associated with user- specified keys values may be any (homogeneous) type keys may be any (homogeneous) comparable type Hashing CSE 326 Autumn 2001 2 Dictionary Implementations So Far Insert Find Delete Unsorted list O(1) O(n) O(n) Trees O(log n) O(log n) O(log n) Sorted array O(n) O(log n) O(n) Array special case O(1) O(1) O(1) known keys {1, … , K} Hashing CSE 326 Autumn 2001 3 ADT Legalities: A Digression on Keys Methods are the contract between an ADT and the outside agent (client code) Ex: Dictionary contract is {insert, find, delete} Ex: Priority Q contract is {insert, deleteMin} Keys are the currency used in transactions between an outside agent and ADT Ex: insert(key), find(key), delete(key) So … How about O(1) insert/find/delete for any key type? Hashing CSE 326 Autumn 2001 4 Hash Table Goal: Key as Index We can access a record as a[5] We want to access a record as a[“Hannah”] Adrien Adrien 2 roller-blade demon Adrien roller-blade demon Hannah Hannah 5 C++ guru Hannah C++ guru Hashing CSE 326 Autumn 2001 5 Hash Table Approach Hannah Dave Adrien f(x) Donald Ed But… is there a problem with this pipe-dream? Hashing CSE 326 Autumn 2001 6 Hash Table Dictionary Data Structure Hash function: maps keys to integers Result: Can quickly find the right spot for a given entry Hannah Dave f(x) Adrien Unordered and sparse table Donald Ed Result: Cannot efficiently list all entries Cannot efficiently find min, max, ordered ranges Hashing CSE 326 Autumn 2001 7 Hash Table Taxonomy hash function Hannah Dave Adrien f(x) collision Donald Ed keys load factor = # oftableSize table entries in Hashing CSE 326 Autumn 2001 8 Agenda: Hash Table Design Decisions What should the hash function be? What should the table size be? How should we resolve collisions? Hashing CSE 326 Autumn 2001 9 Hash Function Hash function maps a key to a table index Value & find(Key & key) { int index = hash(key) % tableSize; return Table[index]; } Hashing CSE 326 Autumn 2001 10 What Makes A Good Hash Function? Fast runtime O(1) and fast in practical terms Distributes the data evenly hash(a) % size hash(b) % size Uses the whole hash table for all 0 i < size, k such that hash(k) % size = i Hashing CSE 326 Autumn 2001 11 Good Hash Function for Integer Keys Choose 0 tableSize is prime hash(n) = n 1 2 Example: tableSize = 7 3 insert(4) 4 insert(17) 5 find(12) insert(9) 6 delete(17) Hashing CSE 326 Autumn 2001 12 Good Hash Function for Strings? Let s = s1s2s3s4…sn: choose hash(s) = s1 + s2128 + s31282 + s41283 + … + sn128n Think of the string as a base 128 (aka radix 128) number Problems: hash(“really, really big”) = well… something really, really big hash(“one thing”) % 128 = hash(“other thing”) % 128 Hashing CSE 326 Autumn 2001 13 String Hashing Issues and Techniques Minimize collisions Make tableSize and radix relatively prime Typically, make tableSize not a multiple of 128 Simplify computation Use Horner’s Rule int hash(String s) { h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (s[i] + 128*h) % tableSize; } return h; } Hashing CSE 326 Autumn 2001 14 Good Hashing: Multiplication Method Hash function is defined by size plus a parameter A hA(k) = size * (k*A mod 1) where 0 < A < 1 Example: size = 10, A = 0.485 hA(50) = 10 * (50*0.485 mod 1) = 10 * (24.25 mod 1) = 10 * 0.25 = 2 no restriction on size! when building a static table, we can try several values of A more computationally intensive than a single mod Hashing CSE 326 Autumn 2001 15 Hashing Dilemma Suppose your Worst Enemy 1) knows your hash function; 2) gets to decide which keys to send you? Faced with this enticing possibility, Worst Enemy decides to: a) Send you keys which maximize collisions for your hash function. b) Take a nap. Moral: No single hash function can protect you! Faced with this dilemma, you: a) Give up and use a linked list for your Dictionary. b) Drop out of software, and choose a career in fast foods. c) Run and hide. d) Proceed to the next slide, in hope of a better alternative. Hashing CSE 326 Autumn 2001 16 Universal Hashing1 0 Suppose we have a set K of k1 1 possible keys, and a finite set h . H of hash functions that map k2 . K . keys to entries in a hashtable m-1 of size m. hi H hj Definition: H is a universal collection of hash functions if and only if … For any two keys k1, k2 in K, there are at most |H|/m functions in H for which h(k1) = h(k2). So … if we randomly choose a hash function from H, our chances of collision are no more than if we get to choose hash table entries at random! 1Motivation: see previous slide (or visit http://www.burgerking.com/jobs) Hashing CSE 326 Autumn 2001 17 Random Hashing – Not! How can we “randomly choose a hash function”? Certainly we cannot randomly choose hash functions at runtime, interspersed amongst the inserts, finds, deletes! Why not? We can, however, randomly choose a hash function each time we initialize a new hashtable. Conclusions Worst Enemy never knows which hash function we will choose – neither do we! No single input (set of keys) can always evoke worst-case behavior Hashing CSE 326 Autumn 2001 18 Good Hashing: Universal Hash Function A (UHFa) Parameterized by prime table size and vector: a = <a0 a1 … ar> where 0 <= ai < size Represent each key as r + 1 integers where ki < size size = 11, key = 39752 ==> <3,9,7,5,2> size = 29, key = “hello world” ==> <8,5,12,12,15,23,15,18,12,4> r ha(k) = ai ki mod size i 0 Hashing CSE 326 Autumn 2001 19 UHFa: Example Context: hash strings of length 3 in a table of size 131 let a = <35, 100, 21> ha(“xyz”) = (35*120 + 100*121 + 21*122) % 131 = 129 Hashing CSE 326 Autumn 2001 20 Thinking about UHFa Strengths: works on any type as long as you can form ki’s if we’re building a static table, we can try many values of the hash vector <a> random <a> has guaranteed good properties no matter what we’re hashing Weaknesses must choose prime table size larger than any ki Hashing CSE 326 Autumn 2001 21 Good Hashing: Universal Hash Function 2 (UHF2) Parameterized by j, a, and b: j * size should fit into an int a and b must be less than size hj,a,b(k) = ((ak + b) mod (j*size))/j Hashing CSE 326 Autumn 2001 22 UHF2 : Example Context: hash integers in a table of size 16 let j = 32, a = 100, b = 200 hj,a,b(1000) = ((100*1000 + 200) % (32*16)) / 32 = (100200 % 512) / 32 = 360 / 32 = 11 Hashing CSE 326 Autumn 2001 23 Thinking about UHF2 Strengths if we’re building a static table, we can try many parameter values random a,b has guaranteed good properties no matter what we’re hashing can choose any size table very efficient if j and size are powers of 2 (why?) Weaknesses need to turn non-integer keys into integers Hashing CSE 326 Autumn 2001 24 Hash Function Summary Goals of a hash function reproducible mapping from key to table index evenly distribute keys across the table separate commonly occurring keys (neighboring keys?) fast runtime Some hash function candidates h(n) = n % size h(n) = string as base 128 number % size Multiplication hash: compute percentage through the table Universal hash function A: dot product with random vector Universal hash function 2: next pseudo-random number Hashing CSE 326 Autumn 2001 25 Hash Function Design Considerations Know what your keys are Study how your keys are distributed Try to include all important information in a key in the construction of its hash Try to make “neighboring” keys hash to very different places Prune the features used to create the hash until it runs “fast enough” (very application dependent) Hashing CSE 326 Autumn 2001 26 Handling Collisions Pigeonhole principle says we can’t avoid all collisions try to hash without collision n keys into m slots with n > m try to put 6 pigeons into 5 holes What do we do when two keys hash to the same entry? Separate Chaining: put a little dictionary in each entry Open Addressing: pick a next entry to try within hashtable Terminology madness :-( Separate Chaining sometimes called Open Hashing Open Addressing sometimes called Closed Hashing Hashing CSE 326 Autumn 2001 27 Separate Chaining h(a) = h(d) Put a little dictionary at each entry h(e) = h(b) Commonly, unordered linked list (chain) Or, choose another Dictionary type as 0 appropriate (search tree, hashtable, etc.) 1 a d 2 Properties 3 e b can be greater than 1 4 performance degrades with length of chains 5 c Alternate Dictionary type (e.g. search 6 tree, hashtable) can speed up secondary search Hashing CSE 326 Autumn 2001 28 Separate Chaining Code void insert(const Key & k, const Value & v) { findBucket(k).insert(k,v); } Value & find(const Key & k) { return findBucket(k).find(k); } void delete(const Key & k) { findBucket(k).delete(k); } [private] Dictionary & findBucket(const Key & k) { return table[hash(k)%table.size]; } Hashing CSE 326 Autumn 2001 29 Load Factor in Separate Chaining Search cost unsuccessful search: successful search: Desired load factor: Hashing CSE 326 Autumn 2001 30 Open Addressing Allow one key at each table entry h(a) = h(d) 0 two objects that hash to the same h(e) = h(b) 1 spot can’t both go there a first one there gets the spot 2 d next one must go in another spot 3 e Properties 4 b 1 5 performance degrades with c difficulty of finding right spot 6 Hashing CSE 326 Autumn 2001 31 Probing Requires collision resolution function f(i) Probing how to: First probe - given a key k, hash to h(k) Second probe - if h(k) is occupied, try h(k) + f(1) Third probe - if h(k) + f(1) is occupied, try h(k) + f(2) And so forth Probing properties we force f(0) = 0 ith probe is to (h(k) + f(i)) mod size if i reaches size - 1, the probe has failed depending on f(), the probe may fail sooner long sequences of probes are costly! Hashing CSE 326 Autumn 2001 32 Linear Probing f(i) = i Probe sequence is h(k) mod size h(k) + 1 mod size h(k) + 2 mod size … bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash(k); do { entry = &table[probePoint]; probePoint = (probePoint + 1) % size; } while (!entry->isEmpty() && entry->key != k); return !entry->isEmpty(); } Hashing CSE 326 Autumn 2001 33 Linear Probing Example insert(76) insert(93) insert(40) insert(47) insert(10) insert(55) 76%7 = 6 93%7 = 2 40%7 = 5 47%7 = 5 10%7 = 3 55%7 = 6 0 0 0 0 0 0 47 47 47 1 1 1 1 1 1 55 2 2 2 2 2 2 93 93 93 93 93 3 3 3 3 3 3 10 10 4 4 4 4 4 4 5 5 5 5 5 5 40 40 40 40 6 6 6 6 6 6 76 76 76 76 76 76 probes: 1 1 1 3 1 3 Load Factor in Linear Probing For any < 1, linear probing will find an empty slot Search cost (for large table sizes) successful search: 1 1 1 1 2 unsuccessful search: 1 1 1 1 2 2 Linear probing suffers from primary clustering Performance quickly degrades for > 1/2 Hashing CSE 326 Autumn 2001 35 Quadratic Probing f(i) = i2 Probe sequence: h(k) mod size h(k) + 1 mod size h(k) + 4 mod size h(k) + 9 mod size … bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash(k), i = 0; do { entry = &table[probePoint]; i++; probePoint = (probePoint + (2*i - 1)) % size; } while (!entry->isEmpty() && entry->key != k); return !entry->isEmpty(); } Hashing CSE 326 Autumn 2001 36 Good Quadratic Probing Example insert(76) insert(40) insert(48) insert(5) insert(55) 76%7 = 6 40%7 = 5 48%7 = 6 5%7 = 5 55%7 = 6 0 0 0 0 0 48 47 47 1 1 1 1 1 2 2 2 2 2 5 5 3 3 3 3 3 55 4 4 4 4 4 5 5 5 5 5 40 40 40 40 6 6 6 6 6 76 76 76 76 76 probes: 1 1 2 3 3 Bad Quadratic Probing Example insert(76) insert(93) insert(40) insert(35) insert(47) 76%7 = 6 93%7 = 2 40%7 = 5 35%7 = 0 47%7 = 5 0 0 0 0 0 35 35 1 1 1 1 1 2 2 2 2 2 93 93 93 93 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 40 40 40 6 6 6 6 6 76 76 76 76 76 probes: 1 1 1 1 Quadratic Probing Succeeds for ½ If size is prime and ½, then quadratic probing will find an empty slot in size/2 probes or fewer. show for all 0 i, j size/2 and i j (h(x) + i2) mod size (h(x) + j2) mod size by contradiction: suppose that for some i, j: (h(x) + i2) mod size = (h(x) + j2) mod size i2 mod size = j2 mod size (i2 - j2) mod size = 0 [(i + j)(i - j)] mod size = 0 but how can i + j = 0 or i + j = size when i j and i,j size/2? same for i - j mod size = 0 Hashing CSE 326 Autumn 2001 39 Quadratic Probing May Fail for > ½ For any i larger than size/2, there is some j smaller than i that adds with i to equal size (or a multiple of size). D’oh! Hashing CSE 326 Autumn 2001 40 Load Factor in Quadratic Probing For any ½, quadratic probing will find an empty slot For > ½, quadratic probing may find a slot Quadratic probing does not suffer from primary clustering Quadratic probing does suffer from secondary clustering How could we possibly solve this? Hashing CSE 326 Autumn 2001 41 Double Hashing f(i) = i*hash2(k) Probe sequence: h1(k) mod size (h1(k) + 1 h2(x)) mod size (h1(k) + 2 h2(x)) mod size … bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash1(k), delta = hash2(k); do { entry = &table[probePoint]; probePoint = (probePoint + delta) % size; } while (!entry->isEmpty() && entry->key != k); return !entry->isEmpty(); } Hashing CSE 326 Autumn 2001 42 A Good Double Hash Function… … is quick to evaluate. … differs from the original hash function. … never evaluates to 0 (mod size). One good choice: Choose a prime p < size Let hash2(k)= p - (k mod p) Hashing CSE 326 Autumn 2001 43 Double Hashing Example (p=5) insert(76) insert(93) insert(40) insert(47) insert(10) insert(55) 76%7 = 6 93%7 = 2 40%7 = 5 47%7 = 5 10%7 = 3 55%7 = 6 5 - (47%5) = 3 5 - (55%5) = 5 0 0 0 0 0 0 1 1 1 1 1 1 47 47 47 2 2 2 2 2 2 93 93 93 93 93 3 3 3 3 3 3 10 10 4 4 4 4 4 4 55 5 5 5 5 5 5 40 40 40 40 6 6 6 6 6 6 76 76 76 76 76 76 probes: 1 1 1 2 1 2 Load Factor in Double Hashing For any < 1, double hashing will find an empty slot (given appropriate table size and hash2) Search cost appears to approach optimal (random hash): 1 1 ln successful search: 1 unsuccessful search: 1 1 No primary clustering and no secondary clustering One extra hash calculation Hashing CSE 326 Autumn 2001 45 Deletion in Open Addressing delete(2) find(7) 0 0 0 0 1 1 1 1 Where is it?! 2 2 2 3 3 7 7 4 4 Must use lazy deletion! 5 5 On insertion, treat a (lazily) 6 6 deleted item as an empty slot Hashing CSE 326 Autumn 2001 46 The Squished Pigeon Principle Insert using Open Addressing cannot work with 1. Insert using Open Addressing with quadratic probing may not work with ½. With Separate Chaining or Open Addressing, large load factors lead to poor performance! How can we relieve the pressure on the pigeons? Hint: what happens when we overrun array storage in a {queue, stack, heap}? What else must happen with a hashtable? Hashing CSE 326 Autumn 2001 47 Rehashing When the gets “too large” (over some constant threshold), rehash all elements into a new, larger table: takes O(n), but amortized O(1) as long as we (just about) double table size on the resize spreads keys back out, may drastically improve performance gives us a chance to retune parameterized hash functions avoids failure for Open Addressing techniques allows arbitrarily large tables starting from a small table clears out lazily deleted items Hashing CSE 326 Autumn 2001 48 Case Study Spelling dictionary Practical notes 30,000 words almost all searches are static successful – Why? arbitrary(ish) preprocessing words average about 8 time characters in length Goals 30,000 words at 8 bytes/word ~ .25 MB fast spell checking pointers are 4 bytes minimal storage there are many regularities in the structure of English words Hashing CSE 326 Autumn 2001 49 Case Study: Design Considerations Possible Solutions sorted array + binary search Separate Chaining Open Addressing + linear probing Issues Which data structure should we use? Which type of hash function should we use? Hashing CSE 326 Autumn 2001 50 Case Study: Storage Assume words are strings and entries are pointers to strings Array + Open addressing binary search Separate Chaining … How many pointers does each use? Hashing CSE 326 Autumn 2001 51 Case Study: Analysis storage time n pointers + words log2n 15 probes per Binary search = 360KB access, worst case n + n/ pointers + 1 + /2 probes per Separate Chaining words access on average ( = 1 600KB) ( = 1 1.5 probes) n/ pointers + words (1 + 1/(1 - ))/2 probes per access on average Open Addressing ( = 0.5 480KB) ( = 0.5 1.5 probes) What to do, what to do? … Hashing CSE 326 Autumn 2001 52 Perfect Hashing When we know the entire key set in advance … Examples: programming language keywords, CD- ROM file list, spelling dictionary, etc. … then perfect hashing lets us achieve: Worst-case O(1) time complexity! Worst-case O(n) space complexity! Hashing CSE 326 Autumn 2001 53 Perfect Hashing Technique Static set of n known keys 0 Separate chaining, two-level hash 1 Primary hash table size=n 2 jth secondary hash table size=nj2 3 (where nj keys hash to slot j in primary hash table) 4 Secondary hash tables Universal hash functions in all 5 hash tables 6 Conduct (a few!) random trials, until we get collision-free hash Primary hash table functions Hashing CSE 326 Autumn 2001 54 Perfect Hashing Theorems1 Theorem: If we store n keys in a hash table of size n2 using a randomly chosen universal hash function, then the probability of any collision is < ½. Theorem: If we store n keys in a hash table of size m=n using a randomly chosen universal hash function, then m1 2 E n j 2n j 0 where nj is the number of keys hashing to slot j. Corollary: If we store n keys in a hash table of size m=n using a randomly chosen universal hash function and we set the size of each secondary hash table to mj=nj2, then: a) The expected amount of storage required for all secondary hash tables is less than 2n. b) The probability that the total storage used for all secondary hash tables exceeds 4n is less than ½. 1Intro to Algorithms, 2nd ed. Cormen, Hashing CSE 326 Autumn 2001 Leiserson, Rivest, Stein 55 Perfect Hashing Conclusions Perfect hashing theorems set tight expected bounds on sizes and collision behavior of all the hash tables (primary and all secondaries). Conduct a few random trials of universal hash functions, by simply varying UHF parameters, until we get a set of UHFs and associated table sizes which deliver … Worst-case O(1) time complexity! Worst-case O(n) space complexity! Hashing CSE 326 Autumn 2001 56 Extendible Hashing: Cost of a Database Query I/O to CPU ratio is 300-to-1! Hashing CSE 326 Autumn 2001 57 Extendible Hashing Hashing technique for huge data sets optimizes to reduce disk accesses each hash bucket fits on one disk block better than B-Trees if order is not important – why? Table contains buckets, each fitting in one disk block, with the data a directory that fits in one disk block is used to hash to the correct bucket Hashing CSE 326 Autumn 2001 58 Extendible Hash Table Directory entry: key prefix (first k bits) and a pointer to the bucket with all keys starting with its prefix Each block contains keys matching on first j k bits, plus the data associated with each key directory for k = 3 000 001 010 011 100 101 110 111 (2) (2) (3) (3) (2) 00001 01001 10001 10101 11001 00011 01011 10011 10110 11011 00100 01100 10111 11100 00110 11110 Hashing CSE 326 Autumn 2001 59 Inserting (easy case) 000 001 010 011 100 101 110 111 (2) (2) (3) (3) (2) 00001 01001 10001 10101 11001 00011 01011 10011 10110 11100 00100 01100 10111 11110 00110 insert(11011) 000 001 010 011 100 101 110 111 (2) (2) (3) (3) (2) 00001 01001 10001 10101 11001 00011 01011 10011 10110 11011 00100 01100 10111 11100 00110 11110 Splitting a Leaf 000 001 010 011 100 101 110 111 (2) (2) (3) (3) (2) 00001 01001 10001 10101 11001 00011 01011 10011 10110 11011 00100 01100 10111 11100 00110 11110 insert(11000) 000 001 010 011 100 101 110 111 (2) (2) (3) (3) (3) (3) 00001 01001 10001 10101 11000 11100 00011 01011 10011 10110 11001 11110 00100 01100 10111 11011 00110 Splitting the Directory 1. insert(10010) 00 01 10 11 But, no room to insert and no adoption! (2) (2) (2) 2. Solution: Expand directory 01101 10000 11001 10001 11110 3. Then, it’s just a normal split. 10011 10111 000 001 010 011 100 101 110 111 Hashing CSE 326 Autumn 2001 62 If Extendible Hashing Doesn’t Cut It Store only pointers to the items + (potentially) much smaller M + fewer items in the directory – one extra disk access! Rehash + potentially better distribution over the buckets + fewer unnecessary items in the directory – can’t solve the problem if there’s simply too much data What if these don’t work? use a B-Tree to store the directory! Hashing CSE 326 Autumn 2001 63 Hash Wrap Collision resolution Hash functions •Separate Chaining Simple integer hash: prime Expand beyond hashtable table size via secondary Dictionaries Multiplication method Allows > 1 Universal hashing guarantees •Open Addressing no (always) bad input Expand within hashtable Perfect hashing Secondary probing: {linear, quadratic, double hash} Requires known, fixed keyset 1 (by definition!) Achieves O(1) time, O(n) space ½ (by preference!) - guaranteed! Extendible hashing Rehashing For disk-based data Tunes up hashtable when Combine with b-tree directory crosses the line if needed Hashing CSE 326 Autumn 2001 64