VIEWS: 20 PAGES: 59 POSTED ON: 4/18/2011 Public Domain
Cpt S 223 Advanced Data Structures Teddy Yap, Jr. School of Electrical Engineering and Computer Science Washington State University Today’s Lecture Hash Tables Overview Hashing Technique supporting insertion, deletion, and search in average-case constant time Operations requiring elements to be sorted (e.g. find minimum) are not efficiently supported Hash table ADT Implementations Analysis Applications Hash Table One approach Element Hash table is an array of value fixed size (called Key TableSize). Array elements indexed by a key, which is mapped to an array index (0…TableSize – 1). Mapping (or hash function) h from key to index E.g. h(“john”) = 3 Factors Affecting Hash Table Design Hash function Table size Usually fixed at the start Collision handling schemes Hash Table (cont’d.) Insert Hash key T[h(“john”)] = <“john”, 25000> Delete Data record T[h(“john”)] = NULL Search Return T[h(“john”)] What if h(“john”) = h(“joe”)? Hash Function h(key) ==> hash table index Mapping from key to array index is called a hash function. Typically, many-to-one mapping. Different keys map to different indices. Distributes keys evenly over table. Collision occurs when hash function maps two keys to same array index. Hash Function (cont’d.) Simple hash function h(key) = key mod TableSize Assumes integer keys For random keys, h() distributes keys evenly over table. What if TableSize = 100 and keys are multiples of 10? Better if TableSize is a prime number. Not too close to powers of 2 or 10 Hash Function for String Keys Approach 1 Add up character ASCII values (0-127) to produce integer keys E.g. “abcd” = 97 + 98 + 99 + 100 = 394 h(“abcd”) = 394 mod TableSize Small strings may not use all of table strlen(s) * 127 < TableSize Anagrams will map to the same index h(“abcd”) = h(“dbac”) Hash Function for String Keys Approach 2 Treat first 3 characters of string as base-27 integer (26 letters plus space) key = S[0] + (27 * S[1]) + (272 * S[2]) Assumes first 3 characters randomly distributed Not true for English Hash Function for String Keys (cont’d.) Approach 3 Use all N characters of string as N-digit base-K integer Choose K to be prime number larger than number of different digits (characters) E.g. K = 29, 31, 37 If L = length of string s, then Use Horner’s rule to compute h(s) Limit L for long strings Collision Resolution What happens when h(k1) = h(k2)? ==> Collision! Collision resolution strategies Chaining Store colliding keys in a linked list at the same hash table index Open addressing Store colliding keys elsewhere in the table Chaining Collision Resolution Approach #1 Collision Resolution by Chaining Hash table T is a vector of lists Only singly-linked lists needed if memory is tight Key k is stored in list at T[h(k)] E.g. TableSize = 10 h(k) = k mod 10 Insert first 10 perfect squares Insertion sequence = 0, 1, 4, 9, 16, 25, 36, 49, 64, 81 Implementation of Chaining Hash Table Generic hash functions for integer and string keys Implementation of Chaining Hash Table (cont’d.) Implementation of Chaining Hash Table (cont’d.) STL algorithm find Each of these operations takes time linear in the length of the list. Implementation of Chaining Hash Table (cont’d.) No duplicates Doubles size of table and reinserts current Elements (more on this later) Implementation of Chaining Hash Table (cont’d.) All hash objects must define == and != operators. Hash function to handle Employee object type Collision Resolution by Chaining: Analysis Load factor of a hash table T N = number of elements in T M = size of T = N / M Average length of a chain is Unsuccessful search O() Successful search O( / 2) Ideally, we want 1 (not a function of N) I.e. TableSize = number of elements you expect to store in the table Open Addressing Collision Resolution Approach #2 Collision Resolution by Open Addressing When a collision occurs, look elsewhere in the table for an empty slot. Advantages over chaining No need for additional list structures No need to allocate/deallocate memory during insertion/deletion (slow) Disadvantages Slower insertion – may need several attempts to find an empty slot Table needs to be bigger (than chaining-based table) to achieve average-case constant-time performance Load factor 0.5 Collision Resolution by Open Addressing Probe sequence Sequence of slots in hash table to search h0(x), h1(x), h2(x), … Needs to visit each slot exactly once Needs to be repeatable (so we can find/delete what we’ve inserted) Hash function hi(x) = (h(x) + f(i)) mod TableSize f(0) = 0 ==> first try Linear Probing f(i) is a linear function of i. E.g. f(i) = i Example: h(x) = x mod TableSize h0(89) = (h(89) + f(0)) mod 10 = 9 h0(18) = (h(18) + f(0)) mod 10 = 8 h0(49) = (h(49) + f(0)) mod 10 = 9 (X) h1(49) = (h(49) + f(1)) mod 10 = 0 Linear Probing Example Insert sequence: 89, 18, 49, 58, 69 Linear Probing: Analysis Probe sequences can get long. Primary clustering Keys tend to cluster in one part of table. Keys that hash into cluster will be added to the end of the cluster (making it even bigger). Linear Probing: Analysis (cont’d.) Expected number of Example ( = 0.5) probes for insertion or Insert/unsuccessful unsuccessful search search 2.5 probes Successful search 1.5 probes Example ( = 0.9) Expected number of Insert/unsuccessful probes for successful search search 50.5 probes Successful search 5.5 probes Random Probing: Analysis Random probing does not suffer from clustering. Expected number of probes for insertion or unsuccessful search: Example = 0.5: 1.4 probes = 0.9: 2.6 probes Linear vs. Random Probing Linear probing Random probing U – unsuccessful search # of probes S – successful search I – insert Load factor Quadratic Probing Avoids primary clustering f(i) is quadratic in i E.g., f(i) = i2 Example h0(58) = (h(58) + f(0)) mod 10 = 8 (X) h1(58) = (h(58) + f(1)) mod 10 = 9 (X) h2(58) = (h(58) + f(2)) mod 10 = 2 Quadratic Probing Example Insert sequence: 89, 18, 49, 58, 69 Question: Delete 49, find 49, is there a problem? Quadratic Probing: Analysis Difficult to analyze Theorem 5.1 New element can always be inserted into a table that is at least half empty and TableSize is prime. Otherwise, may never find an empty slot, even if one exists. Ensure table never gets half full. If close, then expand it. Quadratic Probing (cont’d.) Only M (TableSize) different probe sequences May cause “secondary clustering” Deletion Emptying slots can break probe sequences Lazy deletion Differentiate between empty and deleted slot Skip deleted slots Slows operations (effectively increases ) Quadratic Probing: Implementation Quadratic Probing: Implementation (cont’d.) Lazy deletion Quadratic Probing: Implementation (cont’d.) Ensures table size is prime Quadratic Probing: Implementation (cont’d.) Find Skip DELETED; No duplicates Quadratic probe sequence Quadratic Probing: Implementation (cont’d.) Insert No duplicates Remove No deallocation needed Double Hashing Combine two different hash functions f(i) = i * h2(x) Good choices for h2(x)? Should never evaluate to 0 h2(x) = R – (x mod R) R is a prime number less than TableSize Previous example with R = 7 h0(49) = (h(49) + f(0)) mod 10 = 9 (X) h1(49) = (h(49) + (7 – 49 mod 7)) mod 10 = 6 f(1) Double Hashing Example Double Hashing: Analysis Imperative that TableSize is prime. E.g., insert 23 into previous table Empirical tests show double hashing close to random hashing. Extra hash function takes extra time to compute. Rehashing Increase the size of the hash table when load factor too high Typically expand the table to twice its size (but still prime) Reinsert existing elements into new hash table Rehashing Example h(x) = x mod 7 h(x) = x mod 17 = 0.57 = 0.29 Rehashing Insert 23 = 0.71 Rehashing Analysis Rehashing takes O(N) time. But happens infrequently Specifically Must have been N/2 insertions since last rehash Amortizing the O(N) cost over the N/2 prior insertions yields only constant additional time per insertion Rehashing Implementation When to rehash When table is half full ( = 0.5). When an insertion fails. When load factor reaches some threshold. Works for chaining and open addressing. Rehashing for Chaining Rehashing for Quadratic Probing Hash Tables in C++ STL Hash tables are not part of the C++ Standard library. Some implementations of STL have hash tables (e.g., SGI’s STL). hash_set hash_map Hash Set in SGI’s STL #include <hash_set> struct eqstr { bool operator()(const char* s1, const char* s2) const { return strcmp(s1, s2) == 0; } }; void lookup(const hash_set<const char*, hash<const char*>, eqstr>& Set, const char* word) { hash_set<const char*, hash<const char*>, eqstr>::const_iterator it = Set.find(word); cout << word << ": " << (it != Set.end() ? "present" : "not present") << endl; } Key Hash function Key equality test int main() { hash_set<const char*, hash<const char*>, eqstr> Set; Set.insert("kiwi"); lookup(Set, “kiwi"); } Hash Map in SGI’s STL #include <hash_map> struct eqstr { bool operator() (const char* s1, const char* s2) const { return strcmp(s1, s2) == 0; } }; int main() { Key Data Hash function Key equality test hash_map<const char*, int, hash<const char*>, eqstr> months; months["january"] = 31; months["february"] = 28; … months["december"] = 31; cout << “january -> " << months[“january"] << endl; } Problem with Large Tables What if hash table is too large to store in main memory? Solution: Store hash table on disk. Minimize disk accesses But… Collisions require disk accesses. Rehashing requires a lot of disk accesses. Solution: Extendible hashing Extendible Hashing Store hash table in a depth–1 tree Every search takes 2 disk accesses. Insertions require few disk accesses. Hash the keys to a long integer (“extendible”) Use first few bits of extended keys as the keys in the root node (“directory”) Leaf nodes contain all extended keys starting with the bits in the associated root node key. Extendible Hashing Example Extendible hash table Contains N = 12 data elements First D = 2 bits of key used by root node keys 2D entries in directory Each leaf contains up to M = 4 data elements As determined by disk page size Each leaf stores number of common starting bits (dL) Extendible Hashing Example (cont’d.) After inserting 100100 Directory split and rewritten Leaves not involved in split now pointed to by two adjacent directory entries. These leaves are not accessed. Extendible Hashing Example (cont’d.) After inserting 000000 One leaf splits Only two pointer change in directory Extendible Hashing Analysis Expected number of leaves is (N/M) * log2e = (N/M) * 1.44. Average leaf is (ln 2) = 0.69 full. Same as for B-trees. Expected size of directory is O(N(1+1/M)/M). O(N/M) for large M (elements per leaf) Hash Table Applications Maintaining symbol table in compilers Accessing tree or graph nodes by name E.g., city names in Google maps Maintaining a transposition table in games Remember previous game situations and the move taken (avoid re-computation) Dictionary lookups Spelling checkers Natural language understanding (word sense) Summary Hash tables support fast insert and search. O(1) average case performance Deletion possible, but degrades performance Not good if need to maintain ordering over elements Many applications Points to Remember – Hash Tables Table size prime Table size much larger than number of inputs (to maintain closer to 0 or < 0.5) Tradeoffs between chaining vs. probing Collision chances decrease in this order: linear probing, quadratic probing, {random probing, double hashing} Rehashing required to resize hash table at time when exceeds 0.5 Good for searching. Not good if there is some order implied by data.