# Cpt Advanced Data Structures by mikesanye

VIEWS: 20 PAGES: 59

• pg 1
```									              Cpt S 223
Teddy Yap, Jr.
School of Electrical Engineering and
Computer Science
Washington State University
Today’s Lecture
Hash Tables
Overview

Hashing
Technique supporting insertion, deletion, and
search in average-case constant time
Operations requiring elements to be sorted (e.g.
find minimum) are not efficiently supported
Implementations
Analysis
Applications
Hash Table
 One approach
Element
 Hash table is an array of
value
fixed size (called          Key
TableSize).
 Array elements indexed
by a key, which is
mapped to an array index
(0…TableSize – 1).
 Mapping (or hash
function) h from key to
index
 E.g. h(“john”) = 3
Factors Affecting Hash Table Design

Hash function
Table size
Usually fixed at the start
Collision handling schemes
Hash Table (cont’d.)
 Insert      Hash key

 T[h(“john”)] = <“john”, 25000>
 Delete                   Data record
 T[h(“john”)] = NULL
 Search
 Return T[h(“john”)]
 What if h(“john”) = h(“joe”)?
Hash Function
h(key) ==> hash table index

Mapping from key to array index is called
a hash function.
Typically, many-to-one mapping.
Different keys map to different indices.
Distributes keys evenly over table.
Collision occurs when hash function maps
two keys to same array index.
Hash Function (cont’d.)
Simple hash function
h(key) = key mod TableSize
Assumes integer keys
For random keys, h() distributes keys
evenly over table.
What if TableSize = 100 and keys are
multiples of 10?
Better if TableSize is a prime number.
Not too close to powers of 2 or 10
Hash Function for String Keys

Approach 1
Add up character ASCII values (0-127) to
produce integer keys
E.g. “abcd” = 97 + 98 + 99 + 100 = 394
h(“abcd”) = 394 mod TableSize
Small strings may not use all of table
strlen(s) * 127 < TableSize
Anagrams will map to the same index
h(“abcd”) = h(“dbac”)
Hash Function for String Keys

Approach 2
Treat first 3 characters of string as base-27
integer (26 letters plus space)
key = S[0] + (27 * S[1]) + (272 * S[2])
Assumes first 3 characters randomly
distributed
Not true for English
Hash Function for String Keys (cont’d.)

 Approach 3
 Use all N characters of string as
N-digit base-K integer
 Choose K to be prime number
larger than number of different
digits (characters)
 E.g. K = 29, 31, 37
 If L = length of string s, then

 Use Horner’s rule to compute
h(s)
 Limit L for long strings
Collision Resolution

What happens when h(k1) = h(k2)?
==> Collision!
Collision resolution strategies
Chaining
Store colliding keys in a linked list at the same hash
table index
Store colliding keys elsewhere in the table
Chaining
Collision Resolution Approach #1
Collision Resolution by Chaining
 Hash table T is a
vector of lists
needed if memory is
tight
 Key k is stored in list
at T[h(k)]
 E.g. TableSize = 10
h(k) = k mod 10
Insert first 10 perfect
squares

Insertion sequence = 0, 1, 4, 9, 16, 25, 36, 49, 64, 81
Implementation of Chaining Hash Table

Generic hash functions for integer and string keys
Implementation of Chaining Hash Table
(cont’d.)
Implementation of Chaining Hash Table
(cont’d.)

STL algorithm find

Each of these operations
takes time linear in the length
of the list.
Implementation of Chaining Hash Table
(cont’d.)

No duplicates

Doubles size of table and reinserts current
Elements (more on this later)
Implementation of Chaining Hash Table
(cont’d.)

All hash objects must define
== and != operators.

Hash function to handle
Employee object type
Collision Resolution by Chaining:
Analysis
 Load factor  of a hash table T
N = number of elements in T
M = size of T
 = N / M
 Average length of a chain is 
 Unsuccessful search O()
 Successful search O( / 2)
 Ideally, we want   1 (not a function of N)
I.e. TableSize = number of elements you expect to store
in the table
Collision Resolution Approach #2

 When a collision occurs, look elsewhere in the
table for an empty slot.
No need for additional list structures
No need to allocate/deallocate memory during
insertion/deletion (slow)
Slower insertion – may need several attempts to find an
empty slot
Table needs to be bigger (than chaining-based table) to
achieve average-case constant-time performance
 Load factor   0.5

Probe sequence
Sequence of slots in hash table to search
h0(x), h1(x), h2(x), …
Needs to visit each slot exactly once
Needs to be repeatable (so we can find/delete
what we’ve inserted)
Hash function
hi(x) = (h(x) + f(i)) mod TableSize
f(0) = 0              ==> first try
Linear Probing

f(i) is a linear function of i.
E.g. f(i) = i
Example: h(x) = x mod TableSize
h0(89) = (h(89) + f(0)) mod 10 = 9
h0(18) = (h(18) + f(0)) mod 10 = 8
h0(49) = (h(49) + f(0)) mod 10 = 9 (X)
h1(49) = (h(49) + f(1)) mod 10 = 0
Linear Probing Example

Insert sequence: 89, 18, 49, 58, 69
Linear Probing: Analysis

Probe sequences can get long.
Primary clustering
Keys tend to cluster in one part of table.
Keys that hash into cluster will be added to the
end of the cluster (making it even bigger).
Linear Probing: Analysis (cont’d.)
 Expected number of         Example ( = 0.5)
probes for insertion or     Insert/unsuccessful
unsuccessful search          search
 2.5 probes
Successful search
 1.5 probes
 Example ( = 0.9)
 Expected number of          Insert/unsuccessful
probes for successful        search
search                          50.5 probes
Successful search
 5.5 probes
Random Probing: Analysis
 Random probing does not suffer from
clustering.
 Expected number of probes for insertion or
unsuccessful search:

 Example
 = 0.5: 1.4 probes
 = 0.9: 2.6 probes
Linear vs. Random Probing

Linear probing
Random probing

U – unsuccessful search
# of probes

S – successful search
I – insert

Avoids primary clustering
E.g., f(i) = i2
Example
h0(58) = (h(58) + f(0)) mod 10 = 8 (X)
h1(58) = (h(58) + f(1)) mod 10 = 9 (X)
h2(58) = (h(58) + f(2)) mod 10 = 2

Insert sequence: 89, 18, 49, 58, 69   Question: Delete 49,
find 49, is there a problem?
Difficult to analyze
Theorem 5.1
New element can always be inserted into a
table that is at least half empty and TableSize is
prime.
Otherwise, may never find an empty slot,
even if one exists.
Ensure table never gets half full.
If close, then expand it.

Only M (TableSize) different probe
sequences
May cause “secondary clustering”
Deletion
Emptying slots can break probe sequences
Lazy deletion
Differentiate between empty and deleted slot
Skip deleted slots
Slows operations (effectively increases )
(cont’d.)
Lazy deletion
(cont’d.)

Ensures table size is prime
(cont’d.)
Find

Skip DELETED;
No duplicates
(cont’d.)
Insert

No duplicates

Remove

No deallocation needed
Double Hashing
Combine two different hash functions
f(i) = i * h2(x)
Good choices for h2(x)?
Should never evaluate to 0
h2(x) = R – (x mod R)
R is a prime number less than TableSize
Previous example with R = 7
h0(49) = (h(49) + f(0)) mod 10 = 9 (X)
h1(49) = (h(49) + (7 – 49 mod 7)) mod 10 = 6
f(1)
Double Hashing Example
Double Hashing: Analysis

Imperative that TableSize is prime.
E.g., insert 23 into previous table
Empirical tests show double hashing close
to random hashing.
Extra hash function takes extra time to
compute.
Rehashing

Increase the size of the hash table when
Typically expand the table to twice its size
(but still prime)
Reinsert existing elements into new hash
table
Rehashing Example

h(x) = x mod 7    h(x) = x mod 17
 = 0.57           = 0.29

Rehashing

Insert 23
 = 0.71
Rehashing Analysis

Rehashing takes O(N) time.
But happens infrequently
Specifically
Must have been N/2 insertions since last
rehash
Amortizing the O(N) cost over the N/2 prior
insertions yields only constant additional time
per insertion
Rehashing Implementation

When to rehash
When table is half full ( = 0.5).
When an insertion fails.
When load factor reaches some threshold.
Works for chaining and open addressing.
Rehashing for Chaining
Hash Tables in C++ STL

Hash tables are not part of the C++
Standard library.
Some implementations of STL have hash
tables (e.g., SGI’s STL).
hash_set
hash_map
Hash Set in SGI’s STL
#include <hash_set>

struct eqstr {
bool operator()(const char* s1, const char* s2) const {
return strcmp(s1, s2) == 0;
}
};

void lookup(const hash_set<const char*, hash<const char*>, eqstr>& Set,
const char* word) {
hash_set<const char*, hash<const char*>, eqstr>::const_iterator it
= Set.find(word);
cout << word << ": "
<< (it != Set.end() ? "present" : "not present")
<< endl;
}
Key        Hash function Key equality test
int main() {
hash_set<const char*, hash<const char*>, eqstr> Set;
Set.insert("kiwi");
lookup(Set, “kiwi");
}
Hash Map in SGI’s STL
#include <hash_map>

struct eqstr {
bool operator() (const char* s1, const char* s2) const {
return strcmp(s1, s2) == 0;
}
};

int main() {   Key      Data     Hash function Key equality test
hash_map<const char*, int, hash<const char*>, eqstr>
months;
months["january"] = 31;
months["february"] = 28;
…
months["december"] = 31;
cout << “january -> " << months[“january"] << endl;
}
Problem with Large Tables

What if hash table is too large to store in
main memory?
Solution: Store hash table on disk.
Minimize disk accesses
But…
Collisions require disk accesses.
Rehashing requires a lot of disk accesses.

Solution: Extendible hashing
Extendible Hashing
 Store hash table in a depth–1 tree
Every search takes 2 disk accesses.
Insertions require few disk accesses.
 Hash the keys to a long integer (“extendible”)
 Use first few bits of extended keys as the keys in
the root node (“directory”)
 Leaf nodes contain all extended keys starting
with the bits in the associated root node key.
Extendible Hashing Example
 Extendible hash table
 Contains N = 12 data
elements
 First D = 2 bits of key
used by root node keys
 2D entries in directory
 Each leaf contains up to
M = 4 data elements
 As determined by disk
page size
 Each leaf stores number
of common starting bits
(dL)
Extendible Hashing Example (cont’d.)

After inserting
100100

Directory split and
rewritten

Leaves not involved in split now pointed to by two adjacent directory entries.
These leaves are not accessed.
Extendible Hashing Example (cont’d.)

After inserting
000000

One leaf splits

Only two pointer
change in directory
Extendible Hashing Analysis

Expected number of leaves is (N/M) *
log2e = (N/M) * 1.44.
Average leaf is (ln 2) = 0.69 full.
Same as for B-trees.
Expected size of directory is O(N(1+1/M)/M).
O(N/M) for large M (elements per leaf)
Hash Table Applications
Maintaining symbol table in compilers
Accessing tree or graph nodes by name
E.g., city names in Google maps
Maintaining a transposition table in games
Remember previous game situations and the
move taken (avoid re-computation)
Dictionary lookups
Spelling checkers
Natural language understanding (word sense)
Summary

Hash tables support fast insert and
search.
O(1) average case performance
Not good if need to maintain ordering over
elements
Many applications
Points to Remember – Hash Tables
 Table size prime
 Table size much larger than number of inputs (to
maintain  closer to 0 or < 0.5)
 Tradeoffs between chaining vs. probing
 Collision chances decrease in this order: linear