Cpt Advanced Data Structures by mikesanye

VIEWS: 20 PAGES: 59

									              Cpt S 223
Advanced Data Structures
                        Teddy Yap, Jr.
  School of Electrical Engineering and
                    Computer Science
         Washington State University
Today’s Lecture
      Hash Tables
Overview

Hashing
  Technique supporting insertion, deletion, and
   search in average-case constant time
  Operations requiring elements to be sorted (e.g.
   find minimum) are not efficiently supported
Hash table ADT
  Implementations
  Analysis
  Applications
Hash Table
 One approach
                                      Element
   Hash table is an array of
                                       value
    fixed size (called          Key
    TableSize).
   Array elements indexed
    by a key, which is
    mapped to an array index
    (0…TableSize – 1).
   Mapping (or hash
    function) h from key to
    index
   E.g. h(“john”) = 3
Factors Affecting Hash Table Design

Hash function
Table size
  Usually fixed at the start
Collision handling schemes
Hash Table (cont’d.)
 Insert      Hash key

    T[h(“john”)] = <“john”, 25000>
 Delete                   Data record
    T[h(“john”)] = NULL
 Search
    Return T[h(“john”)]
 What if h(“john”) = h(“joe”)?
Hash Function
                             h(key) ==> hash table index


Mapping from key to array index is called
 a hash function.
  Typically, many-to-one mapping.
  Different keys map to different indices.
  Distributes keys evenly over table.
Collision occurs when hash function maps
 two keys to same array index.
Hash Function (cont’d.)
Simple hash function
  h(key) = key mod TableSize
  Assumes integer keys
For random keys, h() distributes keys
 evenly over table.
What if TableSize = 100 and keys are
 multiples of 10?
Better if TableSize is a prime number.
  Not too close to powers of 2 or 10
Hash Function for String Keys

Approach 1
  Add up character ASCII values (0-127) to
   produce integer keys
    E.g. “abcd” = 97 + 98 + 99 + 100 = 394
    h(“abcd”) = 394 mod TableSize
  Small strings may not use all of table
    strlen(s) * 127 < TableSize
  Anagrams will map to the same index
    h(“abcd”) = h(“dbac”)
Hash Function for String Keys

Approach 2
  Treat first 3 characters of string as base-27
   integer (26 letters plus space)
  key = S[0] + (27 * S[1]) + (272 * S[2])
  Assumes first 3 characters randomly
   distributed
    Not true for English
Hash Function for String Keys (cont’d.)

 Approach 3
   Use all N characters of string as
    N-digit base-K integer
   Choose K to be prime number
    larger than number of different
    digits (characters)
       E.g. K = 29, 31, 37
   If L = length of string s, then


   Use Horner’s rule to compute
    h(s)
   Limit L for long strings
Collision Resolution

What happens when h(k1) = h(k2)?
  ==> Collision!
Collision resolution strategies
  Chaining
     Store colliding keys in a linked list at the same hash
      table index
  Open addressing
     Store colliding keys elsewhere in the table
                    Chaining
Collision Resolution Approach #1
Collision Resolution by Chaining
 Hash table T is a
  vector of lists
    Only singly-linked lists
     needed if memory is
     tight
 Key k is stored in list
  at T[h(k)]
 E.g. TableSize = 10
    h(k) = k mod 10
    Insert first 10 perfect
     squares

Insertion sequence = 0, 1, 4, 9, 16, 25, 36, 49, 64, 81
Implementation of Chaining Hash Table




                 Generic hash functions for integer and string keys
Implementation of Chaining Hash Table
(cont’d.)
Implementation of Chaining Hash Table
(cont’d.)



                                   STL algorithm find




                       Each of these operations
                       takes time linear in the length
                       of the list.
Implementation of Chaining Hash Table
(cont’d.)




                                 No duplicates




             Doubles size of table and reinserts current
             Elements (more on this later)
Implementation of Chaining Hash Table
(cont’d.)



                              All hash objects must define
                              == and != operators.




               Hash function to handle
               Employee object type
Collision Resolution by Chaining:
Analysis
 Load factor  of a hash table T
   N = number of elements in T
   M = size of T
    = N / M
 Average length of a chain is 
 Unsuccessful search O()
 Successful search O( / 2)
 Ideally, we want   1 (not a function of N)
   I.e. TableSize = number of elements you expect to store
    in the table
        Open Addressing
Collision Resolution Approach #2
Collision Resolution by Open Addressing

 When a collision occurs, look elsewhere in the
  table for an empty slot.
 Advantages over chaining
  No need for additional list structures
  No need to allocate/deallocate memory during
   insertion/deletion (slow)
 Disadvantages
  Slower insertion – may need several attempts to find an
   empty slot
  Table needs to be bigger (than chaining-based table) to
   achieve average-case constant-time performance
      Load factor   0.5
Collision Resolution by Open Addressing

Probe sequence
  Sequence of slots in hash table to search
  h0(x), h1(x), h2(x), …
  Needs to visit each slot exactly once
  Needs to be repeatable (so we can find/delete
   what we’ve inserted)
Hash function
  hi(x) = (h(x) + f(i)) mod TableSize
  f(0) = 0              ==> first try
Linear Probing

f(i) is a linear function of i.
  E.g. f(i) = i
Example: h(x) = x mod TableSize
  h0(89) = (h(89) + f(0)) mod 10 = 9
  h0(18) = (h(18) + f(0)) mod 10 = 8
  h0(49) = (h(49) + f(0)) mod 10 = 9 (X)
  h1(49) = (h(49) + f(1)) mod 10 = 0
Linear Probing Example




    Insert sequence: 89, 18, 49, 58, 69
Linear Probing: Analysis

Probe sequences can get long.
Primary clustering
  Keys tend to cluster in one part of table.
  Keys that hash into cluster will be added to the
   end of the cluster (making it even bigger).
Linear Probing: Analysis (cont’d.)
 Expected number of         Example ( = 0.5)
  probes for insertion or     Insert/unsuccessful
  unsuccessful search          search
                                  2.5 probes
                              Successful search
                                  1.5 probes
                             Example ( = 0.9)
 Expected number of          Insert/unsuccessful
  probes for successful        search
  search                          50.5 probes
                              Successful search
                                  5.5 probes
Random Probing: Analysis
 Random probing does not suffer from
  clustering.
 Expected number of probes for insertion or
  unsuccessful search:

 Example
   = 0.5: 1.4 probes
   = 0.9: 2.6 probes
              Linear vs. Random Probing


                             Linear probing
                           Random probing

                    U – unsuccessful search
# of probes




                    S – successful search
                    I – insert




                                          Load factor 
Quadratic Probing

Avoids primary clustering
f(i) is quadratic in i
  E.g., f(i) = i2
Example
  h0(58) = (h(58) + f(0)) mod 10 = 8 (X)
  h1(58) = (h(58) + f(1)) mod 10 = 9 (X)
  h2(58) = (h(58) + f(2)) mod 10 = 2
Quadratic Probing Example




    Insert sequence: 89, 18, 49, 58, 69   Question: Delete 49,
                                          find 49, is there a problem?
Quadratic Probing: Analysis
Difficult to analyze
Theorem 5.1
  New element can always be inserted into a
   table that is at least half empty and TableSize is
   prime.
Otherwise, may never find an empty slot,
 even if one exists.
Ensure table never gets half full.
  If close, then expand it.
Quadratic Probing (cont’d.)

Only M (TableSize) different probe
 sequences
  May cause “secondary clustering”
Deletion
  Emptying slots can break probe sequences
  Lazy deletion
    Differentiate between empty and deleted slot
    Skip deleted slots
    Slows operations (effectively increases )
Quadratic Probing: Implementation
Quadratic Probing: Implementation
(cont’d.)
                        Lazy deletion
Quadratic Probing: Implementation
(cont’d.)



                         Ensures table size is prime
Quadratic Probing: Implementation
(cont’d.)
                       Find




                              Skip DELETED;
                              No duplicates
                              Quadratic probe sequence
Quadratic Probing: Implementation
(cont’d.)
                        Insert


                        No duplicates




                        Remove




                        No deallocation needed
Double Hashing
Combine two different hash functions
f(i) = i * h2(x)
Good choices for h2(x)?
  Should never evaluate to 0
  h2(x) = R – (x mod R)
    R is a prime number less than TableSize
Previous example with R = 7
  h0(49) = (h(49) + f(0)) mod 10 = 9 (X)
  h1(49) = (h(49) + (7 – 49 mod 7)) mod 10 = 6
                                 f(1)
Double Hashing Example
Double Hashing: Analysis

Imperative that TableSize is prime.
  E.g., insert 23 into previous table
Empirical tests show double hashing close
 to random hashing.
Extra hash function takes extra time to
 compute.
Rehashing

Increase the size of the hash table when
 load factor too high
Typically expand the table to twice its size
 (but still prime)
Reinsert existing elements into new hash
 table
  Rehashing Example


h(x) = x mod 7    h(x) = x mod 17
 = 0.57           = 0.29




                      Rehashing



Insert 23
 = 0.71
Rehashing Analysis

Rehashing takes O(N) time.
But happens infrequently
Specifically
  Must have been N/2 insertions since last
   rehash
  Amortizing the O(N) cost over the N/2 prior
   insertions yields only constant additional time
   per insertion
Rehashing Implementation

When to rehash
  When table is half full ( = 0.5).
  When an insertion fails.
  When load factor reaches some threshold.
Works for chaining and open addressing.
Rehashing for Chaining
Rehashing for Quadratic Probing
Hash Tables in C++ STL

Hash tables are not part of the C++
 Standard library.
Some implementations of STL have hash
 tables (e.g., SGI’s STL).
  hash_set
  hash_map
Hash Set in SGI’s STL
#include <hash_set>

struct eqstr {
  bool operator()(const char* s1, const char* s2) const {
  return strcmp(s1, s2) == 0;
  }
};

void lookup(const hash_set<const char*, hash<const char*>, eqstr>& Set,
   const char* word) {
  hash_set<const char*, hash<const char*>, eqstr>::const_iterator it
    = Set.find(word);
  cout << word << ": "
       << (it != Set.end() ? "present" : "not present")
       << endl;
}
                  Key        Hash function Key equality test
int main() {
    hash_set<const char*, hash<const char*>, eqstr> Set;
    Set.insert("kiwi");
    lookup(Set, “kiwi");
}
Hash Map in SGI’s STL
#include <hash_map>

struct eqstr {
   bool operator() (const char* s1, const char* s2) const {
     return strcmp(s1, s2) == 0;
   }
};

int main() {   Key      Data     Hash function Key equality test
  hash_map<const char*, int, hash<const char*>, eqstr>
   months;
  months["january"] = 31;
  months["february"] = 28;
  …
  months["december"] = 31;
  cout << “january -> " << months[“january"] << endl;
}
Problem with Large Tables

What if hash table is too large to store in
 main memory?
Solution: Store hash table on disk.
  Minimize disk accesses
But…
  Collisions require disk accesses.
  Rehashing requires a lot of disk accesses.

               Solution: Extendible hashing
Extendible Hashing
 Store hash table in a depth–1 tree
  Every search takes 2 disk accesses.
  Insertions require few disk accesses.
 Hash the keys to a long integer (“extendible”)
 Use first few bits of extended keys as the keys in
  the root node (“directory”)
 Leaf nodes contain all extended keys starting
  with the bits in the associated root node key.
Extendible Hashing Example
 Extendible hash table
 Contains N = 12 data
  elements
 First D = 2 bits of key
  used by root node keys
    2D entries in directory
 Each leaf contains up to
  M = 4 data elements
    As determined by disk
     page size
 Each leaf stores number
  of common starting bits
  (dL)
  Extendible Hashing Example (cont’d.)




After inserting
100100


Directory split and
rewritten



     Leaves not involved in split now pointed to by two adjacent directory entries.
     These leaves are not accessed.
 Extendible Hashing Example (cont’d.)




After inserting
000000


One leaf splits


Only two pointer
change in directory
Extendible Hashing Analysis

Expected number of leaves is (N/M) *
 log2e = (N/M) * 1.44.
Average leaf is (ln 2) = 0.69 full.
  Same as for B-trees.
Expected size of directory is O(N(1+1/M)/M).
  O(N/M) for large M (elements per leaf)
Hash Table Applications
Maintaining symbol table in compilers
Accessing tree or graph nodes by name
  E.g., city names in Google maps
Maintaining a transposition table in games
  Remember previous game situations and the
   move taken (avoid re-computation)
Dictionary lookups
  Spelling checkers
  Natural language understanding (word sense)
Summary

Hash tables support fast insert and
 search.
  O(1) average case performance
  Deletion possible, but degrades performance
Not good if need to maintain ordering over
 elements
Many applications
Points to Remember – Hash Tables
 Table size prime
 Table size much larger than number of inputs (to
  maintain  closer to 0 or < 0.5)
 Tradeoffs between chaining vs. probing
 Collision chances decrease in this order: linear
  probing, quadratic probing, {random probing,
  double hashing}
 Rehashing required to resize hash table at time
  when  exceeds 0.5
 Good for searching. Not good if there is some
  order implied by data.

								
To top