Docstoc

hashing

Document Sample
hashing Powered By Docstoc
					         Chapter 9.


The Map ADT & the hash table




                               1
                                                              map




• A map models a searchable collection of key-value entries
• The main operations of a map are for searching, inserting, and
  deleting items
• Multiple entries with the same key are not allowed
• Applications:
   – address book (key=name, value = address)
   – student-record database (key=student id, value = student
     record)
                                                                   2
                                                       map




The map ADT requires that each key is unique, so the
association of keys to values defines a mapping




                                                        3
map




 4
                                                                   map


public interface Map<K,V>{
   public interface Entry<K,V>{   //see 2 slides up
            }

/**return the number of entries in the map*/
public int size();

/**returns true if this map contains no key-value mappings*/
public boolean isEmpty();

/** Returns value to which specific key is mapped, or null    */
public V get(K key);

/** associate value with specified key in map, return old
  *value or null if already an entry with this key*/
public V put(K key, V value);

/**Removes the mapping for a key from this map if present*/
public V remove(K key);



                                                                    5
                                                          map
/**return a set containing all the keys stored in map*/
public Set<K> keys();

/**return a set containing all the values associated
  *with the values stored*/
public Set<V> values();

/**return a set containing all the key-value entries*/
public Set<Map.Entry> entries();

}




                                                           6
                                                             map
public interface Entry<K,V>{

/**Compares specified object with this entry for
equality*/
public boolean equals(Object o);

/**Returns the key corresponding to this entry*/
public K getKey();

/**Returns the value corresponding to this entry*/
public V getValue();

/**Returns the hash code value for this map entry*/
public int hashCode();

/**Replaces the value corresponding to this entry with the
  *specified value*/
public V setValue(V value);


}
                                                              7
example                                              map




    Operation   Output     Map

    isEmpty()   true       Ø
    put(5,A)    null       (5,A)
    put(7,B)    null       (5,A),(7,B)
    put(2,C)    null       (5,A),(7,B),(2,C)
    put(8,D)    null       (5,A),(7,B),(2,C),(8,D)
    put(2,E)    C          (5,A),(7,B),(2,E),(8,D)
    get(7)      B          (5,A),(7,B),(2,E),(8,D)
    get(4)      null       (5,A),(7,B),(2,E),(8,D)
    get(2)      E          (5,A),(7,B),(2,E),(8,D)
    size()      4          (5,A),(7,B),(2,E),(8,D)
    remove(5)   A          (7,B),(2,E),(8,D)
    remove(2)   E          (7,B),(8,D)
    get(2)      null       (7,B),(8,D)
    isEmpty()   false      (7,B),(8,D)

                         Maps                         8
                                                             map

           A Simple List-Based Map

 • We can implement a map using an unsorted list
   – We store the items of the map in a list S (based on a
     doubly-linked list), in arbitrary order



head                                                     tail


           9 c          6 g         5 a         8 r
                                              entries


                                                                9
                                                                map

         Performance of a List-Based Map

• Performance:
   – put, get and remove take O(n) time since in the worst
      case (the item is not found) we traverse the entire
      sequence to look for an item with the given key
• The unsorted list implementation is effective only for maps
  of small size.
• All of the fundamental operations take O(n) time.
• Would like something faster…




                                                                 10
hash table
                                                  hash table
A hash table consists of two major components …
                   hash table
… a bucket array
                        hash table
… and a hash function
                                     hash table
Performance is expected to be O(1)
bucket array
                        bucket array                         hash table




• A bucket array is an array A of size N
• A[i] is a bucket, i.e. a collection of <key,value> pairs
• N is the capacity of A
• <k,e> is inserted in A[k]
     • if keys are well distributed between 0 .. N-1
• if keys are unique integers in range 0 .. N-1
  then each bucket holds at most one entry.
     • consequently O(1) for get, insert, delete
• downside: space is proportional to N
     • if N is much larger than n (number of entries) we waste space
• downside: keys must be in range 0 .. N
     • this may not be the case (think matric number)
                             bucket array                             hash table




  0      1      2     3      4      5       6     7      8     9       10




                     (3,C)               (6,A)
        (1,D)        (3,F)                       (7,Q)
                     (3,Z)               (6,C)




Bucket array of size 11 for the entries (1,D), (3,C), (3,F), (3,Z),
(6,A), (6,C) and (7,Q)

If hashed keys unique entries in range [0..11] then each bucket
holds at most one entry. Otherwise we have a collision and need
to deal with it.
                                                                            19
collision           bucket array                         hash table




            When two different entries map to the same
            bucket we have a   collision




                                                               20
collision           bucket array                         hash table




            When two different entries map to the same
            bucket we have a   collision
                                                    It’s good to
                                                        avoid
                                                     collisions




                                                               21
hash function
                            hash function                         hash table


A hash function maps each key to an integer
           in the range [0,N-1]

           Given entry <k,e> … h(k) is the index into the bucket array



                       store entry <k,e> in A[h(k)]

h is a good hash function if
• h maps keys so as to minimise collisions
• h is easy to compute/program
• h is fast to compute


                                    h(k) has two actions
                                    1. map k to a hash code
                                    2. map hash code into range [0,N-1]
hash codes in java        hash function                         hash table




                     But care should be taken as this might not be “good”
a bit of maths … that you know (af2)
                                                af2



Let A and B be sets
• A function is
    • a mapping from elements of A
    • to elements of B
• and is a subset of AxB
    • i.e. can be defined by a set of tuples!




                  f : A B

 x[ x  A  y[ y  B  x, y  f )]]
                                                       af2
f : A B

• A is the domain
• B is codomain
•f(x) = y
     • y is image of x
     • x is preimage of y
• There may be more than one preimage of y
• There is only one image of x
     • otherwise not a function
• There may be an element in the codomain with no preimage
• Range of f is the set of all images of A
     • the set of all results
Injection (aka one-to-one, 1-1)                                         af2


    xy[( f ( x)  f ( y))  x  y]




         a               u                       a               x
         b               v
                                                 b
         c               w
                                                 c               y
                         x
         d
                         y                       d
                                                                 z
                         z
                                                     not an injection
             injection

                                  If an injection then preimages are unique
Injection (aka one-to-one, 1-1)                                         af2


    xy[( f ( x)  f ( y))  x  y]

                             Ideally we want our hash function to be
                             • injective (no collisions)
                             • have a small codomain and range
                                  • may need to compress range


         a               u                       a               x
         b               v
                                                 b
         c               w
                                                 c               y
                         x
         d
                         y                       d
                                                                 z
                         z
                                                     not an injection
             injection

                                  If an injection then preimages are unique
back to ads2
                                                    hash code & hash function


Just to clear this up (but lets not make too big a deal about it) …
                                                    hash code & hash function


Just to clear this up (but lets not make too big a deal about it) …




       We assume hash code is an integer in the codomain
       Hash function brings hash codes into the range [0,N-1]




         We will examine just a few hash functions, acting on strings
Polynomial hash codes                                 hash code & hash function




      Assume we have a key s that is a character String



          Here is a really dumb hash code


                   public int dumbHash(String s){
                     int code = 0;
                     for (int i=0;i<s.length();i++) code = code + s.charAt(i);
                     return code;
                   }

                                 What would we get for
                                    • dumbHash(“spot”)
                                    • dumbHash(“pots”)
                                    • dumbHash(“tops”)
                                    • dumbHash(“post”)
Polynomial hash codes                                 hash code & hash function



    Take into consideration the “position” of elements of the key




                                                                        n 1
        h  s0 a  s1a  s2 a    sn 1a
                         0          1           2




      So, this doesn’t look any different from an every-day number
      It’s to the base   a and the coefficients are the components of the key
Polynomial hash codes                            hash code & hash function




                        Good values for   a   appear to be 33, 37, 39, 41
Polynomial hash codes                          hash code & hash function




          Small scale experiments on unix dictionary
          • a = 33
          • 25104 words/strings
          • minimum hash value -9165468936209580338
          • maximum hash value 8952279818009261254
          • collision count 7




 Yikes! Look at that range!!!!
Cyclic shift hash codes                         hash code & hash function




                          Start moving bits around
Cyclic shift hash codes   hash code & hash function
Cyclic shift hash codes   hash code & hash function




 Thanks to Arash Partow
Cyclic shift hash codes   hash code & hash function
Cyclic shift hash codes   hash code & hash function
Cyclic shift hash codes   hash code & hash function
Cyclic shift hash codes   hash code & hash function
Cyclic shift hash codes   hash code & hash function
Cyclic shift hash codes   hash code & hash function
Cyclic shift hash codes   hash code & hash function
Compression Functions                             hash code & hash function




    So, you think you’ve found something that produces a good hash code …
             How do we compress its range to fit into our machine?
Compression Functions                                 hash code & hash function



    Assume we want to limit storage to buckets in range [0,N-1]




             The division method




                            i  hash(key) mod N

             int i = (int)(hash(s) % N);
                                                 NOTE: keep N prime
             S[i] = s;



                                           … ideally, but there may be collisions 
Compression Functions                              hash code & hash function



    Assume we want to limit storage to buckets in range [0,N-1]



          The multiply add and divide (MAD) method



             i  a  hash(key)  b mod N

                 • N is prime
                 • a > 1 is scaling factor
                 • b ≥ 0 is a shift
                 •a%N≠0
                             hash tables




Collision handling schemes
Collision handling schemes              hash tables




                    Separate Chaining
Collision handling schemes             Separate Chaining              hash tables




         bucket[i] is a small map
         • implemented as a list




                                    bucket[i] should be a short list
                                    It may be sorted
                                    It might be something other than a list
Collision handling schemes           Separate Chaining                hash tables




          Let N be number of buckets and n the amount of data stored
                             load factor     is   n/M



            Upside:
            • simple



           Downside:
           • requires auxiliary data structures (to resolve collisions)
           • this may put additional burden on space
Collision handling schemes                     hash tables




                             Open Addressing
    Open Addressing   hash tables




Linear Probing
Linear Probing                       Open Addressing               hash tables



 i = hash(key);
 bucket[i] != null;
 collision!




                      Try next bucket[(i+1) % N]



                               Try next bucket[(i+2) % N]




                                             Try next bucket[(i+N-1) % N]
Linear Probing                          Open Addressing                    hash tables




       What happens with get(key)?



                   1.   i = hash(key);
                   2.   bucket[i] == key … found, return 
                   3.   bucket[i] == null … not found, return 
                   4.   bucket[i] != null and bucket[i] != key
                           i = (i+1) % N
                           goto 2




   “Linear Probing” gets its name because accessing a bucket is viewed as a probe
Linear Probing                      Open Addressing                 hash tables


  What happens with remove(key)?



                 We have a special marker “removed”


                        1. i = hash(key);
                        2. bucket[i] == key … found 
                              bucket[i] = “removed”
                              return
                        3. bucket[i] == null … not found 
                              return
                        4. bucket[i] != null and bucket[i] != key
                              i = (i+1) % N
                              goto 2
Linear Probing                       Open Addressing        hash tables


  What happens with put(key)?


                 1. Free location j = -1;
                 2. i = hash(key);
                 3. bucket[i] == key … found 
                       update bucket[i]
                       return
                 4. bucket[i] == “removed”
                      j = i;
                       i = (i+1) % N
                       goto 3
                 5. bucket[i] != null && bucket[i] != key
                       i = (i+1) % N
                       goto 3
                   6. bucket[i] == null // search stops
                        if (j > -1) bucket[j] = <key,e>
                        if (j = -1) bucket[i] = <key,e>
Linear Probing                       Open Addressing                 hash tables


  So?



         Advantages
         • saves space as bucket[i] is only a bucket for a single entry
         • that is, no additional data structures


                 Disadvantages
                 • removals are complicated
                 • put is complicated
                 • if there are collisions entries might clump together
                 • search can then degenerate from O(1) down to O(N)




           We might use linear probing when memory is tight and we
           want FAST access
       Open Addressing   hash tables




Quadratic Probing
Quadratic Probing                    Open Addressing    hash tables




         Quadratic probing


                             iteratively try ….

                             • bucket[(i + f(j)) % N]

                             where

                             • i = hash(key)
                             • j = 0,1,2,…
                             • f(j) = j*j
   Open Addressing   hash tables




Double Hashing
Double Hashing                         Open Addressing      hash tables




    We have a secondary hash function (call it g)



                 i = hash(key) and collision at bucket[i]




                          Try bucket[(i + g(key)) % N]

                          Where g(key) = q – (key % q)
                          Where q is a prime number < N
      Open Addressing   hash tables




So?
So?                                Open Addressing                hash tables




 Open addressing saves space, but is complicated, and may be slower



In experiments chaining is competitive or faster, depending on load factor




      If memory is not an issue:
      • recommend use chaining with low load factor

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:17
posted:10/12/2011
language:Maltese
pages:70